Question: Deep serialization of an object graph--how deep should it go?
So, I've been thinking lately about serializing/remoting object graphs. The entity framework currently serializes an entire object graph when binary serialization is used but only serializes one entity at a time in XML/DataContract scenarios. I'm working on a sample designed to show how graphs can be serialized, and we're looking into ways to make this even simpler/more automatic. In the process, though, an issue has come up that has me concerned: What if the graph in memory can vary in size? Might you want to serialize only a subset of the graph, and if so how should that subset be specified? Would it always be the same or might you want to specify different subsets for different operations?
To give some context, let's take an example: Assume we have a model with Customers, Orders, OrderLines and Products. Now let's assume that the reason we're serializing things is that we have a web mehtod which returns a customer. With the EF today and a method that just directly returns a customer you would only get the customer, but let's assume for the moment that you could easily indicate that a whole graph should be returned rather than just a single entity. If that were the case, then there are a few possibilities:
1) The entire graph connected to the customer is returned every time. If you are building a stateless webservice, then you would likely construct a new ObjectContext instance each time the method is called, retrieve from the DB just those entities you want to return and then return them. In this scenario, returning the entire graph every time works just fine because the entire graph contains exactly what you want to return.
What if the context is maintained across multiple operations, though? Then everything would still be fine as long as other data retrieved into that context is disjoint from the graph containing the customer you want to return. You could, for instance, retrieve customer1, all of that customer's orders and all of those orders' orderlines as well as customer2 and all of their orders and orderlines and returning customer1 would be unaffected by the fact that customer2 had been loaded into the context. The moment you retrieve into that context a product which has been ordered by both customer1 and customer2, though, those two subgraphs become part of a larger graph, and returning either customer would actually cause the full graph including both customers and all their orders to be returned.
2) One way to address the potential issue with option 1 would be to remove certain navigation properties or annotate some of them to indicate that serialization should stop at that point. So, in the specific example above, the relationship from product back to the orderlines containing that product could be marked so that serializaiton wouldn't travel over it. This would allow a customer graph including products to be returned without that ever leading to the graph for multiple customers being returned all at once.
The problem with this approach, though, is that you might want some web methods to serialize different subgaphs than others. What if you wanted to add to the method that returns a customer a different method which returns a product and all of the orders that contain an instance of that product? In that case, the annotation indicating that products should not serialize the order lines that reference them (necessary to make the customer returning method work correctly) would prevent the products returning method from working as intended.
3) So, another approach altogether would be to have some mechanism to indicate on a method-by-method basis what subgraph to return. Naturally, this kind of mechanism provides the most flexibility, but it's also the most complicated to build and to explain, and it generates other questions like: Is it OK to always serialize all members of a collection as long as that collection is included, or are there scenarios where you would want to perform a filtered serialization where only part of a collection is serialized even though the whole thing is present in memory?
So, what do you think? I could really use the feedback. Binary serialization already uses option #1 above, and part of me thinks that option #1 may well be good enough for almost all scenarios. There's no doubt that it would be a LOT simpler to build and to explain, but if there are important, common scenarios where it isn't good enough, then maybe we need to take on #3.
Thanks,
Danny
Comments
Anonymous
November 19, 2007
It would be nice if you could default the behaviour to #1 but offer some mechanism to do #3. Most of the time I find myself typing a lot of code to do #1 where there never is a possibility for the graph to explode. I think we have an 80/20 situation here, where the usefullness of the EF will be judged on having a default #1 for 80% of the cases AND having a way to have #3 for the other 20% (which of course will then be used for 80% of the time, or harbour 80% of the complexity).Anonymous
November 20, 2007
Hi, I start to love Your blog... I think the serialization should "return what was requested". I think here are two possible scenarios #1 You want to pass the context around. That is perfectly OK. As the caller is able to deserialize the context, there is a contract between them which relies on the type definition of the context (caller and callee share a common type system, maybe through proxies). In this case all relationships defined in the context have to be traveled and all content of course should be serialized. (One should be careful when defining the schema of the context, to not include things not needed) #2 You NOT want to pass the context around, but rather want the context (and it's type system) to be hidden. Then You have to establish a contract between the caller and the callee. Thus defining a type system both can share (this holds true even when serialization is used as a persistence mechanism). Otherwise deserialization is not possible. One can use a subset of the type system of the context of course, but not required to! Consider a service maintaining different contexts (pointing to different persistence stores (not limited to databases, think of LDAP)) and combining the data retrieved from any of the stores (ensuring distributed trasactions may be). Back to Your example: Suppose there is a contract defining the customer, order and product entities (no matter what elsewhere in the context) In Your example one asked for customer1. So all his orders and the related products should be serialized. As no one asked for the productX (which is related to customer1 and customer2) I expect the orders collection of productX (when deserialized) to contain only orders from customer1 but not from customer2. As the orders collection is a navigational property, it is a mdeling artifact. So I personally don't expect the collection to fulfill the entity contract. So not containing all orders is OK for me. In scenario #2 there should be a possibility to define the shared type system as a projection /combination of the type systems of several contexts. It should also be possible to filter the content of the contexts with respect to either the implementation (fixed, by method) and/or the arguments provided by the caller (the caller may provide 1..n queries). Even in this case, the shared type system defines, which relationships to travel. The queries then define, which content to serialize.
- Martin
Anonymous
November 20, 2007
Jesam, Yes, I agree that we want to build an 80%/20% solution. One of the difficulties here is that we don't currently have a way to automatically do #1 above (the 80% solution) in a way that is interoperable. There are some WCF on both sides-only solutions, but I don't believe that's a reasonable constraint for a general-purpose framework when we are talking about web services. Yes, it should be possible for app developers to create something on top of the EF which assumes WCF on both sides, but the EF should not automatically impose that restriction since many times web services are used for interoperablity. This is what has kept us from automatically serializing graphs before now (no interoperable, general-purpose solution). I am working, however, on a sample for how you could add general-purpose graph serialization if you are willing to assume .net on both sides, which rasises the question of how to specify what you want to serialize. Further, that's a question we have to address someday if we do find a way to make graph serialization more general purpose. The more I think about it, the more I'm becoming convinced that not only will serializing the entire graph (at least what's in memory) address the 80% case, but also it's possible to spin up a separate context/perform temporary surgery on your graph, etc. if you need to serialize a subset. So quite probably that's the approach I will take. Thanks for the feedback!Anonymous
November 20, 2007
The comment has been removedAnonymous
November 21, 2007
Danny, this is a great post. In our scenario, we are using stateless services and creating a new context with each request as you mentioned in example 1. So for starters, we just want a very simple way to serialize graphs, detach, set modifed, attach and save. We would explicitly load the collections of the graph controlling how deep we went. Nothing fancy, no state management, change tracking, concurrency etc. In the future I think we’d really like to have context-per-session type of environment. We have considered a caching strategy to try and keep the context alive, but it seems to open up a whole lot of other issues… The new workflow services seem appealing, by maybe being able to persist an object context in a workflow, but I tend to wonder about performance penalties with the additional overhead. So thinking about were we are now, and were we’d like to be in the future, scenario 3 seems like an interesting option. After reading your post I started to envision the SubGraph type that could be applied to a given method. Can you share some of your ideas on what that might look like? Expand on the concept a little? In a response to a comment you stated: First off, it's important to realize that the context is NOT the DataSet. That is, it is not just a disconnected container of data--it represents not only something that tracks data but also a connection to the database and full metadata to describe the conceptual model, the storage model and the mapping between them. Throwing ideas out there, could the subgraph represent a mini-context which acts like a disconnected data container? A ‘smart-er contract’ which could interact with the context and do things like detach, attach, and set its members modified? It would even be really cool if that graph had a lightweight state mechanism / change tracker that could live with it. In the case of interoperability, this could just be a plain old ‘dumb’ xsd of some sort.Anonymous
November 25, 2007
Hi Danny, thanks for Your useful information. I've thought a lot about Your replies. I've looked for succh detailed information for quite a while, but not fond so long. I like this kind of background information, because I first have to understand the intentions and design decisions behind EF, until I can use it properly. But most of the posts in the blogs deal only with technical details. So great thanks for enlighting me. Nevertheless I want to add some thoughts. As the context can't be useful serialized, what does a remote caller want from the context? I can imagine, that the caller is interested in some of the persisted data the context holds. As Jarod sugests, that can be seen as a SubGraph. This object should be serializable (the database can't contain any cycles). In any case, there has to be a contract between the caller and the service. So I think it is useful to have a possibility to create the contract out of the context (like DataContract in WCF). It schould be possible to use not only one context for creation of the DataContract but "merge" the type systems of different contexts (In my Scenario #2 I did't want a context to maintain more than one conceptual model, but leave this to the programmer). However one needs possibilities to restrict the span of the object graph to be serialized. Like You, I see three options (#1 and #3), but #2 I would realize different. The main problem with Your suggestion is the "backprpagation of reachabilty" (thus adding the orders of customer2 when reached the productX and now enumerating their orders property). To prevent that from happening, one should follow any relationship only in ONE direction when serializing.
- Martin
Anonymous
November 26, 2007
#3 is huge... as for a way of selecting which objects in a graph to serialize (and even which properties...) could a LINQ query on the object graph be sufficient to specify the subgraph? If not, some way of specifying the root object(s) and then a list of which relationships can be traversed (i.e. defing the parition or point-cuts). this would need the relationships to be identified explicitly in some way. I have a feeling LINQ could be enough... Cheers, -Matthew HobbsAnonymous
November 26, 2007
Thanks for the feedback folks. One interesting thing I've learned in discussion with some of the folks on the WCF team is that it is possible to do something like #3 (per-operation override of serialization) with the OperationFormatter. This certainly isn't as easy as a declarative "span" specification or something, but the fact that it is possible in the relatively less common cases is leading me more and more in the direction of just serializing the whole graph by default in this kind of scenario.
- Danny
Anonymous
June 17, 2008
I'm going to take a somewhat different approach to this. From my point of view, I'm primarily interested in a SO approach. To that extent and concept of a 'context' is useless. I mean a Java client or a .NET client for that matter (with whom I have no direct knowledge about) is going to find dealing with a context somewhat difficult (assuming you could serialize it in the first place). The whole concept of a context assumes 'logic' to handle the context, which basically means its useless for interop services (unless you provide an implementation for each client). I've been doing my own style of object graph serialization and I've always taken option #3 in my designs, as I figure I don't want all the graph (could be huge!). I tend to specifiy the exact object tree I want/need. This means I might get duplicate objects in two brances, but I treat it all as data anyway (this could be fixed by processing the tree and re-linking duplicate records). Current implementations are also limited to all records of the child, but I think with the Linq syntax I'll have to figure out how to improve on that. I do find Linq an interesting technology, however unless a disconnected, stateless, contextless design is delivered I question its use outside of a DAL layer. If it is used within a DAL layer only then its usefulness (in my opinion) is quite limited as thats really the boiler room of the app, not where 'business logic' is being applied. I guess I'll leave you with this last thought... If the EF != SOA friendly, will it appeal to the enterprise market? (which I'm assuming is the target market)Anonymous
June 17, 2008
Of course SOA friendly is the goal. The question is how much of the overall vision arrives in what time frame. Also, to be clear about option 3 vs. option 1, when I talk about a context in option 1 I'm talking about a very specific artifact of the EF programming model. In an SOA kind of world, I would be building a stateless solution, and I would create one of these contexts each time someone calls one of my services. In that case I would easily retrieve just the set of data I want to return and serialize all of it. Option 3 sort of assumes you have some data retrieved and reused across multiple service calls and you want to be able to specify returning some subset of the data you have in memory on different calls. Based on feedback here and other thinking, I have become increasingly convinced that this isn't all that useful. Now to the question of interop and multiple clients, etc.: Yes, this is critical. We're working on ways to make this as simple as it reasonably can be. That said, we also have feedback from some folks focused on a narrower set of scenarios where the clients of the services are all running .net and they want a simpler to implement set of services/clients for those cases. The trick for us is how to balance the various needs. The good news is that I think we're making some progress in figuring this out, and as we get a little further down the road we'll be working on ways to share out the thinking and designs and get feedback from many folks with the hope that our v2 delivery of the EF will be much more compelling for these scenarios (both where the client is .net and the more general interoperable case).
- Danny