ObjectSpaces: The Devil is in the Demand

發行項
04/20/2004

On object persistence I’m object persnickety. I’ve looked at this problem space quite a bit over the last five years. I’ve built systems and frameworks galore that provide a multitude of different styles of object-relational mapping with a variety of different object representations. I’ve coined the phrase ‘object span’ to describe the ability to front load a graph-based query with enough information to cherry pick at the best bits of data so data retrieval is optimized to the application. I’ve even written about that here. Yet I’ve never felt one hundred percent right about any of it. Not the span exactly, but the whole concept of systems designed to retrieve segments of a data graph into the client space, with interceptors (or swizzlers) that fault in more segments upon demand. I’ve come to the conclusion that demand loading is just a terribly bad idea.

Now the way I look at it is that data stored in the server is of course the data of record. Any fragment of that data that you pull down to another machine or another process space is by its very essence just a copy of the original, and it’s a stale copy to boot. Unless you enforce pessimistic concurrency by locking out anyone else whenever you even glance at a record, you are always dealing with stale data. If the fragment of data you bring back is actually construed of records from a variety of tables through relationships and collections available on your objects then you have any even worse problem. You might as well lock the whole database and serialize any and all interaction with it. That is, if you want to keep up the charade that your client copy of data is somehow accurate.

That’s what you’d have to do, or a close approximation, if you wanted any reliability when it comes to faulting in data that was not retrieved with your original query. But this is exactly what we keep on thinking we can do when we design our objects with built-in behaviors that try to govern constraints over the data and its relationships. Certainly, with perfect knowledge, this is absolutely the right thing to do. When designing objects for your application that only exist in memory this is possible, yet for objects that exist logically in the database and only transiently in memory any such rules baked into the object just can never honestly be upheld. You just don’t have all the data. You don’t know the current state. You don’t know that a collection of related objects has changed membership. How can you control that membership in code on the client, unless you’ve frozen out every other possible interaction?

Sure, you can wrap your client code in a transaction and rely on optimistic concurrency to throw back anything that fundamentally violates these constraints. But these violations are evaluated on the server, not in your code. Your code will try to judge these things long before submitting the lot to the back end. The back-end can only catch constraints that you missed; it can’t help you undo constraints that you enforced incorrectly with stale information.

It seems self evident, therefore, that you can only perform constraint checking over data you know you have, and that is all retrieved in a consistent state, at the same time. This might make you think you can add constraints to property accessors that make certain you don’t modify a field that would violate some simple constraint check. This would make sense for data type constraints, like the range of values legal for a particular integer, etc. But this breaks down if the constraint evaluates against other data as well, even if the data was originally all consistent. You can’t just restrict piecewise modifications to the data. You’ve got to leave the programmer some wiggle room to work the data before submitting it as valid. You see this all the time with poorly written data-entry systems. The ones that throw a fit if you enter an invalid value, and don’t let you progress until you do. But it was only invalid, given the state of another field, which you can’t get to yet. These kinds of things drive me batty. The developer has to be able to choose when to apply the constraints to check for validity. It should not be automatic.

What I’m getting at is that the only place with enough perfect knowledge to enforce constraints beyond simple data-type constraints is the place that has all the information about the data. In a client-server system, the only place that is, is the server. This even pertains to code running on the server, and even in the same process as the server, if the data is a stale copy of the official data. Therefore there is no use in pretending that your client objects are something more than they are. They are a working copy of information that you can make no inferences about validity until the whole wad is packaged up and sent back to the server.

Still, you might think it reasonable that your client objects employ demand loading features just so you don’t have to bake into the system what data is retrieved all at once. I agree, this is a terribly bad thing. Applications tend to have a variety of data usage patterns, so optimizing for one generally makes all the others unbearable. But by doing this you are implying that data fetched now is intrinsically the same and as good as data retrieved in the original request. Yet, even though you could have inferred some sort of consistency with the data when it was retrieved all at once, you can no longer do this with data this is retrieved at just any-old-when. Demand loading perpetuates the myth that the client objects somehow represents accurate proxies into the server data.

If this doesn’t make the hair stand up on the back of your neck, think about all the potential security problems that go with objects that inherently carry around context that allows it to reach back into the database to get more data. If you ever intended to hand off any of these objects to some piece of untrusted code, you can just forget it. You’d be handing off your credentials along with it.

So in reality, you don’t want to make any claims about the integrity of your client objects. You really just want these objects to represent the result of your query and nothing more. You still want to represent relationships to other objects as properties and collections, but you don’t want these in-memory references to imply anything that might be construed as consistency with the backend. You want this data to simply represent the data you asked for from the server at the time that you asked for it. Beyond that, since you can’t accurately apply constraints on the client, the user of these objects should be able to modify them to their hearts content, changing values, adding and removing objects from collections, and none of it should mean anything until you choose to submit some of these changes to the back-end.

So to summarize:

1) span good, demand load bad.

2) client objects are just data, not behavior

Of course, this is not the ideal that Object Spaces is offering. You can do it however, by just ignoring the demand-loading data types and instead stick to simple generics for collections: List<T> and plain object references for 1-to-1 relationships.

Let me know what you think.

Matt

Comments

Anonymous
April 20, 2004
The comment has been removed
Anonymous
April 20, 2004
I generally think the idea of 'in total' is dubious to try to achieve. You often only want to see a portion of related objects in any given application. For example, if you related a customer and his orders together, using a collection to represent the customer's orders. The idea of totality works okay for the average customer with only a few orders, but then there are a few with quite a large number, and you app might only need to be concerned with the most recent. So you wouldn't want to imply that the orders collection always strictly refers to the entire set of orders for a given customer. This would preclude ever using that particular data class /shape to represent the customer and only a subset of orders. Which might be a common usage pattern.
Anonymous
April 20, 2004
The comment has been removed
Anonymous
April 20, 2004
Forgive me, I've never actually tried any of this.

Would it not be possible to "check-out" the dataset of an object from the DB and set up a trigger so that a message can be sent to the client whenever some of the data behind an object is altered in the DB?

The client code could decide whether or not to refresh the object state from the DB before it tries to demand-load another part of the graph.

The user could be told "someone else has just changed this data, do you want to see what changes they have made", or the program could have some default behavior.

It would be less overhead than pessimistic locking, but would reduce the overhead of the client breaking a constriant when submitting an update to a database that has been altered since the initial data was retrieved.

You can't guarentee that a client would get the message, so you still have to constraint check all the updates, but it might make things a bit smoother.

With Yukon I'd have though such things were do-able since you can create new triggers within queries.

I expect a lot of research as already been done on such a solution, is there any particular reason why it wouldn't be effective?
Anonymous
April 20, 2004
If I understand you correctly, the constraints problem is not solved with span loading.

Suppose you have a business rule that says that the Customer.CreditLimit needs to be higher than the amount of all the Invoice.Total for the customer.

Suppose you load an existing Invoice with its InvoiceLines and with the Invoice.Customer object (using span loading). You change some values in the lines. Then you compare the total with the CreditLimit, and it satisfies the rule, so you then persist the changes.

If someone else changed the CreditLimit after you read it, then you constraint could be no longer satisfied but you won't know.

I see two ways of solving this. One is to do optimistic locking on the Customer object even if it's not updated.

The other is locking the Customer record by reading it again with a pessimistic lock, perform the check, and commit it.

I don't see how span loading can solve these scenarios.

Regards,

Andres
Anonymous
April 20, 2004
An addition.

Doing an optimistic locking in the Customer even if it's not updated is not really what you want to do. You just want to use the value stored in the database when you are going to persist it, so the only good solution is the second one.
Anonymous
April 20, 2004
You may have misunderstood me. I was making the point that you can not implement the constraints correctly, so there was no need to pretend to be a consistent replica of the server data. This means no demand loading, which leaves only spans as a way to query for a non-trivial section of a graph.
Anonymous
April 20, 2004
The comment has been removed
Anonymous
April 20, 2004
While I agree with you that data in an object should be treated as a copy, and thus by definition a stale piece of data, you make a mistake along the way in your conclusions.

If I load a customer object into memory, I know that it is a copy. If I then want to read its order rows, I can load these on demand. Now, what am I asking when I read the orders at time T? ALL orders of that given customer. Because I load them at time T, I have the most optimal set of orders for that customer I can get. Say user U1 loads that customer, gets some coffee after that. User U2 creates in the mean time a new order for that customer. U1 comes back, and clicks open the 'Orders' collection in the gui. Voila, U1 sees the order U2 just created. With spans, this would not have happened and the data would be less accurate.

I say 'less accurate' and not 'wrong', because as you described correctly, every data moved to a position outside the datastore is stale. It now comes down to the following things:
1) every user of .NET who wants to read/write data from/into a database has to know that the data in memory is stale.
2) To keep the effect of stale data as small as possible, data should be read as late as possible so the copy in core is more likely to reflect the data in the database
3) to keep the effect of stale data as small as possible, developers should build in functionality locking, i.e.: locking on the application level of functionalty so users don't step on each-others towes by doing the same actions on the same data (which forces organizations to schedule work more efficiently, which in the end makes the software more efficiently applied).

Span's also don't free you from stale data, every element in a span is a union, which might look like a 1 query action but it isn't. Furthermore, spans can be horrible in performance in situation A and fast in situation B while load on demand can be fast in A and horrible in B.

Loading 100 customers in a gui which can drill down to orders, order rows and products is very inefficient with spans, because you have to load a lot of data you probably will never use. (this is the area where pessimistic locking falls flat on its face too: it locks a lot of rows which are probably not used) Load on demand however can be very efficient then.

In a remoting scenario it is the other way around. Loading a graph into a root object and passing that root object back to the caller via remoting is far more efficient than a chatty application which loads data on demand.

Andres: pessimistic locking and optimistic locking won't help you in any situation. They give you a false sense of safety which isn't there. First you can always work around pessimistic locking and second optimistic locking/concurrency and pessimistic locking/concurrency will cause loss of work. It's the REASON why 2 or more processes altering the same data which needs fixing, not the RESULT.
Anonymous
April 20, 2004
The comment has been removed
Anonymous
April 20, 2004
Andres,

I suspect that since the server has to validate anyway upon update that there's not much use in having the client lock all the data remotely and check the constraints before submitting.
Anonymous
April 20, 2004
Would comparisons on a timestamp field help the situation any?

An interval at which the timestamp on the client is compared with the timestamp in the "client in memory" data and prompt the user to re-get newer data

OR

A comparison of the client and server timestamp values before a persist-type operation is performed?
Anonymous
April 20, 2004
"For the GUI example, I'd have the query extract just customers. Then when the UI wishes to drill down, an additional query is fired to get the data for that customer. The behavior is built into the app, and not into the object model."
But isn't that load-on-demand? Looking at your conclusions, I see spans are good and load on demand is bad. What I wanted to say was that that conclusion is too black/white.
Anonymous
April 20, 2004
The comment has been removed
Anonymous
April 20, 2004
I have to second this.

Full load is idiotic. It is arrogant.

When I write an object, I can not make assumptions on how the object is used later.

Order? Auto load order details. Great - all I want to show is a list of the largest orders we ever had, without details - thanks for loading the garbage.
Order? Auto load order details, articles? All I want is a list of orders that still are unpaid - I am not interested in the articles.

This means on demand. Only on demand is flexible enough to handle the object being loaded as it should be. When it is needed.

Now, performance for this is interesting (ask me - the EntityBroker had no support for spans until a week ago, we call them "Prefetch". THis is what counters this. Here the UI or user of the object - which is the ONLY thing that knows what it will do with the objects and what knows what other objects it needs - can define what other objects to load. Nice, isn't it? The only solution. Possibly, naturally, with a good cache for static objects in addition.

But this is about as good as it can get. And then, the programmer also need to know what he is doing. A good O/R mapper can solve the paradigm issues, but concurrency and outdated data are something he has to think about.
Anonymous
April 20, 2004
Thomas,

Matt is not saying that you should always load the whole graph, he's saying that you should explicitly tell the O/R mapper what to load up-front. If you need order headers, ask for it. If you need header and lines, ask for it.

Frans,

Denying the existence of a problem does not make it go away ;). If your application cannot deal with concurrency issues then it does not work.

Imagine amazon.com has to decrease the inventory of a product each time you confirm an order. How will it implement functionality locking for that? It should not allow two people to add the same product in the cart?

If you cannot have a lock on data before checking constraints, then your application does not work.

Also, when you do span load, you know you are retrieving a set of data that was consistent at some moment of time. If you do delayed loading you don't.

If I have a forum web app with users and posts. I load the users, and it has a field with the number of posts. Then I delay load the posts. The number of post can be different, so I get data that is not consistent and that could lead my code to wrong decisions. If I load the user and the posts at the same time, they are consistent.
Anonymous
April 20, 2004
The comment has been removed
Anonymous
April 20, 2004
Frans, (way back up the list)

I think we are agreeing. I never meant to exclude the ability to fetch data later, just that it was not automatic. Certainly, you can write code that would assembly your in memory graph using as many queries as you would like. But you at least know you did this and are willing to deal with the possiblity of inconsistencies.

Everyone else, thanks for joining in. I'll respond to more later.
Anonymous
April 21, 2004
Anders: my point was that any low-level concurrency method is always causing loss of work and is always resulting in a horrible way of 'correcting the error'. After all, the error is seen in the low areas of the application, how to correct it, as the user expects it to work.

"If you cannot have a lock on data before checking constraints, then your application does not work. "
Data locking from outside the database can be pretty bad for an application in general (but sometimes unavoidable, admitted). If a webuser locks a row with an action on a website, how long should the lock hold, if the modem of the webuser suddenly drops the carrier? :) I think that's the point of Matt's article: hte data is outside the db, it's therefore stale and you can't assume it's not.
Anonymous
April 21, 2004
Frans,

That's exactly my point. To apply the constraint you need to be in the server. Before commiting anything you should read all the data that matters with a lock, check the constraints, and commit. If you are using delay loading, you need to make sure you load that data (i.e., the Customer credit limit) again in the context of a transaction.

Of course that I'm not saying that the web application should hold a lock in the server.

I agree that most of the ways to correct the a concurrency error are bad, but think of how merge replications conflicts are solved. You can have rules to apply when a conflict happens, and that way you can have a good automatic thing to do in some cases. Anyway the problem does exists and in some cases you cannot avoid it. Of course I'd like to be able to avoid it, but you cannot. Do you find any way to solve the amazon.com example using functionality locking?

Regards,

Andres.
Anonymous
May 22, 2006
The comment has been removed

共用方式為

ObjectSpaces: The Devil is in the Demand

Comments

其他資源