The event driven Persistent Data Grid
Not a web control, I'm talking about the notion of applying grid computing to large scale distributed data provisioning. I'd like to suggest a pattern and see if anyone can tell me if a product provides this, or if this is described elsewhere. I'd like to buy it.
The data grid is not a new concept. (see https://www.gemstone.com/solutions/gridcomputing.php and https://www.gigaspaces.com/pr_ce.html ) This idea allows you to create very fast delivery of data across a distributed infrastructure. This is useful for grid computing applications that allow massive multiparallel execution in a simplified environment, where data bottlenecks can starve your virtual supercomputer and completely screw up your ability to deliver.
One assumption of the data grid is that the memory is large enough to store and retrieve the data. What if there is a LOT of data... Gigabytes. What if it is not feasable to keep it in memory? For example, if someone were to query the Microsoft customer database, and ask for all customers in Kansas, they'd get millions of records. Simply creating an infrastructure that can respond to such a request prevents us from using memory structure... but only for requests for very large amounts of data.
Requests for fairly small amounts of data can easily be served from memory.
Therefore, small data domains should be served from memory. They can be preloaded and made ready by distributing the domains to many servers "in the cloud." However, the in-memory data grid is not enough.
Let's assume that I have customers all over the world, and I need to deliver gigabytes of fresh, real time, data to all of them. The source systems can live anywhere. The consuming systems should not need to know where. Data Grids are good, but don't cover the need for large data stores.
I need to combine the data replication of RDBMS systems with the speed and distributed nature of the Data Grid. Add to that: I'd prefer for it to be event driven (although I can write an event adapter for a source system that cannot, of and by itself, generate events).
So the notion works like this: I distribute, around the world, a set of database servers, highly redundant and reliable. On top of them, I place data grid servers (one, two, twenty, whatever). That creates a data grid cluster. I put in a directory service that allows an app to start up anywhere and find the nearest data grid cluster.
When a source application creates a new data element, it sends an event to the nearest data grid cluster informing it of the primary keys and some base data. That element is replicated around the world, first to memory and then to persistent storage. Depending on policies, the grid clusters can request full data for the data item from the source system, or they can wait until full data is requested by an app.
The local grid cluster is highly redundant and persistent. Members all contribute memory to storing different data elements, but all of the data is stored in persistent storage as well. That way, if a data request needs large sums of data, then the data grid can force an ETL process between it's persistent data store and the requestors database system, potentially moving millions of rows of data without having to package each row in an XML transaction, send it across the wire, and interpret it into a local database. This Database-Refresh style request is what really differentiates this pattern from a 'standard' Data Grid.
OK. I want one. I'm working to understand and define this, and figure out how much of this is out of the box with SQL Server 2005.
ROI: every app gets rapid access to millions of rows of data, worldwide, without needing to know the source for the data, or the parameters of the source system's ability to feed that data to the data cache infrastructure. Basically, every app loses it's data access layer "into the cloud."
Do you know of a product that provides a persistent distributed in-memory data cache based on event-driven data propogation models, preferably using canonical schemas that makes its data available over web services?
Comments
Anonymous
September 09, 2006
Hmmm, interesting idea. Especially if apply to the ESB.Anonymous
October 06, 2006
Nick, Is not this Scaleout feature http://www.microsoft.com/technet/prodtechnol/sql/2005/scddrtng.mspx you are talking about.Yes of course this opens up a new way of looking at federation and data routing .However i think there will be some similar strategy available today but i don't ve a URL or instance with me. One can coined this with WSCF and SQLEveryWhere as well and really this might comes out superb Architecture. Using the word might since i didn't ve implemented one. Anyway this is a nice post and hope to see more thx shreemanAnonymous
October 08, 2006
Hi Shreeman, No. My idea above is not described in the article that you mention. That article assumes that the middle layer of the app knows about the distributed nature of the data, which is absolutely silly. It is a kind of tight coupling that simply defeats flexibility. This is an example of the failure to imagine. --- NickAnonymous
October 08, 2006
Nick The reason i went into scaleout was due to the fact that you ve mentioned p2p replication[although i could ve provided the msdn link where scaleout has more options like p2p ] and its also a data routing [based on AD and near grid cluster].Although ur point was more on a loosely coupled architecture of routing [including SODA based event subscription] . Further u ve included the persistence to it [that is why i termed analogous to smart client offline capabilities including msde and merge repliaction based sync i termed those wrong in last post as wscf..typo..]. Further the links you ve provided does talks about those 2 but neither were complete and you suggestion combined those 2 as well as provide more stress into loosely coupled subscription and routing however there are still few scenarios which were not clear that is synchronization of local storage and network congestion[loadbalancing] in a cluster based scenario . i can term this as GRID computing +SODA but still the primitive side in grid computing that is cluster,load and/or network balancer and data routing were the primer concern. i would be more then happy if u share ur view on datarouting,event subscription layer and synchronization from client and finally what ur grid cluster looks like? shreemanAnonymous
October 16, 2006
When it comes to scale out for a cluster or Data Grid, I think what you are looking for is something like Coherence for .NET .. see http://www.tangosol.com/ Peace.