multiple architectures - many ghosts in the machine
I got an interesting comment to my post about a persistent data grid... that the idea is interesting when considered in context with an ESB. (I assume this particular TLA stands for Enterprise Service Bus). I don't know if the person leaving the comment meant to say that they are essentially the same, or just complimentary. If he thought that I meant the same thing, then I failed to be clear.
The thing about the ESB is that it places the messages "into the cloud." The persistent data grid places cached data "into the cloud." Different, but complimentary.
When I was describing this idea to two other architects the other day, one asked "what happens on update? Does the cache update from the message?" The answer is no. The message may intend for data to be updated. I may even command that data be updated, but until the data is actually updated in the source system, it has no place in the cache.
In a very real sense, while the data grid may leverage an ESB as a portion of its architecture, it is seperate from it. The distributed data, which behaves in a manner that should allow very fast data performance, even at great distances, is not a message. Intelligent and seamless routing and distribution is essential but does not deliver large datasets at great distances.
While I cannot know for certain if my idea would, I can tell you that ESB, of and by itself, does not. So, in this situation, at least two architecturs are needed.
Add to that the need for business intelligence. In a BI world, the data needs to be delivered "as of a particular date" in order to be useful for the recipient analytic systems. This is because this 'date relevance' is needed to get proper roll-ups of data in order to create a truly valid snapshot of the business.
For example, if you have one system recording inventory levels, another recording shipments in transit, and another showing sales in stores, you need to know that your data in the analytics represents "truth" as of a particular time (say Midnight GMT). Otherwise, you may end up counting an item in inventory at 1am, in a shipment at 7am, and sold by 10am. Count it thrice... go ahead. Hope you don't value your job, or your company's future.
That requires data pulls that represent data as of a particular time, even if the pull happens a considerable time later. For example, we may only be able to get our data from the inventory system at midnight local time, let's say Pacific Standard Time, when the server is not too busy. That's about eight hours off of GMT. The query has to pull for GMT.
This type of query is not well suited for a data-grid style cache, and while the message can travel through the ESB, the actual movement of the data is probably best handled by an ETL (Extract Translate Load) process using an advanced system like SQL Server Integration Services (the replacement for SQL DTS).
Alas, in our data architecture, I've described no less than three different data movement mechanisms. Yet I still have not mentioned the local creation of mastered data. If the enterprise architecture indicates that a centralized CRM system is the actual 'master' system for customer data, then the CRM will use local data access to read and write that data. That is a fourth architecture.
OK... so where do reports get their data? That's a fun one. Do they pull directly from the source system? If so, that's a direct connect. What if the source system is 10,000 miles away? Can we configure the cache system to automatically refresh a set of datasets for the timely pull of operational reporting data? That would be a variation on my persistent data cache: the pre-scheduled data cache refresh. This would require a seperate data store from the active cache itself. This amounts to data architecture number five.
Recap... how many data architectures do we need, all running at once?
- Message-based data movement
- Cached data 'in the cloud'
- Business Intelligence data through large ETL loads
- Direct data connections for locally mastered data
- Prescheduled data cache refresh for operational reporting
That's a lot. But not unreasonably so. Heck, my Toyota Prius has a bunch of different electric motors in it, in addition to the one engaged in the powertrain. Sophisticated systems are complex. That is their nature.
So when I go off on 'simplification' as a way to reduce costs, I'm not talking about an overly simplistic infrastructure. I'm talking about reducing unneeded redundancy, not useful sophistication. It is just fine to have more than one way to move data.
Comments
- Anonymous
September 10, 2006
My last comment was about complimentary with the ESB.
My vision of this in adding data grid controling functionality to the ESB orchestration service to get data cache depending on the service-container metadata/workflow ASAP to the specific service-container.
Let's use your example as base "if you have one system recording inventory levels, another recording shipments in transit, and another showing sales in stores, you need to know that your data in the analytics represents "truth" as of a particular time" and map this to the ESB. Now we have (let's simplify a bit) - 3 service-conteiners "Inventory", "Shipping", "Sales" plugged to the ESB. Each of this services has its own data layer with dependency to the specific tables.
What if describe each service-container with metadata of DAL? For example, using Service Model Language (btw, we are working on the specification just right now in common with Microsoft, IBM, SUN and other companies) or smth like that. In result, we have data map, where we see which data/table is required. This Service Data Map (let's name it so) should be under control of ESB orchestration and be combined with workflow.
After that we add data grid layer. Add controling services responsed to the replication of DB data to the nearest zone node, that is next to the our service-container, based on the our SDM.
In other words, as soon as client add new records to the Inventory, ESB orchestration sends event to the Data Grid services asking him to update data-cache that is in the same zone with the next service in the workflow.
For every new added service-contained to the ESB we just update our Service Data Map + Workflow and ESB Orchestration/Data Grid will updata data-cache next to this servives' zone.
Could you give feedback?
PS: we could discuss this via email (laflour at gmail dot com) - Anonymous
September 10, 2006
Hello Michael,
I can see that you've been thinking about the same things. Your ideas are sound.
I'm curious about the statement "This Service Data Map (let's name it so) should be under control of ESB orchestration and be combined with workflow."
I was thinking of the data cache system being more independent of the messaging infrastructure but built on top of it. It sounds like your vision would bind the data cache system to control elements in the messaging infrastructure. Did I get that right?
While I don't have a problem with focusing on moving data close to the next step in the workflow, I'd say that this decision is more configured and less 'controlled.' In other words, I wouldn't make the developer of the ESB orchestration indicate the locale of each step in the workflow. They shouldn't know that.
Therefore, the data cache should simply be responsive to the needs of the subscribers. Therefore, we can say that systems that need data, including ones that collaborate in a workflow, will place demands on the data cache infrastructure, including a demand that the data be up to date for it's participation in the workflow. Our sales reporting system, for example, can demand that shipment information must be kept current.
This means that the data grid, upon reciept of the 'sale' event from the order entry system, would broadcast that event (using the ESB) to the various nodes worldwide, and the one nearest the sales reporting system would respond with requests for more information.
Other nodes may respond as well, asking for more information. Only the first request would go back to the source system however, because after that point, the further information needed would be in the cache node closest to the order entry system.
This 'demand based' configuration is flexible because it does not require the ESB orchestration to 'know' anything about the locale of the particular steps in the workflow, nor does it prevent those locales from changing independently, or scaling to meet a distributed need (like reporting from multiple continents).
(Note: I'm going to alter your e-mail address a bit to keep the automatic spamming apps from picking it up.) - Anonymous
September 10, 2006
Yep, it's exactly what I was talking, but a bit from the different perspective - to have only one ESB.
Do u have a sample of that kind working system? Has anybody already realized it? Because we are going to develop this kind of system to integrate enterprise apps/mainfraims into it and it would be interesting to discuss some aspect of this. - Anonymous
September 10, 2006
>> I'm curious about the statement "This Service Data Map (let's name it so) should be under control of ESB orchestration and be combined with workflow."
Explain what I meant:
There is SML (Service Modeling Languages) specification exists (http://www.microsoft.com/windowsserversystem/dsi/serviceml.mspx) that is used to describe systems, their structure, services and constraints. (specification is currently under development) We can use this specification to describe our data for each service that is pluggable to the ESB, and use this description in Data Grid to get better view where and which data is required. - Anonymous
September 12, 2006
Hi Michael,
Thanks for the update. I'll look into SML. Note: if you are using SML to model your services, that does NOT mean that the Service Data Map is under the control of ESB orchestration. Rather, it sounds like the ESB is the consumer of the service description, not controlling it.
Therefore, it is not tight coupling for the data cache to also leverage the same service description for it's caching. They would still be quite independent of one another. In fact, the services themselves may be called by the data cache, and may feed data into it, without knowing that the data cache exists. It is simply another consumer of data. - Anonymous
September 25, 2006
I would say that yet another architecture might worth considering: the Data Grid is the Data, not cache of the data, but primary data source itself. In this architecture database(s) used as attached storages to represent data in a convenient way and as data warehouses.
Data Grid is as reliable as database (and vendors like GigaSpaces claim that data grid is even more reliable and scalable than RDBMS). In this situation we could use DataGrid as data provider and distributor for all the attached processes, which might act as listeners to the data events. Pretty much like triggers in the database. Implementations of the idea include GigaSpaces, GemFire, and ORACLE Fusion.