Unstructured Data, the Achilles Heel of SOA
You may be wondering why I saying that unstructured data is the Achilles Heel to SOA. Very simply SOA has been so focused about the plumbing and connectors and have completely lost sight of the user. At the end of the day SOA was supposed to make life more bearable for business users by providing them systems that were more agile and provide a rich integration experience. I think that there has been some great work done by both Microsoft and others but I think we got lost down one of the trails. Unstructured data is that connector into the business user.
Gartner projected that in 2003 business users will spend anywhere from 30 to 40 percent of their time managing documents. Similarly, Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data .
If you have seen some of my recent presentations I talk a lot about this topic. To date I have not had a focused session on this merely referred to the issue many times when presenting Office Business Applications. I think I will need to talk about this more.
I have an increasing number of questions from customers about what is my option on this thing called unstructured data. Specifically I was asked this question below:
What does “unstructured data” mean to my bank and what are the underlying issues?
This sounds like it could be a “technological solution to a political problem” silver bullet that the customer is looking for. And I’ve found silver bullets usually end up making interestingly shaped holes in feet.
I have removed names to protect the innocent... :) At any rate, I would like to comment that this is a very serious problem in enterprise today not just Financial Services. I believe that it is not a political problem but rather it describes how human behave with technology. This information supports many of the human based processes that makes us all productive. However, when IT tries to force structure on a process that does not require it the process is then inhibited.
From the image above you can see that human create and consume information differently than applications. Business users are productive by creating an Excel worksheet to calculate figures or rates rater than going into a structured environment. Many times it is too restrictive to the business user.
Who needs unstructured data? Should we just migrate to more structured data options? I think that is a real bad idea for many reasons. But the biggest reason is that systems look for syntax and humans look for patterns.
So what are some of these unstructured data objects?
- E-Mails
- Reports
- Quick notes to workers
- Excel Files
- Word Documents
- PDF Documents
- Images (e.g., .jpg, or .gif)
- Media (e.g., mp3, .wma, or .wmv)
- Text Files
- PowerPoint Presentations
- HTML to a lesser degree
Below is a chart that I like to use that illustrates the differences between structured and un-structured data.
I used a similar method for breaking these down as Gartner. However, there was a bit missing so I kept the structure but added content. If you would like to see the Gartner article and you have a subscription here is the article name:
The New Data Integration Frontier: Unifying Structured and Unstructured Data
What is Microsoft's Story?
Microsoft has a very strong office productivity presence that coupled with the new 2007 Office Server System and you have a great story.
Yes I am saying OBA yet again...
So lets talk about how to OBAs can facilitate the creation, storage, information management, routing, downstream system consumption, synchronization and security.
Content Creation
The business users create content in familiar Office applications such as Outlook, Excel, PowerPoint or Word. The majority of the business users do not work in a database platform, data entering information.
The majority of these tools use Open Document XML standards. This means that as you and I work in an unstructured way under the covers the data is being represented in a way systems can understand.
By using these formats the information integrates quite nicely in with the new Microsoft Office SharePoint Server (MOSS). This leads us to our next topic... Storage
Storage
Hosted within MOSS is a service for storing information. This service is called SharePoint Document Libraries. What I really like about this environment is that it's not just a place where documents reside and go to their grave.
Services can be configured rather than coded:
- Workflow
- Polices
- Document Expirations and Retention Policies
- Questioning of Information
Classifying Information
It is one thing to store information but how do you classify the data? On all Office Documents there is the ability to wrap meta-data around the document. For other non-Office data MOSS provides the facility to wrap meta-data around all items stored in a document library.
Essentially, MOSS is a meta-data repository for all your unstructured data. As an example, lets say that a home title was provided by a title agency in electronic format as a .tiff image. Meta-data could be wrapped around that document to describe both it's contents and purpose.
Downstream systems can then pull that information based on that meta-data. Additionally, business users can find this data easier.
Routing and Human Consumption
As I pointed out Document Libraries are a great way to get information into MOSS. But how do you route this information within your enterprise? Architects can eliminate the need to write these workflow's since Windows Workflow Foundation provides routing services for information. A great example of this is being able to subscribe to content by RSS or E-Mail alerts. For actionable information routing Outlook tasks can be sent.
For an example of what the workflow could look like, here is an example from the OR-LOS solution. This workflow shows how to extend simple flows and add approval based processes. The approval workflow shows how to save information to logs and trigger events. All of this was click and drag rather than coding of any sort.
System Consumption
Since the MOSS workflow is based entirely on WF system centric routing of information can be easy to achieve by using Windows Communication Foundation. Obviously this can be stubbed out many different ways depending on the varying architecture decisions you may have. For example, the OR-LOS solution uses XML web service proxys throughout the architecture. This allows for the maximum amount of interoperability between systems. WF provides many options here.
Synchronization
Sync'ing data from these two worlds are critical. Many times a business user is working on a Form for example. There may be initial data entry stages then back-end process must kick-off. A very common scenario.
In the OR-LOS solution this occurs often. The illustration below shows all the communication touch points in the architecture.
The Master Loan Flow which is ran from MOSS holds all of our unstructured data. In this case InfoPath is used as the entry mechanism. These forms capture all the loan data that is required by the system. Once the data entry is at a point in which the system can take some actions, web services are called down to the Lending Message Bus which in turn traverses throughout the layers.
The data is massaged and automated tasks are executed. Once there is human intervention needed an event is triggered in the MLF (Windows Workflow Foundation). In this process it kicks off alerts and updates work pipelines.
It's important to note that there needs to be a governing system. Windows workflow provides this. It facilitates the communication between the human and system workflow's.
Security
When dealing with data most times there is information that should be protected. MOSS provides a platform that integrates nicely into a Single Sign On (SSO) enabled enterprise. This is nothing new on the market. What does differentiate MOSS from other solutions is that there is Information Rights Management (IRM) out of the box.
With IRM messages can be routed to participants and provides enterprises the ability to control that information. You can lock down information by eliminating forwarding, copy & paste, and print screen activities.
There are many options and would encourage you to read the MSDN articles that describe the architecture. You can choose to deploy IRM into the same Outlook user interface that you use today. Additionally IRM can be extended to Smart Clients.
Conclusion
I hear from our customers that their SOA attempts have been hindered or have failed as a result of not being able to show the business iterative results that enable them to be more productive. Businesses are still very hopeful that SOA will provide tangible business value. To do so uniting the unstructured world and the structured is essential.
Remember there are no silver bullets, but MOSS provides a great platform to build upon to facilitate the unification of unstructured and structured data.
Comments
- Anonymous
July 23, 2007
This past week seems to be the week for the call to action for Enterprise Architects to start blogging.