BizTalk Server: How to Simplify Complex XML Schemas
Introduction
The XML Schemas are used for two main tasks:
- for processing XML documents (for the XML document validation and for the XML document transformation);
- for defining the domain specific standards.
XML Schemas and Domain Standards
If we look at the domain standards. EDI, RosettaNet, NIEM, ebXML, Global Justice XML Data Model, SWIFT, OpenTrave, Maritime Data Standards, HIPAA, HL7, we can see that schemas embrace the domain knowledge in form which can be formally and officially validated. The XML Schemas are helpful for defining industry and domain standards.
Compare standards which are defined in form of the XML schemas and the standards in form of the documents. It is almost impossible to verify if the data satisfy the standard or not when we use the text document to define the standard. And it is possible to validate data and validate it automatically, if we use the XML Schemas.
The domain specialists use XML Schemas to define standards in unambiguous form, in machine verifiable format.
Those schemas tend to be large and very detailed. The reason is to define as much knowledge as possible.
But if we use XML Schemas for the first task, for processing XML documents, we need something completely different, we need the small schemas.
We don't need the abundance of standard schemas in most applications. We only need a small portion of schema to validate or transform the significant part of data.
We upload the megabyte size standard schemas, we perform mapping for these huge schemas, and all processing lasts for eternity and it consumes huge amount of CPU and memory.
For the most integration projects we don’t want to validate data to satisfy the standard. We want to transfer data between systems as fast as possible with minimal development effort.
How to work with those wealthy schemas? How to do our integration faster at run-time and in development?
First we have to decide, does our application require the whole schema or not?
If the answer is "No" we have a solution.
How to Simplify?
Solution is to simplify the schema. Cut out all unused parts of schema.
The first step in our simplification is to decide which parts of original schema we want to transfer further, want to map to another schema. We keep these parts unchanged and we simplify all other unnecessary schema parts.
The second step is to research if the target integrated system perform the data validation of the input data or not. Good system usually validates input data. Validation includes the data format validation (is this field integer, date type or does it match a regex expression?), the data range (is this string too long or is this integer too big?), the right encoding (is this value belong to the code table?), etc.
If the target system performs this validation, it doesn't make sense to us perform the same data validation on the integration layer. We just pass the data without any validation to the target system. Let this system validate data and decide what to do with errors: send errors back to the source system or try to repair or something else. Actually it is not good architecture, if an intermediary (our integration system) is trying to do such validations and decisions. It means spreading the business logic between systems where target system delegates the data validation logic to intermediary. The integration system deals with data validation only if it needed.
Example: HIPAA Schema Simplification in the BizTalk Server
[Download code here: HIPAA Schema Simplification for the BizTalk Server application]
Now let's be more technical. The next example is implemented in the BizTalk Server and the HIPAA schemas, but you can use the same principles with other systems and standards.
The first step in the schema simplification is the structural modification. It is pretty simple. We replace the unused schema parts with <any> tags [http://www.w3.org/TR/xmlschema-0/#any]. If we are still want to map this schema part but without any details, we can use the Mass Copy functoid.
The second part of the schema simplification is the type simplification.
For the HIPAA schemas I use these regex replacements:
Open your schema with XML (Text) Editor mode:
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb_1.png
Click Ctrl-Shift-H (Find and Replace in Files) and check “Use Regular Expressions” option:
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb.png
Make two replacements:
- type="X12_.*" --> type="xs:string"
- <xs:restriction base="X12_.*">.*\n.*\n.*\n.*</xs:restriction> --> <xs:restriction base="xs:string"/>
Save and close.
Open schema again with Schema Editor, make any small change and undo it. Editor will recalculate type information and pops up the **Clean Up Global Data Types **window. Check all types and click OK.
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb_2.png
This cleans up all unused Global Data Types.
Previously we replaced all those types with “xs:string” type and those types are not used anymore.
It takes 5 minutes for this replacement. What is the result?
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb_3.png
The modified schema is twice smaller.
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb_4.png - is the dll size with original schema.
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb_5.png - is dll size with modified schema.
The assembly for modified schema was also cut twice in size.
Result is not bad for 5 min job.
How these simplified schemas change our performance?
All projects with schemas and maps are compiled in Visual Studio notably faster. I like this improvement as a developer.
How about the run-time performance?
I have made a simple proof of concept project to check the performance changes.
Test project
The project compounded of two BizTalk applications and two BizTalk Visual Studio projects. Do not do this in production projects! One Visual Studio solution should keep one and exactly one BizTalk application.
Each project keeps one HIPAA schema, one very simple schema, one “complex” map (HIPAA schema to HIPAA schema), and one simple map (HIPAA schema to the very simple schema).
The first project works with original HIPAA schema and the second project with simplified HIPAA schema.
Build and Deploy one project.
Each BizTalk application compounded of a receive file location and a send file port. The receive location uses the EdiReceive pipeline to convert the text EDI documents into the XML documents. So we need to add a reference to the “BizTalk EDI Application”:
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb_6.png
After deployment import the binding file which you find in the project folder. Create the In and Outfolders and apply necessary permissions to those folders. Change the folder paths in the file locations for your folders.
There is also a UnitTests project with several unit tests. Change folder paths in the test code.
Perform tests.
Then delete the application and deploy second BizTalk project and perform tests again.
Do not deploy both projects side by side.
Performance results:
Note: Before each test start the Backup BizTalk job to clean up a little bit the MessageBox.
https://gwb.blob.core.windows.net/leonidganeline/Windows-Live-Writer/2e794a7fb954_10D58/image_thumb_8.png
Tests for 1, 10 and 100 messages did not show visible difference. The difference could be noticeable in my environment in 1000 message and 3K message batch tests. The above table shows the test result for 3K batch tests.
The performance gain is about 10%. It is not breathtaking but anyway it is not so bad for the 5 minutes effort.
Conclusion
The schema type simplification is worth to do if the application expects the sustainable high payloads, the high peak payloads, and everywhere you want to get the best possible performance.
See Also
Another important place to find a huge amount of BizTalk related articles is the TechNet Wiki itself. The best entry point is BizTalk Server Resources on the TechNet Wiki.