Share via


Removing XML elements from an input document

 

I am working on a BizTalk application where we are processing messages from a Point of Sale (POS) system.   As it turns out these messages contain a lot of data that we don't need such as receipt data in HTML format embedded in an element and a number of other elements which contain data specific to the POS system.  This extra data made the size of these messages overly bloated and we didn’t want to incur the performance overhead of dealing with the bloat.  The thing to do in this situation was to strip the unneeded data before processing it.

 

So, the first thing, of course, was to check if something like this has already been done and if anyone had blogged about it.  Nothing.  So, on the plane home the XMLStripper pipeline component was started.

 

The first decision was that this was going to be done as a pipeline component - that was a pretty simple decision.  Next, it had to process the xml in a streaming fashion since I knew that the messages were going to be large and I didn't want to take the memory hit on loading it into the DOM.  I also wanted the processing to continue even if the stripping functionality couldn't take place (which meant we still needed the removed elements to be in the schema).  Lastly, it needed to take a list of elements to exclude and it needed to pass through any node type it encountered.

 

Since I needed to get a list of elements to remove I utilized the property bag functionality with the Load and Save (shown below) methods to allow this list to be passed to the pipeline from the properties page in Visual Studio.

 

        public virtual void Save(Microsoft.BizTalk.Component.Interop.IPropertyBag pb, bool fClearDirty, bool fSaveAllProperties)

        {

            this.WritePropertyBag(pb, "RemoveElements", this.RemoveElements );

        }

 

Next was the Execute method.  In this method I used the XmlTextReader to get the streaming functionality that I needed.  I also used a List object to hold the elements that needed to be stripped from the XML message.  The real work happens in the while block where I inspect each node.  If the node is an element (specifically not an EndElement) then I check if the name is contained in the List object.  If it is then I flip a flag  (encounteredRemovedTag )  which is utilized in the NodeType.Text switch and the NodeType.EndElement switch blocks.  This flag is important because if a node is encountered that needs to be stripped then all embedded nodes also have to be stripped.

 

In the NodeType.EndElement switch block I check if the name of the end node is equal to the value stored in currentElementToBeRemoved.  If so, then I know that we need to flip the flag (encounteredRemovedTag) and continue processing each node normally.

 

Each of the rest of the switch blocks process all other NodeTypes encountered and continue to build the output xml message.  At the very end of the process I take the new XML message, which is now contained in the MemoryStream, set its position back to 0 and pass that to the IPipelineContext ResourceTracker.

 

The code for the Execute method (only the relevant code) of the pipeline component looks like this:

 

 

        public …... Execute(……….)

        {

            try

            {

                IBaseMessagePart bodyPart = inmsg.BodyPart;

                MemoryStream ms = new MemoryStream();

 

                if (bodyPart != null)

                {

                    Stream originalStream = bodyPart.GetOriginalDataStream();

 

                    if (originalStream != null)

                    {

                        XmlTextReader Xtr = new XmlTextReader(originalStream);

 

                        XmlTextWriter Xtw = new XmlTextWriter(ms, Encoding.UTF8);

                        Xtw.Formatting = Formatting.Indented;

 

                        List<string> RemoveElementsList = new List<string>();

 

                        RemoveElementsList.AddRange(RemoveElements.Split(','));

 

                        Xtr.WhitespaceHandling = WhitespaceHandling.None;

 

                        bool continueProcessing = true;

                        bool encounteredRemovedTag = false;

                        string currentElementToBeRemoved = string.Empty;

 

                        if (Xtr.Read() == false) continueProcessing = false;

 

                        while (continueProcessing)

                        {

                            switch (Xtr.NodeType)

                            {

                                case XmlNodeType.Element:

                                    if (RemoveElementsList.Contains(Xtr.Name))

                                    {

                                        currentElementToBeRemoved = Xtr.Name;

                                    }

 

                                    if (Xtr.Name != currentElementToBeRemoved && !encounteredRemovedTag)

                                    {

                                        Xtw.WriteStartElement(Xtr.Prefix, Xtr.LocalName, Xtr.NamespaceURI);

                                        Xtw.WriteAttributes(Xtr, true);

                                    }

                                    if (Xtr.Name == currentElementToBeRemoved) encounteredRemovedTag = true;

                                    if (Xtr.IsEmptyElement && !encounteredRemovedTag)

                                    {

                                        Xtw.WriteEndElement();

                                    }

                                    break;

                                case XmlNodeType.EndElement:

                                    if (!encounteredRemovedTag)

                                    {

                                        Xtw.WriteFullEndElement();

                                    }

                                    if (Xtr.Name == currentElementToBeRemoved)

                                    {

                                        encounteredRemovedTag = false;

                                        currentElementToBeRemoved = string.Empty;

                                    }

                                    break;

                                case XmlNodeType.Text:

                                    if (!encounteredRemovedTag)

                                    {

                                        Xtw.WriteString(Xtr.Value);

                                    }

                                    break;

                                case XmlNodeType.Whitespace:

                                case XmlNodeType.SignificantWhitespace:

                                    Xtw.WriteWhitespace(Xtr.Value);

                                    break;

                                case XmlNodeType.CDATA:

                                    Xtw.WriteCData(Xtr.Value);

                                    break;

                                case XmlNodeType.EntityReference:

                                    Xtw.WriteEntityRef(Xtr.Name);

                                    break;

                                case XmlNodeType.XmlDeclaration:

                                case XmlNodeType.ProcessingInstruction:

                                    Xtw.WriteProcessingInstruction(Xtr.Name, Xtr.Value);

                                    break;

                                case XmlNodeType.DocumentType:

                                    Xtw.WriteDocType(Xtr.Name, Xtr.GetAttribute("PUBLIC"), Xtr.GetAttribute("SYSTEM"), Xtr.Value);

                                    break;

                                case XmlNodeType.Comment:

                                    Xtw.WriteComment(Xtr.Value);

                                    break;

                            }

                            continueProcessing = Xtr.Read();

                        }

 

                        Xtw.Flush();

                    }

                }

               

                ms.Position = 0;

                bodyPart.Data = ms;

 

                pc.ResourceTracker.AddResource(ms);

                return inmsg;

            }

Comments