Debugging Routing Failures
Okay, I disappeared for a while. There is this crazy thing called "work" which my bosses (for those of you who are confused, Scott Woodgate is not my boss :) seem to think I need to do sometimes. Some positive news is that BizTalk 2004 SP1 should be released early next year which is a great thing. When it comes out, I recommend getting it, running it through your test environments and then plopping it into your systems. There are some good things done here and I have seen it improve some performance scenarios up to 10% (although I don't gaurantee any of that just consider it another way for me to make you more likely to try it out :). We are definitely trying.
In other news, it looks like I might be talking again at teched 2005. Plans are to do more performance related talks (similar to last years expect probalby with more concrete scenarios as examples and with more lessons learned from what customers actually try to do). I am also slated to give a talk on operational health ... how to setup a system for high availability and how to monitor your system to make sure it stays healthy and what to do when certain indicators go off. If you have any other areas you think would be good talks for me to give, let me know. Always curious. :)
Okay, now to the topic I mentioned here. I am guessing that by now, if you have been using BTS for long enough, you have figured this out already, but I might as well put some information here anyways so people can find it.
First, what is a routing failure. My previous post describes how BizTalk sits on top of a pub/sub routing engine which is part of the MessageBox. When the messaging engine or orchestration engines publish a message to the messagebox, if no one is subscribing to that message, that is considered a routing failure and an event will be raised, a routing faliure report (described later) will be generated, and possibly a message / instance will be suspended. There are a couple of expections to this like for NACKs where the engine knows that routing failures are exceptable but these are only for internal messages. This "Ignore Routing Failures" type functionality is not something you can configure and while I am sure that the hunt is on now, you cannot hack this up either and it wouldn't be supported. :) Back to the real story. So how do you figure out why it failed to route??
The routing failure report is literaly a dummy message associated with a dummy instance. The only really interesting part to this message is the context. This is the context of the message which failed to route at the time of the routing failure. It is possible (probable) that the message which gets suspended will not have all of the contextual information which was there when the message failed to route since we ususally suspend the adapter's message, not the message out of the pipeline which we tried to publish. That is why we generate these reports so that you have a chance to see what really happened. If you open up the context in HAT, you can see the list of properties which were promoted. Now why didn't it route. 99.99% of the time routing failures occur is when you are in testing. They usually occur because you have a misconfigured filter on your sendport or your activation receive. The easiest way to see this is to use the subscription viewer located in the <installdir>\sdk\utilites directory. This tool is a bit rough (sorry, I am not really a UI guy :), but it gives you a view of what your subscriptions are. Ideally you have some idea of where you expected this message to go. Most subscriptions will have readable names so you can find the one associated with the sendport / orchestration you were expecting it to route to and check the subscription. Simply compare the properties which we are subscribing to to the properties which were promoted in the context. A couple of gotchas which I think are more difficult to see and not well displayed. First, you cannot route a message to 2 solicit response ports. We do not support that cause we have no idea what that means. You sent out one message but got two responses. Request response is considered a one - to -one operation. I know there are lots of scenarios for doing more, but to cleanly support those would require exposing a lot more in the engine like how many people you actually routed the message to so that you can correctly consume all of the responses. This is not something we are planning on doing anytime soon. So, you should know that a routing failure will be generated if you try to route to multiple solicit reponse ports. Another boundary case is when you try to route a previously encrypted message. The host to which the message is being routed (be it the orchestatraion's host or the sendport's host) must "own" the decryption certificate. This is because we do not consider the receive port as the destination of the message. It's job is simply to convert the message into a desired format and extract all relevant information including information required to route the message. The orchestration / sendport is the destination of the message. As such, they need to have the certificate to demonstrate that they could have decrypted the message if it hadn't already been done for them. Adding the certificate can be done in the Admin MMC via the proprties setting for the host. I am not sure if you get a different error in the eventlog for these two boundary cases. Not sure. All of these cases, though, can be debugged with the Routing Faliure report in HAT, the subscription viewer, the eventlog and a bit of knowledge of what you system is actually trying to do.
If you get routing failures in production, it is going to be something in your design. The most common cases I know of is you have a request response call out of an orchestration, but the response receive shape is in a listen with a timeout. Hence if you hit the timeout and terminate the orchestration, and then the response gets sent back, well the messaging engine could get a routing failure. In general, these types of scenarios either end up in zombies or routing failures since it is simply a race. Not all zombie scenarios cause routing faliures since it is often the case that if the instance is gone, the message might trigger the creation of a new instance (as is the case for a lot of convoy scenarios). You can read more about these subjects in earlier blog entries. In general, though, in this case, it is up to you to decide how / what you want to do with this response since the original sending service is gone. I can't really think of other scenarios where you would hit this in production. It is going to be built into your design, somewhere. Some race condition exists in your design that can cause this ... almost always because you have a listen with a timeout or perhaps a listen with a receive on both sides and both can happen (like with control terminate messages).
Hope that gives some insight. Hopefully I will get a couple more posts in this year. :) Have a happy holiday season.
Lee