Content enrichment service scaling and aggregation
WCF Routing Service with content based routing
The configuration of the content enrichment feature only supports a single web service endpoint. This can be limiting for a number of common scenarios:
- You want to integrate more than one content enrichment web service into content processing.
- Your web service is time-consuming and you want to load balance between different instances of the web service.
- You have a scaled out search topology with multiple content processing components, and you want to scale up the web service to match the load.
- You want fault tolerance for the web service.
There are several possible technologies one could consider to solve these scenarios. This blog post focuses on using the Windows Communication Foundation (WCF) Routing Service technology included in .NET Framework 4.0. You can also check out our upcoming blog post on how to deploy a network load balancing cluster for high performance and availability.
WCF Routing enables development of complex routing logic, load-balancing, and fault tolerance. All of these mechanisms support our underlying requirement for scaling out, but not all of them need to be implemented for all scenarios. We’ll look closer at routing logic in particular in this blog post.
In short, the benefits of WCF Routing are:
- Simple and quick technology to implement and deploy.
- Allows for a varying degree of routing logic complexity.
- A set of pre-defined filters that can be customized.
- Completely custom filter implementations.
- Service aggregation through routing rules that inspect managed properties.
- Fault tolerance through backup endpoints.
- Load balancing through custom filter implementations.
Search topology and WCF Routing
We’ll start by recapping some basics, and then move on to a concrete example.
A search topology can consist of anywhere from 1 to n content processing components. The role of a content processing component is to parse and transform the data coming into the system before delivering it to the indexing component. This processing takes place in discrete processing flows that can range from 0 to n instances within a specific content processing component. The number of active flow instances will depend on available resources and the amount of data being crawled. A ballpark figure can be calculated as the number of physical cores on the host multiplied by three. There’s no guarantee that this calculation will be true in the future.
When content enrichment is enabled for the Search service application, all active flow instances will potentially call out to the configured web service endpoint for every document. Assuming a web service that has no temporal cost, the web service will receive roughly the same number of calls per second as the crawl rate (documents per second) of the farm. How much of a bottleneck the web service becomes, if at all, depends on the following factors:
- The amount of resources consumed by the web service implementation.
- The hardware specification of the web service host.
- The number of calls per second.
- The size of the configured payload to send and receive between the content processing component and the web service.
- Including network topology.
- The number of concurrent calls per second.
- Depends on the number of content processing components and active flow instances.
The following is a simple visual representation of how a search topology with two content processing components can be configured to communicate with a single WCF Routing Service. The WCF Routing Service in turn distributes incoming requests to the appropriately registered service endpoints based on a set of defined filters and the content of the received SOAP envelope. Each service implementation has a backup endpoint that will ensure high availability in case of a failure situation. Typically a CommunicationException or TimeoutException will cause the router to try the backup endpoint.
Even though a single connector appears between nodes in the drawing, there will most likely be multiple HTTP connections at run time. The number of allowed active connections can be throttled through the service throttling subsection of the service behavior section in the web configuration file (for IIS hosting). By default the underlying connections will be persistent, which creates less overhead than re-creating an http-connection for every call.
Example implementation of content based routing
There may be situations where you have different web service implementations aimed at different types of content. You can pack all of them into a single service and handle requests differently depending on content, but in other cases you may know a priori that some content will be tougher to process and that it’s desirable to dedicate a particularly beefy host to those documents. Also important, maintainability of your service implementations may decrease if you have no separation of business logic. To show you how to achieve this, we’ll walk through an implementation of a WCF Routing Service where we do content-based routing predicated on the content source of an item.
Creating the WCF Routing Service
The following fictitious values are used in the example.
Role |
Value |
Web Service 1 |
servicehost1.contoso.com |
Web Service 1 backup |
servicehost2.contoso.com |
Web Service 2 |
servicehost3.contoso.com |
Web Service 2 backup |
servicehost4.contoso.com |
WCF Routing Host |
routinghost.contoso.com |
While there are different ways of implementing a WCF Routing Service, and different levels of complexity, we’ll focus on a very simple router that we can express mostly declaratively through the web configuration file. Initially you’ll need to have Internet Information Services (IIS) set up on a server and create a new site (including a new directory on your local drive for the site).
Let’s start with the web.config file and look the different sections in it separately before tying it all together in a full example. Every section described below will be a descendant of the <system.serviceModel> node. We’ll start with the binding used by both the router’s exposed service, and the clients it talks to.
Bindings
We’ve created a single basicHttpBinding where we’ve configured large values for the readerQuotas and the maxReceivedMessageSize. These values can be reduced later on once you know the limits you want to have in place. They are used to limit the allowed size and complexity of the received SOAP envelope.
<basicHttpBinding>
<binding
name="basicHttpBinding_IContentProcessingEnrichmentService"
maxReceivedMessageSize = "8388608">
<readerQuotas
maxDepth="32"
maxStringContentLength="2147483647"
maxArrayLength="2147483647"
maxBytesPerRead="2147483647"
maxNameTableCharCount="2147483647" />
<security mode="None" />
</binding>
</basicHttpBinding>
Services
This is where we define the endpoint that the router uses to expose itself. We will configure the content enrichment feature in SharePoint to use this endpoint through the cmdlets later. Take note that the baseAddress attribute is not required when hosting in IIS, it’s simply here to make it clear what host this service is for.
<service
behaviorConfiguration="RoutingServiceBehavior"
name="System.ServiceModel.Routing.RoutingService">
<host>
<baseAddresses>
<add baseAddress="https://routinghost.contoso.com:800"/>
</baseAddresses>
</host>
<endpoint
name="RoutingServiceEndpoint"
address=""
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="System.ServiceModel.Routing.IRequestReplyRouter" />
</service>
Clients
Here we define the endpoints to the content enrichment web service implementations that the router will route to. These are not different from a normal implementation that you host in a single-service scenario. As can be seen in the following example, we’re configuring a total of four client endpoints. These cover our two different service implementations, with an additional backup for each in case of a failure.
<client>
<endpoint
name="Service1"
address=
"https://servicehost1.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
<endpoint
name="Service1Backup"
address=
"https://servicehost2.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
<endpoint
name="Service2"
address=
"https://servicehost3.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
<endpoint
name="Service2Backup"
address=
"https://servicehost4.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
</client>
Service behavior
We need to create a service behavior where we reference the name of the filter table that will be defined in the next step. In addition, to enable full inspection of the SOAP envelopes in our XPath filters, we set the attribute routeOnHeadersOnly to false.
<behavior name="RoutingServiceBehavior">
<routing
filterTableName="ContentSourceFilters"
routeOnHeadersOnly="False"/>
</behavior>
Routing
Here we define the filters and the filter table where we map the filters to normal endpoints and backup endpoints. The XPath expressions look for all Property nodes in the SOAP envelope by using a predicate that specifies the name of the property and the value. This predicate is used to match against specific content sources. There are various types of filters that we can use, but the XPath type is sufficient in speed and functionality for this example. To develop more complex scenarios, take a look at custom filters in the online WCF documentation.
<routing>
<namespaceTable>
<!-- Define prefix for Content Enrichment namespace,
used in XPath filters -->
<add
prefix="cc"
namespace=
"https://schemas.microsoft.com/office/server/search/
contentprocessing/2012/01/ContentProcessingEnrichment"/>
</namespaceTable>
<!-- Filter definitions -->
<filters>
<filter
name = "Sharepoint"
filterType = "XPath"
filterData=
"//cc:Property[cc:Name[. = 'ContentSource'] and
cc:Value[. = 'Local Sharepoint Sites']]"/>
<filter
name = "Fileshare"
filterType = "XPath"
filterData=
"//cc:Property[cc:Name[. = 'ContentSource'] and
cc:Value[. = 'Large Fileshare']]"/>
</filters>
<!-- Filter mappings -->
<filterTables>
<filterTable name="ContentSourceFilters">
<add
filterName="Sharepoint"
endpointName="Service1"
backupList="BackupSharepoint"/>
<add
filterName="Fileshare"
endpointName="Service2"
backupList="BackupFileshare"/>
</filterTable>
</filterTables>
<!-- Backup lists -->
<backupLists>
<backupList name="BackupSharepoint">
<add endpointName="Service1Backup" />
</backupList>
<backupList name="BackupFileshare">
<add endpointName="Service2Backup" />
</backupList>
</backupLists>
</routing>
Web.config
It’s time to tie it all together in a single configuration. The following example uses all the previous pieces to build a complete configuration file.
<?xml version="1.0"?>
<configuration>
<system.serviceModel>
<bindings>
<basicHttpBinding>
<binding
name=
"basicHttpBinding_IContentProcessingEnrichmentService"
maxReceivedMessageSize = "8388608">
<readerQuotas
maxDepth="32"
maxStringContentLength="2147483647"
maxArrayLength="2147483647"
maxBytesPerRead="2147483647"
maxNameTableCharCount="2147483647" />
<security mode="None" />
</binding>
</basicHttpBinding>
</bindings>
<services>
<service
behaviorConfiguration="RoutingServiceBehavior"
name="System.ServiceModel.Routing.RoutingService">
<host>
<baseAddresses>
<add
baseAddress=
"https://routinghost.contoso.com:800" />
</baseAddresses>
</host>
<endpoint
name="RoutingServiceEndpoint"
address=""
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract= "System.ServiceModel.Routing.IRequestReplyRouter" />
</service>
</services>
<client>
<endpoint
name="Service1"
address=
"https://servicehost1.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
<endpoint
name="Service1Backup"
address=
"https://servicehost2.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
<endpoint
name="Service2"
address=
"https://servicehost3.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
<endpoint
name="Service2Backup"
address=
"https://servicehost4.contoso.com:800/ContentEnrichmentService.svc"
binding="basicHttpBinding"
bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService"
contract="*" />
</client>
<behaviors>
<serviceBehaviors>
<behavior
name="RoutingServiceBehavior">
<routing
filterTableName="ContentSourceFilters"
routeOnHeadersOnly="False"/>
</behavior>
</serviceBehaviors>
</behaviors>
<routing>
<namespaceTable>
<add
prefix="cc"
namespace=
"https://schemas.microsoft.com/office/server/search/
contentprocessing/2012/01/ContentProcessingEnrichment"/>
</namespaceTable>
<filters>
<filter
name = "Sharepoint"
filterType = "XPath"
filterData =
"//cc:Property[cc:Name = 'ContentSource' and
cc:Value = 'Local Sharepoint Sites']"/>
<filter
name = "Fileshare"
filterType = "XPath"
filterData =
"//cc:Property[cc:Name = 'ContentSource' and
cc:Value = 'Large Fileshare']"/>
</filters>
<filterTables>
<filterTable name="ContentSourceFilters">
<add
filterName="Sharepoint"
endpointName="Service1"
backupList="BackupSharepoint"/>
<add
filterName="Fileshare"
endpointName="Service2"
backupList="BackupFileshare"/>
</filterTable>
</filterTables>
<backupLists>
<backupList name="BackupSharepoint">
<add endpointName="Service1Backup" />
</backupList>
<backupList name="BackupFileshare">
<add endpointName="Service2Backup" />
</backupList>
</backupLists>
</routing>
</system.serviceModel>
</configuration>
The service file
The markup code of the service file needs to reference the RoutingService class and Routing assembly, rather than your own implementation/assembly, which would be the normal procedure. The code part can be just an empty implementation since it won’t be used.
<%@
ServiceHost=""
Language="C#"
Debug="true"
Service=
"System.ServiceModel.Routing.RoutingService,
System.ServiceModel.Routing,
version=4.0.0.0,
Culture=neutral,
PublicKeyToken=31bf3856ad364e35"
%>
Final remarks
To summarize, we’ve shown how it is possible to overcome some of the limitations with a single web service endpoint through the use of WCF Routing. The fact that the router itself is still a single point of failure can be overcome through other load balancing mechanics like NLB.
If you want to learn more about how to customize search with content enrichment, check out the official documentation on MSDN, and the other blog posts on content enrichment.