SP2013 Crawling *Explained: Enumeration/Discovery (Part 3b)

[アーティクル]
08/11/2016

With VerboseEx logging enabled, the crawl of a single item (or interchangeably, a “document”) can generate more than 6000 ULS events, at times making troubleshooting a particular document analogous to finding the needle in the haystack. In this series of posts, I first described the high level orchestration for crawling a document, deep dived on the steps involved with starting a crawl, and then in part 3a covered the concept of enumeration. In this part 3b, I'll follow the concepts from part 3a to follow the same ULS logs picking up where I left off in part 2. Finally, in upcoming posts, I’ll continue to show the same overall flow using key ULS events to illustrate a deep dive through an expected path when crawling an item.

Semi-unrelated side note: Since the last post, our MSDN blogs got migrated over to WordPress. Although it is a better interface on the authoring front, I haven't [taken the time to] figured out how to replicate some of the formatting in our previous tool set. Being said, I will do my best to keep colors consistent from those previous posts to this and those directly related going forward... (but I admit that you'll see subtle differences if comparing old-to-new posts in this series).

Recap: Starting the crawl (Part 2)

In this scenario, I first performed a full crawl, added a new document (OfficeSpace.docx), increased logging to VerboseEx, and then captured ULS while performing an incremental crawl. From the Crawl Log, we can see this item has a DocID 1047491, which becomes a valuable data point as we progress through these ULS events.

In part 2, we followed events in ULS by the Legacy [Crawl] Admin component (e.g. $SSA.AdminComponent) and then by the "master" Crawl Component to illustrate the detailed steps to start a crawl, which included:

Seeding the Crawl Queue with the the start addresses
Notifying the Crawl Components to join this crawl

Up to this point in the series, the concept of “crawling” was in regards to the content source level operation (e.g. starting the crawl on Content Source ID 2).

But from here, any reference to “crawling” should typically refer to an action done on a specific item (e.g. crawling a document such as "https://foo/page.aspx").

With this crawl (CrawlID 228) now Started (e.g. Central Admin displays the state of this Content Source as “Crawling”), the Gatherer thread in the "master" Crawl Component then spawns a new thread intended to start loading batches of items from the crawl queue.

02/27/2015 18:11:12.17 mssearch.exe (0x17F8) 0x2620
SharePoint Server Search Crawler:Gatherer Plugin e5jl Verbose
gatherer SQL thread 0x2620 started

02/27/2015 18:11:12.17 mssearch.exe (0x17F8) 0x2620
SharePoint Server Search Crawler:Gatherer Plugin e5jp Verbose
Gatherer LoadTransactions thread running, Crawl 228

Recap: Enumeration/Discovery (Part 3a)

Going back to part 1 of this series, I simplistically defined Enumeration/Discovery as the process where the crawl component:

Asks the content repository for the links to items from a container in that content repository (e.g. what items do you have within this start address? ...within this folder? ...etc)
And then stores these emitted/discovered [child] links in the “crawl queue” (the MSSCrawlQueue table in the Crawl Store Database)

In part 3b, I attempted to provide a deep dive conceptual explanation for the enumeration process by expanding this concept of container objects and their child items. To reiterate, container can be a folder in a file share, a root entity in a BDC model, an HTML web page seed list, a SharePoint Web Application, Site Collection, Site [web], or List/Library. When defining a SharePoint Content Source, each of the start addresses is a root container (e.g. the web application URL).

During the crawl start phase, the crawl queue gets seeded with the start addresses for this Content Source. Additionally, when starting an incremental crawl, the crawl queue be seeded with additional links such as items marked for re-crawl or items that were failed in the previous crawl (e.g. the ACTION_MASTER_SEED_INC_CRAWL step described in part 2).

From this point, each of the Gatherers [Crawl Components] work independently by calling the stored procedure proc_MSS_GetNextCrawlBatch in the Crawl Store DB to earmark a “batch” of documents from the crawl queue. Simplistically, consider the crawl queue as a table of links-to-be-crawled ...that is, links we have discovered, but have not yet retrieved from the content repository.

Once a batch of documents has earmarked by a Gatherer, this is considered the start of a [search] transaction for these documents. This batch of documents links is returned to the applicable Gatherer thread, which will begin retrieving each from the content repository.

Conceptually, we spilling over into the concept of Gathering at this point ...and the reason is that the Gatherer has no actual code path that differentiates gathering an item versus enumerating a container. Although it is way beyond the scope of this particular post, I described this before in a much older post of mine about custom Search connectors:

-----------------------------------------------------
The [Gatherer thread in the] crawler logically works the *same regardless of the type of content it’s crawling – web, file shares, SP (there are a few slight *nuances for SP content), etc. The [Gatherer] just processes links (aka, “documents”), and it is up to the [protocol handler] to understand *how to talk to the content or know *what to do with that link (e.g. [to determine whether to] enumerate it using the finder() method or retrieve it using the specificFinder() ).

Thinking of everything – both containers and items – generically as a “document” and realizing the details to handle this “document” get implemented in the [protocol handler] was actually an aha! moment that helped me generalize this problem and better helped get my mind wrapped around it.
-----------------------------------------------------

In other words, enumeration is simply the by-product of gathering an container object such as a SharePoint Web Application (or folder, or HTML web page, a root entity in a BDC model, etc).

And in this part 3, I'm keeping the focus on the ULS events related to enumeration and those events showing the crawl queue getting populated. Then, in part 4 related to "Gathering", I will dive into a bit more nuance around the various protocol handlers used to interface with different types of content repositories.

Enumeration ...a view from ULS

With this incremental crawl started and queue initially seeded, we are at the point where the new Gatherer thread calls the stored procedure proc_MSS_GetNextCrawlBatch to get a “batch” of document links from the crawl queue:

02/27/2015 18:11:12.17 mssearch.exe (0x17F8) 0x2620
SharePoint Server Search Crawler:Gatherer Plugin e5jp Verbose
Gatherer LoadTransactions thread running, Crawl 228

In the response to this request, we see an event for each link in the batch, such as these first few listed below including:

-- The root page of the web application (*notice it starts with "http")
02/27/2015 18:11:12.17 mssearch.exe (0x17F8) 0x2620
SharePoint Server Search Crawler:Gatherer Plugin e5e7 VerboseEx
CCGatherer::LoadTransactionsFromCrawl, Application=SSA,
Project=Portal_Content , crawlID=228,
docID=1500, URL=https://taxes.chotchkies.lab,
dwTransactionLoaded=1

-- The web application object (*notice it starts with the SharePoint protocol for content , "sts4")
02/27/2015 18:11:12.17 mssearch.exe (0x17F8) 0x2620 SharePoint Server Search Crawler:Gatherer Plugin e5e7 VerboseEx CCGatherer::LoadTransactionsFromCrawl, Application=SSA,
Project=Portal_Content, crawlID=228,
docID=1501, URL=sts4://taxes.chotchkies.lab,
dwTransactionLoaded=2

...along with several other specific items marked for re-crawl.

With these transactions loaded, the Gatherer "robot" threads can begin their work gathering these items from the content repository and then feeding to the CPC for processing.

As a side note: The "PerformanceLevel" setting at the Search Service level determines the total number of robot threads that an individual Crawl Component can run. This setting also determines the default crawler impact rule (e.g. the maximum number of robot threads that can be used for any given host domain being crawled). Use the following PowerShell cmdlet for the current PerformanceLevel setting:
[Get|Set]-SPEnterpriseSearchService

At this point in the ULS, no robot threads were started/available for the SSA, so we begin to see several started in messages such as:

02/27/2015 18:11:12.19 mssearch.exe (0x17F8) 0x06AC SharePoint Server Search Crawler:Gatherer Service dsg8 Verbose
robot thread 0x06AC started

Followed by the Gatherer delegating the actual task of retrieval to the Filter Daemon (mssdmn.exe):

02/27/2015 18:11:12.19 mssearch.exe (0x17F8) 0x06AC SharePoint Server Search Crawler:Gatherer Service ds86 VerboseEx
url sts4://taxes.chotchkies.lab/ data 300000 connect 300000
f561161d-2242-4fc0-a067-2b513d692bcd

From this message:

You have a correlation ID...
- In this case f561161d-2242-4fc0-a067-2b513d692bcd
- Filter the ULS on this correlation ID
The filtered ULS will have events from both the mssearch.exe and mssdmn.exe processes
But there are still several events missing - to expose those:
- For events by the mssdmn.exe, the first effectively reports what task the Gatherer delegated to it
- In this case:

02/27/2015 18:11:12.19 mssdmn.exe (0x1774) 0x14B0 SharePoint Server Search Connectors:SharePoint dvb0 VerboseEx
New url: sts4://taxes.chotchkies.lab/,
crawl type = incremental,
HighDateTime = 00000000, LowDateTime = 00000001 [sts3handler.cxx:279]
f561161d-2242-4fc0-a067-2b513d692bcd

In ULSViewer, change the filter to include events that either:
- Match the correlation ID (f561161d-2242-4fc0-a067-2b513d692bcd)
- OR that match the thread ID of the mssdmn.exe process (e.g. 0x14B0)
These resulting filtered ULS events will cover the Gathering, Feeding, and the completion of the Search transaction for this particular item:
- In other words, all the events related to steps 1b, 1c, 2a, 2b, 3 ...and then 6b and 6c

As I referenced above, this is REALLY where the concepts of gathering and enumeration cross-over, so I'm intentionally cherry-picking the next events below that highlight enumeration.

In part 3a of this series, I explained that for SharePoint content, the Crawl Component interfaces with the Site Data Web Service hosted by the Web Application being crawled. In ULS, we see right before this web service gets called:

02/27/2015 18:11:12.20 mssdmn.exe (0x1774) 0x14B0 SharePoint Server Search Connectors:SharePoint dm0c Verbose
GetVirtualServerPolicy is about to call WS for virtual server, strVirtualServerUrl https://taxes.chotchkies.lab
f561161d-2242-4fc0-a067-2b513d692bcd

Followed by the response:

02/27/2015 18:11:12.23 mssdmn.exe (0x1774) 0x14B0 SharePoint Server Search Connectors:SharePoint ac3jn Verbose
GetContent CorrelationID 0702ee9c-39f7-00ee-fb00-520e2e2e7ed9
ObjectType VirtualServer
Url https://taxes.chotchkies.lab/ _vti_bin/sitedata.asmx
Search RequestDuration 45 SPIISLatency 0 SP RequestDuration 13
f561161d-2242-4fc0-a067-2b513d692bcd

And the full SOAP response from the Site Data Web Service:

02/27/2015 18:11:12.25 mssdmn.exe (0x1774) 0x14B0 SharePoint Server Search Connectors:SharePoint dm0d VerboseEx
GetVirtualServerPolicy received WS response
<VirtualServer>
<Metadata ID="{d525ed46-79a7-4b86-8f55-917322baf1a9}" Version="15.0.4675.1000" URL="https://taxes.chotchkies.lab/" URLZone="Default" URLIsHostHeader="False" />
<ContentDatabases>
<ContentDatabase ID="{4bb174f0-b860-4848-9b06-b1b3c0a0b4d0}" />
</ContentDatabases>
<Policies AnonymousGrantMask="0" AnonymousDenyMask="0">
<PolicyUser LoginName="NT AUTHORITY\LOCAL SERVICE" BinaryIdentifier="S-1-5-19" Sid="S-1-5-19" BinaryIdentifierType="SecurityIdentifier" GrantMask="4611686224789442657" DenyMask="0" />
<PolicyUser LoginName="i:0#.w|chotchkies\crawlsvc" BinaryIdentifier="i:0).w|s-1-5-21-643199992-1059713991-1699802858-1127" Sid="" BinaryIdentifierType="UserKey" GrantMask="4611686224789442657" DenyMask="0" />
<PolicyUser LoginName="CHOTCHKIES\spwebsvc" BinaryIdentifier="S-1-5-21-643199992-1059713991-1699802858-1112" Sid="S-1-5-21-643199992-1059713991-1699802858-1112" BinaryIdentifierType="SecurityIdentifier" GrantMask="9223372036854775807" DenyMask="0" />
<PolicyUser LoginName="i:0#.w|chotchkies\spwebsvc" BinaryIdentifier="i:0).w|s-1-5-21-643199992-1059713991-1699802858-1112" Sid="" BinaryIdentifierType="UserKey" GrantMask="9223372036854775807" DenyMask="0" />
<PolicyUser LoginName="i:0#.w|chotchkies\farmsvc" BinaryIdentifier="i:0).w|s-1-5-21-643199992-1059713991-1699802858-1110" Sid="" BinaryIdentifierType="UserKey" GrantMask="9223372036854775807" DenyMask="0" />
<PolicyUser LoginName="CHOTCHKIES\farmSvc" BinaryIdentifier="S-1-5-21-643199992-1059713991-1699802858-1110" Sid="S-1-5-21-643199992-1059713991-1699802858-1110" BinaryIdentifierType="SecurityIdentifier" GrantMask="9223372036854775807" DenyMask="0" />
</Policies>
<Servers />
<AlternateDomains Count="2" Name="SharePoint Taxes Host">
<AlternateDomain>
<IncomingUrl>https://taxes.chotchkies.lab</IncomingUrl>
<UrlZone>Default</UrlZone>
<MappedUrl>https://taxes.chotchkies.lab</MappedUrl>
<RedirectUrl />
</AlternateDomain>
<AlternateDomain>
<IncomingUrl>https://offthebooks.chotchkies.lab</IncomingUrl>
<UrlZone>Custom</UrlZone>
<MappedUrl>https://offthebooks.chotchkies.lab</MappedUrl>
<RedirectUrl />
</AlternateDomain>
</AlternateDomains>
</VirtualServer>
f561161d-2242-4fc0-a067-2b513d692bcd

From this response, the SP Protocol Handler emits a link representing each Content database (e.g. the child items within the web application container). In this case from the XML above, we can see that this web application only had one Content DB:
<ContentDatabase ID="{4bb174f0-b860-4848-9b06-b1b3c0a0b4d0}" />

And from this, the Gatherer generates a link that gets inserted to a TempTable in the Crawl Store DB:

02/27/2015 18:11:12.28 mssearch.exe (0x17F8) 0x06AC SharePoint Server Search Crawler:Gatherer Plugin dw3a VerboseEx
CGatherAddLink::InsertLink: InsertRow on TempTable succeeded,
URL sts4://taxes.chotchkies.lab/contentdbid={4bb174f0-b860-4848-9b06-b1b3c0a0b4d0}, CrawlID 228, SourceDocID 1501
f561161d-2242-4fc0-a067-2b513d692bcd

Assuming no crawl rules or boundaries (e.g. for web crawls, page depth or server hops) are exceeded, the item is considered a "crawl candidate" and moved from the TempTable into the crawl queue (but that processing all occurs within the context of the Crawl Store DB).

Shampoo, rinse... repeat (with conditioner)

Eventually, in another thread (very possibly in another Crawl Component), a Gatherer thread will pull a batch of items from the crawl queue that includes the previously emitted link above to the Content DB. However, because this is an incremental crawl, the call to the Site Data Web Service changes from GetContent() to GetChanges().

Using the same strategies above to find a robot thread picking up the link for the Content DB:

02/27/2015 18:11:12.19 mssearch.exe (0x17F8) 0x08B4 SharePoint Server Search Crawler:Gatherer Service ds86 VerboseEx
url sts4://taxes.chotchkies.lab/contentdbid={4bb174f0-b860-4848-9b06-b1b3c0a0b4d0}/ data 300000 connect 300000
e64f034a-6d40-42bd-958e-0bef8e0d2dee

Then filter the ULS on this new correlation ID, find the new thread ID of the mssdmn.exe handling this particular item...

02/27/2015 18:11:12.19 mssdmn.exe (0x1774) 0x0C34 SharePoint Server Search Connectors:SharePoint dvb0 VerboseEx
New url: sts4://taxes.chotchkies.lab/contentdbid={4bb174f0-b860-4848-9b06-b1b3c0a0b4d0}/, crawl type = incremental, HighDateTime = 00000000, LowDateTime = 00000001
e64f034a-6d40-42bd-958e-0bef8e0d2dee

...and then in ULSViewer, change the filter to include events that either:

Match the correlation ID (e64f034a-6d40-42bd-958e-0bef8e0d2dee)
OR that match the thread ID of the mssdmn.exe process (e.g. 0x0C34)

In this filtered view, we once again see just before a call is made to the Site Data Web Service for this Content DB using the change log cookie (e.g. see the LastChange value in the request parameters below)

02/27/2015 18:11:12.27 mssdmn.exe (0x1774) 0x0C34 SharePoint Server Search Connectors:SharePoint dm19 Verbose
GetChanges(before ws call)
URL https://taxes.chotchkies.lab,
Content Database {4bb174f0-b860-4848-9b06-b1b3c0a0b4d0}, AmountOfChanges 100,
LastChange 1;0;4bb174f0-b860-4848-9b06-b1b3c0a0b4d0;635601068426200000;665848,
CurrentChange

Upon completion of the request, we see:

02/27/2015 18:11:12.59 mssdmn.exe (0x1774) 0x0C34 SharePoint Server Search Connectors:SharePoint aiwts Verbose
GetChanges CorrelationID 0702ee9c-19fa-00ee-fb00-5aab8c4613ff
Url https://taxes.chotchkies.lab/ _vti_bin/sitedata.asmx
Search RequestDuration 332 SPIISLatency 0 SP RequestDuration 315

Quick reminder: And as a brief recap around this whole scenario, I first performed a full crawl, added a new document (OfficeSpace.docx), increased logging to VerboseEx, and then captured ULS while performing an incremental crawl... so this new document is pretty much the only change to process.

And again, we see the SOAP response for this request.

02/27/2015 18:11:12.59 mssdmn.exe (0x1774) 0x0C34 SharePoint Server Search Connectors:SharePoint allng VerboseEx
GetChanges received WS response
<GetChangesResult>
<SPContentDatabase Change="Unchanged" ItemCount="2">
<SPSite Change="Unchanged" ItemCount="2" Id="{8f746e28-dca0-40f4-aa78-86e7db9f3ceb}">
<Messages />
<SPWeb Change="UpdateShallow" ItemCount="1" UpdateSecurity="False" UpdateSecurityScope="False" Id="{e2ea4e1e-6546-453c-b8b8-d4ffd1679ef0}" ParentId="{8f746e28-dca0-40f4-aa78-86e7db9f3ceb}" InternalUrl="/siteurl=/siteid={8f746e28-dca0-40f4-aa78-86e7db9f3ceb}/weburl=/webid={e2ea4e1e-6546-453c-b8b8-d4ffd1679ef0}/" DisplayUrl="/" ServerUrl="https://taxes.chotchkies.lab" CRC="742474233" SequenceNumber="0">
<SPList Change="Unchanged" ItemCount="1" >
<SPListItem Change="Add" ItemCount="0" UpdateSecurity="False" UpdateSecurityScope="False" Id="{9cc0ebdf-3704-408f-a7b6-b2029e0f969b}" ParentId="{a107482c-b902-441a-9863-450254f10a96}_" InternalUrl="/siteurl=/siteid={8f746e28-dca0-40f4-aa78-86e7db9f3ceb}/weburl=/webid={e2ea4e1e-6546-453c-b8b8-d4ffd1679ef0}/listid={a107482c-b902-441a-9863-450254f10a96}/folderurl=/itemid=1060" DisplayUrl="/Shared Documents/OfficeSpace.docx" ServerUrl="https://taxes.chotchkies.lab" CRC="0" SequenceNumber="1" Url="OfficeSpace.docx">
<ListItem>
...list item details and schema trimmed for brevity...
</ListItem>
</SPListItem>
</SPList>
</SPWeb>
</SPSite>
</SPContentDatabase>
<MoreChanges>False</MoreChanges>
<StartChangeId> 1;0;4bb174f0-b860-4848-9b06-b1b3c0a0b4d0;635606861984470000;665850</StartChangeId>
<EndChangeId> 1;0;4bb174f0-b860-4848-9b06-b1b3c0a0b4d0;635606861984470000;665850</EndChangeId>
</GetChangesResult>

And from this, the Gatherer emits a link for the changed item to the crawl queue:

02/27/2015 18:11:12.61 mssdmn.exe (0x1774) 0x0C34 SharePoint Server Search Connectors:SharePoint dv9u VerboseEx
Emit change log crawl link sts4://taxes.chotchkies.lab/siteurl=/siteid={8f746e28-dca0-40f4-aa78-86e7db9f3ceb}/weburl=/webid={e2ea4e1e-6546-453c-b8b8-d4ffd1679ef0}/listid={a107482c-b902-441a-9863-450254f10a96}/folderurl=/itemid=1060,
DisplayURL = https://taxes.chotchkies.lab/Shared Documents/OfficeSpace.docx,
url=sts4://taxes.chotchkies.lab/contentdbid={4bb174f0-b860-4848-9b06-b1b3c0a0b4d0}

And you might have guessed... but eventually, in another thread (very possibly in another Crawl Component), a Gatherer thread will pull a batch of items from the crawl queue that includes the previously emitted link above to this document OfficeSpace.docx where it will be retrieved from SharePoint, fed to the CPC, processed, and submitted to the index.

In coming posts: Continuing to peel back the onion…

So far in this series, we covered:

A high-level view to orchestrating the crawl
A deep dive look at starting the crawl
Enumeration/Discovery
1. Concepts
2. Illustrating through ULS /* this post */
3. Simulating with PowerShell

And in coming posts, we will then deep dive into each of the following areas in separate posts:

Enumeration/Discovery: The process where the Crawl Component:
- Asks for the links to items from the content repository (e.g. “what items do you have within this start address?”)
- And then stores these emitted/discovered links in the “crawl queue” (the MSSCrawlQueue table in the Crawl Store Database)
Gathering: The process of the Crawl Component retrieving [think: downloading] the enumerated items (e.g. using links from the crawl queue)
- Each Crawl Component independently earmarks a small sub-set of items from the MSSCrawlQueue to be gathered/processed; once earmarked, the item is considered part of a “search transaction”
Feeding: The process of the Crawl Component submitting the gathered items to the Content Processing Component(s)
Processing: The process of the Content Processing Component converting the documents into a serialized object of Managed Properties (aka: “Mars Document”)
- The CPC will produce one Mars Document for each item gathered/fed for processing. In other words, the Mars Document is the end product from processing
Index Submission: The process of submitting the processed document (e.g. the Mars Document) to Index Components where it will be inserted into the index
- Just before submission, a collection of Mars Documents get batched into a “Content Group”, which gets submitted to the Index as an atomic operation
Callback Handling: Following Index Submission:
- A callback message gets returned back to Content Processing Component regarding the status of the Index Submission
- Then, another callback gets returned from the Content Processing Component back to the Crawl Component regarding the status of processing (which implicitly includes the status from Index Submission as well – In other words, processing cannot report success if Index Submission failed)
- Finally, the transaction for the item completes when its status is persisted to the MSSCrawlUrl table and the item gets cleared from the MSSCrawlQueue table

…to be continued :-)

Comments

Anonymous
August 18, 2016
Excellent Post. Great Learning opportunity. Looking forward to the following post on this topic!
Anonymous
April 29, 2017
Awesome informative post! You should update the links in the old posts of the series. There is no link to this 3b part. Also when is the rest of the series coming? It already spans 2 years ;-)
- Anonymous
  June 09, 2017
  Admittedly, I got sidetracked from writing when I began building the Search Health Reports (SRx) and shifting my focus to the Cloud Hybrid SSA… Being said, through that, I’ve been trying to do knowledge sharing via tools rather than blogs. Most likely, if I’m able to carve out the time to finish that series, I’ll pivot at where the series currently stands and start diving more into the flow for the Cloud SSA (*most of what I’ve already written still applies to both – so it would largely be a look at how the Crawler processes the gathered content and pushes it to SPO).

次の方法で共有