Exploring The NetCF WebCrawler Sample III - Visual Studio 2005 Beta 2
In the first part of this series, I discussed how the WebCrawler uses an HTTP HEAD request to determine if the URL can be crawled (points to an HTML page). The Visual Studio.NET 2003 version of the WebCrawler used a very simple check of the Content-Type header (equal to "text/html").
When I was reviewing the sample for the Visual Studio 2005 Beta 2 release, I ran the WebCrawler under the debugger and noticed that some URLs were failing unexpectedly. While stepping through the Crawler.PageIsHtml method, I noticed that one of the failing sites were specifying the text encoding (ex: "text/html; charset=utf-8") as part of the content type. Because of the WebCrawler's very simple approach to checking the URL's content type, this site was not being crawled.
To fix this, I changed the content type comparison to use String.StartsWith instead of String.Equals. After making this change, the previously failing URL was correctly identified and the WebCrawler successfully crawled the page.
The code fragment, below, is an excerpt from the Crawler.PageIsHtml method and shows the updated content type check. For a complete source listing, please consult the files that were installed with Visual Studio .NET 2005 / .NET Framework SDK Beta 2.
// check the content type
string contentType = headers["Content-type"];
if (contentType != null)
{
contentType = contentType.ToLower(CultureInfo.InvariantCulture);
if (contentType.StartsWith(TypeHTML))
{
isHtml = true;
}
}
For reference, the original code looked like this:
// check the content type
string contentType = headers["Content-type"];
if(contentType != null)
{
contentType = contentType.ToLower();
if(contentType.Equals(TypeHTML))
{
isHtml = true;
}
}
Until next time,
-- DK
Disclaimer(s):
This posting is provided "AS IS" with no warranties, and confers no rights.
Some of the information contained within this post may be in relation to beta software. Any and all details are subject to change.
Comments
- Anonymous
May 17, 2005
RePost:
http://www.yeyan.cn/Programming/NetCFWebCrawlerVisualStudioNET2005.aspx