Exploring the NetCF WebCrawler sample (Visual Studio .NET 2003)
Every so often, I talk to people about the NetCF WebCrawler sample that shipped as part of Visual Studio .NET 2003 (\Program Files\Microsoft Visual Studio .NET 2003\CompactFrameworkSDK\v1.0.5000\Windows CE\Samples\VC#\Pocket PC\WebCrawler). One of the things I get asked about is how the sample keeps track of the pages it has visited and when it knows to stop. The way the WebCrawler sample handles this is in how it uses the StatusCode member of the LinkInfo class.
The LinkInfo.StatusCode member is declared as being of type HttpStatusCode. HttpStatusCode is an enum defined in the System.Net namespace. Whenever a link is discovered, the WebCrawler creates an instance of LinkInfo and stores the Url in the LinkPath member. It also sets the value of StatusCode to 0. As shown in the code fragment, below, 0 is being used to indicate that a connection to this Url has not been attempted. There are also two other special values noted: -1 and -2, which denote application specific special case values.
NOTE: The code in this post has been edited for clarity and to reduce size. For full sample code, please consult the files that were installed with Visual Studio .NET 2003.
public class LinkInfo
{
/// The link's address (ex: https://www.microsoft.com)
public readonly string LinkPath;
/// HttpStatusCode received when we attempted to
/// connect to the link
/// NOTE: Other possible values
/// 0 == no connection attempted
/// -1 == generic failure
/// -2 == link does not point to html data
public HttpStatusCode StatusCode;
}
Each LinkInfo object that is created by the WebCrawler (in WebCrawler.Crawler.GetPageLinks) is stored for future reference -- the WebCrawler sample uses a Hashtable for storage. Each Url found by the WebCrawler is stored only once. If the same Url is identified later in the crawl, the new instance is discarded. This helps to avoid circular crawls (a page that links to another, which links back to the first).
// add the link
links.Add(linkString, new LinkInfo(linkString, (HttpStatusCode)0));
In the above, you can see the use of 0 (not visited) for the LinkInfo.StatusCode value. As mentioned before, System.Net.HttpStatusCode is an enum and since enums are defined as being of an integer type (ex: short, int, etc), integer values can be cast to the enum type (in this case HttpStatusCode).
Later in the crawl, the discovered links are read from the Hashtable and if their StatusCode is equal to 0, an HTTP header request is issued (in WebCrawler.Crawler.PageIsHtml):
// create the web request
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(pageAddress);
// get headers only
req.Method = "HEAD";
// make the connection
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
// read the headers
WebHeaderCollection headers = resp.Headers;
// check the content type
string contentType = headers["Content-type"];
if(contentType != null)
{
contentType = contentType.ToLower();
if(contentType.Equals(TypeHTML))
{
isHtml = true;
}
}
// get the status code (should be 200)
status = resp.StatusCode;
// close the connection
resp.Close();
In the above example TypeHTML is defined as being "text/html".
The result of the header request (HttpWebRequest.StatusCode) is stored in the link's LinkInfo.StatusCode field. If the request fails, the exception handler (not shown) stores the value of WebException.Response.StatusCode or -1 (generic error) if the exception is anything other than a WebException. If the Content-Type header reports anything other than "text/html", the caller sets LinkInfo.Status code to -2 (defined in LinkInfo as denoting a link to non-HTML data).
Once it has been determined that the Url points to HTML, the crawler requests the full page, stores the StatusCode and if successful searches for additional links (in WebCrawler.Crawler.Crawl). As you can se below, the WebCrawler sample supports links that are found in the a, frame, area and link HTML tags.
string pageData = "";
li.StatusCode = GetPageData(ref pageUri,
out pageData);
// if we successfully retrieved the page data
if(HttpStatusCode.OK == li.StatusCode)
{
// <a href=
GetPageLinks(pageUri,
pageData,
"a",
"href",
found);
// <frame src=
GetPageLinks(pageUri,
pageData,
"frame",
"src",
found);
// <area href=
GetPageLinks(pageUri,
pageData,
"area",
"href",
found);
// <link href=
GetPageLinks(pageUri,
pageData,
"link",
"href",
found);
}
Once the crawler finds that there are no more unvisited links (LinkInfo.StatusCode == 0) in it's Hashtable, it stops.
It is important to note that the WebCrawler sample keeps a master Hashtable (not shown in the above exampled) that holds all of the Urls that it discovers (regardless of whether or not it was able to successfully visit them) as well as a temporary Hashtable used while collecting links from a Url. This technique is used to avoid modifying the list of discovered links while iterating through them. Once all links have been discovered, the temporary Hashtable is merged with the master Hashtable and is discarded, to be cleaned up by the Garbage Collector.
As we have seen, the WebCrawler uses it's master Hashtable to keep track of the Urls it has found and uses the data stored in the Urls associated LinkInfo object to determine if it has previously visited the link. Like I mentioned earlier, this post touches on a very small part of the WebCrawler sample code -- for the complete sample, please consult your Visual Studio .NET 2003 installation (\Program Files\Microsoft Visual Studio .NET 2003\CompactFrameworkSDK\v1.0.5000\Windows CE\Samples\VC#\Pocket PC\WebCrawler).
Until next time,
-- DK
Disclaimer(s):
This posting is provided "AS IS" with no warranties, and confers no rights.