Customizing the Blog Crawler for different formats
I’ve had several requests that require customizing the Blog Crawler.
The entire source code of the Blog Crawler is available, so it can be modified to crawl blogs other than https://blogs.msdn.com
Currently, it saves the entire HTML retrieved from a blog’s URL. It converts relative links to absolute like so:
From href="https://blogs.msdn.com/Themes/Blogs/hover/style/style.css"
To href="https://blogs.msdn.com/Themes/Blogs/hover/style/style.css"
This allows the web control to render the page with the CSS references as well as making all the links on the page live. When it’s rendered, links like CSS and images are retrieved as needed. This is fairly slow, and requires an online connection.
The Foxpro version of the crawler actually saves the HTML page as an MHT file (from IE, File->Save As->Type->Web Archive, single file), which means all images and CSS are stored in the file, so no online content is retrieved, and it’s much faster to render the pages. I might update the VB version to save as MHT file, perhaps as an option. Which way would you prefer?
A blog endpoint is an actual blog post permalink. Several blog URL’s are not endpoints: for example, they may be a summary of postings for the month, by category, etc.
The crawler determines if it’s an endpoint by counting the number of “/” in the URL with 1 line of code. This can be changed easily to accommodate other blogs.
fIsPostedEntry = cUrl.Replace("/", "").Length + 8 = cUrl.Length ' if there are 8 backslashes, then it's a blog entry ("https://blogs.msdn.com/calvin_hsia/archive/2006/05/16/599108.aspx")
The crawler assumes that every blog entry starts with the same root URL, like “https://blogs.msdn.com/calvin_hsia”. It crawls the blog’s home page, like https://blogs.msdn.com/calvin_hsia and finds all the links with the same root and adds them to a table. If it’s an endpoint, the page is saved into the table as well.
The way the crawler parses out the published date is probably very customized to https://blogs.msdn.com
Case "div" ' Parse out the Publish date
If fIsPostedEntry Then
Dim oC As Object
oC = .Attributes.GetNamedItem("class")
If Not oC Is Nothing And Not oC.value Is Nothing Then
Dim cClass As String = oC.Value
If (cClass = "postfoot" Or cClass = "posthead") And Not .innerText Is Nothing Then
Dim cText As String = .innerText.Replace("Published", "")
If cText.Length > 0 Then
Try
cText = cText.Trim.Substring(0, cText.IndexOf(CStr(IIf(cText.IndexOf("AM") > 0, "AM", "PM"))) + 2)
dtPubDate = DateTime.Parse(cText)
fGotPubDate = True
Catch ex As Exception
System.Diagnostics.Debug.WriteLine("Date parse err: " + ex.Message)
End Try
End If
End If
End If
End If
The way the crawler parses out links for endpoints, it looks for “archive/2” (as in “archive/2006”) in the URL There were some links on my blogs from comment spam which needed to be filtered out too.
Case "a" ' it's a link
cLink = .Attributes("href").value.Replace("%5f", "_").ToString.ToLower
If cLink.StartsWith(cBlogUrl) And cLink <> cCurrentLink Then ' if it's to the blog
If (Not cLink.Contains("#")) And cLink.Contains("archive/2") Then 'like archive/2006
If cLink.Contains("<") OrElse cLink.Contains("%") Then ' some comment spam
Else
<< got good link >>
End If
End If
End If
Changing the code to work with blogs other than https://blogs.msdn.com means seeing how much they differ in format and changing these areas of code.
For example, https://blogs is an internal Microsoft blogging site. It says “Posted on “ rather than “Published on “, so that would need to be changed.
Comments
- Anonymous
June 15, 2006
hi Calvin
i still thinking about bellow:
http://blogs.msdn.com/calvin_hsia/archive/2006/05/11/595562.aspx
but i am not understand ,i want run MessageBoxA in foxpro code,can you still study?
thank you - Anonymous
May 29, 2009
PingBack from http://paidsurveyshub.info/story.php?title=calvin-hsia-s-weblog-customizing-the-blog-crawler-for-different-formats