Partager via


Customizing the Blog Crawler for different formats

I’ve had several requests that require customizing the Blog Crawler.

The entire source code of the Blog Crawler is available, so it can be modified to crawl blogs other than https://blogs.msdn.com

Currently, it saves the entire HTML retrieved from a blog’s URL. It converts relative links to absolute like so:

From href="https://blogs.msdn.com/Themes/Blogs/hover/style/style.css"

To href="https://blogs.msdn.com/Themes/Blogs/hover/style/style.css"

This allows the web control to render the page with the CSS references as well as making all the links on the page live. When it’s rendered, links like CSS and images are retrieved as needed. This is fairly slow, and requires an online connection.

The Foxpro version of the crawler actually saves the HTML page as an MHT file (from IE, File->Save As->Type->Web Archive, single file), which means all images and CSS are stored in the file, so no online content is retrieved, and it’s much faster to render the pages. I might update the VB version to save as MHT file, perhaps as an option. Which way would you prefer?

A blog endpoint is an actual blog post permalink. Several blog URL’s are not endpoints: for example, they may be a summary of postings for the month, by category, etc.

The crawler determines if it’s an endpoint by counting the number of “/” in the URL with 1 line of code. This can be changed easily to accommodate other blogs.

        fIsPostedEntry = cUrl.Replace("/", "").Length + 8 = cUrl.Length ' if there are 8 backslashes, then it's a blog entry ("https://blogs.msdn.com/calvin_hsia/archive/2006/05/16/599108.aspx")

The crawler assumes that every blog entry starts with the same root URL, like “https://blogs.msdn.com/calvin_hsia”. It crawls the blog’s home page, like https://blogs.msdn.com/calvin_hsia and finds all the links with the same root and adds them to a table. If it’s an endpoint, the page is saved into the table as well.

The way the crawler parses out the published date is probably very customized to https://blogs.msdn.com

                Case "div" ' Parse out the Publish date

                    If fIsPostedEntry Then

                        Dim oC As Object

                        oC = .Attributes.GetNamedItem("class")

                        If Not oC Is Nothing And Not oC.value Is Nothing Then

                            Dim cClass As String = oC.Value

                            If (cClass = "postfoot" Or cClass = "posthead") And Not .innerText Is Nothing Then

                                Dim cText As String = .innerText.Replace("Published", "")

                                If cText.Length > 0 Then

                                    Try

                                        cText = cText.Trim.Substring(0, cText.IndexOf(CStr(IIf(cText.IndexOf("AM") > 0, "AM", "PM"))) + 2)

                                        dtPubDate = DateTime.Parse(cText)

  fGotPubDate = True

                                    Catch ex As Exception

                                        System.Diagnostics.Debug.WriteLine("Date parse err: " + ex.Message)

                                End Try

                                End If

                            End If

                        End If

                    End If

The way the crawler parses out links for endpoints, it looks for “archive/2” (as in “archive/2006”) in the URL There were some links on my blogs from comment spam which needed to be filtered out too.

                Case "a" ' it's a link

                    cLink = .Attributes("href").value.Replace("%5f", "_").ToString.ToLower

                    If cLink.StartsWith(cBlogUrl) And cLink <> cCurrentLink Then ' if it's to the blog

                 If (Not cLink.Contains("#")) And cLink.Contains("archive/2") Then 'like archive/2006

                            If cLink.Contains("<") OrElse cLink.Contains("%") Then ' some comment spam

                            Else

                                    << got good link >>

                            End If

                        End If

                    End If

Changing the code to work with blogs other than https://blogs.msdn.com means seeing how much they differ in format and changing these areas of code.

For example, https://blogs is an internal Microsoft blogging site. It says “Posted on “ rather than “Published on “, so that would need to be changed.

Comments