Partager via


Do you like reading a blog author? Retrieve all blog entries locally for reading/searching using XML, XSLT, XPATH

If you like reading a particular blog and want to read more from the same author, you can subscribe to the blog using RSS and any number of blog readers, such as newsgator, intravnews. This will get a few current entries (RSS is limited to 15 entries) and any subsequent ones, but how do you retrieve all the blog’s archived content?

There are some sites that will allow you to search blogs, such as Techonorati and feedster. If you are a blog author, you can add a search button (like the one on this page) that will use google to search only your own blog (the engine might not have crawled to your most current entries yet).

I wrote a little program that would take every blog entry from a particular author and put them into a table that stores the HTML and extracts the plaintext.

That way, querying the table is as simple as writing a SQL Select statement like

SELECT * FROM blog where ATC("CallWindowProc",blograw)>0

You can modify the code to write your own blog reader/subscriber: add code that runs periodically to retrieve any new content from the RSS. You can write a blog viewer using many techniques, such as using the IE Web Control or automating Outlook, InternetExplorer or MSWord.

It can also be modified to retrieve content from non-blog sources: like articles from online columnists.

How the code works:

There is a routine called GetHTML which uses the WinHTTP object to retrieve the HTML of any URL. It’s called to retrieve the HTML of the main blog page, which contains a listing of prior posts titled “Archives”.

Now it parses the HTML to put the archive hyperlinks into a table. Querying the HTML for these links is done by first converting the HTML to XML using the MSWord SaveAs method with wdFormatXML. Then it uses the selectSingleNode method of the XMLDOM , which uses the xpath query language to find the “Archives” section.

For each of the monthly links, it then similarly gets the HTML and uses an XML query to find the hyperlink and HTML for each individual blog entry, which is then inserted into a table.

Finally, the entire table is scanned to convert the HTML entry to plaintext using MSWord again, this time to just get the raw text.

The plaintext is useful for searching: often the HTML has tags and script code that you may not want to search.

The code is written specifically for blogs hosted at blogs.msdn.com or https://weblogs.asp.net Beware that there is minimal error checking. For other blogs, you may have to modify the XML queries. For example, they may note have "Archives" or the archives may not be organized by month.

The code is available here.

47934

Comments

  • Anonymous
    January 12, 2005
    The limit of 15 only applies to v0.91 of the RSS spec according to http://blogs.law.harvard.edu/tech/rss#comments the current version has no limits so its probably the individual web sites that are limiting it (or attempting to be compatible with the old version)
  • Anonymous
    January 19, 2005
    Other info: I added an important notice to the post about Google terms and rumours about sites being removed from the index, please check it out and decide for yourself :)
  • Anonymous
    May 25, 2006
    Today’s sample shows how to create a web crawler in the background. This crawler starts with a web page,...
  • Anonymous
    June 02, 2006
    David
  • Anonymous
    July 01, 2006
    L