Webcrawl a blog to retrieve all entries locally: RSS on steroids

[アーティクル]
05/25/2006

Today’s sample shows how to create a web crawler in the background. This crawler starts with a web page, looks for all links on that page, and follows all those links. The links are filtered to my blog, but generalizing the code to search the entire web or some other site is trivial (if you have enough disk space<g>). (VB.Net version to appear soon on this blog.)

I was doing a search on my blog for “ancestors” via the Search box on the sidebar on the left, and there were no results. Strange, I thought, so I used MSN search for my site:

https://search.msn.com/results.aspx?FORM=TOOLBR&q=ancestors+site%3Ahttp%3A%2F%2Fblogs.msdn.com%2FCalvin_Hsia%2F

That search succeeded: it came up with the expected blog entry.

This incident reminded me of the fact that I’ve done a lot of work to create my blog, but I depend on a 3^rd party to maintain it. There are hundreds of code samples, with links to references. If the blog server were to disappear for some reason, so would all my content. I wanted to retrieve all my blog content into a local table. Then I can manipulate it any way I want.

In particular, suppose I want to read my entire blog. I would have to do a lot of manual clicking to get to the month/day of the post, and then I might have missed something because I’m manually crawling. That’s pretty cumbersome. Also, I can have all of a blog available while offline, updating when connected.

So I wrote a code sample below that crawls my blog, looking for all the blog posts, and shows them in a form which has search capability. Because it’s all local, searching and navigating from post to post is extremely fast. The entry is displayed in a web control, so the page looks just like it would online and the hyperlinks are all live.

You can start a web crawl by pushing the Crawl button. You can interrupt the web crawl by typing ‘Q’ (<esc> will cancel the automation of the IE SaveAs dialog). The next time the crawl runs, it will resume where it left off. Crawling acts as if you were subscribed to my blog via RSS. Once you have all current content, Crawling again later will just add any new content. The saved content is the entire blog entry web page, including any comments. As an exercise, readers are encouraged to make the web crawling execute on a background thread!

A crawl starts at the main page https://blogs.msdn.com/Calvin_Hsia, which shows any new content and has links on the side bar for any other posts. The page is loaded and then parsed for any links. Any links pointing to my blog are inserted into a table if they’re not there already. Then the table is scanned for any unfollowed links and the process repeats. If a page is a leaf node (currently any link with 8 backslashes) then the Publication date is parsed, and the file is saved in the MHT field in the table. The link parsing was a little complicated due to some comment spam reducing measures and some broken links when the blog host server switched software.

You will probably have to modify the code if you want to do the same for other blogs. For example, some blogs may have the Publication date in a different place. Others may have archive links elsewhere or in a different format.

I experimented with using HTTPGet

cTempFile=ADDBS(GETENV("TEMP"))+SYS(3)+".htm"

LOCAL oHTTP as "winhttp.winhttprequest.5.1"

LOCAL cHTML

oHTTP=NEWOBJECT("winhttp.winhttprequest.5.1")

oHTTP.Open("GET","https://blogs.msdn.com/calvin_hsia/archive/2004/06/28/168054.aspx",.f.)

oHTTP.Send()

STRTOFILE(ohTTP.ResponseText,cTempFile)

oIE=CREATEOBJECT("InternetExplorer.Application")

oIE.Visible=1

oIE.Navigate(cTempFile)

But the content looked pretty bad, because of the CSS references, pictures, etc.

Being able to automate IE was helpful, but how do you parse the HTML for the links to each blog entry? I thought about using an XSLT, but that was fairly complex. I used the IE Document model IHTMLDocument,to search through the HTML nodes for links.

IE has a feature that saves a web page to a single file: Web Archive, single file(*.mht) from the File->SaveAs menu option. So I used Windows Scripting Host to automate this feature.

Making the code run in a background thread is trivial: just use the ThreadClass from here.

Comments

Anonymous
May 26, 2006
Calvin,

That's really beautiful code... Excellent pointers for diverse applications deployed with Visual FoxPro. Thanks for commenting/posting it!!
Anonymous
May 30, 2006
I've updated the VFP MT example based on this (at http://codegallery.gotdotnet.com/SednaY) to be sort of like .NET:
Example Use:
* t=CREATEOBJECT('testserver.thread')
* t.start(5,"do c:MTmyVFPMyThreadFunc WITH p2")
* && start method params:(1)#threads,(2)VFP code to MT,(3)Silent mode
* ?t.check && returns .T. if completed
* t=null && cleanup
Simple, fast, efficient - it is VFP!!
Anonymous
June 06, 2006
I wanted to update a couple zip files of the VB version of my Blog Crawler (to be posted soon) with the...
Anonymous
June 12, 2006
This is the VB.Net 2005 version of the Blog Crawler. It’s based on the Foxpro version, but.it uses SQL...
Anonymous
June 14, 2006
The EventHandler function allows you to connect some code to an object’s event interface. For example,...
Anonymous
July 05, 2006
Sometimes you run a program and you don’t want it to show any dialogs or User Interface at all. For example,...
Anonymous
July 11, 2006
Calvin has written a blog crawler with both VFP and VB.NET versions that allows you to back up your own...
Anonymous
July 22, 2006
funny ringtones
Anonymous
December 15, 2006
PingBack from http://deciacco.com/blog/archives/12
Anonymous
January 31, 2007
I've updated this VFP Web Crawler to more closely match the VB.Net version. Check it out at: http://www.codeplex.com/vfpwebcrawler All source is included...
Anonymous
August 17, 2007
SQLExpress is free and comes with Visual Studio, but the sample Northwind database isn’t included. You
Anonymous
August 17, 2007
PingBack from http://msdnrss.thecoderblogs.com/2007/08/17/install-northwind-for-sql-express-and-use-visual-studio-and-dlinq-to-query-it/
Anonymous
August 17, 2007
SQLExpress is free and comes with Visual Studio, but the sample Northwind database isn’t included. You
Anonymous
August 24, 2007
I updated a version of this code to include an easy to use VFP project and -ability to specify number of threads -better switching between blogs -debug option to make crawling visible See VFPWebCrawler 2.0 at: http://www.codeplex.com/VFPWebcrawler
Anonymous
November 10, 2007
I spent a few hours at a local company called 2Bot ( http://www.2bot.com/ ) which makes a 3-D printer
Anonymous
December 28, 2007
PingBack from http://internet-explorer-history.blogyblog.info/?p=1873
Anonymous
January 04, 2008
PingBack from http://actors.247blogging.info/?p=3677
Anonymous
March 25, 2008
PingBack from http://frankthefrank.info/entry.php?id=kwws%3d22ehwd1eorjv1pvgq1frp2fdoylqbkvld2dufklyh25339238258293%3a8%3b%3b1dvs%7b
Anonymous
May 15, 2008
I received a question: Simply, is there a way of interrupting a vfp sql query once it has started short
Anonymous
May 29, 2009
PingBack from http://paidsurveyshub.info/story.php?title=calvin-hsia-s-weblog-webcrawl-a-blog-to-retrieve-all-entries-locally
Anonymous
June 08, 2009
PingBack from http://insomniacuresite.info/story.php?id=11048

次の方法で共有

Webcrawl a blog to retrieve all entries locally: RSS on steroids

Comments

その他のリソース