Webcrawl a blog to retrieve all entries locally: RSS on steroids
Today’s sample shows how to create a web crawler in the background. This crawler starts with a web page, looks for all links on that page, and follows all those links. The links are filtered to my blog, but generalizing the code to search the entire web or some other site is trivial (if you have enough disk space<g>). (VB.Net version to appear soon on this blog.)
I was doing a search on my blog for “ancestors” via the Search box on the sidebar on the left, and there were no results. Strange, I thought, so I used MSN search for my site:
That search succeeded: it came up with the expected blog entry.
This incident reminded me of the fact that I’ve done a lot of work to create my blog, but I depend on a 3rd party to maintain it. There are hundreds of code samples, with links to references. If the blog server were to disappear for some reason, so would all my content. I wanted to retrieve all my blog content into a local table. Then I can manipulate it any way I want.
In particular, suppose I want to read my entire blog. I would have to do a lot of manual clicking to get to the month/day of the post, and then I might have missed something because I’m manually crawling. That’s pretty cumbersome. Also, I can have all of a blog available while offline, updating when connected.
So I wrote a code sample below that crawls my blog, looking for all the blog posts, and shows them in a form which has search capability. Because it’s all local, searching and navigating from post to post is extremely fast. The entry is displayed in a web control, so the page looks just like it would online and the hyperlinks are all live.
You can start a web crawl by pushing the Crawl button. You can interrupt the web crawl by typing ‘Q’ (<esc> will cancel the automation of the IE SaveAs dialog). The next time the crawl runs, it will resume where it left off. Crawling acts as if you were subscribed to my blog via RSS. Once you have all current content, Crawling again later will just add any new content. The saved content is the entire blog entry web page, including any comments. As an exercise, readers are encouraged to make the web crawling execute on a background thread!
A crawl starts at the main page https://blogs.msdn.com/Calvin_Hsia, which shows any new content and has links on the side bar for any other posts. The page is loaded and then parsed for any links. Any links pointing to my blog are inserted into a table if they’re not there already. Then the table is scanned for any unfollowed links and the process repeats. If a page is a leaf node (currently any link with 8 backslashes) then the Publication date is parsed, and the file is saved in the MHT field in the table. The link parsing was a little complicated due to some comment spam reducing measures and some broken links when the blog host server switched software.
You will probably have to modify the code if you want to do the same for other blogs. For example, some blogs may have the Publication date in a different place. Others may have archive links elsewhere or in a different format.
I experimented with using HTTPGet
cTempFile=ADDBS(GETENV("TEMP"))+SYS(3)+".htm"
LOCAL oHTTP as "winhttp.winhttprequest.5.1"
LOCAL cHTML
oHTTP=NEWOBJECT("winhttp.winhttprequest.5.1")
oHTTP.Open("GET","https://blogs.msdn.com/calvin_hsia/archive/2004/06/28/168054.aspx",.f.)
oHTTP.Send()
STRTOFILE(ohTTP.ResponseText,cTempFile)
oIE=CREATEOBJECT("InternetExplorer.Application")
oIE.Visible=1
oIE.Navigate(cTempFile)
But the content looked pretty bad, because of the CSS references, pictures, etc.
Being able to automate IE was helpful, but how do you parse the HTML for the links to each blog entry? I thought about using an XSLT, but that was fairly complex. I used the IE Document model IHTMLDocument,to search through the HTML nodes for links.
IE has a feature that saves a web page to a single file: Web Archive, single file(*.mht) from the File->SaveAs menu option. So I used Windows Scripting Host to automate this feature.
Making the code run in a background thread is trivial: just use the ThreadClass from here.
See also :
Wite your own RSS News/Blog aggregator in <100 lines of code
Use a simple XSLT to read the RSS feed from a blog,
Generating VBScript to read a blog
CLEAR ALL
CLEAR
#define WAIT_TIMEOUT 258
#define ERROR_ALREADY_EXISTS 183
#define WM_USER 0x400
SET EXCLUSIVE OFF
SET SAFETY OFF
SET ASSERTS ON
PUBLIC oBlogForm as Form
oBlogForm=creat("BlogForm","blogs.msdn.com/Calvin_Hsia")
oBlogForm.Visible=1
DEFINE CLASS BlogForm AS Form
Height=_screen.Height-80
Width = 900
AllowOutput=0
left=170
cBlogUrl=""
oThreadMgr=null
ADD OBJECT txtSearch as textbox WITH width=200
ADD OBJECT cmdSearch as CommandButton WITH left=210,caption="\<Search"
ADD OBJECT cmdCrawl as CommandButton WITH left=310,caption="\<Crawl"
ADD OBJECT cmdQuit as CommandButton WITH left=410,caption="\<Quit"
ADD OBJECT oGrid as Grid WITH ;
width = thisform.Width,;
top=20,;
ReadOnly=1,;
Anchor=15
ADD OBJECT oWeb as cWeb WITH ;
top=230,;
height=thisform.Height-250,;
width = thisform.Width,;
Anchor=15
ADD OBJECT lblStatus as label WITH top = thisform.Height-18,width = thisform.Width,anchor=4,caption=""
PROCEDURE Init(cUrl as String)
this.cBlogUrl=cUrl
IF !FILE("blogs.dbf")
CREATE table Blogs(title c(250),pubdate t,link c(100),followed i, Stored t,mht m)
INDEX on link TAG link
INDEX on pubdate TAG pubdate DESCENDING
INSERT INTO Blogs (link) VALUES (cUrl) && jump start the table with a link
INSERT INTO blogs (link) VALUES ('https://blogs.msdn.com/vsdata/archive/2004/03/18/92346.aspx') && early blogs
INSERT INTO blogs (link) VALUES ('https://blogs.msdn.com/vsdata/archive/2004/03/31/105159.aspx')
INSERT INTO blogs (link) VALUES ('https://blogs.msdn.com/vsdata/archive/2004/04/05/107986.aspx')
INSERT INTO blogs (link) VALUES ('https://blogs.msdn.com/vsdata/archive/2004/05/12/130612.aspx')
INSERT INTO blogs (link) VALUES ('https://blogs.msdn.com/vsdata/archive/2004/06/16/157451.aspx')
ENDIF
USE blogs SHARED && reopen shared
this.RequeryData()
this.RefrGrid
PROCEDURE RequeryData
LOCAL cTxt, cWhere
cTxt=ALLTRIM(thisform.txtSearch.value)
cWhere= "!EMPTY(mht)"
IF LEN(cTxt)>0
cWhere=cWhere+" and ATC(cTxt, mht)>0"
ENDIF
SELECT * FROM blogs WHERE &cWhere ORDER BY pubdate DESC INTO CURSOR Result
thisform.lblStatus.caption="# records ="+TRANSFORM(_tally)
WITH this.oGrid
.RecordSource= "Result"
.Column1.FontSize=14
.Column1.Width=this.Width-120
.RowHeight=25
ENDWITH
thisform.refrGrid
PROCEDURE RefrGrid
cFilename=ADDBS(GETENV("temp"))+SYS(3)+".mht"
STRTOFILE(mht,cFilename)
thisform.oWeb.Navigate(cFilename)
PROCEDURE oGrid.AfterRowColChange(nColIndex as Integer)
IF this.rowcolChange=1 && row changed
thisform.RefrGrid
ENDIF
PROCEDURE cmdQuit.Click
thisform.Release
PROCEDURE cmdCrawl.Click
thisform.txtSearch.value=""
fBackgroundThread=.t. && if you want to run on background thread
IF this.Caption = "\<Crawl"
thisform.lblStatus.caption= "Blog crawl start"
CreateCrawlProc()
IF fBackgroundThread
this.Caption="Stop \<Crawl"
*Get ThreadManager from https://blogs.msdn.com/calvin_hsia/archive/2006/05/23/605465.aspx
thisform.oThreadMgr=NEWOBJECT("ThreadManager","threads.prg")
thisform.oThreadMgr.CreateThread("MyThreadFunc",thisform.cBlogUrl,"oBlogForm.CrawlDone")
thisform.lblStatus.caption= "Background Crawl Thread Created"
ELSE
LOCAL oBlogCrawl
oBlogCrawl=NEWOBJECT("BlogCrawl","MyThreadFunc.prg","",thisform.cBlogUrl) && the class def resides in MyThreadFunc.prg
thisform.CrawlDone
ENDIF
ELSE
this.Caption="\<Crawl"
IF fBackgroundThread AND TYPE("thisform.oThreadMgr")="O"
thisform.lblStatus.caption= "Attempting thread stop"
thisform.oThreadMgr.SendMsgToStopThreads()
ENDIF
ENDIF
PROCEDURE CrawlDone
thisform.oThreadMgr=null
thisform.cmdCrawl.caption="\<Crawl"
thisform.lblStatus.caption= "Crawl done"
this.RequeryData()
PROCEDURE cmdSearch.Click
thisform.RequeryData
PROCEDURE destroy
IF USED("result")
USE IN result
ENDIF
SELECT Blogs
SET MESSAGE TO
SET FILTER TO
SET ORDER TO LINK && LINK
ENDDEFINE
DEFINE CLASS cweb as olecontrol
oleclass="shell.explorer.2"
PROCEDURE refreshxxx
NODEFAULT
PROCEDURE TitleChange(cText as String)
thisform.caption=cText
PROCEDURE Navigatecomplete2(pDisp AS VARIANT, URL AS VARIANT) AS VOID
IF url=GETENV("TEMP")
ERASE (url)
ENDIF
ENDDEFINE
PROCEDURE CreateCrawlProc as String && Create the Thread proc, which includes the crawling class
TEXT TO cstrVFPCode TEXTMERGE NOSHOW && generate the task to run: MyThreadFunc
**************************************************
**************************************************
PROCEDURE MyThreadFunc(p2) && p2 is the 2nd param to MyDoCmd
TRY
DECLARE integer GetCurrentThreadId in WIN32API
DECLARE integer PostMessage IN WIN32API integer hWnd, integer nMsg, integer wParam, integer lParam
cParm=SUBSTR(p2,AT(",",p2)+1)
hWnd=INT(VAL(p2))
oBlogCrawl=CREATEOBJECT("BlogCrawl",cParm)
CATCH TO oEx
DECLARE integer MessageBoxA IN WIN32API integer,string,string,integer
MESSAGEBOXA(0,oEx.details+" "+oEx.message,TRANSFORM(oex.lineno),0)
ENDTRY
PostMessage(hWnd, WM_USER, 0, GetCurrentThreadId())
DEFINE CLASS BlogCrawl as session
oWeb=0
oWSH=0
fStopCrawl=.f.
hEvent=0
cMonths="January February March April May June July August September October November December "
cCurrentLink=""
PROCEDURE init(cBlogUrl)
LOCAL fDone,nRec,nStart
nStart=SECONDS()
DECLARE integer CreateEvent IN WIN32API integer lpEventAttributes, integer bManualReset, integer bInitialState, string lpName
DECLARE integer CloseHandle IN WIN32API integer
DECLARE integer WaitForSingleObject IN WIN32API integer hHandle, integer dwMilliseconds
DECLARE integer GetLastError IN WIN32API
this.hEvent = CreateEvent(0,0,0,"VFPAbortThreadEvent") && Get the existing event
IF this.hEvent = 0
THROW "Creating event error:"+TRANSFORM(GetLastError())
ENDIF
?"Start Crawl"
DECLARE integer GetWindowText IN WIN32API integer, string @, integer
DECLARE integer Sleep IN WIN32API integer
this.oWeb=CREATEOBJECT("InternetExplorer.Application")
this.oWeb.visible=1
this.oweb.top=0
this.oweb.left=0
this.oweb.width=500
this.oWSH=CREATEOBJECT("Wscript.Shell")
USE blogs ORDER 1
REPLACE link WITH cBlogUrl, followed WITH 0 && set flag to indicate this page needs to be retrieved and crawled
this.fStopCrawl=.f.
fDone = .f.
DO WHILE !fDone AND NOT this.fStopCrawl
fDone=.t.
GO TOP
SCAN WHILE NOT this.fStopCrawl
nRec=RECNO()
IF followed = 0
REPLACE followed WITH 1
this.BlogCrawl(ALLTRIM(link))
IF this.fStopCrawl
GO nRec
REPLACE followed WITH 0 && restore flag
ENDIF
fDone = .f.
ENDIF
ENDSCAN
ENDDO
?"Done Crawl",SECONDS()-nStart
PROCEDURE BlogCrawl(cUrl)
LOCAL fGotUrl,cTitle
fGotUrl = .f.
DO WHILE !fGotUrl && loop until we've got the target url in IE with no Error
this.oweb.navigate2(cUrl)
DO WHILE this.oweb.ReadyState!=4
?"Loading "+cUrl
Sleep(1000) && yield processor
IF this.IsThreadAborted()
this.fStopCrawl=.t.
?"Aborting Crawl"
RETURN
ENDIF
ENDDO
cTitle=SPACE(250)
nLen=GetWindowText(this.oWeb.HWND,@cTitle,LEN(cTitle))
cTitle=LEFT(cTitle,nLen)
IF EMPTY(cTitle) OR UPPER(ALLTRIM(cTitle))="ERROR" OR ("http"$LOWER(cTitle) AND "400"$cTitle)
?"Error retrieving ",cUrl," Retrying"
ELSE
fGotUrl = .t.
ENDIF
ENDDO
this.cCurrentLink=cUrl
IF OCCURS("/",cUrl)=8 &&https://blogs.msdn.com/calvin_hsia/archive/2005/08/09/449347.aspx
cMht=this.SaveAsMHT(cTitle) && save the page before we parse
IF this.fStopCrawl
RETURN .f.
ENDIF
REPLACE title WITH STRTRAN(STRTRAN(cTitle," - Microsoft Internet Explorer",""),"Calvin Hsia's WebLog : ",""),;
mht WITH cMht,Stored WITH DATETIME()
IF EMPTY(title) && for some reason, the page wasn't retrieved
REPLACE followed WITH 0
ENDIF
ENDIF
?"Parsing HTML"
this.ProcessNodes(this.oWeb.Document,0) && Recur through html nodes to find links
?"Done Parsing HTML"
PROCEDURE ProcessNodes(oNode,nLev) && recursive routine to look through HTML
LOCAL i,j,dt,cClass,oC,cLink
IF this.IsThreadAborted() OR nLev > 30 && limit recursion levels
RETURN
ENDIF
WITH oNode
DO CASE
CASE LOWER(.NodeName)="div" && look for pub date
IF OCCURS("/",this.cCurrentLink)=8 && # of backslashes in blog leaf entry
oC=.Attributes.GetnamedItem("class")
IF !ISNULL(oC) AND !EMPTY(oC.Value)
cClass=oC.Value
IF cClass="postfoot" OR cClass = "posthead"
cText=ALLTRIM(STRTRAN(.innerText,"Published",""))
IF !EMPTY(cText)
dt=this.ToDateTime(cText)
IF SEEK(this.cCurrentLink,"blogs")
REPLACE pubdate WITH dt
ELSE
ASSERT .f.
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
CASE .nodeName="A"
cLink=LOWER(STRTRAN(.Attributes("href").value,"%5f","_"))
IF ATC("https://blogs.msdn.com/calvin_hsia/",cLink)>0
IF ATC("#",cLink)=0 AND ATC("archive/2",cLink)>0
*https://blogs.msdn.com/calvin\_hsia/archive/2004/10/11/\<a%20rel=
IF "<"$cLink && comment spam prevention:
*https://blogs.msdn.com/calvin\_hsia/archive/2004/10/11/240992.aspx
*<a rel="nofollow" target="_new" href="<a rel="nofollow" target="_new" href="https://www.53dy.com">https://www.53dy.com</a>
ELSE
*https://blogs.msdn.com/calvin\_hsia/archive/2004/11/16/visual%20foxpro
IF "%"$cLink
*https://blogs.msdn.com/calvin\_hsia/archive/2004/11/16/258422.aspx
* broken link: host updated software for category links
*<A title="Visual Foxpro" href="https://blogs.msdn.com/calvin_hsia/archive/2004/11/16/Visual%20Foxpro">Visual Foxpro</A>
* SET STEP ON
ELSE
IF !SEEK(cLink,"Blogs")
INSERT INTO Blogs (link) VALUES (cLink)
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
ENDCASE
FOR i = 0 TO .childNodes.length-1
this.ProcessNodes(.childNodes(i),nLev+1)
ENDFOR
ENDWITH
PROCEDURE ToDateTime(cText as String) as Datetime
*Friday, April 01, 2005 11:30 AM by Calvin_Hsia
LOCAL dt as Datetime
ASSERT GETWORDNUM(cText,6)$"AM PM"
nHr = INT(VAL(GETWORDNUM(cText,5)))
IF GETWORDNUM(cText,6)="PM" AND nhr < 12
nHr=nHr+12
ENDIF
dt=CTOT(GETWORDNUM(ctext,4) + "/" +; && Year
TRANSFORM(INT(1+(AT(GETWORDNUM(cText,2),this.cMonths)-1)/10)) + "/" +; && month
TRANSFORM(VAL(GETWORDNUM(cText,3))) + "T" +; && day of month
TRANSFORM(nHr)+":"+; && hour
TRANSFORM(VAL(SUBSTR(GETWORDNUM(cText,5),4)))) && minute
ASSERT !EMPTY(dt)
RETURN dt
PROCEDURE SaveAsMHT(cTitle as String) as String
fRetry = .t.
DO WHILE fRetry
fRetry = .f.
WITH this.oWSH
.AppActivate(cTitle) && bring IE to the foreground
TEMPFILEMHT= "c:\t.mht" && temp file can be constant
ERASE (TEMPFILEMHT)
.SendKeys("%fa"+TEMPFILEMHT+"{tab}w%s") && Alt-F (File Menu) S (Save As) type Web Archive Alt-S
nTries=5
DO WHILE !FILE(TEMPFILEMHT) && wait til IE saves the file
Sleep(5000)
nTries=nTries-1
IF nTries=0
fRetry=.t.
EXIT
ENDIF
IF this.IsThreadAborted()
this.fStopCrawl=.t.
?"Aborting crawl"
RETURN ""
ENDIF
ENDDO
sleep(100)
ENDWITH
ENDDO
RETURN FILETOSTR(TEMPFILEMHT)
RETURN
PROCEDURE IsThreadAborted as Boolean
IF WaitForSingleObject(this.hEvent,0) = WAIT_TIMEOUT
RETURN .f.
ENDIF
RETURN .t.
PROCEDURE destroy
this.oWeb.visible=.f.
this.oWeb=0
CloseHandle(this.hEvent)
ENDDEFINE
**************************************************
**************************************************
ENDTEXT
STRTOFILE(cstrVFPCode,"MyThreadFunc.prg")
COMPILE MyThreadFunc.prg
*SELECT PADR(TRANSFORM(YEAR(pubdate)),4)+"/"+PADL(MONTH(pubdate),2,"0") as mon,COUNT(*) FROM blogs WHERE !EMPTY(mht) GROUP BY mon ORDER BY mon DESC INTO CURSOR result
RETURN
Comments
Anonymous
May 26, 2006
Calvin,
That's really beautiful code... Excellent pointers for diverse applications deployed with Visual FoxPro. Thanks for commenting/posting it!!Anonymous
May 30, 2006
I've updated the VFP MT example based on this (at http://codegallery.gotdotnet.com/SednaY) to be sort of like .NET:
Example Use:
* t=CREATEOBJECT('testserver.thread')
* t.start(5,"do c:MTmyVFPMyThreadFunc WITH p2")
* && start method params:(1)#threads,(2)VFP code to MT,(3)Silent mode
* ?t.check && returns .T. if completed
* t=null && cleanup
Simple, fast, efficient - it is VFP!!Anonymous
June 06, 2006
I wanted to update a couple zip files of the VB version of my Blog Crawler (to be posted soon) with the...Anonymous
June 12, 2006
This is the VB.Net 2005 version of the Blog Crawler. It’s based on the Foxpro version, but.it uses SQL...Anonymous
June 14, 2006
The EventHandler function allows you to connect some code to an object’s event interface. For example,...Anonymous
July 05, 2006
Sometimes you run a program and you don’t want it to show any dialogs or User Interface at all. For example,...Anonymous
July 11, 2006
Calvin has written a blog crawler with both VFP and VB.NET versions that allows you to back up your own...Anonymous
July 22, 2006
funny ringtonesAnonymous
December 15, 2006
PingBack from http://deciacco.com/blog/archives/12Anonymous
January 31, 2007
I've updated this VFP Web Crawler to more closely match the VB.Net version. Check it out at: http://www.codeplex.com/vfpwebcrawler All source is included...Anonymous
August 17, 2007
SQLExpress is free and comes with Visual Studio, but the sample Northwind database isn’t included. YouAnonymous
August 17, 2007
PingBack from http://msdnrss.thecoderblogs.com/2007/08/17/install-northwind-for-sql-express-and-use-visual-studio-and-dlinq-to-query-it/Anonymous
August 17, 2007
SQLExpress is free and comes with Visual Studio, but the sample Northwind database isn’t included. YouAnonymous
August 24, 2007
I updated a version of this code to include an easy to use VFP project and -ability to specify number of threads -better switching between blogs -debug option to make crawling visible See VFPWebCrawler 2.0 at: http://www.codeplex.com/VFPWebcrawlerAnonymous
November 10, 2007
I spent a few hours at a local company called 2Bot ( http://www.2bot.com/ ) which makes a 3-D printerAnonymous
December 28, 2007
PingBack from http://internet-explorer-history.blogyblog.info/?p=1873Anonymous
January 04, 2008
PingBack from http://actors.247blogging.info/?p=3677Anonymous
March 25, 2008
PingBack from http://frankthefrank.info/entry.php?id=kwws%3d22ehwd1eorjv1pvgq1frp2fdoylqbkvld2dufklyh25339238258293%3a8%3b%3b1dvs%7bAnonymous
May 15, 2008
I received a question: Simply, is there a way of interrupting a vfp sql query once it has started shortAnonymous
May 29, 2009
PingBack from http://paidsurveyshub.info/story.php?title=calvin-hsia-s-weblog-webcrawl-a-blog-to-retrieve-all-entries-locallyAnonymous
June 08, 2009
PingBack from http://insomniacuresite.info/story.php?id=11048