Querying HTML with LINQ to XML
Often times we need to parse HTML for data. Sure in a perfect world everything would have a nice service or API wrapped around it but as we all know this is not always the case. Many times we're left with parsing files or "screen scraping" to get the data we need from other applications. Sure this is brittle, but sometimes it's the best we can do. And sometimes you're just trying to get the data once so "good enough" is really good enough.
I was faced with that challenge myself this week. Yes even here not all systems expose services or if they do, finding the documentation or person to consult would take longer than writing a simple program. ;-) At the core all I needed to do was query a couple pieces of data from a bunch of web pages. This seemed like the perfect opportunity to use LINQ to XML because the structure of the page was pretty well formed HTML. However there were a couple tricks to figure out mainly because LINQ to XML doesn't support HTML entities. It only supports character entities and the built in XML entities (< > " & ').
Working with simple HTML in an XElement is very straightforward, as long as it's well-formed and doesn't contain any HTML entity references:
Dim html = <html>
<head>
<title>
Test Page
</title>
</head>
<body>
<a id="link1" href="https://mydownloads1.com">This is a link 1</a>
<a id="link2" href="https://mydownloads2.com">This is a link 2</a>
<a id="link3" href="https://mydownloads3.com">This is a link 3</a>
<a id="link4" href="https://mydownloads4.com">This is a link 4</a>
</body>
</html>
Dim links = From link In html...<a>
For Each link In links
Console.WriteLine(link.@href)
Next
But as we all know HTML almost always contains entity references all over the place (like for the HTML space). Also if you end up with any querystring parameters in your hrefs, when you try to load the HTML into the XElement, you get the same problem. Additionally if you paste a literal into the VB editor it places a semicolon into the querystring because it automatically tries to interpret it as an entity and places a semicolon where you don't want it.
So to fix this you need to remove all the unsupported HTML entity references as well as replace the & characters with &. So in the pages I was loading luckily they were not that complicated and only contained and the problematic querystrings. This is an example of the page I was trying to load:
<html xmlns="https://www.w3.org/1999/xhtml">
<head>
<title>
Sample Page
</title>
<link href="css/page.css" rel="StyleSheet"/>
</head>
<body >
<!--begin form -->
<form name="form1" method="post" action="page.aspx?Product=Cool&Id=12345" id="form1">
<!--begin main table -->
<table class="tblMain" cellspacing="0" cellpadding="0">
<!--Properties -->
<tr>
<td class="tdHead">Properties</td>
</tr>
<tr>
<td class="tdGrid">
<div>
<table class="grid" cellspacing="0" cellpadding="3"
border="1" id="dgPage" style="border-collapse:collapse;">
<tr class="grid_row">
<td class="grid_item" style="font-weight:bold;width:100px;">ID</td>
<td class="grid_item" style="width:480px;">12345</td>
</tr>
<tr class="grid_row">
<td class="grid_item" style="font-weight:bold;width:100px;">Published</td>
<td class="grid_item" style="width:480px;">05/04/2007</td>
</tr>
</table>
</div>
</td>
</tr>
<!--Details -->
<tr>
<td id="tdHeadDetails" class="tdHead">Statistics</td>
</tr>
<tr>
<td class="tdGrid">
<div>
<table class="grid" cellspacing="0" cellpadding="3" rules="all" border="1"
id="dgDetails" style="border-collapse:collapse;">
<tr class="grid_header">
<th scope="col">Rating :</th>
<th scope="col">Raters :</th>
<th scope="col">Pageviews :</th>
<th scope="col">Printed :</th>
<th scope="col">Saved :</th>
<th scope="col">Emailed :</th>
<th scope="col">Linked :</th>
<th scope="col"></th>
</tr>
<tr class="grid_row">
<td class="grid_item" style="width:60px;">5.00</td>
<td class="grid_item" style="width:60px;">100</td>
<td class="grid_item" style="width:80px;">1000000</td>
<td class="grid_item" style="width:60px;">150</td>
<td class="grid_item" style="width:60px;">1000</td>
<td class="grid_item" style="width:60px;">100</td>
<td class="grid_item" style="width:280px;">40</td>
<td class="grid_item">
<a href="https://www.somewhere.com/default.aspx?ID=12345&Name=Beth" target="_blank">View</a>
</td>
</tr>
</table>
</div>
</td>
</tr>
</table>
</form>
</body>
</html>
So here's what I did to load this programmatically and fix up the HTML. Also notice that I need to add an Imports statement in order to import the default xml namespace that is declared in the HTML document otherwise our query later will not return any results.
Imports <xmlns="https://www.w3.org/1999/xhtml">
Imports System.Net
Imports System.IO
Public Class SimpleScreenScrape
Function GetHtmlPage(ByVal strURL As String) As String
Try
Dim strResult As String
Dim objResponse As WebResponse
Dim objRequest As WebRequest = HttpWebRequest.Create(strURL)
objRequest.UseDefaultCredentials = True
objResponse = objRequest.GetResponse()
Using sr As New StreamReader(objResponse.GetResponseStream())
strResult = sr.ReadToEnd()
sr.Close()
End Using
'Replace HTML entity references so that we can load into XElement
strResult = Replace(strResult, " ", "")
strResult = Replace(strResult, "&", "&")
Return strResult
Catch ex As Exception
Return ""
End Try
End Function
Sub QueryData()
Dim html As XElement
Try
Dim p = GetHtmlPage("https://www.somewhere.com/default.aspx")
Using sr As New StringReader(p)
html = XElement.Load(sr)
End Using
Catch ex As Exception
MsgBox("Page could not be loaded.")
Exit Sub
End Try
.
. 'Now we can write the queries...
.
Now for the fun part, the actual querying! Now that the document is loaded into the XElement the querying of it becomes a snap. I needed to grab the publish date, and then all the statistics from the page. This is easily done with a couple LINQ to XML queries, one query for each of the HTML tables where the data is located:
'I'm using FirstOrDefault here because I know my page
' only has one of these tables
Dim stats = (From stat In html...<table> _
Where stat.@id = "dgDetails" _
Select fields = stat.<tr>.<th>, values = stat.<tr>.<td>).FirstOrDefault()
'Same here. FirstOrDefault because there's only one "Published"
' html row (<tr>) on the page that I'm looking for.
Dim lastPublished = (From prop In html...<tr> _
Where prop.<td>.Value = "Published" _
Select prop.<td>(1).Value).FirstOrDefault()
Console.WriteLine(lastPublished)
For i = 0 To stats.fields.Count - 1
Console.WriteLine(stats.fields(i).Value & " = " & stats.values(i).Value)
Next
And that's it. For this simple utility this is good enough for me and took me about 15 minutes to program using LINQ. The trick to loading the HTML document into an XElement is to remove all the unsupported HTML entity references first.
Enjoy!
Comments
Anonymous
April 26, 2008
Beth Massi demonstrates more of the power of LINQ to XML with another type of "data" - HTMLAnonymous
April 27, 2008
all I can say is... OOOO.... NICE! < insert wide eyed, blinky faced emoticon here >Anonymous
April 27, 2008
wow!! that is really neat! Too bad that most developers are still working like they were in the 90's... :( I saw that you're requesting a file located at "http://www.somewhere.com/default.aspx". My question is, using LINQ can I make crossdomain requests?Anonymous
April 28, 2008
Beth, I've been a reader of your blog for quite a while now and this post fits EXACTLY what I'm doing at the moment. I'm a hobby programmer and being such, I'm using the VB Express 2008. So for reports, I've been using HTML pages (created in code with necessary values inserted) and I was wondering how I was going to be able to get HTML type reports working with XML literals and LINQ. My early attempts ended up much of the same way, with errors and other warnings and what not. Now I now that what I was thinking of doing was correct. Thanks so much! JBAnonymous
April 28, 2008
I was querying HTML using LINQ and XML literals and found something very useful: HTML2XHTML.dll http://msdn2.microsoft.com/en-us/library/bb251006.aspx This baby (from Microsoft) will take malformed HTML and produce very nice XHTML! Along with Beth's other techniques, most web pages will be well formed XML! It's very simple on top of it!Anonymous
April 29, 2008
Beth Massi has done something I know some of you C4F readers have been asking for.  Beth developedAnonymous
April 29, 2008
The comment has been removedAnonymous
May 02, 2008
Another great alternative is to use HtmlAgilityPack - http://www.codeplex.com/htmlagilitypackAnonymous
May 12, 2008
Looks really good, could we just not use XSLT though?Anonymous
May 17, 2008
How would you translate Dim lastPublished = (From prop In html...<tr> _ Where prop.<td>.Value = "Published" _ Select prop.<td>(1).Value).FirstOrDefault() in C#?Anonymous
June 04, 2008
hello Beth I want to put error messageboxes when making linq to xml file. for examle I want error message when a data is null or empty How can I do? thanks I tried this name=<%= If(Not company.name = "", company.name, messagebox.show("there is no name")) %> but dont workAnonymous
May 10, 2009
Hello, I'm trying with this code to read web page but still have error with ISO-8859-15 !? I suppose it's only works with perfect pages ?? :-(Anonymous
May 11, 2009
Hi Paraglider, You may want to try looking at these utilities people suggested for converting malformed HTML into XHTML: http://msdn.microsoft.com/en-us/library/bb251006.aspx http://www.codeplex.com/htmlagilitypack HTH, -BAnonymous
May 17, 2014
Hi Beth, I'm facing with a issue. Not pretty sure why my query is not returning any element. It goes like this Dim divElement = (From e In html.<div> Where e.@id = "divTest" Select e.Name).FirstOrDefault() What am I doing wrong? Regards