Better HTML parsing and validation with HtmlAgilityPack

Article
12/10/2006

Let's face it; sometimes the Microsoft.VisualStudio.TestTools.WebTesting.HtmlDocument class just doesn't cut it when you're writing custom extraction and validation code. HtmlDocument was originally designed as an internal class to very efficiently parse URLs for dependent requests (such as images) out of HTML response bodies. Before VS 2005 RTM, we made HtmlDocument part of the public WebTestFramework API, but scheduling and resource constraints prevented us from adding more general purpose DOM features like InnerHtml, InnerText, and GetElementById. You could always parse the HTML string yourself, but fortunately there's a better option: HtmlAgilityPack.

HtmlAgilityPack is an open source project on CodePlex. It provides standard DOM APIs and XPath navigation -- even when the HTML is not well-formed!

Here's a sample web test that uses the HtmlAgilityPack.HtmlDocument instead of the one in WebTestFramework. It simply validates that Microsoft's home page lists Windows as the first item in the navigation sidebar. Download HtmlAgilityPack and add a reference to it from your test project to try out this coded web test.

using System;

using System.Collections.Generic;

using System.Text;

using Microsoft.VisualStudio.TestTools.WebTesting;

using HtmlAgilityPack;

public class WebTest1Coded : WebTest

{

public override IEnumerator<WebTestRequest> GetRequestEnumerator()

{

WebTestRequest request1 = new WebTestRequest("https://www.microsoft.com/");

request1.ValidateResponse += new EventHandler<ValidationEventArgs>(request1_ValidateResponse);

yield return request1;

}

void request1_ValidateResponse(object sender, ValidationEventArgs e)

{

//load the response body string as an HtmlAgilityPack.HtmlDocument

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(e.Response.BodyString);

//locate the "Nav" element

HtmlNode navNode = doc.GetElementbyId("Nav");

//pick the first <li> element

HtmlNode firstNavItemNode = navNode.SelectSingleNode(".//li");

//validate the first list item in the Nav element says "Windows"

e.IsValid = firstNavItemNode.InnerText == "Windows";

}

}

Updated: Fixed XPath query thanks to Oleg's comment. Also fixed indention of the code.

Comments

Anonymous
December 11, 2006
Now, this is cool if you do a lot of html parsing! You can tell I was drawn to it by the word "Agile"
Anonymous
December 11, 2006
What's wrong with SgmlReader?
Anonymous
December 11, 2006
Josh, your sample is broken. //li is absolute XPath selection. So navNode.SelectSingleNode("//li") returns first <li> in the document, not under navNode. If you need to select <li> descendant of navNode you need navNode.SelectSingleNode(".//li") or navNode.SelectSingleNode("descendant::li");
Anonymous
December 12, 2006
Thanks Oleg, I thought something wasn't right with that XPath, but it worked so I left it alone :) I'll update the code. I haven't used SgmlReader myself, but I've read multiple posts saying HtmlAgilityPack works much better for malformed HTML. Josh
Anonymous
December 22, 2006
Jeff Beehler on Sam's Credo. Josh Christie on Better HTML parsing and validation with HtmlAgilityPack....
Anonymous
December 19, 2007
Visual Studio Team System for Testers Content Index for Web Tests and Load Tests Getting Started Online
Anonymous
October 21, 2008
让我们面对它，有时候，当您正在编写自定义的提取和验证规则时Microsoft.VisualStudio.TestTools.WebTesting.HtmlDocument类不会剪切它。HtmlDoc...

Share via

Better HTML parsing and validation with HtmlAgilityPack

Comments

Additional resources