.NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML...

Artikel
06/04/2003

!! Update 06/08/18 !! Html Agility Pack has a new home on CodePlex! Available here. CodePlex is great :)

!! Update 05/05/05 !! Visual Studio 2005 Beta2 version is available here

!! Update 05/23/05 !! This blog will be discontinued. A new blog were comments will be available has been created here.

Here is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is an assembly that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Sample applications:
* Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, you name it.
* Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
* Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.

There is no dependency on anything else than .Net's XPATH implementation. There is no dependency on Internet Explorer's dll or tidy or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool.

For example, here is how you would fix all hrefs in an HTML file:

 HtmlDocument doc = new HtmlDocument();

 doc.Load("file.htm");

 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])

    HtmlAttribute att = link["href"];

    att.Value = FixLink(att);

 doc.Save("file.htm");

You can download it here (link updated 12/12/04), full code source and doc included!

Comments

Anonymous
June 04, 2003
Thanks! Definitely going into my toolkit!
Anonymous
June 19, 2003
I have run across an issue with HtmlAgilityPack. I am trying to scrape a site that has some HTML added to the end of the document by the ISP that is hosting the site.

It is something like this:

<HTML>
...
</HTML>

<script>
...
</script>

HtmlAgilityPack will parse this and then wrap the whole thing in a <span> to give the document a single root, which is the <span> node rather than the <HTML> node.

Is there an option to either 1) ignore the extra markup, or 2) force the extra markup into the <HTML> node?
Anonymous
June 19, 2003
I think it does so because you are setting OptionOutputAsXml to True. In XML, you need a root node without siblings. HtmlAgilityPack creates this fake root node to build valid XML. Just don't use this OptionOutputAsXml.

Does this answer / solves your problem?
Anonymous
June 20, 2003
Yes, I have OptionOutputAsXml set to true. I am trying to produce an XHTML file so that I can apply an XSLT to and get an RSS feed. I guess I could make my XSLT to expect the root node to be a <span> instead of an <html>, but it just seems wrong. I was looking for alternatives.

And I have the source code, so I can tackle it myself, but I just wanted to see if there was already a workaround.
Anonymous
June 20, 2003
You do not need to produce an XHTML file to apply an XSLT to the document (and you should not) The HtmlDocument class supports IXPathNavigable natively for this kind of purpose, so you can just do:

HtmlDocument doc = ...
XslTransform xslt = new XslTransform();
xslt.Load("myXslt.xsl");
xslt.Transform(doc, null, writer);
Anonymous
December 09, 2003
The comment has been removed
Anonymous
February 12, 2004
There's another, the SgmlReader, which is a more structured approach to taming the HTML beast for scraping processes.

http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC
Anonymous
February 13, 2004
Absolutely, but as you say, it uses a more structured approach, and thus modifies "real world" html, which I think is a big problem for many scenarios.

Do this test:
1) go to www.microsoft.com, do a view source and save the file as mshome.htm (don't bother with images, .js and all satellite files)

2) run commandlinesgmlreader.exe mshome.htm mshome2.htm

3) open an IE on mshome.htm and another on mshome2.htm and you will see they are not rendered the same (fonts, tables, etc...)

HtmlAgilityPack does not change original html, even if it's malformed.
Simon.
Anonymous
March 09, 2004
Awesome, the SgmlReader was great, but this is even better! Way to code up the right tool!
Anonymous
March 14, 2004
I'm curios what is the difference between html agility pack and mshtml. I'm assuming that the agility pack was written to fix the problems in mshtml. Is this true? If not, then what does the agility pack have to offer that mshtml doesn't?
Anonymous
March 14, 2004
They are quite different libraries, not really comparable in my opinion.

MSHTML is a COM dll, not a .NET assembly (although you can interop with it), with everything that implies in terms of deployment.
MSHTML has many many dependencies on other DLLs, while Html Agility Pack has absolutely none (in either technical terms or standard ISO terms). MSHTML is client side oriented and has a lot to do with UI and is therefore not suited (at all) for server side operations. And it is somehow strict on HTML code while Html Agility Pack is really not. This is very usefull when you're talking about real world HTML (read: buggy HTML).

Html Agility Pack's purpose is less more ambitious, it basically just parses an HTML fragment (file or stream), builds a DOM out of it and allows you to modify it and save it back. It has however a killer feature that MSHTML does not have: support for XPATH and XSL transforms on plain old buggy malformed HTML code...

Hope this clarifies.
Anonymous
March 15, 2004
This is just great! It serves my purpose.

Thanks a lot.

-Sudhir
Anonymous
March 20, 2004
What a wonderful tool! Thanks a lot!
Anonymous
March 25, 2004
The comment has been removed
Anonymous
March 25, 2004
Wow, hard to believe it took so long for somebody to write and give away such an awesome tool.
Thanks!
Anonymous
March 25, 2004
The Html Agility Pack allows you to use XSLT on HTML document it loads. Note, however, that it does not even relies on XHTML format at all. HTML documents do not need to conform to anything but HTML "as we know it in the real world" :-)

So, yes, I believe you can use the method described http://www10.org/cdrom/papers/102/ to determine dynamic hyperlinks.
Anonymous
March 26, 2004
The comment has been removed
Anonymous
March 26, 2004
The comment has been removed
Anonymous
March 26, 2004
Sorry to keep bothering you Simon, thanks for your help.

This is the first time I've used XPath to navigate anything and I'm using it to navigate HtmlNodes using the HtmlNode.SelectNodes() function.

I'm having a problem with the current context, for example. I've created and filled an HtmlDocument which contains forms. I then obtain a HtmlNodeCollection of the form nodes, then for each form node I attempt to obtain a collection of input nodes that are a descendent of that form node:

HtmlNOdeCollection forms = doc.DocumentNode.SelectNodes("//form");

foreach( HtmlNode formNode in forms )
{
HtmlNodeCollection inputControls = formNode.SelectNodes(".//input");

foreach( HtmlNode inputControl in inputControls )
{
...
}
}

The XPath expression ".//input" should return an HtmlNodeCollection containing any input nodes within the form (the '.' specifying the current context, or the current selected node - from what I understand). But I always get back null.

If I change the expression to "//input" (which should return all input nodes beginning the search from the root node of the document) returns all of the input nodes found in the document (which is correct).

However, I specifically need just the input nodes within the current form node.

What am I doing wrong?

I've been testing this against https://recruitmax.alltel.com/recruitmax/candidates/jobopps.cfm which happens to have 2 forms.

Thanks again!
Anonymous
March 29, 2004
Hi Crumpy, you really are the "out of luck" guy :-) let me explain why. The <form> element deserves, by default, a special treatment by Html Agility Pack: it can overlap. It means you can have HTML like this: <form><b></form></b>, and Html Agility Pack will not report any error and will save it just like that. But it is more a trick than anything else because the <form> node in the DOM does not contain any node, it is declared as empty, and the </form> is declared as a text node with a value of "</form>"... This is why you find nothing inside the <form> element.

You can change the parsing behavior of the Html Agility Pack, using the HtmlNode static property called ElementFlags: just add the following code before you parse your texte:

HtmlNode.ElementFlags.Remove("form");

and you should see the <input> elements inside the <form> elements, just like you thought. Note, however, that <form> elements will not be able to overlap any more if you do this. Without adding this code, you could also fix a complex xpath to find inputs as children of form siblings.

Simon.
Anonymous
March 31, 2004
I have started example HtmlToRss and there is a mistake " File was not found at cache path... " In cache there is no necessary file. How to cope with it? What file there should be? Where file should enter the name there?
Anonymous
March 31, 2004
In html2rss.cs, you find this:

// set the following to true, if you don't want to use the Internet at all and if you are sure something is available in the cache (for testing purposes for example).
hw.CacheOnly = true;

It means we really look for a file in the cache directory. if it's not there, an exception is thrown.

Just set CacheOnly to false (at least the 1st time you run html2rss.exe) and recompile.

Simon.
Anonymous
April 07, 2004
Hi Simon, there is a slight mistake in the HtmlNode constructore where a "form" tag gets an HtmlElementFlag.Empty. So instead of my inputs being childs of the form they are parsed as being siblings. The fix is easy of course just change <code>ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);</code> to <code>ElementsFlags.Add("form", HtmlElementFlag.CanOverlap);</code>
Anonymous
April 08, 2004
Hey Simon,

I'm wondering if you've thought about creating a slimmed down version of this toolkit. Maybe making the dom forward only, or being able to conditionally turn off some of the internal variables like _line and _lineposition.

Anyway, just a thought for future improvements to this great toolkit.
-Charlie
Anonymous
April 10, 2004
The comment has been removed
Anonymous
April 14, 2004
In the chm, the description is:
Gets or Sets the text between the start and end tags of the object.

The declaration on that page is:
public virtual string InnerText {get;}

The observed behaviour is as per declaration.

Why is the InnertText not settable?
Anonymous
April 14, 2004
The comment has been removed
Anonymous
April 14, 2004
This is a sample code to remove comments:

static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("filewithcomments.htm");
doc.Save(Console.Out); // show before
RemoveComments(doc.DocumentNode);
doc.Save(Console.Out); // show after
}

static void RemoveComments(HtmlNode node)
{
if (node.NodeType == HtmlNodeType.Comment)
{
node.ParentNode.RemoveChild(node);
return;
}
if (!node.HasChildNodes)
return;
foreach(HtmlNode subNode in node.ChildNodes)
{
RemoveComments(subNode);
}
}
Anonymous
April 14, 2004
You cannot set innerText by design because it's computed and the doc is wrong as you noticed.

You can set innerHtml.

Simon.
Anonymous
April 14, 2004
I seem to have answered my own question!
Here is source, in case anyone wants it.
Thanks
-Sam

Dim myNodes As HtmlAgilityPack.HtmlNodeCollection = myDoc.DocumentNode.SelectNodes("//comment()")
Dim node As HtmlAgilityPack.HtmlNode

For Each node In myNodes
Console.Write(node.NodeType)
If node.NodeType = HtmlAgilityPack.HtmlNodeType.Comment Then
node.ParentNode.RemoveChild(node)
End If
Next
Anonymous
April 15, 2004
Hey Simon,

Just thought I'd pass on a tweak I made in case you or anyone else thought it was a useful mod.

I added the following to the HtmlNode class so as I'm doing whatever to the nodes I find, I can optionally hang any object off the nodes for re-use later.

-Mark
-------------------

internal object _externalobject = null;

/// <summary>
/// Gets or Sets the external object associated with the node.
/// </summary>
public object ExternalObject {
get {
return _externalobject;
}
set {
_externalobject = value;
}
}
Anonymous
April 19, 2004
It calls HtmlEncode on Html text, thus encoding twice, producing output like
&nbsp;
Anonymous
June 10, 2004
Hi...

I just started playing with HtmlAgility today and I noticed a couple of odd things - most significant was with the results of some xpath queries.

I was using the xpath query "//base/@href" (i.e. intending to select an attribute value from the <base> tag if found. What I got back was an odd HtmlNodeNavigator that had LocalName set to "href" and Name set to "base" (i.e. kind of an odd mashing of the parent node with the attribute node). When I get .Current, i get the parent <base> node.

I don't know how easy it would be, but perhaps HtmlAttribute could be recoded to be a derivation of of HtmlNode? Seems like it would be easier to emulate xml behavior if they were interchangeable...

Thanks
-mark
Anonymous
June 11, 2004
Hi Mark. You are absolutely right. This is a design error, you cannot use attributes in path selection. You still can use it in filters though, like //base[@href]. This would require to change the HtmlNodeNavigator.cs file ... and I have no time to fix it right now :-)
Anonymous
June 17, 2004
Simon, thanks for this great utility.

I haven't seen a way to POST data to a site and create a document, am I missing something?

Also is that a typo in the download link or is it deliberate?

Ian
Anonymous
June 18, 2004
If you talk about the HtmlWeb class, you can pass a method (POST or anything) to the LoadUrl. You can also hook the HttpRequest that will be used if you connect to the PreRequest event. You can tweak the method (or anything else) here as well.
Simon.
Anonymous
June 18, 2004
Ahh I see the light! I saw the method arguments but not the event handlers. Thank you. Ian.
Anonymous
June 21, 2004
Hey Simon,
I ran into an issue with html comments today. I'm trying to insert an html comment in the document and it is requiring me to put the "" wrappers on the value I set for the HtmlCommentNode. Debugging thru the code it appears that the nodes that are generated by the parse routine incorrectly include those wrapper tags in the value of the node...causing node.OuterHtml to return
" -->" and node.InnerHtml to return "".

-Mark
Anonymous
June 24, 2004
Hi Mark.
Can you show a sample of your code?
Simon.
Anonymous
June 29, 2004
based on examples related to remove tags with htmlagility pack, how can i remove tags like:

<o:p>

triyng: RemoveTag(doc, "o\:p"); but it returns "System.Xml.XPath.XPathException:"

private static void RemoveTag(HtmlDocument doc, string tags)
{
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//" + tags);
if (nodes == null)
return;
foreach(HtmlNode node in nodes){
if (node.ParentNode != null)
node.ParentNode.RemoveChild(node);
}
}
Anonymous
June 30, 2004
Hi.
Unfortunately, the support for namespaces is limited in the Html Agility Pack. It does not really know what a namespace is and understands names (prefix ':' localname) as a whole. I agree this is quite confusing :-) but most of the time, you can work around it. In your case, this is how you would do it.

HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//*[name() ='o:p']");
foreach(HtmlNode node in coll)
{
node.ParentNode.RemoveChild(node);
}
Simon.
Anonymous
July 07, 2004
My parsed file ends up having attributes like nowrap set to nowrap="" or checked set to checked="". Is there something Im missing?

Thanks a lot, and this is an awesome tool.
Anonymous
July 12, 2004
The comment has been removed
Anonymous
July 13, 2004
The comment has been removed
Anonymous
July 17, 2004
Hi,

it seems your sample in the beginning doesnt work anymore:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");

1. DocumentElement is Replaced by DocumentNode
2. HtmlNode is not indexable anymore (link["href"] won't work
3. I tried HtmlNode.GetAttributeValue() and HtmlNode SetAttributeValue but after saving the Document with doc.Save() there weren't any changes.

Here is my Code i used:

HtmlDocument doc = hw.Load ("f1.htm");
HtmlNode hn = doc.DocumentNode.SelectSingleNode ("//body");
hn.SetAttributeValue ("new","value");
doc.Save ( "f2.htm");

Please correct me if i'm wrong

greetings

Markus
Anonymous
July 19, 2004
Hi Markus, you're absolutely right. The sample (which was meant for illustration purpose only) is wrong, and it has always been. You're the first one to really try it I suppose :-)

The samples in the .zip file are hopefully ok, though.

Simon.
Anonymous
August 04, 2004
Very nifty tool. Thanks!
Anonymous
August 10, 2004
Hello...
Im getting an error when loading the solution. It's missing:
..HtmlDomViewHtmlDomView.csproj
and
SamplesGetBinaryRemainderGetBinaryRemainder.csproj

I appears they are not in the zip-file :O(
Please help
Anonymous
January 31, 2005
Processing loosely-defined text must rank as the one of
the worst kinds of pro
Anonymous
August 17, 2005
<p>Processing loosely-defined text must rank as the one of
the worst kinds of programming tasks. HTML and CSV parsing
are about as much fun as cleaning the toilet in a bus
station—who knows what you're going to find.</p>
Anonymous
August 18, 2005
<p>Processing loosely-defined text must rank as the one of
the worst kinds of programming tasks. HTML and CSV parsing
are about as much fun as cleaning the toilet in a bus
station—who knows what you're going to find.</p>
Anonymous
October 05, 2006
I've seen this around before, and this post was from June 2003, but it is worth mentioning again!
Anonymous
January 11, 2008
PingBack from http://msdn.blogsforu.com/msdn/?p=3654
Anonymous
March 17, 2008
PingBack from http://blogrssblog.info/simon-mouriers-weblog-net-html-agility-pack-how-to-use-malformed/
Anonymous
December 08, 2008
PingBack from http://alexandersarchive.wordpress.com/2008/12/08/html-agility-pack/
Anonymous
January 22, 2009
PingBack from http://www.hilpers.fr/931621-supression-des-balise-html-expression
Anonymous
May 05, 2009
Avoid (403) Forbidden errors when using HttpWebRequest I had an error when tried to open the page http
Anonymous
May 29, 2009
PingBack from http://paidsurveyshub.info/story.php?title=simon-mourier-s-weblog-net-html-agility-pack-how-to-use-malformed
Anonymous
June 08, 2009
PingBack from http://cellulitecreamsite.info/story.php?id=4212
Anonymous
June 20, 2009
转载自:http://www.cnblogs.com/dragon/archive/2005/06/15/174946.html 示例下载朋友问到这样一个问题，需要实现如下功能 1、

Freigeben über

.NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML...

Comments

Zusätzliche Ressourcen