Creating Text Summaries from HTML

Artikel
07/12/2008

On several projects, I have had the need to convert large HTML blobs into short text summaries that can be displayed in a list. For example, in SharePoint I often need to display lists of Publishing Page content and I want to summarize some of the HTML columns.

This blog post provides code and describes the process for converting HTML to a text summary.

The Process

There are three basic steps to the process:

Strip out all HTML tags (tags, comments and CDATA)
Normalize the whitespace (reduce multiple spaces, tabs and line feeds into single spaces)
Truncate and cleanup the result and append ellipses (if necessary)

Regular Expressions

In order to remove the HTML content, I used several regular expressions that I concatenate together to create one large regular expression. (In order to keep my sanity, I store several smaller regular expressions in separate strings that I concatenate together.)

It is obvious that we will want to remove normal HTML tags and a comments, but it is less obvious that we want to remove CDATA. CDATA tags are not that common and if we were to include their content, we would need to HTML-encode the contents; it is much easier to simply remove them.

The first three patterns below represent the "contents" of tags (the stuff in between the "<" and ">"). The fourth pattern concatenates the results inside the opening/closing brackets.

string TagContentsRegexPattern = @"(?:[^\>\""\']*(?:\""[^\""]*\""|\'[^\']*\')?)*"; string CommentContentsRegexPattern = @"\!\-\-.*?\-\-"; string CDataContentsRegexPattern = @"\!\[CDATA\[.*?\]\]"; string HtmlTagCommentOrCDataRegexPattern = @"\<(?:" + CommentContentsRegexPattern + "|" + CDataContentsRegexPattern + "|" + TagContentsRegexPattern + @")\>";

The final combined regular expression for identifying HTML tags is below:

\<(?:\!\-\-.*?\-\-|\!\[CDATA\[.*?\]\]|(?:[^\>\"\']*(?:\"[^\"]*\"|\'[^\']*\')?)*)\>

In the final code, you will a method called StripTags that replaces these tags with an empty string.

It was difficult to choose whether replace tags with a space or a zero-length string. I ultimately choose to use a zero-length string which introduces the possible risk of incorrectly concatenating two words together (for example, if two <p> tags had no whitespace between them). In this case, I felt it was a better choose to incorrectly combine two words rather than introduce extra whitespace. An improvement to the code might be to detect certain tags such as <p> and <td> and always convert those to spaces.

Normalizing Whitespace

The NormalizeWhitespace method is responsible for converting sequences of whitespace (including space, tabs and linefeeds) into a single space. The string is also effectively trimmed so all whitespace at the start or end of the string is removed.

Truncate and Cleanup

Once we have removed the tags and normalized the whitespace, it's time to truncate the results.

If life were simple, we would simply truncate the string at a particular length; unfortunately, it's a bit more complicated. To do a "great job", we perform the following steps:

If the result is longer than our maximum length, we truncate.
Next, we need to determine if we accidentally broke an HTML entity. For example, imagine if we accidentally truncated "&" in the middle and ended up with "&am", this would effectively corrupt the output. To fix this problem, we look for the last "&" and the last ";" and fix the problem if it exists.
Next, we look for the last space and truncate there. That way, we can avoid splitting a word in the middle.
Finally, we append the ellipse if needed.

But wait... we didn't decode the HTML entities!?!?!?!

That's correct. The assumption for this method is that we ultimately want to rewrite the result into an HTML stream; therefore, we can leave the entities as they are. Do not run the results of these methods through a function that HTML-encodes; otherwise, your output will be double-encoded!

The Final Code

Our final code is listed below:

using System; using System.Text; using System.Text.RegularExpressions; namespace Core.Web { public class HtmlToText { // // Html Tag Regex Patterns // public static readonly string TagContentsRegexPattern = @"(?:[^\>\""\']*(?:\""[^\""]*\""|\'[^\']*\')?)*"; public static readonly string CommentContentsRegexPattern = @"\!\-\-.*?\-\-"; public static readonly string CDataContentsRegexPattern = @"\!\[CDATA\[.*?\]\]"; public static readonly string HtmlTagCommentOrCDataRegexPattern = @"\<(?:" + CommentContentsRegexPattern + "|" + CDataContentsRegexPattern + "|" + TagContentsRegexPattern + @")\>"; public static Regex FindTagRegex = new Regex(HtmlTagCommentOrCDataRegexPattern, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.Compiled | RegexOptions.ExplicitCapture); public static string CreateHtmlSummary(string s, int maximumLength, bool appendEllipse) { string result; if (s == null) result = null; else if (s.Length == 0 || maximumLength <= 0) result = ""; else { // Remove Tags... result = StripTags(s); // Normalize Whitespace... result = NormalizeWhitespace(result); if (result.Length > maximumLength) { int truncateLen = maximumLength; // // Find the last position of the "&" and ";". // If the last ";" is not after the last "&" // then we have split an Entity and need to truncate // before the "&"... // int lastAmpersandPosition = result.LastIndexOf('&', truncateLen - 1); if (lastAmpersandPosition != -1) { int lastSemicolonPosition = result.LastIndexOf(';', truncateLen - 1); if (lastSemicolonPosition < lastAmpersandPosition) truncateLen = lastAmpersandPosition; } // Locate the last space and truncate there so we don't // split words... if (truncateLen > 0 && result[truncateLen] != ' ') { int spacePosition = result.LastIndexOf(' ', truncateLen); if (spacePosition > 0) truncateLen = spacePosition; } result = result.Substring(0, truncateLen); // Append ellipse, if needed... if (appendEllipse) result += "..."; } } return result; } public static string NormalizeWhitespace(string s) { string result; if (s == null) result = null; else if (s.Length == 0) result = ""; else { int startPos = 0; // Trim initial whitespace while (startPos < s.Length && char.IsWhiteSpace(s[startPos])) { startPos++; } if (startPos == s.Length) result = ""; else { int firstNonWhitespaceCharacter = startPos; while (startPos < s.Length && !char.IsWhiteSpace(s[startPos])) { startPos++; } if (startPos == s.Length) { if (firstNonWhitespaceCharacter == 0) result = s; else result = s.Substring(firstNonWhitespaceCharacter); } else { bool haveSeenWhitespace = true; char c; StringBuilder sb = new StringBuilder(s.Length - startPos); sb.Append(s, firstNonWhitespaceCharacter, startPos - firstNonWhitespaceCharacter); for (int i = startPos + 1; i < s.Length; i++) { c = s[i]; if (char.IsWhiteSpace(c) && !haveSeenWhitespace) { haveSeenWhitespace = true; } else { if (haveSeenWhitespace) { sb.Append(' '); haveSeenWhitespace = false; } sb.Append(c); } } result = sb.ToString(); } } } return result; } public static string StripTags(string s) { if (s == null) return null; else return FindTagRegex.Replace(s, string.Empty); } public static string StripTagsAndNormalize(string s) { return NormalizeWhitespace(StripTags(s)); } } }

Possible Enhancements

This could could be enhanced by:

Allowing the consumer to provide the text that should be appended during truncation (instead of "...").
Allow the consumer to provide a parameter to indicate whether the results should be HTML-encoded or pure text.

Drop me a line if you find this code helpful!

Freigeben über