Regex 101 Discussion I5 - Remove unapproved HTML tags from a string

Article
01/27/2006

When accepting HTML input from a user, allow the following tags:

</a>

and remove any others.

******

My first comment is that you should be very careful when you do this sort of thing, because if the user can slip script by your filter, they can execute code on your server. Which is bad. Some attacks exploit html escape characters so that what you see may not look like "<script>".

So be forewarned, and forearmed. Or lower-backed, it doesn't really matter which.

My approach is going to match all HTML tags, and then guard against the ones that I don't want to match. So, I start with:

<.*?>

as my initial match. I'll then refine it so that it won't match the good tags, so I can use replace on the bad ones. I'll start by not match and :

< # opening <
(?! # negative lookahead
(b|/b) # b or /b
)
.*? # match between <>
>

What does that mean? Well, negative lookahead means "try to match the pattern at this point. If you do, the match fails". It doesn't eat any of the characters when it does this. In this case, it will try to match "b" or "/b" inside of the <>, and if it can, it will fail. If it can't, it will succeed.

It's very much like the ^ and $ anchors - the match can only continue if there is a specific condition that is not met. There are both positive and negative variants of lookahead and lookbehind.

Adding the other tags is pretty simple:

Doing the right thing with the string inside the "<a href=...>" is left as an exercise to the reader.

Comments

Anonymous
January 27, 2006
... so <a href="about:blank" onclick="while(true) { print(); }">gotcha</a> still goes through... or is that part of the exercise left to the reader? :)
Anonymous
January 27, 2006
Yes, that is part of the exercise left to the reader.
Anonymous
January 27, 2006
Sorry, but your regex doesn't work.

It correctly does not match , , etc.
But it incorrectly does not match <img> and <iframe> and <ul> too...

Here's a fix.

< # opening <
(?! # negative lookahead
(
b|/b| # b or /b
i|/i| # i or /i
u|/u| # u or /u
as+href.+?|/a # a href= or /a
)
# FIX
> # let's be strict
# END FIX
)
.*? # match between <>
>
Anonymous
January 27, 2006
<img> is not as safe as it sounds because of <img dynsrc="..."> which is an implicit weakened <object> tag (shudder)

http://msdn.microsoft.com/workshop/author/dhtml/reference/properties/dynsrc.asp
Anonymous
January 27, 2006
The comment has been removed
Anonymous
January 28, 2006
The comment has been removed
Anonymous
January 28, 2006
I see a bug in my code already... the
)? # end of values
lines should be
))? # end of values. What I wrote won't even parse.

There may be other bugs... I still haven't tested it.
Anonymous
January 28, 2006
The comment has been removed
Anonymous
January 30, 2006
Eric, you ignored my answer. The much simpler and stricter: (</?(?:u|i|b|as+href="[^">]"|(?<=/)a)>)|</?[^>]>

I simple replace with "$1" strips out all html tags except that which was exactly specified.
Anonymous
January 30, 2006
I notice kbiel's construct
Replace("(a)|b", "$1")

and eric's construct
Replace("(?!a)b", "")

are very similar.
Anonymous
February 01, 2006
The comment has been removed
Anonymous
February 01, 2006
Indeed... but only because he forgot to include the > in the (a) part. These two regexps are equivalent:

(?!x)y style: (?!<(?:b|/b|i|/i|u|/u|a href=.?|/a)>)<.?>
(x)|y style: (<(?:b|/b|i|/i|u|/u|a href=.?|/a)>)|<.?>

except that Eric's relies on negative lookahead and yours relies on $1.

TIMTOWTDI, I guess :)
Anonymous
February 01, 2006
Uh...Maurits, while you are right that the two styles you presented are equivilent, what you wrote for the negative look-ahead style is not what Eric constructed. He allowed fall-through because he placed his negative look-ahead within the tag markers (<>), while you moved the assertion outside of the tag markers and then included the tag markers in the assertion to make the match.

Eric: <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)).?>
Maurits: (?!<(?:b|/b|i|/i|u|/u|a href=.?|/a)>)<.*?>
Anonymous
February 01, 2006
I know, I fixed it ;)

Eric (original, with bug:) <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)).?>
Maurits (fixed v1, ugly but works:) <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)>).?>
Maurits (fixed v2, pretty:) (?!<(?:b|/b|i|/i|u|/u|a href=.?|/a)>)<.?>

I think my fix v1 is ugly because the <'s and >'s don't balance. Petty, I know, but it's who I am.
Anonymous
January 22, 2009
PingBack from http://www.hilpers.it/2613267-regex-per-un-html-sicuro
Anonymous
June 08, 2009
PingBack from http://quickdietsite.info/story.php?id=1300

Share via

Regex 101 Discussion I5 - Remove unapproved HTML tags from a string

Comments

Additional resources