Share via


Regex 101 Discussion I5 - Remove unapproved HTML tags from a string

When accepting HTML input from a user, allow the following tags:

<b>

</b>

<a href=…>

</a>

<i>

</i>

<u>

</u>

and remove any others.

******

My first comment is that you should be very careful when you do this sort of thing, because if the user can slip script by your filter, they can execute code on your server. Which is bad. Some attacks exploit html escape characters so that what you see may not look like "<script>".

So be forewarned, and forearmed. Or lower-backed, it doesn't really matter which.

My approach is going to match all HTML tags, and then guard against the ones that I don't want to match. So, I start with:

<.*?>

as my initial match. I'll then refine it so that it won't match the good tags, so I can use replace on the bad ones. I'll start by not match <b> and </b>:

< # opening <
(?! # negative lookahead
(b|/b) # b or /b
)
.*? # match between <>
>

What does that mean? Well, negative lookahead means "try to match the pattern at this point. If you do, the match fails". It doesn't eat any of the characters when it does this. In this case, it will try to match "b" or "/b" inside of the <>, and if it can, it will fail. If it can't, it will succeed.

It's very much like the ^ and $ anchors - the match can only continue if there is a specific condition that is not met. There are both positive and negative variants of lookahead and lookbehind.

Adding the other tags is pretty simple:

< # opening <
(?! # negative lookahead
(
b|/b| # b or /b
i|/i| # i or /i
u|/u| # u or /u
a\s+href.+?|/a # a href= or /a
)
)
.*? # match between <>
>

Doing the right thing with the string inside the "<a href=...>" is left as an exercise to the reader.

Comments

  • Anonymous
    January 27, 2006
    ... so <a href="about:blank" onclick="while(true) { print(); }">gotcha</a> still goes through... or is that part of the exercise left to the reader? :)

  • Anonymous
    January 27, 2006
    Yes, that is part of the exercise left to the reader.

  • Anonymous
    January 27, 2006
    Sorry, but your regex doesn't work.

    It correctly does not match <i>, <u>, etc.
    But it incorrectly does not match <img> and <iframe> and <ul> too...

    Here's a fix.

    < # opening <
    (?! # negative lookahead
    (
    b|/b| # b or /b
    i|/i| # i or /i
    u|/u| # u or /u
    as+href.+?|/a # a href= or /a
    )
    # FIX
    > # let's be strict
    # END FIX
    )
    .*? # match between <>
    >

  • Anonymous
    January 27, 2006
    <img> is not as safe as it sounds because of <img dynsrc="..."> which is an implicit weakened <object> tag (shudder)

    http://msdn.microsoft.com/workshop/author/dhtml/reference/properties/dynsrc.asp

  • Anonymous
    January 27, 2006
    The comment has been removed

  • Anonymous
    January 28, 2006
    The comment has been removed

  • Anonymous
    January 28, 2006
    I see a bug in my code already... the
    )? # end of values
    lines should be
    ))? # end of values. What I wrote won't even parse.

    There may be other bugs... I still haven't tested it.

  • Anonymous
    January 28, 2006
    The comment has been removed

  • Anonymous
    January 30, 2006
    Eric, you ignored my answer. The much simpler and stricter: (</?(?:u|i|b|as+href="[^">]"|(?<=/)a)>)|</?[^>]>

    I simple replace with "$1" strips out all html tags except that which was exactly specified.

  • Anonymous
    January 30, 2006
    I notice kbiel's construct
    Replace("(a)|b", "$1")

    and eric's construct
    Replace("(?!a)b", "")

    are very similar.





  • Anonymous
    February 01, 2006
    The comment has been removed

  • Anonymous
    February 01, 2006
    Indeed... but only because he forgot to include the > in the (a) part. These two regexps are equivalent:

    (?!x)y style: (?!<(?:b|/b|i|/i|u|/u|a href=.?|/a)>)<.?>
    (x)|y style: (<(?:b|/b|i|/i|u|/u|a href=.?|/a)>)|<.?>

    except that Eric's relies on negative lookahead and yours relies on $1.

    TIMTOWTDI, I guess :)

  • Anonymous
    February 01, 2006
    Uh...Maurits, while you are right that the two styles you presented are equivilent, what you wrote for the negative look-ahead style is not what Eric constructed. He allowed fall-through because he placed his negative look-ahead within the tag markers (<>), while you moved the assertion outside of the tag markers and then included the tag markers in the assertion to make the match.

    Eric: <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)).?>
    Maurits: (?!<(?:b|/b|i|/i|u|/u|a href=.
    ?|/a)>)<.*?>

  • Anonymous
    February 01, 2006
    I know, I fixed it ;)


    Eric (original, with bug:) <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)).?>
    Maurits (fixed v1, ugly but works:) <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)>).
    ?>
    Maurits (fixed v2, pretty:) (?!<(?:b|/b|i|/i|u|/u|a href=.?|/a)>)<.?>

    I think my fix v1 is ugly because the <'s and >'s don't balance. Petty, I know, but it's who I am.

  • Anonymous
    January 22, 2009
    PingBack from http://www.hilpers.it/2613267-regex-per-un-html-sicuro

  • Anonymous
    June 08, 2009
    PingBack from http://quickdietsite.info/story.php?id=1300