Regex 101 Answer S5 - Strip out any non-letter, non-digit characters

Article
11/23/2005

Remove any characters that are not alphanumeric.

Answer:

To remove these characters, we will first need to match them. We know that to match all alphanumeric characters, we could write:

[a-zA-Z0-9]

To match all characters except these, we can negate the character class:

[^a-zA-Z0-9]

It's then simple to use Regex.Replace():

string data = ...;

Regex regex = new Regex("[^a-zA-Z0-9]");

data = Regex.Replace(data, "");

Another way of doing this would be to use the pattern:

[^a-z0-9]

and then create the regex using RegexOptions.CaseInsensitive.

Note: I've seen a few comments referring to Unicode and international characters. I haven't delved into that because I don't want to complicate the discussion, and, frankly, Unicode scares me. If you want the details, you can find them in the docs. For example, you can find out that \W is really equivalent to the Unicode categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

Comments

Anonymous
November 23, 2005
Just curious, couldn't you condense the pattern further to this:

[^a-zd]

Since d is the same as 0-9?
Anonymous
November 23, 2005
Nick,

Yes, you could, though strictly d is not equal to [0-9] but [p{Nd}], the unicode equivalent. Probably fine in most cases.
Anonymous
November 23, 2005
Do you know a regular expression to remove html tags from a string?
Anonymous
November 23, 2005
MaherJ what I usually do is create a CDO Message object, set the .HTMLBody property to the html string, and read the text equivalent from .TextBody
Anonymous
November 23, 2005
Of course if you do not use Unicode categories then there are (e.g.) many digits not being included. :-)

Share via

Regex 101 Answer S5 - Strip out any non-letter, non-digit characters

Comments

Additional resources