Share via


Regex 101 Answer S5 - Strip out any non-letter, non-digit characters

Remove any characters that are not alphanumeric.

Answer:

To remove these characters, we will first need to match them. We know that to match all alphanumeric characters, we could write:

[a-zA-Z0-9]

To match all characters except these, we can negate the character class:

[^a-zA-Z0-9]

It's then simple to use Regex.Replace():

string data = ...;

Regex regex = new Regex("[^a-zA-Z0-9]");

data = Regex.Replace(data, "");

Another way of doing this would be to use the pattern:

[^a-z0-9]

and then create the regex using RegexOptions.CaseInsensitive.

Note: I've seen a few comments referring to Unicode and international characters. I haven't delved into that because I don't want to complicate the discussion, and, frankly, Unicode scares me. If you want the details, you can find them in the docs. For example, you can find out that \W is really equivalent to the Unicode categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

Comments

  • Anonymous
    November 23, 2005
    Just curious, couldn't you condense the pattern further to this:

    [^a-zd]

    Since d is the same as 0-9?
  • Anonymous
    November 23, 2005
    Nick,

    Yes, you could, though strictly d is not equal to [0-9] but [p{Nd}], the unicode equivalent. Probably fine in most cases.
  • Anonymous
    November 23, 2005
    Do you know a regular expression to remove html tags from a string?
  • Anonymous
    November 23, 2005
    MaherJ what I usually do is create a CDO Message object, set the .HTMLBody property to the html string, and read the text equivalent from .TextBody
  • Anonymous
    November 23, 2005
    Of course if you do not use Unicode categories then there are (e.g.) many digits not being included. :-)