Udostępnij za pośrednictwem


Regular Expressions From Scratch, Part Five: The Regular Expression Language

Now things start to get really weird.

Definition 9: Take any alphabet S. The regular expression alphabet of S is S plus a bunch of extra symbols; it's S ∪ {(, ), *, ∪, Ø} (I assume that none of those symbols are already in S.)

I'm doing something that I said earlier that I would try not to do. I'm using symbols in an alphabet that I also use in expressions that talk about strings in that alphabet. (Of course, I also said that this would all fall apart when we got to regular expressions, and sure enough, it did. Foreshadowing: the sign of a quality blog.)

Again, keep a careful eye on when I'm using fixed-width blue, because those are the "meaningless" symbols, not expression notation.

Definition 10: Take any alphabet S. The regular expression language R of an alphabet S is a language formed from strings of the regular expression language of S, and is defined as follows:

  • Ø is in R.
  • Every member of S is in R
  • If u and w are strings in R then (uw) is in R.
  • Similarly, (u∪w) is in R.
  • Similarly, u* is in R.
  • Nothing else is in R

An example might help. Suppose that S = {a, b}. The regular expression alphabet of S is {a, b, (, ), *, ∪, Ø}. The regular expression language of S is R = {Ø, a, b, (ØØ), (Øa), (Øb), (aØ), (aa), (ab), (bØ), (ba), (bb), (Ø∪Ø), (Ø∪a), (Ø∪b), (a∪Ø), (a∪a), (a∪b), (b∪Ø), (b∪a), (b∪b), Ø*, a*, b*, …}

We've defined an alphabet, we've defined a language -- a language which looks suspiciously like the expressions we've been using to talk about languages. Next time we'll do something insanely clever to bridge the gap between strings in the regular expression language and the languages which these strings would define if they were interpreted as expressions.

Comments

  • Anonymous
    December 01, 2005
    Just wanted to let you know how much I am enjoying this series!
  • Anonymous
    December 01, 2005
    The comment has been removed
  • Anonymous
    December 01, 2005
    The empty set character doesn't render properly in Lucida Console.
  • Anonymous
    December 02, 2005
    I'm really enjoying this series too. Probably my favorite series yet on your blog.
  • Anonymous
    January 08, 2006
    In definition 10, I think the phrase "the regular expression language of S" should be "the regular expression alphabet of S".


  • Anonymous
    January 08, 2006
    Again about definition 10, should we add one more entry to the list as follows:

    If u is a string in R then (u) is in R.
  • Anonymous
    January 24, 2006
    That extra definition would be redundant, and could be removed. (u) === u so (u)* === u* and so on for all of the operation symbols.
  • Anonymous
    January 24, 2006
    In response to your other post, Shawn, definition 10 is correct in calling it the "regular expression language" since definition 4 says that a language is a subset of S* (which matches the notation used) and alphabets must be finite according to definition 3 (i think, maybe definition 2) while R is not.
  • Anonymous
    February 21, 2006
    The comment has been removed
  • Anonymous
    March 03, 2006
    When it comes to validating input regular expression becomes a very important part of your security plan. ...