Поделиться через


Regular Expressions From Scratch, Part Three: Concatenation

You probably intuitively understood concatenation already, but let me define it anyway.

Definition 5: The concatenation of two strings w and u (over the same alphabet) makes a string consisting of the sequence of every element in w followed by every element in u. We write concatenations the same way as before: all run together. So if we have S = { a, b, c} and two strings in S* called w and u, where w=abc and u=cccaaabbb then then wu=abccccaaabbb.

Be very clear on this point: w and u are variables which represent string values. wu is an expression which represents concatenating the value of u onto the end of the value of w. Similarly, waaaa would be the result of concatenating aaaa ont w, so waaaa = abcaaaa.

Recall that I'm using the variable e to represent the empty string. You can concatenate the empty string onto any old string, and the result is unchanged. So we = w, for any string w.

We can concatenate any finite number of strings together with impunity. If v=ab then wuvevaeaea = abccccaaabbbababaaa.

We cannot concatenate an infinite number of non-empty strings together to form a string, because strings are by definition finite sequences.

Note that with this definition we can say that the set S* is the set of all finite concatenations of any members of S, plus the empty string.

That pretty much takes care of making strings out of other strings. We're going to need to make new languages from old languages as well. Can we meaningfully concatenate two languages together? Sure, why not?

Definition 6: Two languages L and K over the same alphabet may be concatenated together form a new language. Concatenated languages are notated as you'd expect: by writing one after the other. If u is in L, and v is in K, then uv is in LK.

If one of those languages is the empty language - the language with no strings, not even the empty string - then the concatenation is also the empty language.

For example, suppose

S = {a, b}
L = {e, a, aa, aaa, aaaa, …}
K = {e, b, bb, bbb, bbbb, …}

then

LK = {w in S* such that w has any number of as followed by any number of bs}

That's a little irksome. We need a more compact notation for "any number of as followed by any number of bs". Fortunately we already have such a notation -- we have the Kleene Closure over alphabets. We'll just create two one-character alphabets, star them to form languages, and concatenate the languages:

LK = {a}*{b}*

Let's make that notation a little more compact. From now on when I say

X = a*

what I mean is

X = {a}*

That is, I'm going to omit the set braces, because you can figure them out from context.

We can write that last example much more compactly:

S = {a, b}
L = a*
K = b*
LK = a*b*

Perhaps you can see that we are in fact heading towards regular expressions.

Next time: The Kleene Closure applies to languages too.

Comments

  • Anonymous
    November 25, 2005
    I noticed that no one had commented on the last RegEx post, or this one, and felt that I should speak up so you know at least one person is reading.

    I'm interested in seeing more of this, and I'm sure plenty of others are, too. I like seeing experienced developers put together in-depth articles/tutorials/whatever-you-want-to-call-them.
  • Anonymous
    November 25, 2005
    I've been reading your posts but am wondering if you actually explained what the Kleene Closure is?
  • Anonymous
    November 25, 2005
    Sembol,

    The Kleene closure was definition 3, from last week:

    http://blogs.msdn.com/ericlippert/archive/2005/11/18/493482.aspx
  • Anonymous
    January 04, 2006
    Good writing. So, based on what you say about the empty string, is it true that

    e* = e

    ???
  • Anonymous
    January 04, 2006
    Yes, this would be an example of our more compact syntax. Clearly {e}* = {e} U {ee} U {eee} ... = {e}, so {e}* = {e}, so e*=e.
  • Anonymous
    January 08, 2006
    The comment has been removed
  • Anonymous
    February 09, 2006
    The comment has been removed
  • Anonymous
    February 21, 2006
    The comment has been removed
  • Anonymous
    March 03, 2006
    When it comes to validating input regular expression becomes a very important part of your security plan. ...