Character Classes

Artikel
11/03/2006

A character class represents a set of characters that can match an input string. Combine literal characters, escape characters, and character classes to form a regular expression pattern.

Character classes define sets of characters. Some character classes are equivalent to one or more Unicode general category values or Unicode blocks. A Unicode general category defines the broad classification of a character; that is, whether the character is a type of letter, decimal digit, separator, mathematical symbol, punctuation, and so on. For example, the Lu general category represents "Letter, Uppercase" and the Sm category represents "Symbol, Math". For more information, see Supported Unicode General Categories.

A Unicode block is a named range of Unicode code points. The .NET Framework provides a set of named blocks derived from the Unicode block names. For example, the .NET Framework provides the IsBasicLatin named block, which corresponds to the Basic Latin Unicode block and contains characters ranging from U+0000 through U+007F. For more information, see Supported Named Blocks.

The .NET Framework supports character class subtraction expressions, which enables you to define a set of characters as the result of excluding one character class from another character class. For more information, see Character Class Subtraction.

Character Class Syntax

The following table summarizes the character classes and their syntax.

Character class	Description
[character_group]	(Positive character group.) Matches any character in the specified character group. The character group consists of one or more literal characters, escape characters, character ranges, or character classes that are concatenated. For example, to specify all vowels, use `[aeiou].` To specify all punctuation and decimal digit characters, code `[\p{P}\d]`.
[^character_group]	(Negative character group.) Matches any character not in the specified character group. The character group consists of one or more literal characters, escape characters, character ranges, or character classes that are concatenated. The leading carat character (^) is mandatory and indicates the character group is a negative character group instead of a positive character group. For example, to specify all characters except vowels, use `[^aeiou].` To specify all characters except punctuation and decimal digit characters, use `[^\p{P}\d]`.
[firstCharacter-lastCharacter]	(Character range.) Matches any character in a range of characters. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points. Two or more character ranges can be concatenated. For example, to specify the range of decimal digits from '0' through '9', the range of lowercase letters from 'a' through 'f', and the range of uppercase letters from 'A' through 'F', use `[0-9a-fA-F]`.
.	(The period character.) Matches any character except \n. If modified by the Singleline option, a period character matches any character. For more information, see Regular Expression Options. Note that a period character in a positive or negative character group (a period within square brackets) is treated as a literal period character, not a character class.
\p{name}	Matches any character in the Unicode general category or named block specified by name (for example, Ll, Nd, Z, IsGreek, and IsBoxDrawing).
\P{name}	Matches any character not in Unicode general category or named block specified in name.
\w	Matches any word character. Equivalent to the Unicode general categories `[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}]`. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to `[a-zA-Z_0-9]`.
\W	Matches any nonword character. Equivalent to the Unicode general categories `[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}]`. If ECMAScript-compliant behavior is specified with the ECMAScript option, \W is equivalent to `[^a-zA-Z_0-9]`.
\s	Matches any white-space character. Equivalent to the escape sequences and Unicode general categories `[\f\n\r\t\v\x85\p{Z}]`. If ECMAScript-compliant behavior is specified with the ECMAScript option, \s is equivalent to `[ \f\n\r\t\v]`.
\S	Matches any non-white-space character. Equivalent to the escape sequences and Unicode general categories `[^\f\n\r\t\v\x85\p{Z}]`. If ECMAScript-compliant behavior is specified with the ECMAScript option, \S is equivalent to `[^ \f\n\r\t\v]`.
\d	Matches any decimal digit. Equivalent to `\p{Nd}` for Unicode and `[0-9]` for non-Unicode, ECMAScript behavior.
\D	Matches any nondigit character. Equivalent to `\P{Nd}` for Unicode and `[^0-9]` for non-Unicode, ECMAScript behavior.

Supported Unicode General Categories

Unicode defines the general categories and descriptions listed in the following table. For more information, see the "UCD File Format" and "General Category Values" subtopics at the Unicode Character Database.

Category	Description
Lu	Letter, Uppercase
Ll	Letter, Lowercase
Lt	Letter, Titlecase
Lm	Letter, Modifier
Lo	Letter, Other
Mn	Mark, Nonspacing
Mc	Mark, Spacing Combining
Me	Mark, Enclosing
Nd	Number, Decimal Digit
Nl	Number, Letter
No	Number, Other
Pc	Punctuation, Connector
Pd	Punctuation, Dash
Ps	Punctuation, Open
Pe	Punctuation, Close
Pi	Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf	Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po	Punctuation, Other
Sm	Symbol, Math
Sc	Symbol, Currency
Sk	Symbol, Modifier
So	Symbol, Other
Zs	Separator, Space
Zl	Separator, Line
Zp	Separator, Paragraph
Cc	Other, Control
Cf	Other, Format
Cs	Other, Surrogate
Co	Other, Private Use
Cn	Other, Not Assigned (no characters have this property)

The .NET Framework provides additional categories that represent a set of Unicode character categories, as shown in the following table.

Category	Represents
C	(All control characters) Cc, Cf, Cs, Co, and Cn.
L	(All letters) Lu, Ll, Lt, Lm, and Lo.
M	(All diacritic marks) Mn, Mc, and Me.
N	(All numbers) Nd, Nl, and No.
P	(All punctuation) Pc, Pd, Ps, Pe, Pi, Pf, and Po.
S	(All symbols) Sm, Sc, Sk, and So.
Z	(All separators) Zs, Zl, and Zp.

Supported Named Blocks

The .NET Framework provides the named blocks listed in the following table. The set of supported named blocks is based on Unicode 4.0 and Perl 5.6.

Code point range	Block name
0000 - 007F	IsBasicLatin
0080 - 00FF	IsLatin-1Supplement
0100 - 017F	IsLatinExtended-A
0180 - 024F	IsLatinExtended-B
0250 - 02AF	IsIPAExtensions
02B0 - 02FF	IsSpacingModifierLetters
0300 - 036F	IsCombiningDiacriticalMarks
0370 - 03FF	IsGreek -or- IsGreekandCoptic
0400 - 04FF	IsCyrillic
0500 - 052F	IsCyrillicSupplement
0530 - 058F	IsArmenian
0590 - 05FF	IsHebrew
0600 - 06FF	IsArabic
0700 - 074F	IsSyriac
0780 - 07BF	IsThaana
0900 - 097F	IsDevanagari
0980 - 09FF	IsBengali
0A00 - 0A7F	IsGurmukhi
0A80 - 0AFF	IsGujarati
0B00 - 0B7F	IsOriya
0B80 - 0BFF	IsTamil
0C00 - 0C7F	IsTelugu
0C80 - 0CFF	IsKannada
0D00 - 0D7F	IsMalayalam
0D80 - 0DFF	IsSinhala
0E00 - 0E7F	IsThai
0E80 - 0EFF	IsLao
0F00 - 0FFF	IsTibetan
1000 - 109F	IsMyanmar
10A0 - 10FF	IsGeorgian
1100 - 11FF	IsHangulJamo
1200 - 137F	IsEthiopic
13A0 - 13FF	IsCherokee
1400 - 167F	IsUnifiedCanadianAboriginalSyllabics
1680 - 169F	IsOgham
16A0 - 16FF	IsRunic
1700 - 171F	IsTagalog
1720 - 173F	IsHanunoo
1740 - 175F	IsBuhid
1760 - 177F	IsTagbanwa
1780 - 17FF	IsKhmer
1800 - 18AF	IsMongolian
1900 - 194F	IsLimbu
1950 - 197F	IsTaiLe
19E0 - 19FF	IsKhmerSymbols
1D00 - 1D7F	IsPhoneticExtensions
1E00 - 1EFF	IsLatinExtendedAdditional
1F00 - 1FFF	IsGreekExtended
2000 - 206F	IsGeneralPunctuation
2070 - 209F	IsSuperscriptsandSubscripts
20A0 - 20CF	IsCurrencySymbols
20D0 - 20FF	IsCombiningDiacriticalMarksforSymbols -or- IsCombiningMarksforSymbols
2100 - 214F	IsLetterlikeSymbols
2150 - 218F	IsNumberForms
2190 - 21FF	IsArrows
2200 - 22FF	IsMathematicalOperators
2300 - 23FF	IsMiscellaneousTechnical
2400 - 243F	IsControlPictures
2440 - 245F	IsOpticalCharacterRecognition
2460 - 24FF	IsEnclosedAlphanumerics
2500 - 257F	IsBoxDrawing
2580 - 259F	IsBlockElements
25A0 - 25FF	IsGeometricShapes
2600 - 26FF	IsMiscellaneousSymbols
2700 - 27BF	IsDingbats
27C0 - 27EF	IsMiscellaneousMathematicalSymbols-A
27F0 - 27FF	IsSupplementalArrows-A
2800 - 28FF	IsBraillePatterns
2900 - 297F	IsSupplementalArrows-B
2980 - 29FF	IsMiscellaneousMathematicalSymbols-B
2A00 - 2AFF	IsSupplementalMathematicalOperators
2B00 - 2BFF	IsMiscellaneousSymbolsandArrows
2E80 - 2EFF	IsCJKRadicalsSupplement
2F00 - 2FDF	IsKangxiRadicals
2FF0 - 2FFF	IsIdeographicDescriptionCharacters
3000 - 303F	IsCJKSymbolsandPunctuation
3040 - 309F	IsHiragana
30A0 - 30FF	IsKatakana
3100 - 312F	IsBopomofo
3130 - 318F	IsHangulCompatibilityJamo
3190 - 319F	IsKanbun
31A0 - 31BF	IsBopomofoExtended
31F0 - 31FF	IsKatakanaPhoneticExtensions
3200 - 32FF	IsEnclosedCJKLettersandMonths
3300 - 33FF	IsCJKCompatibility
3400 - 4DBF	IsCJKUnifiedIdeographsExtensionA
4DC0 - 4DFF	IsYijingHexagramSymbols
4E00 - 9FFF	IsCJKUnifiedIdeographs
A000 - A48F	IsYiSyllables
A490 - A4CF	IsYiRadicals
AC00 - D7AF	IsHangulSyllables
D800 - DB7F	IsHighSurrogates
DB80 - DBFF	IsHighPrivateUseSurrogates
DC00 - DFFF	IsLowSurrogates
E000 - F8FF	IsPrivateUse
F900 - FAFF	IsPrivateUseArea
FB00 - FB4F	IsCJKCompatibilityIdeographs
FB50 - FDFF	IsAlphabeticPresentationForms
FE00 - FE0F	IsArabicPresentationForms-A
FE20 - FE2F	IsVariationSelectors
FE30 - FE4F	IsCombiningHalfMarks
FE50 - FE6F	IsCJKCompatibilityForms
FE70 - FEFF	IsSmallFormVariants
FF00 - FFEF	IsArabicPresentationForms-B
FFF0 - FFFF	IsHalfwidthandFullwidthForms

Character Class Subtraction

A character class defines a set of characters. Character class subtraction yields a set of characters that is the result of excluding the characters in one character class from another character class.

A character class subtraction expression has the following form:

[base_group-[excluded_group]]

The square brackets ([]) and hyphen (-) are mandatory. The base_group is a positive or negative character group as described in the Character Class Syntax table. The excluded_group component is another positive or negative character group, or another character class subtraction expression (that is, you can nest character class subtraction expressions).

For example, suppose you have a base group that consists of the character range from 'a' through 'z'. To define the set of characters that consists of the base group except for the character 'm', use [a-z-[m]]. To define the set of characters that consists of the base group except for the set of characters 'd', 'j', and 'p', use [a-z-[djp]]. To define the set of characters that consists of the base group except for the character range from 'm' through 'p', use [a-z-[m-p]].

Consider the nested character class subtraction expression, [a-z-[d-w-[m-o]]]. The expression is evaluated from the innermost character range outward. First, the character range from 'm' through 'o' is subtracted from the character range 'd' through 'w', which yields the set of characters from 'd' through 'l' and 'p' through 'w'. That set is then subtracted from the character range from 'a' through 'z', which yields the set of characters, [abcmnoxyz].

You can use any character class with character class subtraction. To define the set of characters that consists of all Unicode characters from \u0000 through \uFFFF except white-space characters (\s), the characters in the punctuation general category (\p{P}), the characters in the IsGreek named block (\p{IsGreek}), and the Unicode NEXT LINE control character (\x85), use [\u0000-\uFFFF-[\s\p{P}\p{IsGreek}\x85]].

Choose character classes for a character class subtraction expression that will yield useful results. Avoid an expression that yields an empty set of characters, which cannot match anything, or an expression that is equivalent to the original base group. For example, the empty set is the result of the expression [\p{IsBasicLatin}-[\x00-\x7F]], which subtracts all characters from the IsBasicLatin general category. Similarly, the original base group is the result of the expression [a-z-[0-9]]. This is because the base group, which is the character range of letters from 'a' through 'z', does not contain any characters in the excluded group, which is the character range of decimal digits from '0' through '9'.

Note that XML Schema Regular Expressions has similar support for character class subtraction.

Freigeben über

Character Classes

Character Class Syntax

Supported Unicode General Categories

Supported Named Blocks

Character Class Subtraction

See Also

Reference

Other Resources

Zusätzliche Ressourcen