System.Text.Rune struct
This article provides supplementary remarks to the reference documentation for this API.
A Rune instance represents a Unicode scalar value, which means any code point excluding the surrogate range (U+D800..U+DFFF). The type's constructors and conversion operators validate the input, so consumers can call the APIs assuming that the underlying Rune instance is well formed.
If you aren't familiar with the terms Unicode scalar value, code point, surrogate range, and well-formed, see Introduction to character encoding in .NET.
When to use the Rune type
Consider using the Rune
type if your code:
- Calls APIs that require Unicode scalar values
- Explicitly handles surrogate pairs
APIs that require Unicode scalar values
If your code iterates through the char
instances in a string
or a ReadOnlySpan<char>
, some of the char
methods won't work correctly on char
instances that are in the surrogate range. For example, the following APIs require a scalar value char
to work correctly:
- Char.GetNumericValue
- Char.GetUnicodeCategory
- Char.IsDigit
- Char.IsLetter
- Char.IsLetterOrDigit
- Char.IsLower
- Char.IsNumber
- Char.IsPunctuation
- Char.IsSymbol
- Char.IsUpper
The following example shows code that won't work correctly if any of the char
instances are surrogate code points:
// THE FOLLOWING METHOD SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
int CountLettersBadExample(string s)
{
int letterCount = 0;
foreach (char ch in s)
{
if (char.IsLetter(ch))
{ letterCount++; }
}
return letterCount;
}
// THE FOLLOWING METHOD SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
let countLettersBadExample (s: string) =
let mutable letterCount = 0
for ch in s do
if Char.IsLetter ch then
letterCount <- letterCount + 1
letterCount
Here's equivalent code that works with a ReadOnlySpan<char>
:
// THE FOLLOWING METHOD SHOWS INCORRECT CODE.
// DO NOT DO THIS IN A PRODUCTION APPLICATION.
static int CountLettersBadExample(ReadOnlySpan<char> span)
{
int letterCount = 0;
foreach (char ch in span)
{
if (char.IsLetter(ch))
{ letterCount++; }
}
return letterCount;
}
The preceding code works correctly with some languages such as English:
CountLettersInString("Hello")
// Returns 5
But it won't work correctly for languages outside the Basic Multilingual Plane, such as Osage:
CountLettersInString("𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟")
// Returns 0
The reason this method returns incorrect results for Osage text is that the char
instances for Osage letters are surrogate code points. No single surrogate code point has enough information to determine if it's a letter.
If you change this code to use Rune
instead of char
, the method works correctly with code points outside the Basic Multilingual Plane:
int CountLetters(string s)
{
int letterCount = 0;
foreach (Rune rune in s.EnumerateRunes())
{
if (Rune.IsLetter(rune))
{ letterCount++; }
}
return letterCount;
}
let countLetters (s: string) =
let mutable letterCount = 0
for rune in s.EnumerateRunes() do
if Rune.IsLetter rune then
letterCount <- letterCount + 1
letterCount
Here's equivalent code that works with a ReadOnlySpan<char>
:
static int CountLetters(ReadOnlySpan<char> span)
{
int letterCount = 0;
foreach (Rune rune in span.EnumerateRunes())
{
if (Rune.IsLetter(rune))
{ letterCount++; }
}
return letterCount;
}
The preceding code counts Osage letters correctly:
CountLettersInString("𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟")
// Returns 8
Code that explicitly handles surrogate pairs
Consider using the Rune
type if your code calls APIs that explicitly operate on surrogate code points, such as the following methods:
- Char.IsSurrogate
- Char.IsSurrogatePair
- Char.IsHighSurrogate
- Char.IsLowSurrogate
- Char.ConvertFromUtf32
- Char.ConvertToUtf32
For example, the following method has special logic to deal with surrogate char
pairs:
static void ProcessStringUseChar(string s)
{
Console.WriteLine("Using char");
for (int i = 0; i < s.Length; i++)
{
if (!char.IsSurrogate(s[i]))
{
Console.WriteLine($"Code point: {(int)(s[i])}");
}
else if (i + 1 < s.Length && char.IsSurrogatePair(s[i], s[i + 1]))
{
int codePoint = char.ConvertToUtf32(s[i], s[i + 1]);
Console.WriteLine($"Code point: {codePoint}");
i++; // so that when the loop iterates it's actually +2
}
else
{
throw new Exception("String was not well-formed UTF-16.");
}
}
}
Such code is simpler if it uses Rune
, as in the following example:
static void ProcessStringUseRune(string s)
{
Console.WriteLine("Using Rune");
for (int i = 0; i < s.Length;)
{
if (!Rune.TryGetRuneAt(s, i, out Rune rune))
{
throw new Exception("String was not well-formed UTF-16.");
}
Console.WriteLine($"Code point: {rune.Value}");
i += rune.Utf16SequenceLength; // increment the iterator by the number of chars in this Rune
}
}
When not to use Rune
You don't need to use the Rune
type if your code:
- Looks for exact
char
matches - Splits a string on a known char value
Using the Rune
type may return incorrect results if your code:
- Counts the number of display characters in a
string
Look for exact char
matches
The following code iterates through a string
looking for specific characters, returning the index of the first match. There's no need to change this code to use Rune
, as the code is looking for characters that are represented by a single char
.
int GetIndexOfFirstAToZ(string s)
{
for (int i = 0; i < s.Length; i++)
{
char thisChar = s[i];
if ('A' <= thisChar && thisChar <= 'Z')
{
return i; // found a match
}
}
return -1; // didn't find 'A' - 'Z' in the input string
}
Split a string on a known char
It's common to call string.Split
and use delimiters such as ' '
(space) or ','
(comma), as in the following example:
string inputString = "🐂, 🐄, 🐆";
string[] splitOnSpace = inputString.Split(' ');
string[] splitOnComma = inputString.Split(',');
There is no need to use Rune
here, because the code is looking for characters that are represented by a single char
.
Count the number of display characters in a string
The number of Rune
instances in a string might not match the number of user-perceivable characters shown when displaying the string.
Since Rune
instances represent Unicode scalar values, components that follow the Unicode text segmentation guidelines can use Rune
as a building block for counting display characters.
The StringInfo type can be used to count display characters, but it doesn't count correctly in all scenarios for .NET implementations other than .NET 5+.
For more information, see Grapheme clusters.
How to instantiate a Rune
There are several ways to get a Rune
instance. You can use a constructor to create a Rune
directly from:
A code point.
Rune a = new Rune(0x0061); // LATIN SMALL LETTER A Rune b = new Rune(0x10421); // DESERET CAPITAL LETTER ER
A single
char
.Rune c = new Rune('a');
A surrogate
char
pair.Rune d = new Rune('\ud83d', '\udd2e'); // U+1F52E CRYSTAL BALL
All of the constructors throw an ArgumentException
if the input doesn't represent a valid Unicode scalar value.
There are Rune.TryCreate methods available for callers who don't want exceptions to be thrown on failure.
Rune
instances can also be read from existing input sequences. For instance, given a ReadOnlySpan<char>
that represents UTF-16 data, the Rune.DecodeFromUtf16 method returns the first Rune
instance at the beginning of the input span. The Rune.DecodeFromUtf8 method operates similarly, accepting a ReadOnlySpan<byte>
parameter that represents UTF-8 data. There are equivalent methods to read from the end of the span instead of the beginning of the span.
Query properties of a Rune
To get the integer code point value of a Rune
instance, use the Rune.Value property.
Rune rune = new Rune('\ud83d', '\udd2e'); // U+1F52E CRYSTAL BALL
int codePoint = rune.Value; // = 128302 decimal (= 0x1F52E)
Many of the static APIs available on the char
type are also available on the Rune
type. For instance, Rune.IsWhiteSpace and Rune.GetUnicodeCategory are equivalents to Char.IsWhiteSpace and Char.GetUnicodeCategory methods. The Rune
methods correctly handle surrogate pairs.
The following example code takes a ReadOnlySpan<char>
as input and trims from both the start and the end of the span every Rune
that isn't a letter or a digit.
static ReadOnlySpan<char> TrimNonLettersAndNonDigits(ReadOnlySpan<char> span)
{
// First, trim from the front.
// If any Rune can't be decoded
// (return value is anything other than "Done"),
// or if the Rune is a letter or digit,
// stop trimming from the front and
// instead work from the end.
while (Rune.DecodeFromUtf16(span, out Rune rune, out int charsConsumed) == OperationStatus.Done)
{
if (Rune.IsLetterOrDigit(rune))
{ break; }
span = span[charsConsumed..];
}
// Next, trim from the end.
// If any Rune can't be decoded,
// or if the Rune is a letter or digit,
// break from the loop, and we're finished.
while (Rune.DecodeLastFromUtf16(span, out Rune rune, out int charsConsumed) == OperationStatus.Done)
{
if (Rune.IsLetterOrDigit(rune))
{ break; }
span = span[..^charsConsumed];
}
return span;
}
There are some API differences between char
and Rune
. For example:
- There is no
Rune
equivalent to Char.IsSurrogate(Char), sinceRune
instances by definition can never be surrogate code points. - The Rune.GetUnicodeCategory doesn't always return the same result as Char.GetUnicodeCategory. It does return the same value as CharUnicodeInfo.GetUnicodeCategory. For more information, see the Remarks on Char.GetUnicodeCategory.
Convert a Rune
to UTF-8 or UTF-16
Since a Rune
is a Unicode scalar value, it can be converted to UTF-8, UTF-16, or UTF-32 encoding. The Rune
type has built-in support for conversion to UTF-8 and UTF-16.
The Rune.EncodeToUtf16 converts a Rune
instance to char
instances. To query the number of char
instances that would result from converting a Rune
instance to UTF-16, use the Rune.Utf16SequenceLength property. Similar methods exist for UTF-8 conversion.
The following example converts a Rune
instance to a char
array. The code assumes you have a Rune
instance in the rune
variable:
char[] chars = new char[rune.Utf16SequenceLength];
int numCharsWritten = rune.EncodeToUtf16(chars);
Since a string
is a sequence of UTF-16 chars, the following example also converts a Rune
instance to UTF-16:
string theString = rune.ToString();
The following example converts a Rune
instance to a UTF-8
byte array:
byte[] bytes = new byte[rune.Utf8SequenceLength];
int numBytesWritten = rune.EncodeToUtf8(bytes);
The Rune.EncodeToUtf16 and Rune.EncodeToUtf8 methods return the actual number of elements written. They throw an exception if the destination buffer is too short to contain the result. There are non-throwing TryEncodeToUtf8 and TryEncodeToUtf16 methods as well for callers who want to avoid exceptions.
Rune in .NET vs. other languages
The term "rune" is not defined in the Unicode Standard. The term dates back to the creation of UTF-8. Rob Pike and Ken Thompson were looking for a term to describe what would eventually become known as a code point. They settled on the term "rune", and Rob Pike's later influence over the Go programming language helped popularize the term.
However, the .NET Rune
type is not the equivalent of the Go rune
type. In Go, the rune
type is an alias for int32
. A Go rune is intended to represent a Unicode code point, but it can be any 32-bit value, including surrogate code points and values that are not legal Unicode code points.
For similar types in other programming languages, see Rust's primitive char
type or Swift's Unicode.Scalar
type, both of which represent Unicode scalar values. They provide functionality similar to .NET's Rune
type, and they disallow instantiation of values that are not legal Unicode scalar values.