Normalização

Artigo
08/25/2010

Alguns caracteres Unicode tem vários equivalente binário representações consiste em conjuntos de combinação e/ou compostos caracteres Unicode.Representações de vários para um único caractere complicam a pesquisa, classificação, correspondência e outras operações.

O padrão Unicode define um processo chamado de normalização que retorna um binário representação para qualquer um dos equivalente binário representações de um caractere.Normalização pode usar vários algoritmos, chamados de formulários de normalização, que obedecem às regras diferentes.O .NET estrutura atualmente oferece suporte a Unicode formulários de normalização C, D, KC e KD.

Observação:
Normalizado para o mesmo formulário de normalização de duas seqüências podem ser comparadas usando uma comparação ordinal, ou seja, uma comparação binária caractere por caractere.

Para obter mais informações sobre os formulários de normalização compatíveis com o .NET estrutura, consulte NormalizationForm. Para obter mais informações sobre a normalização e caractere decompositions equivalência, consulte Unicode Standard Annex # 15, "Formulários de normalização unicode", nasSite inicial Unicode.

Normalizando uma string

O aplicativo deve usar o String.Normalize método de um String objeto para retornar uma nova seqüência de caracteres é normalizada por padrão, o formulário de normalização C. Como alternativa, o aplicativo pode usar o String.Normalize método de um String o objeto, especificando um NormalizationForm valor, para retornar uma nova seqüência de caracteres é normalizada especificamente para o formulário de normalização C, D, KC ou KD.

Teste para determinar se uma string é normalizada

O aplicativo pode usar o String.IsNormalized método de um String objeto para determinar se o valor de seqüência de caracteres do objeto é normalizado para formulário de normalização C. sistema autônomo alternativa, o aplicativo pode usar o String.IsNormalized método de um String objeto, especificando um determinado NormalizationForm valor, para determinar se o valor de seqüência de caracteres do objeto é normalizado especificamente para o formulário de normalização C, D, KC ou KD.

Exemplo

O exemplo de código a seguir demonstra o IsNormalized e Normalize métodos. Os testes de exemplo de código para determinar se uma seqüência de caracteres original estiver em qualquer um dos quatro formulários de normalização, cria uma versão de seqüência de caracteres original em cada um dos formulários de normalização, testes para determinar se cada seqüência normalizada está no formulário de normalização pretendido e, em seguida, exibe o ponto de código hexadecimais de cada caractere em cada seqüência normalizada.

' This example demonstrates the String.Normalize method
'                       and the String.IsNormalized method
Imports System
Imports System.Text
Imports Microsoft.VisualBasic

Class Sample
   Public Shared Sub Main()
      ' Character c; combining characters acute and cedilla; character 3/4
      Dim s1 = New [String](New Char() {ChrW(&H0063), ChrW(&H0301), ChrW(&H0327), ChrW(&H00BE)})
      Dim s2 As String = Nothing
      Dim divider = New [String]("-"c, 80)
      divider = [String].Concat(Environment.NewLine, divider, Environment.NewLine)

      Try
         Show("s1", s1)
         Console.WriteLine()
         Console.WriteLine("U+0063 = LATIN SMALL LETTER C")
         Console.WriteLine("U+0301 = COMBINING ACUTE ACCENT")
         Console.WriteLine("U+0327 = COMBINING CEDILLA")
         Console.WriteLine("U+00BE = VULGAR FRACTION THREE QUARTERS")

         Console.WriteLine(divider)

         Console.WriteLine("A1) Is s1 normalized to the default form (Form C)?: {0}", s1.IsNormalized())
         Console.WriteLine("A2) Is s1 normalized to Form C?:  {0}", s1.IsNormalized(NormalizationForm.FormC))
         Console.WriteLine("A3) Is s1 normalized to Form D?:  {0}", s1.IsNormalized(NormalizationForm.FormD))
         Console.WriteLine("A4) Is s1 normalized to Form KC?: {0}", s1.IsNormalized(NormalizationForm.FormKC))
         Console.WriteLine("A5) Is s1 normalized to Form KD?: {0}", s1.IsNormalized(NormalizationForm.FormKD))

         Console.WriteLine(divider)

         Console.WriteLine("Set string s2 to each normalized form of string s1.")
         Console.WriteLine()
         Console.WriteLine("U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE")
         Console.WriteLine("U+0033 = DIGIT THREE")
         Console.WriteLine("U+2044 = FRACTION SLASH")
         Console.WriteLine("U+0034 = DIGIT FOUR")
         Console.WriteLine(divider)

         s2 = s1.Normalize()
         Console.Write("B1) Is s2 normalized to the default form (Form C)?: ")
         Console.WriteLine(s2.IsNormalized())
         Show("s2", s2)
         Console.WriteLine()

         s2 = s1.Normalize(NormalizationForm.FormC)
         Console.Write("B2) Is s2 normalized to Form C?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormC))
         Show("s2", s2)
         Console.WriteLine()

         s2 = s1.Normalize(NormalizationForm.FormD)
         Console.Write("B3) Is s2 normalized to Form D?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormD))
         Show("s2", s2)
         Console.WriteLine()

         s2 = s1.Normalize(NormalizationForm.FormKC)
         Console.Write("B4) Is s2 normalized to Form KC?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKC))
         Show("s2", s2)
         Console.WriteLine()

         s2 = s1.Normalize(NormalizationForm.FormKD)
         Console.Write("B5) Is s2 normalized to Form KD?: ")
         Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKD))
         Show("s2", s2)
         Console.WriteLine()

      Catch e As Exception
         Console.WriteLine(e.Message)
      End Try
   End Sub 'Main

   Private Shared Sub Show(title As String, s As String)
      Console.Write("Characters in string {0} = ", title)
      Dim x As Char
      For Each x In  s.ToCharArray()
         Console.Write("{0:X4} ", AscW(x))
      Next x
      Console.WriteLine()
   End Sub 'Show
End Class 'Sample
'
'This example produces the following results:
'
'Characters in string s1 = 0063 0301 0327 00BE
'
'U+0063 = LATIN SMALL LETTER C
'U+0301 = COMBINING ACUTE ACCENT
'U+0327 = COMBINING CEDILLA
'U+00BE = VULGAR FRACTION THREE QUARTERS
'
'--------------------------------------------------------------------------------
'
'A1) Is s1 normalized to the default form (Form C)?: False
'A2) Is s1 normalized to Form C?:  False
'A3) Is s1 normalized to Form D?:  False
'A4) Is s1 normalized to Form KC?: False
'A5) Is s1 normalized to Form KD?: False
'
'--------------------------------------------------------------------------------
'
'Set string s2 to each normalized form of string s1.
'
'U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
'U+0033 = DIGIT THREE
'U+2044 = FRACTION SLASH
'U+0034 = DIGIT FOUR
'
'--------------------------------------------------------------------------------
'
'B1) Is s2 normalized to the default form (Form C)?: True
'Characters in string s2 = 1E09 00BE
'
'B2) Is s2 normalized to Form C?: True
'Characters in string s2 = 1E09 00BE
'
'B3) Is s2 normalized to Form D?: True
'Characters in string s2 = 0063 0327 0301 00BE
'
'B4) Is s2 normalized to Form KC?: True
'Characters in string s2 = 1E09 0033 2044 0034
'
'B5) Is s2 normalized to Form KD?: True
'Characters in string s2 = 0063 0327 0301 0033 2044 0034
'

// This example demonstrates the String.Normalize method
//                       and the String.IsNormalized method

using System;
using System.Text;

class Sample 
{
    public static void Main() 
    {
// Character c; combining characters acute and cedilla; character 3/4
    string s1 = new String( new char[] {'\u0063', '\u0301', '\u0327', '\u00BE'});
    string s2 = null;
    string divider = new String('-', 80);
    divider = String.Concat(Environment.NewLine, divider, Environment.NewLine);

    try 
    {
    Show("s1", s1);
    Console.WriteLine();
    Console.WriteLine("U+0063 = LATIN SMALL LETTER C");
    Console.WriteLine("U+0301 = COMBINING ACUTE ACCENT");
    Console.WriteLine("U+0327 = COMBINING CEDILLA");
    Console.WriteLine("U+00BE = VULGAR FRACTION THREE QUARTERS");
    Console.WriteLine(divider);

    Console.WriteLine("A1) Is s1 normalized to the default form (Form C)?: {0}", 
                                 s1.IsNormalized());
    Console.WriteLine("A2) Is s1 normalized to Form C?:  {0}", 
                                 s1.IsNormalized(NormalizationForm.FormC));
    Console.WriteLine("A3) Is s1 normalized to Form D?:  {0}", 
                                 s1.IsNormalized(NormalizationForm.FormD));
    Console.WriteLine("A4) Is s1 normalized to Form KC?: {0}", 
                                 s1.IsNormalized(NormalizationForm.FormKC));
    Console.WriteLine("A5) Is s1 normalized to Form KD?: {0}", 
                                 s1.IsNormalized(NormalizationForm.FormKD));

    Console.WriteLine(divider);

    Console.WriteLine("Set string s2 to each normalized form of string s1.");
    Console.WriteLine();
    Console.WriteLine("U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE");
    Console.WriteLine("U+0033 = DIGIT THREE");
    Console.WriteLine("U+2044 = FRACTION SLASH");
    Console.WriteLine("U+0034 = DIGIT FOUR");
    Console.WriteLine(divider);

    s2 = s1.Normalize();
    Console.Write("B1) Is s2 normalized to the default form (Form C)?: ");
    Console.WriteLine(s2.IsNormalized());
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormC);
    Console.Write("B2) Is s2 normalized to Form C?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormC));
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormD);
    Console.Write("B3) Is s2 normalized to Form D?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormD));
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormKC);
    Console.Write("B4) Is s2 normalized to Form KC?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKC));
    Show("s2", s2);
    Console.WriteLine();

    s2 = s1.Normalize(NormalizationForm.FormKD);
    Console.Write("B5) Is s2 normalized to Form KD?: ");
    Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKD));
    Show("s2", s2);
    Console.WriteLine();
    }

    catch (Exception e) 
        {
        Console.WriteLine(e.Message);
        }
    }

    private static void Show(string title, string s)
    {
    Console.Write("Characters in string {0} = ", title);
    foreach(short x in s.ToCharArray())
        {
        Console.Write("{0:X4} ", x);
        }
    Console.WriteLine();
    }
}
/*
This example produces the following results:

Characters in string s1 = 0063 0301 0327 00BE

U+0063 = LATIN SMALL LETTER C
U+0301 = COMBINING ACUTE ACCENT
U+0327 = COMBINING CEDILLA
U+00BE = VULGAR FRACTION THREE QUARTERS

--------------------------------------------------------------------------------

A1) Is s1 normalized to the default form (Form C)?: False
A2) Is s1 normalized to Form C?:  False
A3) Is s1 normalized to Form D?:  False
A4) Is s1 normalized to Form KC?: False
A5) Is s1 normalized to Form KD?: False

--------------------------------------------------------------------------------

Set string s2 to each normalized form of string s1.

U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
U+0033 = DIGIT THREE
U+2044 = FRACTION SLASH
U+0034 = DIGIT FOUR

--------------------------------------------------------------------------------

B1) Is s2 normalized to the default form (Form C)?: True
Characters in string s2 = 1E09 00BE

B2) Is s2 normalized to Form C?: True
Characters in string s2 = 1E09 00BE

B3) Is s2 normalized to Form D?: True
Characters in string s2 = 0063 0327 0301 00BE

B4) Is s2 normalized to Form KC?: True
Characters in string s2 = 1E09 0033 2044 0034

B5) Is s2 normalized to Form KD?: True
Characters in string s2 = 0063 0327 0301 0033 2044 0034

*/

// This example demonstrates the String.Normalize method
//                       and the String.IsNormalized method
using namespace System;
using namespace System::Text;
void Show( String^ title, String^ s )
{
   Console::Write( "Characters in string {0} = ", title );
   System::Collections::IEnumerator^ myEnum = s->ToCharArray()->GetEnumerator();
   while ( myEnum->MoveNext() )
   {

      /*) * __try_cast < Char * > ( myEnum -> Current );*/
      int x;
      Console::Write( "{0:X4} ", x );
   }

   Console::WriteLine();
}

int main()
{

   // Character c; combining characters acute and cedilla; character 3/4
   array<Char>^temp0 = {L'c',L'\u0301',L'\u0327',L'\u00BE'};
   String^ s1 = gcnew String( temp0 );
   String^ s2 = nullptr;
   String^ divider = gcnew String( '-',80 );
   divider = String::Concat( Environment::NewLine, divider, Environment::NewLine );
   try
   {
      Show( "s1", s1 );
      Console::WriteLine();
      Console::WriteLine( "U+0063 = LATIN SMALL LETTER C" );
      Console::WriteLine( "U+0301 = COMBINING ACUTE ACCENT" );
      Console::WriteLine( "U+0327 = COMBINING CEDILLA" );
      Console::WriteLine( "U+00BE = VULGAR FRACTION THREE QUARTERS" );
      Console::WriteLine( divider );
      Console::WriteLine( "A1) Is s1 normalized to the default form (Form C)?: {0}", s1->IsNormalized() );
      Console::WriteLine( "A2) Is s1 normalized to Form C?:  {0}", s1->IsNormalized( NormalizationForm::FormC ) );
      Console::WriteLine( "A3) Is s1 normalized to Form D?:  {0}", s1->IsNormalized( NormalizationForm::FormD ) );
      Console::WriteLine( "A4) Is s1 normalized to Form KC?: {0}", s1->IsNormalized( NormalizationForm::FormKC ) );
      Console::WriteLine( "A5) Is s1 normalized to Form KD?: {0}", s1->IsNormalized( NormalizationForm::FormKD ) );
      Console::WriteLine( divider );
      Console::WriteLine( "Set string s2 to each normalized form of string s1." );
      Console::WriteLine();
      Console::WriteLine( "U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE" );
      Console::WriteLine( "U+0033 = DIGIT THREE" );
      Console::WriteLine( "U+2044 = FRACTION SLASH" );
      Console::WriteLine( "U+0034 = DIGIT FOUR" );
      Console::WriteLine( divider );
      s2 = s1->Normalize();
      Console::Write( "B1) Is s2 normalized to the default form (Form C)?: " );
      Console::WriteLine( s2->IsNormalized() );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormC );
      Console::Write( "B2) Is s2 normalized to Form C?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormC ) );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormD );
      Console::Write( "B3) Is s2 normalized to Form D?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormD ) );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormKC );
      Console::Write( "B4) Is s2 normalized to Form KC?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormKC ) );
      Show( "s2", s2 );
      Console::WriteLine();
      s2 = s1->Normalize( NormalizationForm::FormKD );
      Console::Write( "B5) Is s2 normalized to Form KD?: " );
      Console::WriteLine( s2->IsNormalized( NormalizationForm::FormKD ) );
      Show( "s2", s2 );
      Console::WriteLine();
   }
   catch ( Exception^ e ) 
   {
      Console::WriteLine( e->Message );
   }

}

/*
This example produces the following results:

Characters in string s1 = 0063 0301 0327 00BE

U+0063 = LATIN SMALL LETTER C
U+0301 = COMBINING ACUTE ACCENT
U+0327 = COMBINING CEDILLA
U+00BE = VULGAR FRACTION THREE QUARTERS

--------------------------------------------------------------------------------

A1) Is s1 normalized to the default form (Form C)?: False
A2) Is s1 normalized to Form C?:  False
A3) Is s1 normalized to Form D?:  False
A4) Is s1 normalized to Form KC?: False
A5) Is s1 normalized to Form KD?: False

--------------------------------------------------------------------------------

Set string s2 to each normalized form of string s1.

U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
U+0033 = DIGIT THREE
U+2044 = FRACTION SLASH
U+0034 = DIGIT FOUR

--------------------------------------------------------------------------------

B1) Is s2 normalized to the default form (Form C)?: True
Characters in string s2 = 1E09 00BE

B2) Is s2 normalized to Form C?: True
Characters in string s2 = 1E09 00BE

B3) Is s2 normalized to Form D?: True
Characters in string s2 = 0063 0327 0301 00BE

B4) Is s2 normalized to Form KC?: True
Characters in string s2 = 1E09 0033 2044 0034

B5) Is s2 normalized to Form KD?: True
Characters in string s2 = 0063 0327 0301 0033 2044 0034

*/

// This example demonstrates the String.Normalize method
//                       and the String.IsNormalized method
import System.*;
import System.Text.*;

class Sample
{
    public static void main(String[] args)
    {
        // Character c; combining characters acute and cedilla; character 3/4
        String s1 = new String(new char[] { '\u0063', '\u0301', '\u0327', 
            '\u00BE' });
        String s2 = null;
        String divider = new String('-', 80);
        divider = String.Concat(Environment.get_NewLine(), divider, 
            Environment.get_NewLine());

        try {
            Show("s1", s1);
            Console.WriteLine();
            Console.WriteLine("U+0063 = LATIN SMALL LETTER C");
            Console.WriteLine("U+0301 = COMBINING ACUTE ACCENT");
            Console.WriteLine("U+0327 = COMBINING CEDILLA");
            Console.WriteLine("U+00BE = VULGAR FRACTION THREE QUARTERS");
            Console.WriteLine(divider);

            Console.WriteLine("A1) Is s1 normalized to the default form " 
                + "(Form C)?: {0}", System.Convert.ToString(s1.IsNormalized()));
            Console.WriteLine("A2) Is s1 normalized to Form C?:  {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormC)));
            Console.WriteLine("A3) Is s1 normalized to Form D?:  {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormD)));
            Console.WriteLine("A4) Is s1 normalized to Form KC?: {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormKC)));
            Console.WriteLine("A5) Is s1 normalized to Form KD?: {0}", 
                System.Convert.ToString(s1.
                IsNormalized(NormalizationForm.FormKD)));

            Console.WriteLine(divider);

            Console.WriteLine("Set string s2 to each normalized form of " 
                + "string s1.");
            Console.WriteLine();
            Console.WriteLine("U+1E09 = LATIN SMALL LETTER C WITH CEDILLA " 
                + "AND ACUTE");
            Console.WriteLine("U+0033 = DIGIT THREE");
            Console.WriteLine("U+2044 = FRACTION SLASH");
            Console.WriteLine("U+0034 = DIGIT FOUR");
            Console.WriteLine(divider);

            s2 = s1.Normalize();
            Console.Write("B1) Is s2 normalized to the default form " 
                + "(Form C)?: ");
            Console.WriteLine(s2.IsNormalized());
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormC);
            Console.Write("B2) Is s2 normalized to Form C?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormC));
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormD);
            Console.Write("B3) Is s2 normalized to Form D?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormD));
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormKC);
            Console.Write("B4) Is s2 normalized to Form KC?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKC));
            Show("s2", s2);
            Console.WriteLine();

            s2 = s1.Normalize(NormalizationForm.FormKD);
            Console.Write("B5) Is s2 normalized to Form KD?: ");
            Console.WriteLine(s2.IsNormalized(NormalizationForm.FormKD));
            Show("s2", s2);
            Console.WriteLine();
        }
        catch (System.Exception e) {
            Console.WriteLine(e.get_Message());
        }
    } //main

    private static void Show(String title, String s)
    {
        Console.Write("Characters in string {0} = ", title);
        char myCharArray[] = s.ToCharArray();
        for (int iCtr = 0; iCtr < myCharArray.length; iCtr++) {
            char c = myCharArray[iCtr];
            Console.Write(((System.Int32)c).ToString("X4") + " ");
        }
        Console.WriteLine();
    } //Show
} //Sample
/*
This example produces the following results:

Characters in string s1 = 0063 0301 0327 00BE

U+0063 = LATIN SMALL LETTER C
U+0301 = COMBINING ACUTE ACCENT
U+0327 = COMBINING CEDILLA
U+00BE = VULGAR FRACTION THREE QUARTERS

--------------------------------------------------------------------------------

A1) Is s1 normalized to the default form (Form C)?: False
A2) Is s1 normalized to Form C?:  False
A3) Is s1 normalized to Form D?:  False
A4) Is s1 normalized to Form KC?: False
A5) Is s1 normalized to Form KD?: False

--------------------------------------------------------------------------------

Set string s2 to each normalized form of string s1.

U+1E09 = LATIN SMALL LETTER C WITH CEDILLA AND ACUTE
U+0033 = DIGIT THREE
U+2044 = FRACTION SLASH
U+0034 = DIGIT FOUR

--------------------------------------------------------------------------------

B1) Is s2 normalized to the default form (Form C)?: True
Characters in string s2 = 1E09 00BE

B2) Is s2 normalized to Form C?: True
Characters in string s2 = 1E09 00BE

B3) Is s2 normalized to Form D?: True
Characters in string s2 = 0063 0327 0301 00BE

B4) Is s2 normalized to Form KC?: True
Characters in string s2 = 1E09 0033 2044 0034

B5) Is s2 normalized to Form KD?: True
Characters in string s2 = 0063 0327 0301 0033 2044 0034

*/

Consulte também

Conceitos

Normalização e classificação

Partilhar via