Tokenizer Class
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Provides an abstraction for tokenizers, enabling the encoding of text into tokens and the decoding of token IDs back into text.
public abstract class Tokenizer
public class Tokenizer
type Tokenizer = class
Public MustInherit Class Tokenizer
Public Class Tokenizer
- Inheritance
-
Tokenizer
- Derived
Constructors
Tokenizer() |
Initializes a new instance of the Tokenizer class. |
Tokenizer(Model, PreTokenizer, Normalizer) |
Create a new Tokenizer object. |
Properties
Decoder |
Gets or sets the Decoder in use by the Tokenizer. |
Model |
Gets the Model in use by the Tokenizer. |
Normalizer |
Gets the Normalizer in use by the Tokenizer. |
PreTokenizer |
Gets the PreTokenizer used by the Tokenizer. |
Methods
CountTokens(ReadOnlySpan<Char>, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. |
CountTokens(String, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. |
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Get the number of tokens that the input text will be encoded to. |
Decode(IEnumerable<Int32>, Boolean) |
Decode the given ids, back to a String. |
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32) |
Decode the given ids back to text and store the result in the |
Decode(IEnumerable<Int32>) |
Decode the given ids, back to a String. |
Decode(Int32, Boolean) |
Decodes the Id to the mapped token. |
Encode(String) |
Encodes input text to object has the tokens list, tokens Ids, tokens offset mapping. |
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. |
EncodeToIds(String, Boolean, Boolean) |
Encodes input text to token Ids. |
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. |
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to token Ids. |
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. |
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to a list of EncodedTokens. |
EncodeToTokens(String, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. |
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. |
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. |
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32) |
Find the index of the maximum encoding capacity without surpassing the token limit. |
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. |
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. |
TrainFromFiles(Trainer, ReportProgress, String[]) |
Train the tokenizer model using input files. |