Share via


Tokenizer Class

Definition

Provides an abstraction for tokenizers, enabling the encoding of text into tokens and the decoding of token IDs back into text.

public abstract class Tokenizer
public class Tokenizer
type Tokenizer = class
Public MustInherit Class Tokenizer
Public Class Tokenizer
Inheritance
Tokenizer
Derived

Constructors

Tokenizer()

Initializes a new instance of the Tokenizer class.

Tokenizer(Model, PreTokenizer, Normalizer)

Create a new Tokenizer object.

Properties

Decoder

Gets or sets the Decoder in use by the Tokenizer.

Model

Gets the Model in use by the Tokenizer.

Normalizer

Gets the Normalizer in use by the Tokenizer.

PreTokenizer

Gets the PreTokenizer used by the Tokenizer.

Methods

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

CountTokens(String, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Get the number of tokens that the input text will be encoded to.

Decode(IEnumerable<Int32>, Boolean)

Decode the given ids, back to a String.

Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)

Decode the given ids back to text and store the result in the destination span.

Decode(IEnumerable<Int32>)

Decode the given ids, back to a String.

Decode(Int32, Boolean)

Decodes the Id to the mapped token.

Encode(String)

Encodes input text to object has the tokens list, tokens Ids, tokens offset mapping.

EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

EncodeToIds(String, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to token Ids.

EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to a list of EncodedTokens.

EncodeToTokens(String, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)

Find the index of the maximum encoding capacity without surpassing the token limit.

GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

TrainFromFiles(Trainer, ReportProgress, String[])

Train the tokenizer model using input files.

Applies to