Share via


PreTokenizer Class

Definition

Base class for all pre-tokenizers classes. The PreTokenizer is in charge of doing the pre-segmentation step.

public abstract class PreTokenizer
type PreTokenizer = class
Public MustInherit Class PreTokenizer
Inheritance
PreTokenizer
Derived

Constructors

PreTokenizer()

Methods

CreateWhiteSpace(IReadOnlyDictionary<String,Int32>)

Create a new instance of the PreTokenizer class which split the text at the white spaces.

CreateWordOrNonWord(IReadOnlyDictionary<String,Int32>)

Create a new instance of the PreTokenizer class which split the text at the word or non-word boundary. The word is a set of alphabet, numeric, and underscore characters.

CreateWordOrPunctuation(IReadOnlyDictionary<String,Int32>)

Create a new instance of the PreTokenizer class which split the text at the whitespace or punctuation characters.

PreTokenize(ReadOnlySpan<Char>)

Get the offsets and lengths of the tokens relative to the original string.

PreTokenize(String)

Get the offsets and lengths of the tokens relative to the text.

Applies to