Compartilhar via


NGramFeaturizer Class

Text transforms that can be performed on data before training a model.

Inheritance
nimbusml.internal.core.feature_extraction.text._ngramfeaturizer.NGramFeaturizer
NGramFeaturizer
nimbusml.base_transform.BaseTransform
NGramFeaturizer
sklearn.base.TransformerMixin
NGramFeaturizer

Constructor

NGramFeaturizer(language='English', stop_words_remover=None, text_case='Lower', keep_diacritics=False, keep_punctuations=True, keep_numbers=True, output_tokens_column_name=None, dictionary=None, word_feature_extractor={'Name': 'NGram', 'Settings': {'NgramLength': 1, 'SkipLength': 0, 'AllLengths': True, 'MaxNumTerms': [10000000], 'Weighting': 'Tf'}}, char_feature_extractor={'Name': 'NGram', 'Settings': {'NgramLength': 3, 'SkipLength': 0, 'AllLengths': False, 'MaxNumTerms': [10000000], 'Weighting': 'Tf'}}, vector_normalizer='L2', columns=None, **params)

Parameters

Name Description
columns

a dictionary of key-value pairs, where key is the output column name and value is a list of input column names.

  • Only one key-value pair is allowed.

  • Input column type: string.

  • Output column type: Vector Type.

The << operator can be used to set this value (see Column Operator)

For example

  • NGramFeaturizer(columns={'features': ['age', 'parity', 'induced']})

  • NGramFeaturizer() << {'features': ['age', 'parity', 'induced']})

For more details see Columns.

language

Specifies the language used in the data set. The following values are supported:

  • "AutoDetect": for automatic language detection.

  • "English"

  • "French"

  • "German"

  • "Dutch"

  • "Italian"

  • "Spanish"

  • "Japanese".

stop_words_remover

Specifies the stopwords remover to use. There are three options supported:

  • None: No stopwords remover is used.

  • PredefinedStopWordsRemover : A precompiled language-specific lists of stop words is used that includes the most common words from Microsoft Office.

  • CustomStopWordsRemover : A user-defined list of stopwords. It accepts the following option: stopword.

The default value is None.

text_case

Text casing using the rules of the invariant culture. Takes the following values:

  • "Lower"

  • "Upper"

  • "None"

The default value is "Lower".

keep_diacritics

False to remove diacritical marks; True to retain diacritical marks. The default value is False.

keep_punctuations

False to remove punctuation; True to retain punctuation. The default value is True.

keep_numbers

False to remove numbers; True to retain numbers. The default value is True.

output_tokens_column_name

Column containing the transformed text tokens.

dictionary

A dictionary of whitelisted terms which accepts the following options:

  • Term: An optional character vector of terms or categories.

  • DropUnknowns: Drop items.

  • Sort: Specifies how to order items when vectorized. Two orderings are supported:

    • "Occurrence": items appear in the order encountered.

    • "Value": items are sorted according to their default comparison. For example, text sorting will be case sensitive (e.g., 'A' then 'Z' then 'a').

The default value is None. Note that the stopwords list takes precedence over the dictionary whitelist as the stopwords are removed before the dictionary terms are whitelisted.

word_feature_extractor

Specifies the word feature extraction arguments. There are two different feature extraction mechanisms:

  • Ngram: Count-based feature extraction.

  • NgramHash: Hashing-based feature extraction..

The default value is None.

char_feature_extractor

Specifies the char feature extraction arguments. There are two different feature extraction mechanisms:

  • Ngram: Count-based feature extraction.

  • NgramHash: Hashing-based feature extraction. The default value is None.

vector_normalizer

Normalize vectors (rows) individually by rescaling them to unit norm. Takes one of the following values:

  • "None"

  • "L2"

  • "L1"

  • "LInf"

The default value is "L2".

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # NGramFeaturizer
   from nimbusml import FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.text import NGramFeaturizer
   from nimbusml.feature_extraction.text.extractor import Ngram

   # data input (as a FileDataStream)
   path = get_dataset("wiki_detox_train").as_filepath()

   data = FileDataStream.read_csv(path, sep='\t')
   print(data.head())
   #   Sentiment                                      SentimentText
   # 0          1  ==RUDE== Dude, you are rude upload that carl p...
   # 1          1  == OK! ==  IM GOING TO VANDALIZE WILD ONES WIK...
   # 2          1  Stop trolling, zapatancas, calling me a liar m...
   # 3          1  ==You're cool==  You seem like a really cool g...
   # 4          1  ::::: Why are you threatening me? I'm not bein...

   # transform usage
   xf = NGramFeaturizer(
       word_feature_extractor=Ngram(),
       columns={
           'features': ['SentimentText']})

   # fit and transform
   features = xf.fit_transform(data)

   # print features
   print(features.head())
   #   Sentiment   ...         features.douchiest  features.award.
   # 0          1  ...                        0.0              0.0
   # 1          1  ...                        0.0              0.0
   # 2          1  ...                        0.0              0.0
   # 3          1  ...                        0.0              0.0
   # 4          1  ...                        0.0              0.0

Remarks

The NGramFeaturizer transform produces a matrix of token ngrams/skip-grams counts for a given corpus of text. There are two ways it can do this:

  • build a dictionary of n-grams and use the id in the dictionary as the index in the bag;

  • hash each n-gram and use the hash value as the index in the bag.

The purpose of hashing is to convert variable-length text documents into equal-length numeric feature vectors, to support dimensionality reduction and to make the lookup of feature weights faster.

The text transform is applied to text input columns. It offers language detection, tokenization, stopwords removing, text normalization and feature generation. It supports the following languages by default: English, French, German, Dutch, Italian, Spanish and Japanese.

The n-grams are represented as count vectors, with vector slots corresponding either to n-grams (created using Ngram ) or to their hashes (created using NgramHash ). Embedding ngrams in a vector space allows their contents to be compared in an efficient manner. The slot values in the vector can be weighted by the following factors:

  • term frequency - The number of occurrences of the slot in the text

  • inverse document frequency - A ratio (the logarithm of inverse relative slot frequency) that measures the information a slot provides by determining how common or rare it is across the entire text.

  • term frequency-inverse document frequency - the product term frequency and the inverse document frequency.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name Description
deep
Default value: False