Ngram Class

Reference

Extracts NGrams from text and converts them to vector using dictionary.

Inheritance: nimbusml.internal.core.feature_extraction.text.extractor._ngram.Ngram

Ngram

Constructor

Ngram(ngram_length=1, skip_length=0, all_lengths=True, max_num_terms=[10000000], weighting='Tf', **params)

Parameters

Name	Description
ngram_length	Ngram length.
skip_length	Maximum number of tokens to skip when constructing an n-gram.
all_lengths	Whether to include all n-gram lengths up to NgramLength or only NgramLength.
max_num_terms	Maximum number of n-grams to store in the dictionary.
weighting	The weighting criteria.
params	Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # NGramFeaturizer
   from nimbusml import FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.text import NGramFeaturizer
   from nimbusml.feature_extraction.text.extractor import Ngram

   # data input (as a FileDataStream)
   path = get_dataset("wiki_detox_train").as_filepath()

   data = FileDataStream.read_csv(path, sep='\t')
   print(data.head())
   #   Sentiment                                      SentimentText
   # 0          1  ==RUDE== Dude, you are rude upload that carl p...
   # 1          1  == OK! ==  IM GOING TO VANDALIZE WILD ONES WIK...
   # 2          1  Stop trolling, zapatancas, calling me a liar m...
   # 3          1  ==You're cool==  You seem like a really cool g...
   # 4          1  ::::: Why are you threatening me? I'm not bein...

   # transform usage
   xf = NGramFeaturizer(
       word_feature_extractor=Ngram(),
       columns={
           'features': ['SentimentText']})

   # fit and transform
   features = xf.fit_transform(data)

   # print features
   print(features.head())
   #   Sentiment   ...         features.douchiest  features.award.
   # 0          1  ...                        0.0              0.0
   # 1          1  ...                        0.0              0.0
   # 2          1  ...                        0.0              0.0
   # 3          1  ...                        0.0              0.0
   # 4          1  ...                        0.0              0.0

Remarks

The NGramFeaturizer transform produces a bag of counts of sequences of consecutive words, called n-grams, from a given corpus of text. There are two ways it can do this:

build a dictionary of n-grams and use the id in the dictionary as

the index in the bag;

hash each n-gram and use the hash value as the index in the bag.

This class provides the text extractor that implement the first. In NGramFeaturizer, users should specify which text extractor to use as the argument.

The n-grams are represented as count vectors, with vector slots corresponding to n-grams. Embedding ngrams in a vector space allows their contents to be compared in an efficient manner. The slot values in the vector can be weighted by the following factors:

term frequency - The number of occurrences of the slot in the

text

inverse document frequency - A ratio (the logarithm of

inverse relative slot frequency) that measures the information a

slot provides by determining how common or rare it is across the entire

text.

term frequency-inverse document frequency - the product

term frequency and the inverse document frequency.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name	Description
deep	Default value: False

Udostępnij za pośrednictwem

Ngram Class

Constructor

Parameters

Examples

Remarks

Methods

get_params

Parameters

Dodatkowe zasoby