Partilhar via


PredefinedStopWordsRemover Class

Remover with predefined list of stop words.

Inheritance
nimbusml.internal.core.feature_extraction.text.stopwords._predefinedstopwordsremover.PredefinedStopWordsRemover
PredefinedStopWordsRemover

Constructor

PredefinedStopWordsRemover(**params)

Parameters

Name Description
params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # NGramFeaturizer
   from nimbusml import FileDataStream
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.text import NGramFeaturizer
   from nimbusml.feature_extraction.text.extractor import Ngram

   # data input (as a FileDataStream)
   path = get_dataset("wiki_detox_train").as_filepath()

   data = FileDataStream.read_csv(path, sep='\t')
   print(data.head())
   #   Sentiment                                      SentimentText
   # 0          1  ==RUDE== Dude, you are rude upload that carl p...
   # 1          1  == OK! ==  IM GOING TO VANDALIZE WILD ONES WIK...
   # 2          1  Stop trolling, zapatancas, calling me a liar m...
   # 3          1  ==You're cool==  You seem like a really cool g...
   # 4          1  ::::: Why are you threatening me? I'm not bein...

   # transform usage
   xf = NGramFeaturizer(
       word_feature_extractor=Ngram(),
       columns={
           'features': ['SentimentText']})

   # fit and transform
   features = xf.fit_transform(data)

   # print features
   print(features.head())
   #   Sentiment   ...         features.douchiest  features.award.
   # 0          1  ...                        0.0              0.0
   # 1          1  ...                        0.0              0.0
   # 2          1  ...                        0.0              0.0
   # 3          1  ...                        0.0              0.0
   # 4          1  ...                        0.0              0.0

Remarks

The NGramFeaturizer transform produces a bag of counts of sequences of consecutive words from a given corpus of text. It also offers stopwords removing. A precompiled language-specific lists of stop words is used in this class that includes the most common words from Microsoft Office.

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

Name Description
deep
Default value: False