Artikkeli 09/16/2022
12 avustajaa
Palaute
Tässä artikkelissa
Data transformations are used to:
Prepare data for model training.
Apply an imported model in TensorFlow or ONNX format.
Post-process data after it has been passed through a model.
The transformations in this guide return classes that implement the IEstimator interface. Data transformations can be chained together. Each transformation both expects and produces data of specific types and formats, which are specified in the linked reference documentation.
Some data transformations require training data to calculate their parameters. For example: the NormalizeMeanVariance transformer calculates the mean and variance of the training data during the Fit()
operation, and uses those parameters in the Transform()
operation.
Other data transformations don't require training data. For example: the ConvertToGrayscale transformation can perform the Transform()
operation without having seen any training data during the Fit()
operation.
Column mapping and grouping
Transform
Definition
ONNX Exportable
Concatenate
Concatenate one or more input columns into a new output column
Yes
CopyColumns
Copy and rename one or more input columns
Yes
DropColumns
Drop one or more input columns
Yes
SelectColumns
Select one or more columns to keep from the input data
Yes
Normalization and scaling
Transform
Definition
ONNX Exportable
NormalizeMeanVariance
Subtract the mean (of the training data) and divide by the variance (of the training data)
Yes
NormalizeLogMeanVariance
Normalize based on the logarithm of the training data
Yes
NormalizeLpNorm
Scale input vectors by their lp-norm , where p is 1, 2 or infinity. Defaults to the l2 (Euclidean distance) norm
Yes
NormalizeGlobalContrast
Scale each value in a row by subtracting the mean of the row data and divide by either the standard deviation or l2-norm (of the row data), and multiply by a configurable scale factor (default 2)
Yes
NormalizeBinning
Assign the input value to a bin index and divide by the number of bins to produce a float value between 0 and 1. The bin boundaries are calculated to evenly distribute the training data across bins
Yes
NormalizeSupervisedBinning
Assign the input value to a bin based on its correlation with label column
Yes
NormalizeMinMax
Scale the input by the difference between the minimum and maximum values in the training data
Yes
NormalizeRobustScaling
Scale each value using statistics that are robust to outliers that will center the data around 0 and scales the data according to the quantile range.
Yes
Conversions between data types
Transform
Definition
ONNX Exportable
ConvertType
Convert the type of an input column to a new type
Yes
MapValue
Map values to keys (categories) based on the supplied dictionary of mappings
No
MapValueToKey
Map values to keys (categories) by creating the mapping from the input data
Yes
MapKeyToValue
Convert keys back to their original values
Yes
MapKeyToVector
Convert keys back to vectors of original values
Yes
MapKeyToBinaryVector
Convert keys back to a binary vector of original values
No
Hash
Hash the value in the input column
Yes
Text transformations
Transform
Definition
ONNX Exportable
FeaturizeText
Transform a text column into a float array of normalized ngrams and char-grams counts
No
TokenizeIntoWords
Split one or more text columns into individual words
Yes
TokenizeIntoCharactersAsKeys
Split one or more text columns into individual characters floats over a set of topics
Yes
NormalizeText
Change case, remove diacritical marks, punctuation marks, and numbers
Yes
ProduceNgrams
Transform text column into a bag of counts of ngrams (sequences of consecutive words)
Yes
ProduceWordBags
Transform text column into a bag of counts of ngrams vector
Yes
ProduceHashedNgrams
Transform text column into a vector of hashed ngram counts
No
ProduceHashedWordBags
Transform text column into a bag of hashed ngram counts
Yes
RemoveDefaultStopWords
Remove default stop words for the specified language from input columns
Yes
RemoveStopWords
Removes specified stop words from input columns
Yes
LatentDirichletAllocation
Transform a document (represented as a vector of floats) into a vector of floats over a set of topics
Yes
ApplyWordEmbedding
Convert vectors of text tokens into sentence vectors using a pretrained model
Yes
Transform
Definition
ONNX Exportable
OneHotEncoding
Convert one or more text columns into one-hot encoded vectors
Yes
OneHotHashEncoding
Convert one or more text columns into hash-based one-hot encoded vectors
No
Transform
Definition
ONNX Exportable
DetectAnomalyBySrCnn
Detect anomalies in the input time series data using the Spectral Residual (SR) algorithm
No
DetectChangePointBySsa
Detect change points in time series data using singular spectrum analysis (SSA)
No
DetectIidChangePoint
Detect change points in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores
No
ForecastBySsa
Forecast time series data using singular spectrum analysis (SSA)
No
DetectSpikeBySsa
Detect spikes in time series data using singular spectrum analysis (SSA)
No
DetectIidSpike
Detect spikes in independent and identically distributed (IID) time series data using adaptive kernel density estimations and martingale scores
No
DetectEntireAnomalyBySrCnn
Detect anomalies for the entire input data using the SRCNN algorithm.
No
DetectSeasonality
Detect seasonality using fourier analysis.
No
LocalizeRootCause
Localizes root cause from time series input using a decision tree algorithm.
No
LocalizeRootCauses
Localizes root causes from tie series input.
No
Missing values
Transform
Definition
ONNX Exportable
IndicateMissingValues
Create a new boolean output column, the value of which is true when the value in the input column is missing
Yes
ReplaceMissingValues
Create a new output column, the value of which is set to a default value if the value is missing from the input column, and the input value otherwise
Yes
Feature selection
Transform
Definition
ONNX Exportable
ApproximatedKernelMap
Map each input vector onto a lower dimensional feature space, where inner products approximate a kernel function, so that the features can be used as inputs to the linear algorithms
No
ProjectToPrincipalComponents
Reduce the dimensions of the input feature vector by applying the Principal Component Analysis algorithm
Transform
Definition
ONNX Exportable
Platt(String, String, String)
Transforms a binary classifier raw score into a class probability using logistic regression with parameters estimated using the training data
Yes
Platt(Double, Double, String)
Transforms a binary classifier raw score into a class probability using logistic regression with fixed parameters
Yes
Naive
Transforms a binary classifier raw score into a class probability by assigning scores to bins, and calculating the probability based on the distribution among the bins
Yes
Isotonic
Transforms a binary classifier raw score into a class probability by assigning scores to bins, where the position of boundaries and the size of bins are estimated using the training data
No
Transform
Definition
ONNX Exportable
ApplyOnnxModel
Transform the input data with an imported ONNX model
No
LoadTensorFlowModel
Transform the input data with an imported TensorFlow model
No
Transform
Definition
ONNX Exportable
FilterByCustomPredicate
Drops rows where a specified predicate returns true.
No
FilterByStatefulCustomPredicate
Drops rows where a specified predicate returns true, but allows for a specified state.
No
CustomMapping
Transform existing columns onto new ones with a user-defined mapping
No
Expression
Apply an expression to transform columns into new ones
No