microsoftml.categorical_hash: hash e converte una colonna di testo in categorie

Articolo
01/02/2025

Uso

microsoftml.categorical_hash(cols: [str, dict, list],
    hash_bits: int = 16, seed: int = 314489979,
    ordered: bool = True, invert_hash: int = 0,
    output_kind: ['Bag', 'Ind', 'Key', 'Bin'] = 'Bag', **kargs)

Descrizione

Trasformazione hash categorica che può essere eseguita sui dati prima di eseguire il training di un modello.

Dettagli

categorical_hash converte un valore categorico in una matrice di indicatori eseguendo l'hashing del valore e usando l'hash come indice nel contenitore. Se la colonna di input è un vettore, viene restituito un singolo contenitore di indicatori. categorical_hash attualmente non supporta la gestione dei dati dei fattori.

Argomenti

Cols

Stringa di caratteri o elenco di nomi di variabili da trasformare. Se dict, le chiavi rappresentano i nomi delle nuove variabili da creare.

hash_bits

Intero che specifica il numero di bit in cui eseguire l'hashing. Deve essere compreso tra 1 e 30 inclusi. Il valore predefinito è 16.

seme

Intero che specifica il valore di inizializzazione dell'hashing. Il valore predefinito è 314489979.

ordinato

True includere la posizione di ogni termine nell'hash. In caso contrario, False. Il valore predefinito è True.

invert_hash

Intero che specifica il limite per il numero di chiavi che è possibile usare per generare il nome dello slot. 0 significa che non si inverte l'hashing; -1 significa che non esiste alcun limite. Mentre un valore zero offre prestazioni migliori, è necessario un valore diverso da zero per ottenere nomi di coefficiente significativi. Il valore predefinito è 0.

output_kind

Stringa di caratteri che specifica il tipo di tipo di output.

"Bag": restituisce un vettore multi set. Se la colonna di input è un vettore di categorie, l'output contiene un vettore, dove il valore in ogni slot è il numero di occorrenze della categoria nel vettore di input. Se la colonna di input contiene una singola categoria, il vettore dell'indicatore e il vettore di contenitore sono equivalenti
"Ind": restituisce un vettore indicatore. La colonna di input è un vettore di categorie e l'output contiene un vettore indicatore per slot nella colonna di input.
"Key: restituisce un indice. L'output è un ID intero (compreso tra 1 e il numero di categorie nel dizionario) della categoria.
"Bin: restituisce un vettore che è la rappresentazione binaria della categoria.

Il valore predefinito è "Bag".

karg

Argomenti aggiuntivi inviati al motore di calcolo.

Rendiconto

oggetto che definisce la trasformazione.

Vedere anche

categorical

Esempio

'''
Example on rx_logistic_regression and categorical_hash.
'''
import numpy
import pandas
from microsoftml import rx_logistic_regression, categorical_hash, rx_predict
from microsoftml.datasets.datasets import get_dataset

movie_reviews = get_dataset("movie_reviews")

train_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Do not like it", "Really like it",
        "I hate it", "I like it a lot", "I kind of hate it", "I do like it",
        "I really hate it", "It is very good", "I hate it a bunch", "I love it a bunch",
        "I hate it", "I like it very much", "I hate it very much.",
        "I really do love it", "I really do hate it", "Love it!", "Hate it!",
        "I love it", "I hate it", "I love it", "I hate it", "I love it"],
    like=[True, False, True, False, True, False, True, False, True, False,
        True, False, True, False, True, False, True, False, True, False, True,
        False, True, False, True]))
        
test_reviews = pandas.DataFrame(data=dict(
    review=[
        "This is great", "I hate it", "Love it", "Really like it", "I hate it",
        "I like it a lot", "I love it", "I do like it", "I really hate it", "I love it"]))


# Use a categorical hash transform.
out_model = rx_logistic_regression("like ~ reviewCat",
                data=train_reviews,
                ml_transforms=[categorical_hash(cols=dict(reviewCat="review"))])
                
# Weights are similar to categorical.
print(out_model.coef_)

# Use the model to score.
source_out_df = rx_predict(out_model, data=test_reviews, extra_vars_to_write=["review"])
print(source_out_df.head())

Prodotto:

Not adding a normalizer.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
Beginning processing data.
Rows Read: 25, Read Time: 0, Transform Time: 0
Beginning processing data.
LBFGS multi-threading will attempt to load dataset into memory. In case of out-of-memory issues, turn off multi-threading by setting trainThreads to 1.
Warning: Too few instances to use 4 threads, decreasing to 1 thread(s)
Beginning optimization
num vars: 65537
improvement criterion: Mean Improvement
L1 regularization selected 3 of 65537 weights.
Not training a calibrator because it is not needed.
Elapsed time: 00:00:00.1209392
Elapsed time: 00:00:00.0190134
OrderedDict([('(Bias)', 0.2132447361946106), ('f1783', -0.7939924597740173), ('f38537', 0.1968022584915161)])
Beginning processing data.
Rows Read: 10, Read Time: 0, Transform Time: 0
Beginning processing data.
Elapsed time: 00:00:00.0284223
Finished writing 10 rows.
Writing completed.
           review PredictedLabel     Score  Probability
0   This is great           True  0.213245     0.553110
1       I hate it          False -0.580748     0.358761
2         Love it           True  0.213245     0.553110
3  Really like it           True  0.213245     0.553110
4       I hate it          False -0.580748     0.358761

Condividi tramite