자습서: 텍스트 분류 모델 생성, 평가 및 채점

아티클
01/28/2025

이 자습서에서는 Microsoft Fabric에서 텍스트 분류 모델에 대한 Synapse 데이터 과학 워크플로의 엔드투엔드 예시를 제공합니다. 이 시나리오에서는 Spark에서 word2vec 및 로지스틱 회귀를 사용하여 도서 제목만을 기준으로 British Library 도서 데이터 세트에서 도서의 장르를 결정합니다.

이 자습서에서는 다음 단계를 다룹니다.

사용자 지정 라이브러리 설치
데이터 로드
탐색적 데이터 분석을 통해 데이터 이해 및 처리
word2vec 및 로지스틱 회귀를 사용하여 기계 학습 모델 학습, MLflow 및 Fabric 자동 로깅 기능을 사용하여 실험 추적
채점 및 예측을 위한 기계 학습 모델 로드

필수 조건

Microsoft Fabric 구독을 구매합니다. 또는 무료 Microsoft Fabric 평가판에 등록합니다.
Microsoft Fabric에 로그인합니다.
홈페이지 왼쪽 아래에 있는 환경 전환기를 사용하여 패브릭으로 전환합니다.

Microsoft Fabric 레이크하우스가 없는 경우 Microsoft Fabric에서 레이크하우스 만들기의 단계에 따라 레이크하우스를 만들 수 있습니다.

Notebook에서 따라 하기

다음 옵션 중 하나를 선택하여 Notebook에서 따라할 수 있습니다.

기본 제공 Notebook을 열고 실행합니다.
GitHub에서 Notebook을 업로드합니다.

기본 제공 Notebook 열기

이 자습서에는 샘플 타이틀 장르 분류 Notebook이 함께 제공됩니다.

이 자습서의 샘플 Notebook을 열려면 데이터 과학 자습서시스템 준비의 지침을 따릅니다.
코드 실행을 시작하기 전에 Notebook lakehouse를 연결해야 합니다.

GitHub에서 Notebook 가져오기

이 자습서에는 AIsample - Title Genre Classification.ipynb Notebook이 함께 제공됩니다.

이 자습서의 동반 노트북을 열려면 데이터 과학 자습서를 위한 시스템 준비의 지침을 따라 노트북을 작업 공간으로 가져오세요.
이 페이지에서 코드를 복사하여 붙여넣으려는 경우 새 Notebook을 만들 수 있습니다.
코드 실행을 시작하기 전에 Notebook에 레이크하우스를 연결해야 합니다.

1단계: 사용자 지정 라이브러리 설치

기계 학습 모델 개발 또는 임시 데이터 분석의 경우 Apache Spark 세션에 대한 사용자 지정 라이브러리를 신속하게 설치해야 할 수 있습니다. 라이브러리를 설치하는 데는 두 가지 옵션이 있습니다.

현재 Notebook에만 라이브러리를 설치하려면 Notebook의 인라인 설치 기능(%pip 또는 %conda)을 사용합니다.
또는 Fabric 환경을 만들거나, 공개 소스에서 라이브러리를 설치하거나, 사용자 지정 라이브러리를 업로드한 다음, 작업 영역 관리자가 해당 환경을 작업 영역의 기본값으로 연결할 수 있습니다. 그러면 환경의 모든 라이브러리를 작업 영역의 모든 Notebook 및 Spark 작업 정의에서 사용할 수 있게 됩니다. 환경에 대한 자세한 내용은 Microsoft Fabric에서 환경 생성, 구성 및 사용을 참조하세요.

분류 모델의 경우 wordcloud 라이브러리를 사용하여 텍스트에서 단어 빈도를 나타냅니다. 여기서 단어의 크기는 빈도를 나타냅니다. 이 자습서에서는 %pip install을 사용하여 Notebook에 wordcloud를 설치합니다.

참고 항목

%pip install 실행 후 PySpark 커널이 다시 시작됩니다. 다른 셀을 실행하기 전에 필요한 라이브러리를 설치해야 합니다.

# Install wordcloud for text visualization by using pip
%pip install wordcloud

2단계: 데이터 로드

데이터 세트에는 도서관과 Microsoft의 협업을 통해 디지털화된 British Library의 도서에 대한 메타데이터가 포함되어 있습니다. 메타데이터는 도서가 픽션인지 아니면 논픽션인지를 나타내는 분류 정보입니다. 이 데이터 세트로는 제목만을 기준으로 도서의 장르를 결정하는 분류 모델을 학습시키는 것이 목표입니다.

BL 레코드 ID	리소스 유형	속성	이름과 연결된 날짜	이름 유형	역할	모든 이름	제목	변형 제목	시리즈 제목	시리즈 내의 숫자	출판 국가	출판 유형	출판사	출판 날짜	버전	물리적 설명	듀이 분류	BL 서가 기호	토픽	장르	언어	주의	물리적 리소스에 대한 BL 레코드 ID	classification_id	user_id	created_at	subject_ids	annotator_date_pub	annotator_normalised_date_pub	annotator_edition_statement	annotator_genre	annotator_FAST_genre_terms	annotator_FAST_subject_terms	annotator_comments	annotator_main_language	annotator_other_languages_summaries	annotator_summaries_language	annotator_translation	annotator_original_language	annotator_publisher	annotator_place_pub	annotator_country	annotator_title	디지털화된 도서 링크	주석 달림
014602826	논문	Yearsley, Ann	1753-1806	person		More, Hannah, 1745-1833 [person]; Yearsley, Ann, 1753-1806 [person]	Poems on several occasions [With a prefatory letter by Hannah More.]				영국	런던		1786	4판 원고 노트			지디털 스토어 11644.d.32			영어		003996603																						False
014602830	논문	A, T.		person		올덤, 존, 1653-1683 [사람]; A, T. [person]	A Satyr against Vertue. (A poem: supposed to be spoken by a Town-Hector [By John Oldham. The preface signed: T. A.])				영국	런던		1679		15페이지(4°)		Digital Store 11602.ee.10. (2.)			영어		000001143																						False

다음 매개변수를 정의하여 이 Notebook을 다양한 데이터 세트에 적용할 수 있습니다.

IS_CUSTOM_DATA = False  # If True, the user must manually upload the dataset
DATA_FOLDER = "Files/title-genre-classification"
DATA_FILE = "blbooksgenre.csv"

# Data schema
TEXT_COL = "Title"
LABEL_COL = "annotator_genre"
LABELS = ["Fiction", "Non-fiction"]

EXPERIMENT_NAME = "sample-aisample-textclassification"  # MLflow experiment name

데이터 세트 다운로드 및 레이크하우스에 업로드

이 코드는 공개적으로 사용 가능한 버전의 데이터 세트를 다운로드한 다음 Fabric 레이크하우스에 저장합니다.

중요 사항

실행하기 전에 Notebook에 레이크하우스를 추가해야 합니다. 그렇게 하지 않으면 오류가 발생합니다.

if not IS_CUSTOM_DATA:
    # Download demo data files into the lakehouse, if they don't exist
    import os, requests

    remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/Title_Genre_Classification"
    fname = "blbooksgenre.csv"
    download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

    if not os.path.exists("/lakehouse/default"):
        # Add a lakehouse, if no default lakehouse was added to the notebook
        # A new notebook won't link to any lakehouse by default
        raise FileNotFoundError(
            "Default lakehouse not found, please add a lakehouse and restart the session."
        )
    os.makedirs(download_path, exist_ok=True)
    if not os.path.exists(f"{download_path}/{fname}"):
        r = requests.get(f"{remote_url}/{fname}", timeout=30)
        with open(f"{download_path}/{fname}", "wb") as f:
            f.write(r.content)
    print("Downloaded demo data files into lakehouse.")

필수 라이브러리 가져오기

처리를 시작하기 전에 Spark 및 SynapseML의 라이브러리를 포함하여 필요한 라이브러리를 가져와야 합니다.

import numpy as np
from itertools import chain

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import (
    BinaryClassificationEvaluator,
    MulticlassClassificationEvaluator,
)

from synapse.ml.stages import ClassBalancer
from synapse.ml.train import ComputeModelStatistics

import mlflow

하이퍼 매개 변수 정의

모델 학습에 대한 일부 하이퍼 매개 변수를 정의합니다.

중요 사항

각 매개 변수를 이해하는 경우에만 이러한 하이퍼 매개 변수를 수정해야 합니다.

# Hyperparameters 
word2vec_size = 128  # The length of the vector for each word
min_word_count = 3  # The minimum number of times that a word must appear to be considered
max_iter = 10  # The maximum number of training iterations
k_folds = 3  # The number of folds for cross-validation

이 Notebook을 실행하는 데 필요한 시간 기록을 시작해야 합니다.

# Record the notebook running time
import time

ts = time.time()

MLflow 실험 추적 설정

자동 로깅은 MLflow 로깅 기능을 확장합니다. 자동 로깅은 기계 학습 모델을 학습할 때 해당 모델의 입력 매개 변수 값 및 출력 메트릭을 자동으로 캡처합니다. 그런 다음 이 정보를 작업 영역에 기록합니다. 작업 영역에서는 MLflow API를 사용하여 정보에 액세스하고 시각화하거나, 작업 공간에서 해당 실험을 수행할 수 있습니다. 자동 로깅에 대한 자세한 내용은 Microsoft Fabric의 자동 로깅을 참조하세요.

# Set up Mlflow for experiment tracking

mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(disable=True)  # Disable Mlflow autologging

Notebook 세션에서 Microsoft Fabric 자동 로깅을 사용하지 않도록 설정하려면 mlflow.autolog()을 호출하고 disable=True를 설정합니다.

레이크하우스에서 원시 날짜 데이터 읽기

raw_df = spark.read.csv(f"{DATA_FOLDER}/raw/{DATA_FILE}", header=True, inferSchema=True)

3단계: 탐색적 데이터 분석 수행

display 명령을 통해 데이터 세트를 탐색하고, 데이터 세트에 대한 고급 통계를 보고, 차트 보기를 표시합니다.

display(raw_df.limit(20))

데이터 준비

중복 항목을 제거하여 데이터를 정리합니다.

df = (
    raw_df.select([TEXT_COL, LABEL_COL])
    .where(F.col(LABEL_COL).isin(LABELS))
    .dropDuplicates([TEXT_COL])
    .cache()
)

display(df.limit(20))

클래스 밸런싱을 적용하여 바이어스를 해결합니다.

# Create a ClassBalancer instance, and set the input column to LABEL_COL
cb = ClassBalancer().setInputCol(LABEL_COL)

# Fit the ClassBalancer instance to the input DataFrame, and transform the DataFrame
df = cb.fit(df).transform(df)

# Display the first 20 rows of the transformed DataFrame
display(df.limit(20))

단락과 문장을 더 작은 단위로 분할하여 데이터 세트를 토큰화합니다. 이렇게 하면 의미를 더 쉽게 할당할 수 있습니다. 그런 다음 중지 단어를 제거하여 성능을 향상시킵니다. 중지 단어 제거는 말뭉치의 모든 문서에서 일반적으로 발생하는 단어를 제거하는 것을 의미합니다. 중지 단어 제거는 NLP(자연어 처리) 애플리케이션에서 가장 일반적으로 사용되는 전처리 단계 중 하나입니다.

# Text transformer
tokenizer = Tokenizer(inputCol=TEXT_COL, outputCol="tokens")
stopwords_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens")

# Build the pipeline
pipeline = Pipeline(stages=[tokenizer, stopwords_remover])

token_df = pipeline.fit(df).transform(df)

display(token_df.limit(20))

각 클래스에 대한 wordcloud 라이브러리를 표시합니다. wordcloud 라이브러리는 텍스트 데이터에 자주 등장하는 키워드를 시각적으로 눈에 띄게 표현한 것입니다. wordcloud 라이브러리는 키워드를 구름과 같은 색 그림으로 렌더링하여 주요 텍스트 데이터를 한눈에 더 잘 포착할 수 있기 때문에 효과적입니다. wordcloud에 대해 자세히 알아보세요.

# WordCloud
for label in LABELS:
    tokens = (
        token_df.where(F.col(LABEL_COL) == label)
        .select(F.explode("filtered_tokens").alias("token"))
        .where(F.col("token").rlike(r"^\w+$"))
    )

    top50_tokens = (
        tokens.groupBy("token").count().orderBy(F.desc("count")).limit(50).collect()
    )

    # Generate a wordcloud image
    wordcloud = WordCloud(
        scale=10,
        background_color="white",
        random_state=42,  # Make sure the output is always the same for the same input
    ).generate_from_frequencies(dict(top50_tokens))

    # Display the generated image by using matplotlib
    plt.figure(figsize=(10, 10))
    plt.title(label, fontsize=20)
    plt.axis("off")
    plt.imshow(wordcloud, interpolation="bilinear")

마지막으로, word2vec를 사용하여 텍스트를 벡터화합니다. word2vec 기술은 텍스트의 각 단어에 대한 벡터 표현을 만듭니다. 비슷한 컨텍스트에서 사용되거나 의미 체계 관계가 있는 단어는 벡터 공간에서의 근접성을 통해 효과적으로 캡처됩니다. 이 근접성은 유사한 단어에는 유사한 단어 벡터가 있음을 나타냅니다.

# Label transformer
label_indexer = StringIndexer(inputCol=LABEL_COL, outputCol="labelIdx")
vectorizer = Word2Vec(
    vectorSize=word2vec_size,
    minCount=min_word_count,
    inputCol="filtered_tokens",
    outputCol="features",
)

# Build the pipeline
pipeline = Pipeline(stages=[label_indexer, vectorizer])
vec_df = (
    pipeline.fit(token_df)
    .transform(token_df)
    .select([TEXT_COL, LABEL_COL, "features", "labelIdx", "weight"])
)

display(vec_df.limit(20))

4단계: 모델 학습 및 평가

데이터를 확보한 상태에서 모델을 정의합니다. 이 섹션에서는 로지스틱 회귀 모델을 학습시켜 벡터화된 텍스트를 분류합니다.

학습 및 테스트 데이터 세트 준비

# Split the dataset into training and testing
(train_df, test_df) = vec_df.randomSplit((0.8, 0.2), seed=42)

기계 학습 실험 추적

기계 학습 실험은 모든 관련 기계 학습 실행에 대한 조직 및 제어의 기본 단위입니다. 실행은 모델 코드의 단일 실행에 해당합니다.

기계 학습 실험 추적은 매개 변수, 메트릭, 모델 및 기타 아티팩트와 같은 모든 실험과 해당 구성 요소를 관리합니다. 추적을 통해 특정 기계 학습 실험의 모든 필수 구성 요소를 구성할 수 있습니다. 또한 저장된 실험을 사용하여 과거 결과를 쉽게 재현할 수 있습니다. Microsoft Fabric의 기계 학습 실험에 대해 자세히 알아봅니다.

# Build the logistic regression classifier
lr = (
    LogisticRegression()
    .setMaxIter(max_iter)
    .setFeaturesCol("features")
    .setLabelCol("labelIdx")
    .setWeightCol("weight")
)

하이퍼 매개 변수 튜닝

하이퍼 매개 변수를 검색하는 매개 변수 그리드를 작성합니다. 그런 다음, 교차 평가자 추정기를 빌드하여 CrossValidator 모델을 생성합니다.

# Build a grid search to select the best values for the training parameters
param_grid = (
    ParamGridBuilder()
    .addGrid(lr.regParam, [0.03, 0.1])
    .addGrid(lr.elasticNetParam, [0.0, 0.1])
    .build()
)

if len(LABELS) > 2:
    evaluator_cls = MulticlassClassificationEvaluator
    evaluator_metrics = ["f1", "accuracy"]
else:
    evaluator_cls = BinaryClassificationEvaluator
    evaluator_metrics = ["areaUnderROC", "areaUnderPR"]
evaluator = evaluator_cls(labelCol="labelIdx", weightCol="weight")

# Build a cross-evaluator estimator
crossval = CrossValidator(
    estimator=lr,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=k_folds,
    collectSubModels=True,
)

모델 평가

테스트 데이터 세트의 모델을 평가하여 비교할 수 있습니다. 잘 학습된 모델은 유효성 검사 및 테스트 데이터 세트에 대해 실행할 때 관련 메트릭에서 고성능을 보여 주어야 합니다.

def evaluate(model, df):
    log_metric = {}
    prediction = model.transform(df)
    for metric in evaluator_metrics:
        value = evaluator.evaluate(prediction, {evaluator.metricName: metric})
        log_metric[metric] = value
        print(f"{metric}: {value:.4f}")
    return prediction, log_metric

MLflow를 사용하여 실험 추적

학습 및 평가 프로세스를 시작합니다. MLflow를 사용하여 모든 실험을 추적하고 매개변수, 메트릭, 모델을 기록합니다. 이러한 모든 정보는 작업 영역에 실험 이름으로 기록됩니다.

with mlflow.start_run(run_name="lr"):
    models = crossval.fit(train_df)
    best_metrics = {k: 0 for k in evaluator_metrics}
    best_index = 0
    for idx, model in enumerate(models.subModels[0]):
        with mlflow.start_run(nested=True, run_name=f"lr_{idx}") as run:
            print("\nEvaluating on test data:")
            print(f"subModel No. {idx + 1}")
            prediction, log_metric = evaluate(model, test_df)

            if log_metric[evaluator_metrics[0]] > best_metrics[evaluator_metrics[0]]:
                best_metrics = log_metric
                best_index = idx

            print("log model")
            mlflow.spark.log_model(
                model,
                f"{EXPERIMENT_NAME}-lrmodel",
                registered_model_name=f"{EXPERIMENT_NAME}-lrmodel",
                dfs_tmpdir="Files/spark",
            )

            print("log metrics")
            mlflow.log_metrics(log_metric)

            print("log parameters")
            mlflow.log_params(
                {
                    "word2vec_size": word2vec_size,
                    "min_word_count": min_word_count,
                    "max_iter": max_iter,
                    "k_folds": k_folds,
                    "DATA_FILE": DATA_FILE,
                }
            )

    # Log the best model and its relevant metrics and parameters to the parent run
    mlflow.spark.log_model(
        models.subModels[0][best_index],
        f"{EXPERIMENT_NAME}-lrmodel",
        registered_model_name=f"{EXPERIMENT_NAME}-lrmodel",
        dfs_tmpdir="Files/spark",
    )
    mlflow.log_metrics(best_metrics)
    mlflow.log_params(
        {
            "word2vec_size": word2vec_size,
            "min_word_count": min_word_count,
            "max_iter": max_iter,
            "k_folds": k_folds,
            "DATA_FILE": DATA_FILE,
        }
    )

실험을 보려면 다음을 수행합니다.

왼쪽 탐색 창에서 작업 영역을 선택합니다.
실험 이름 찾기 및 선택 - 이 경우 sample_aisample-textclassification

5단계: 채점 및 예측 결과 저장

Microsoft Fabric을 사용하면 사용자가 PREDICT 확장 가능한 함수를 통해 기계 학습 모델을 운영할 수 있습니다. 이 함수는 모든 컴퓨팅 엔진에서 일괄 처리 채점(또는 일괄 처리 추론)을 지원합니다. 특정 모델의 Notebook 또는 항목 페이지에서 바로 일괄 처리 예측을 만들 수 있습니다. PREDICT 및 Fabric에서 이를 사용하는 방법에 대한 자세한 내용은 Microsoft Fabric에서 PREDICT를 통한 기계 학습 모델 채점을 참조하세요.

앞의 평가 결과에서 모델 1은 AUPRC(정밀도-재현율 곡선 아래 영역)과 AUC-ROC(곡선 수신자 작동 특성 아래 영역) 모두에 대해 가장 큰 메트릭을 가집니다. 따라서 예측에 모델 1을 사용해야 합니다.

AUC-ROC 측정값은 이진 분류자 성능을 측정하는 데 널리 사용됩니다. 그러나 때로는 AUPRC 측정값을 기반으로 분류자를 평가하는 것이 더 적절할 수 있습니다. AUC-ROC 차트는 TPR(진양성 비율)과 FPR(가양성 비율) 간의 절충을 시각화합니다. AUPRC 곡선은 단일 시각화로 정밀도(양수 예측 값 또는 PPV) 및 재현율(진양성률 또는 TPR)를 결합합니다.

# Load the best model
model_uri = f"models:/{EXPERIMENT_NAME}-lrmodel/1"
loaded_model = mlflow.spark.load_model(model_uri, dfs_tmpdir="Files/spark")

# Verify the loaded model
batch_predictions = loaded_model.transform(test_df)
batch_predictions.show(5)

# Code to save userRecs in the lakehouse
batch_predictions.write.format("delta").mode("overwrite").save(
    f"{DATA_FOLDER}/predictions/batch_predictions"
)

# Determine the entire runtime
print(f"Full run cost {int(time.time() - ts)} seconds.")

다음을 통해 공유

자습서: 텍스트 분류 모델 생성, 평가 및 채점

필수 조건

Notebook에서 따라 하기

기본 제공 Notebook 열기

GitHub에서 Notebook 가져오기

1단계: 사용자 지정 라이브러리 설치

2단계: 데이터 로드

데이터 세트 다운로드 및 레이크하우스에 업로드

필수 라이브러리 가져오기

하이퍼 매개 변수 정의

MLflow 실험 추적 설정

레이크하우스에서 원시 날짜 데이터 읽기

3단계: 탐색적 데이터 분석 수행

데이터 준비

4단계: 모델 학습 및 평가

학습 및 테스트 데이터 세트 준비

기계 학습 실험 추적

하이퍼 매개 변수 튜닝

모델 평가

MLflow를 사용하여 실험 추적

5단계: 채점 및 예측 결과 저장

피드백

추가 리소스

다음을 통해 공유

자습서: 텍스트 분류 모델 생성, 평가 및 채점

필수 조건

Notebook에서 따라 하기

기본 제공 Notebook 열기

GitHub에서 Notebook 가져오기

1단계: 사용자 지정 라이브러리 설치

2단계: 데이터 로드

데이터 세트 다운로드 및 레이크하우스에 업로드

필수 라이브러리 가져오기

하이퍼 매개 변수 정의

MLflow 실험 추적 설정

레이크하우스에서 원시 날짜 데이터 읽기

3단계: 탐색적 데이터 분석 수행

데이터 준비

4단계: 모델 학습 및 평가

학습 및 테스트 데이터 세트 준비

기계 학습 실험 추적

하이퍼 매개 변수 튜닝

모델 평가

MLflow를 사용하여 실험 추적

5단계: 채점 및 예측 결과 저장

관련 콘텐츠

피드백

추가 리소스