使用隔离林进行多变量异常情况检测

项目
01/27/2024

本文介绍了如何在 Apache Spark 上使用 SynapseML 进行多变量异常情况检测。通过多变量异常情况检测，可以检测多个变量或时序中的异常情况，同时考虑到不同变量之间的所有相互关联和依赖关系。在此方案中，我们使用 SynapseML 训练一个隔离林模型来进行多变量异常情况检测，然后使用已训练的模型来推断包含来自三个 IoT 传感器的合成度量的数据集中的多变量异常情况。

若要详细了解隔离林模型，请查看 Liu 等人编写的原始论文。

先决条件

将笔记本附加到湖屋。在左侧，选择“添加”来添加现有湖屋，或者创建湖屋。

库导入

from IPython import get_ipython
from IPython.terminal.interactiveshell import TerminalInteractiveShell
import uuid
import mlflow

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
from pyspark.ml import Pipeline

from synapse.ml.isolationforest import *

from synapse.ml.explainers import *

%matplotlib inline

from pyspark.sql import SparkSession

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()

from synapse.ml.core.platform import *

if running_on_synapse():
    shell = TerminalInteractiveShell.instance()
    shell.define_macro("foo", """a,b=10,20""")

输入数据

# Table inputs
timestampColumn = "timestamp"  # str: the name of the timestamp column in the table
inputCols = [
    "sensor_1",
    "sensor_2",
    "sensor_3",
]  # list(str): the names of the input variables

# Training Start time, and number of days to use for training:
trainingStartTime = (
    "2022-02-24T06:00:00Z"  # datetime: datetime for when to start the training
)
trainingEndTime = (
    "2022-03-08T23:55:00Z"  # datetime: datetime for when to end the training
)
inferenceStartTime = (
    "2022-03-09T09:30:00Z"  # datetime: datetime for when to start the training
)
inferenceEndTime = (
    "2022-03-20T23:55:00Z"  # datetime: datetime for when to end the training
)

# Isolation Forest parameters
contamination = 0.021
num_estimators = 100
max_samples = 256
max_features = 1.0

读取数据

df = (
    spark.read.format("csv")
    .option("header", "true")
    .load(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/generated_sample_mvad_data.csv"
    )
)

将列强制转换为适当的数据类型

df = (
    df.orderBy(timestampColumn)
    .withColumn("timestamp", F.date_format(timestampColumn, "yyyy-MM-dd'T'HH:mm:ss'Z'"))
    .withColumn("sensor_1", F.col("sensor_1").cast(DoubleType()))
    .withColumn("sensor_2", F.col("sensor_2").cast(DoubleType()))
    .withColumn("sensor_3", F.col("sensor_3").cast(DoubleType()))
    .drop("_c5")
)

display(df)

训练数据准备

# filter to data with timestamps within the training window
df_train = df.filter(
    (F.col(timestampColumn) >= trainingStartTime)
    & (F.col(timestampColumn) <= trainingEndTime)
)
display(df_train)

测试数据准备

# filter to data with timestamps within the inference window
df_test = df.filter(
    (F.col(timestampColumn) >= inferenceStartTime)
    & (F.col(timestampColumn) <= inferenceEndTime)
)
display(df_test)

训练隔离林模型

isolationForest = (
    IsolationForest()
    .setNumEstimators(num_estimators)
    .setBootstrap(False)
    .setMaxSamples(max_samples)
    .setMaxFeatures(max_features)
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(contamination)
    .setContaminationError(0.01 * contamination)
    .setRandomSeed(1)
)

接下来，我们将创建一个 ML 管道来训练隔离林模型。我们还演示如何创建 MLflow 试验并注册已训练的模型。

仅当稍后访问已训练的模型时，才需要 MLflow 模型注册。对于训练模型并在同一笔记本中执行推理，模型对象模型就已足够。

va = VectorAssembler(inputCols=inputCols, outputCol="features")
pipeline = Pipeline(stages=[va, isolationForest])
model = pipeline.fit(df_train)

执行推理

加载已训练的隔离林模型

执行推理

df_test_pred = model.transform(df_test)
display(df_test_pred)

预制异常检测器

Azure AI 异常检测器

最新点的异常状态：使用前面的点生成模型，并确定最新点是否有异常（Scala、Python）
查找异常：使用整个系列生成模型，并在系列中查找异常（Scala、Python）

通过

使用隔离林进行多变量异常情况检测

先决条件

库导入

输入数据

读取数据

训练数据准备

测试数据准备

训练隔离林模型

执行推理

预制异常检测器

反馈

其他资源

通过

使用隔离林进行多变量异常情况检测

先决条件

库导入

输入数据

读取数据

训练数据准备

测试数据准备

训练隔离林模型

执行推理

预制异常检测器

相关内容

反馈

其他资源