Python 자습서: SQL 기계 학습을 사용하여 고객을 분류하는 모델 배포

아티클
07/12/2024

적용 대상: SQL Server 2017(14.x) 이상 Azure SQL Managed Instance

4부로 구성된 이 자습서 시리즈의 4부에서는 Python에서 개발한 클러스터링 모델을 SQL Server Machine Learning Services 또는 빅 데이터 클러스터를 사용하여 데이터베이스에 배포합니다.

4부로 구성된 이 자습서 시리즈의 4부에서는 Python에서 개발한 클러스터링 모델을 SQL Server Machine Learning Services를 사용하여 데이터베이스에 배포합니다.

4부로 구성된 이 자습서 시리즈의 4부에서는 Python에서 개발한 클러스터링 모델을 Azure SQL Managed Instance Machine Learning Services를 사용하여 데이터베이스에 배포합니다.

새 고객이 계속 등록되므로 정기적으로 클러스터링을 수행하기 위해서는 어떤 앱에서든 Python 스크립트를 호출할 수 있어야 합니다. 이렇게 하려면 SQL 저장 프로시저 내에 Python 스크립트를 배치하여 데이터베이스에 Python 스크립트를 배포할 수 있습니다. 모델은 데이터베이스에서 실행되므로 데이터베이스에 저장된 데이터를 통해 쉽게 학습할 수 있습니다.

이 섹션에서는 방금 서버에 작성한 Python 코드를 이동하고 클러스터링을 배포합니다.

이 문서에서는 다음을 수행하는 방법을 알아봅니다.

모델을 생성하는 저장 프로시저 만들기
서버에서 클러스터링 수행
클러스터링 정보 사용

1부에서는 필수 구성 요소를 설치하고 샘플 데이터베이스를 복원했습니다.

2부에서는 클러스터링을 수행하기 위해 데이터베이스의 데이터를 준비하는 방법을 배웠습니다.

3부에서는 Python에서 K-평균 클러스터링 모델을 만들고 학습하는 방법을 알아보았습니다.

필수 조건

이 자습서 시리즈의 4부에서는 1부의 사전 요구 사항을 이행하고 2부와 3부의 단계를 완료했다고 가정합니다.

모델을 생성하는 저장 프로시저 만들기

다음 T-SQL 스크립트를 실행하여 저장 프로시저를 만듭니다. 이 프로시저는 이 자습서 시리즈의 1부 및 2부에서 개발된 단계를 다시 만듭니다.

구매 및 반품 기록에 따라 고객 분류
K-평균 알고리즘을 사용하여 4개의 고객 클러스터 생성

USE [tpcxbb_1gb]
GO

DROP procedure IF EXISTS [dbo].[py_generate_customer_return_clusters];
GO

CREATE procedure [dbo].[py_generate_customer_return_clusters]
AS

BEGIN
    DECLARE

-- Input query to generate the purchase history & return metrics
     @input_query NVARCHAR(MAX) = N'
SELECT
  ss_customer_sk AS customer,
  CAST( (ROUND(COALESCE(returns_count / NULLIF(1.0*orders_count, 0), 0), 7) ) AS FLOAT) AS orderRatio,
  CAST( (ROUND(COALESCE(returns_items / NULLIF(1.0*orders_items, 0), 0), 7) ) AS FLOAT) AS itemsRatio,
  CAST( (ROUND(COALESCE(returns_money / NULLIF(1.0*orders_money, 0), 0), 7) ) AS FLOAT) AS monetaryRatio,
  CAST( (COALESCE(returns_count, 0)) AS FLOAT) AS frequency
FROM
  (
    SELECT
      ss_customer_sk,
      -- return order ratio
      COUNT(distinct(ss_ticket_number)) AS orders_count,
      -- return ss_item_sk ratio
      COUNT(ss_item_sk) AS orders_items,
      -- return monetary amount ratio
      SUM( ss_net_paid ) AS orders_money
    FROM store_sales s
    GROUP BY ss_customer_sk
  ) orders
  LEFT OUTER JOIN
  (
    SELECT
      sr_customer_sk,
      -- return order ratio
      count(distinct(sr_ticket_number)) as returns_count,
      -- return ss_item_sk ratio
      COUNT(sr_item_sk) as returns_items,
      -- return monetary amount ratio
      SUM( sr_return_amt ) AS returns_money
    FROM store_returns
    GROUP BY sr_customer_sk
  ) returned ON ss_customer_sk=sr_customer_sk
 '

EXEC sp_execute_external_script
      @language = N'Python'
    , @script = N'

import pandas as pd
from sklearn.cluster import KMeans

#get data from input query
customer_data = my_input_data

#We concluded in step 2 in the tutorial that 4 would be a good number of clusters
n_clusters = 4

#Perform clustering
est = KMeans(n_clusters=n_clusters, random_state=111).fit(customer_data[["orderRatio","itemsRatio","monetaryRatio","frequency"]])
clusters = est.labels_
customer_data["cluster"] = clusters

OutputDataSet = customer_data
'
    , @input_data_1 = @input_query
    , @input_data_1_name = N'my_input_data'
             with result sets (("Customer" int, "orderRatio" float,"itemsRatio" float,"monetaryRatio" float,"frequency" float,"cluster" float));
END;
GO

클러스터링 수행

이제 저장 프로시저를 만들었으므로, 이 프로시저를 사용해서 다음 스크립트를 실행하여 클러스터링을 수행합니다.

--Create a table to store the predictions in

DROP TABLE IF EXISTS [dbo].[py_customer_clusters];
GO

CREATE TABLE [dbo].[py_customer_clusters] (
    [Customer] [bigint] NULL
  , [OrderRatio] [float] NULL
  , [itemsRatio] [float] NULL
  , [monetaryRatio] [float] NULL
  , [frequency] [float] NULL
  , [cluster] [int] NULL
  ,
    ) ON [PRIMARY]
GO

--Execute the clustering and insert results into table
INSERT INTO py_customer_clusters
EXEC [dbo].[py_generate_customer_return_clusters];

-- Select contents of the table to verify it works
SELECT * FROM py_customer_clusters;

클러스터링 정보 사용

클러스터링 프로시저를 데이터베이스에 저장했기 때문에 동일한 데이터베이스에 저장된 고객 데이터에 대해 클러스터링을 효율적으로 수행할 수 있습니다. 고객 데이터가 업데이트 될 때마다 절차를 실행하고 업데이트 된 클러스터링 정보를 사용할 수 있습니다.

비활성 상태의 그룹인 클러스터 0의 고객들에게 판촉 이메일을 보낸다고 가정해보세요(4개 클러스터에 대한 설명은 이 자습서의 3부 참조). 다음 코드는 클러스터 0에서 고객의 이메일 주소를 선택합니다.

USE [tpcxbb_1gb]
--Get email addresses of customers in cluster 0 for a promotion campaign
SELECT customer.[c_email_address], customer.c_customer_sk
  FROM dbo.customer
  JOIN
  [dbo].[py_customer_clusters] as c
  ON c.Customer = customer.c_customer_sk
  WHERE c.cluster = 0

c.cluster 값을 바꿔서 다른 클러스터의 고객들에 대한 이메일 주소를 반환할 수 있습니다.

리소스 정리

이 자습서를 완료했으면 tpcxbb_1gb 데이터베이스를 삭제할 수 있습니다.

다음 단계

이 자습서 시리즈의 4부에서는 다음 단계를 완료했습니다.

모델을 생성하는 저장 프로시저 만들기
서버에서 클러스터링 수행
클러스터링 정보 사용

SQL 기계 학습에서 Python을 사용하는 방법에 대한 자세한 정보는 다음을 참조하세요.

다음을 통해 공유