OJ 銷售模擬

發行項
10/16/2024

此資料集衍生自 Dominick 的 OJ 資料集，內含額外的模擬資料，旨在提供可在 Azure Machine Learning 上同時定型數千個模型的資料集。

注意

Microsoft 依「現況」提供 Azure 開放資料集。針對　貴用戶對資料集的使用方式，Microsoft 不提供任何明示或默示的擔保、保證或條件。在　貴用戶當地法律允許的範圍內，針對因使用資料集而導致的任何直接性、衍生性、特殊性、間接性、附隨性或懲罰性損害或損失，Microsoft 概不承擔任何責任。

此資料集是根據 Microsoft 接收來源資料的原始條款所提供。資料集可能包含源自 Microsoft 的資料。

該資料包含 121 週內每週的柳橙汁銷售額。內含 3,991 間店家，每間店有三種柳橙汁品牌，以便訓練 11,973 個模型。

檢視原始資料集描述或下載資料集。

資料行

名稱	資料類型	Values (sample)	描述
廣告	int	1	值，指出該當週是否有柳橙汁廣告 0：沒有廣告 1：廣告
品牌	字串	dominicks tropicana	柳橙汁品牌
價格	double	2.6 2.09	柳橙汁的價格 (美元)
數量	int	10939 11638	該週售出的柳橙汁數量
營收	double	38438.4 36036.0	該週的柳橙汁銷售額 (美元)
儲存	int	2658 1396	售出柳橙汁的店家數
WeekStarting	timestamp	1990-08-09 00:00:00 1992-02-20 00:00:00	指出銷售額所屬週次的日期

預覽

WeekStarting	儲存	品牌	數量	廣告	價格	營收
10/1/1992 12:00:00 AM	3571	minute.maid	13247	1	2.42	32057.74
10/1/1992 12:00:00 AM	2999	minute.maid	18461	1	2.69	49660.09
10/1/1992 12:00:00 AM	1198	minute.maid	13222	1	2.64	34906.08
10/1/1992 12:00:00 AM	3916	minute.maid	12923	1	2.45	31661.35
10/1/1992 12:00:00 AM	1688	minute.maid	9380	1	2.46	23074.8
10/1/1992 12:00:00 AM	1040	minute.maid	18841	1	2.31	43522.71
10/1/1992 12:00:00 AM	1938	minute.maid	14202	1	2.19	31102.38
10/1/1992 12:00:00 AM	2405	minute.maid	16326	1	2.05	33468.3
10/1/1992 12:00:00 AM	1,972	minute.maid	16380	1	2.12	34725.6

資料存取

Azure Notebooks

azureml-opendatasets

from azureml.core.workspace import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()

from azureml.opendatasets import OjSalesSimulated

從 Azure 開放資料集讀取資料

# Create a Data Directory in local path
import os

oj_sales_path = "oj_sales_data"

if not os.path.exists(oj_sales_path):
    os.mkdir(oj_sales_path)

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# or pull a subset of the data
oj_sales_files = OjSalesSimulated.get_file_dataset(num_files=10)

oj_sales_files.download(oj_sales_path, overwrite=True)

將個別資料集上傳至 Blob 儲存體

我們會將資料上傳至 Blob，並從這個 csv 檔案的資料夾建立 FileDataset。

target_path = 'oj_sales_data'

datastore.upload(src_dir = oj_sales_path,
                target_path = target_path,
                overwrite = True, 
                show_progress = True)

建立檔案資料集

我們需要定義資料的路徑，以建立 FileDataset。

from azureml.core.dataset import Dataset

ds_name = 'oj_data'
path_on_datastore = datastore.path(target_path + '/')

input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

將檔案資料集註冊至工作區

我們想要將資料集註冊到工作區，以便將其呼叫為管線的輸入，以進行預測。

registered_ds = input_ds.register(ws, ds_name, create_new_version=True)
named_ds = registered_ds.as_named_input(ds_name)

Azure Databricks

azureml-opendatasets

# This is a package in preview.
# You need to pip install azureml-opendatasets in Databricks cluster. https://learn.microsoft.com/azure/data-explorer/connect-from-databricks#install-the-python-library-on-your-azure-databricks-cluster
# Download or mount OJ Sales raw files Azure Machine Learning file datasets.
# This works only for Linux based compute. See https://learn.microsoft.com/azure/machine-learning/service/how-to-create-register-datasets to learn more about datasets.

from azureml.opendatasets import OjSalesSimulated

ojss_file = OjSalesSimulated.get_file_dataset()
ojss_file

ojss_file.to_path()

# Download files to local storage
import os
import tempfile

mount_point = tempfile.mkdtemp()
ojss_file.download(mount_point, overwrite=True)

# Mount files. Useful when training job will run on a remote compute.
import gzip
import struct
import pandas as pd
import numpy as np

# load compressed OJ Sales Simulated gz files and return numpy arrays
def load_data(filename, label=False):
    with gzip.open(filename) as gz:
        gz.read(4)
        n_items = struct.unpack('>I', gz.read(4))
        if not label:
            n_rows = struct.unpack('>I', gz.read(4))[0]
            n_cols = struct.unpack('>I', gz.read(4))[0]
            res = np.frombuffer(gz.read(n_items[0] * n_rows * n_cols), dtype=np.uint8)
            res = res.reshape(n_items[0], n_rows * n_cols)
        else:
            res = np.frombuffer(gz.read(n_items[0]), dtype=np.uint8)
            res = res.reshape(n_items[0], 1)
    return pd.DataFrame(res)

import sys
mount_point = tempfile.mkdtemp()
print(mount_point)
print(os.path.exists(mount_point))
print(os.listdir(mount_point))

if sys.platform == 'linux':
  print("start mounting....")
  with ojss_file.mount(mount_point):
    print(os.listdir(mount_point))  
    train_images_df = load_data(os.path.join(mount_point, 'train-tabular-oj-ubyte.gz'))
    print(train_images_df.info())

下一步

檢視開放資料集目錄中的其餘資料集。

共用方式為

OJ 銷售模擬

資料行

預覽

資料存取

Azure Notebooks

從 Azure 開放資料集讀取資料

將個別資料集上傳至 Blob 儲存體

建立檔案資料集

將檔案資料集註冊至工作區

Azure Databricks

下一步

意見反應

其他資源

共用方式為

OJ 銷售模擬

資料行

預覽​​

資料存取

Azure Notebooks

從 Azure 開放資料集讀取資料

將個別資料集上傳至 Blob 儲存體

建立檔案資料集

將檔案資料集註冊至工作區

Azure Databricks

下一步

意見反應

其他資源

預覽