DataOperationsCatalog.TrainTestSplit 方法

參考

定義

命名空間:: Microsoft.ML

組件:: Microsoft.ML.Data.dll

套件:: Microsoft.ML v4.0.1

套件:: Microsoft.ML v1.0.0

套件:: Microsoft.ML v1.1.0

套件:: Microsoft.ML v1.2.0

套件:: Microsoft.ML v1.3.1

套件:: Microsoft.ML v1.4.0

套件:: Microsoft.ML v1.5.5

套件:: Microsoft.ML v1.6.0

套件:: Microsoft.ML v1.7.0

套件:: Microsoft.ML v2.0.1

套件:: Microsoft.ML v3.0.1

套件:: Microsoft.ML v5.0.0-preview.1.25125.4

來源:: DataOperationsCatalog.cs

來源:: DataOperationsCatalog.cs

來源:: DataOperationsCatalog.cs

重要

部分資訊涉及發行前產品，在發行之前可能會有大幅修改。 Microsoft 對此處提供的資訊，不做任何明確或隱含的瑕疵擔保。

根據指定的分數，將資料集分割成定型集和測試集。如果提供， samplingKeyColumnName 則為。

public Microsoft.ML.DataOperationsCatalog.TrainTestData TrainTestSplit(Microsoft.ML.IDataView data, double testFraction = 0.1, string samplingKeyColumnName = default, int? seed = default);

member this.TrainTestSplit : Microsoft.ML.IDataView * double * string * Nullable<int> -> Microsoft.ML.DataOperationsCatalog.TrainTestData

Public Function TrainTestSplit (data As IDataView, Optional testFraction As Double = 0.1, Optional samplingKeyColumnName As String = Nothing, Optional seed As Nullable(Of Integer) = Nothing) As DataOperationsCatalog.TrainTestData

參數

data: IDataView

要分割的資料集。

testFraction: Double

要進入測試集的資料分數。

samplingKeyColumnName: String

要用於分組資料列的資料行名稱。如果兩個範例共用相同的值 samplingKeyColumnName ，則保證它們會出現在相同的子集中， (定型或測試) 。這可用來確保不會從定型外泄至測試集。請注意，執行排名實驗時， samplingKeyColumnName 必須是 GroupId 資料行。如果未 null 執行任何資料列群組。

seed: Nullable<Int32>

亂數產生器的種子，用來選取定型測試分割的資料列。

傳回

DataOperationsCatalog.TrainTestData

範例

using System;
using System.Collections.Generic;
using Microsoft.ML;

namespace Samples.Dynamic
{
    /// <summary>
    /// Sample class showing how to use TrainTestSplit.
    /// </summary>
    public static class TrainTestSplit
    {
        public static void Example()
        {
            // Creating the ML.Net IHostEnvironment object, needed for the pipeline.
            var mlContext = new MLContext();

            // Generate some data points.
            var examples = GenerateRandomDataPoints(10);

            // Convert the examples list to an IDataView object, which is consumable
            // by ML.NET API.
            var dataview = mlContext.Data.LoadFromEnumerable(examples);

            // Leave out 10% of the dataset for testing.For some types of problems,
            // for example for ranking or anomaly detection, we must ensure that the
            // split leaves the rows with the same value in a particular column, in
            // one of the splits. So below, we specify Group column as the column
            // containing the sampling keys. Notice how keeping the rows with the
            // same value in the Group column overrides the testFraction definition. 
            var split = mlContext.Data
                .TrainTestSplit(dataview, testFraction: 0.1,
                samplingKeyColumnName: "Group");

            var trainSet = mlContext.Data
                .CreateEnumerable<DataPoint>(split.TrainSet, reuseRowObject: false);

            var testSet = mlContext.Data
                .CreateEnumerable<DataPoint>(split.TestSet, reuseRowObject: false);

            PrintPreviewRows(trainSet, testSet);

            //  The data in the Train split.
            //  [Group, 1], [Features, 0.8173254]
            //  [Group, 1], [Features, 0.5581612]
            //  [Group, 1], [Features, 0.5588848]
            //  [Group, 1], [Features, 0.4421779]
            //  [Group, 1], [Features, 0.2737045]

            //  The data in the Test split.
            //  [Group, 0], [Features, 0.7262433]
            //  [Group, 0], [Features, 0.7680227]
            //  [Group, 0], [Features, 0.2060332]
            //  [Group, 0], [Features, 0.9060271]
            //  [Group, 0], [Features, 0.9775497]

            // Example of a split without specifying a sampling key column.
            split = mlContext.Data.TrainTestSplit(dataview, testFraction: 0.2);
            trainSet = mlContext.Data
                .CreateEnumerable<DataPoint>(split.TrainSet, reuseRowObject: false);

            testSet = mlContext.Data
                .CreateEnumerable<DataPoint>(split.TestSet, reuseRowObject: false);

            PrintPreviewRows(trainSet, testSet);

            // The data in the Train split.
            // [Group, 0], [Features, 0.7262433]
            // [Group, 1], [Features, 0.8173254]
            // [Group, 0], [Features, 0.7680227]
            // [Group, 1], [Features, 0.5581612]
            // [Group, 0], [Features, 0.2060332]
            // [Group, 1], [Features, 0.4421779]
            // [Group, 0], [Features, 0.9775497]
            // [Group, 1], [Features, 0.2737045]

            // The data in the Test split.
            // [Group, 1], [Features, 0.5588848]
            // [Group, 0], [Features, 0.9060271]

        }

        private static IEnumerable<DataPoint> GenerateRandomDataPoints(int count,
            int seed = 0)

        {
            var random = new Random(seed);
            for (int i = 0; i < count; i++)
            {
                yield return new DataPoint
                {
                    Group = i % 2,

                    // Create random features that are correlated with label.
                    Features = (float)random.NextDouble()
                };
            }
        }

        // Example with label and group column. A data set is a collection of such
        // examples.
        private class DataPoint
        {
            public float Group { get; set; }

            public float Features { get; set; }
        }

        // print helper
        private static void PrintPreviewRows(IEnumerable<DataPoint> trainSet,
            IEnumerable<DataPoint> testSet)

        {

            Console.WriteLine($"The data in the Train split.");
            foreach (var row in trainSet)
                Console.WriteLine($"{row.Group}, {row.Features}");

            Console.WriteLine($"\nThe data in the Test split.");
            foreach (var row in testSet)
                Console.WriteLine($"{row.Group}, {row.Features}");
        }
    }
}

適用於

共用方式為

DataOperationsCatalog.TrainTestSplit 方法

定義

參數

傳回

範例

適用於

其他資源