DataOperationsCatalog.ShuffleRows Method
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Shuffle the rows of input
.
public Microsoft.ML.IDataView ShuffleRows (Microsoft.ML.IDataView input, int? seed = default, int shufflePoolSize = 1000, bool shuffleSource = true);
member this.ShuffleRows : Microsoft.ML.IDataView * Nullable<int> * int * bool -> Microsoft.ML.IDataView
Public Function ShuffleRows (input As IDataView, Optional seed As Nullable(Of Integer) = Nothing, Optional shufflePoolSize As Integer = 1000, Optional shuffleSource As Boolean = true) As IDataView
Parameters
- input
- IDataView
The input data.
The random seed. If unspecified, the random state will be instead derived from the MLContext.
- shufflePoolSize
- Int32
The number of rows to hold in the pool. Setting this to 1 will turn off pool shuffling and
ShuffleRows(IDataView, Nullable<Int32>, Int32, Boolean) will only perform a shuffle by reading input
in a random order.
- shuffleSource
- Boolean
If false
, the transform will not attempt to read input
in a random order and only use
pooling to shuffle. This parameter has no effect if the CanShuffle property of input
is false
.
Returns
Examples
using System;
using System.Collections.Generic;
using Microsoft.ML;
namespace Samples.Dynamic
{
public static class ShuffleRows
{
// Sample class showing how to shuffle rows in
// IDataView.
public static void Example()
{
// Create a new context for ML.NET operations. It can be used for
// exception tracking and logging, as a catalog of available operations
// and as the source of randomness.
var mlContext = new MLContext();
// Get a small dataset as an IEnumerable.
var enumerableOfData = GetSampleTemperatureData(5);
var data = mlContext.Data.LoadFromEnumerable(enumerableOfData);
// Before we apply a filter, examine all the records in the dataset.
Console.WriteLine($"Date\tTemperature");
foreach (var row in enumerableOfData)
{
Console.WriteLine($"{row.Date.ToString("d")}" +
$"\t{row.Temperature}");
}
Console.WriteLine();
// Expected output:
// Date Temperature
// 1/2/2012 36
// 1/3/2012 36
// 1/4/2012 34
// 1/5/2012 35
// 1/6/2012 35
// Shuffle the dataset.
var shuffledData = mlContext.Data.ShuffleRows(data, seed: 123);
// Look at the shuffled data and observe that the rows are in a
// randomized order.
var enumerable = mlContext.Data
.CreateEnumerable<SampleTemperatureData>(shuffledData,
reuseRowObject: true);
Console.WriteLine($"Date\tTemperature");
foreach (var row in enumerable)
{
Console.WriteLine($"{row.Date.ToString("d")}" +
$"\t{row.Temperature}");
}
// Expected output:
// Date Temperature
// 1/4/2012 34
// 1/2/2012 36
// 1/5/2012 35
// 1/3/2012 36
// 1/6/2012 35
}
private class SampleTemperatureData
{
public DateTime Date { get; set; }
public float Temperature { get; set; }
}
/// <summary>
/// Get a fake temperature dataset.
/// </summary>
/// <param name="exampleCount">The number of examples to return.</param>
/// <returns>An enumerable of <see cref="SampleTemperatureData"/>.</returns>
private static IEnumerable<SampleTemperatureData> GetSampleTemperatureData(
int exampleCount)
{
var rng = new Random(1234321);
var date = new DateTime(2012, 1, 1);
float temperature = 39.0f;
for (int i = 0; i < exampleCount; i++)
{
date = date.AddDays(1);
temperature += rng.Next(-5, 5);
yield return new SampleTemperatureData
{
Date = date,
Temperature =
temperature
};
}
}
}
}
Remarks
ShuffleRows(IDataView, Nullable<Int32>, Int32, Boolean) will shuffle the rows of any input IDataView using a streaming approach. In order to not load the entire dataset in memory, a pool of shufflePoolSize
rows will be used to randomly select rows to output. The pool is constructed from the first shufflePoolSize
rows in input
. Rows will then be randomly yielded from the pool and replaced with the next row from input
until all the rows have been yielded, resulting in a new IDataView of the same size as input
but with the rows in a randomized order. If the CanShuffle property of input
is true, then it will also be read into the pool in a random order, offering two sources of randomness.