Inspect intermediate data during processing
Learn how to inspect intermediate data during loading, processing, and model training steps in ML.NET. Intermediate data is the output of each stage in the machine learning pipeline.
Consider the following housing data:
HousingData[] housingData = new HousingData[]
{
new HousingData
{
Size = 600f,
HistoricalPrices = new float[] { 100000f ,125000f ,122000f },
CurrentPrice = 170000f
},
new HousingData
{
Size = 1000f,
HistoricalPrices = new float[] { 200000f, 250000f, 230000f },
CurrentPrice = 225000f
},
new HousingData
{
Size = 1000f,
HistoricalPrices = new float[] { 126000f, 130000f, 200000f },
CurrentPrice = 195000f
},
new HousingData
{
Size = 850f,
HistoricalPrices = new float[] { 150000f,175000f,210000f },
CurrentPrice = 205000f
},
new HousingData
{
Size = 900f,
HistoricalPrices = new float[] { 155000f, 190000f, 220000f },
CurrentPrice = 210000f
},
new HousingData
{
Size = 550f,
HistoricalPrices = new float[] { 99000f, 98000f, 130000f },
CurrentPrice = 180000f
}
};
In ML.NET, you can inspect intermediate data like this that's loaded into an IDataView
in various ways, as described in the following sections.
Convert IDataView to IEnumerable
One of the quickest ways to inspect an IDataView
is to convert it to an IEnumerable
. To do this conversion, use the CreateEnumerable
method.
To optimize performance, set reuseRowObject
to true
. Doing so lazily populates the same object with the data of the current row as it's being evaluated as opposed to creating a new object for each row in the dataset.
// Create an IEnumerable of HousingData objects from IDataView
IEnumerable<HousingData> housingDataEnumerable =
mlContext.Data.CreateEnumerable<HousingData>(data, reuseRowObject: true);
// Iterate over each row
foreach (HousingData row in housingDataEnumerable)
{
// Do something (print out Size property) with current Housing Data object being evaluated
Console.WriteLine(row.Size);
}
Access specific indices with IEnumerable
If you only need access to a portion of the data or specific indices, use CreateEnumerable
and set the reuseRowObject
parameter value to false
so a new object is created for each of the requested rows in the dataset. Then, convert the IEnumerable
to an array or list.
Warning
Converting the result of CreateEnumerable
to an array or list loads all the requested IDataView
rows into memory, which might affect performance.
Once the collection has been created, you can perform operations on the data. The following code snippet takes the first three rows in the dataset and calculates the average current price.
// Create an Array of HousingData objects from IDataView
HousingData[] housingDataArray =
mlContext.Data.CreateEnumerable<HousingData>(data, reuseRowObject: false)
.Take(3)
.ToArray();
// Calculate Average CurrentPrice of First Three Elements
HousingData firstRow = housingDataArray[0];
HousingData secondRow = housingDataArray[1];
HousingData thirdRow = housingDataArray[2];
float averageCurrentPrice = (firstRow.CurrentPrice + secondRow.CurrentPrice + thirdRow.CurrentPrice) / 3;
Inspect values in a single column
At any point in the model building process, values in a single column of an IDataView
can be accessed using the GetColumn
method. The GetColumn
method returns all of the values in a single column as an IEnumerable
.
IEnumerable<float> sizeColumn = data.GetColumn<float>("Size").ToList();
Inspect IDataView values one row at a time
IDataView
is lazily evaluated. To iterate over the rows of an IDataView
without converting to an IEnumerable
as demonstrated in previous sections of this document, create a DataViewRowCursor
by using the GetRowCursor
method and passing in the DataViewSchema of your IDataView
as a parameter. Then, to iterate over rows, use the MoveNext
cursor method along with ValueGetter
delegates to extract the respective values from each of the columns.
Important
For performance purposes, vectors in ML.NET use VBuffer
instead of native collection types (that is, Vector
and float[]
).
// Get DataViewSchema of IDataView
DataViewSchema columns = data.Schema;
// Create DataViewCursor
using (DataViewRowCursor cursor = data.GetRowCursor(columns))
{
// Define variables where extracted values will be stored to
float size = default;
VBuffer<float> historicalPrices = default;
float currentPrice = default;
// Define delegates for extracting values from columns
ValueGetter<float> sizeDelegate = cursor.GetGetter<float>(columns[0]);
ValueGetter<VBuffer<float>> historicalPriceDelegate = cursor.GetGetter<VBuffer<float>>(columns[1]);
ValueGetter<float> currentPriceDelegate = cursor.GetGetter<float>(columns[2]);
// Iterate over each row
while (cursor.MoveNext())
{
//Get values from respective columns
sizeDelegate.Invoke(ref size);
historicalPriceDelegate.Invoke(ref historicalPrices);
currentPriceDelegate.Invoke(ref currentPrice);
}
}
Preview result of preprocessing or training on a subset of the data
Warning
Do not use Preview
in production code because it is intended for debugging and may reduce performance.
The model building process is experimental and iterative. To preview what data would look like after preprocessing or training a machine learning model on a subset of the data, use the Preview
method, which returns a DataDebuggerPreview
. The result is an object with ColumnView
and RowView
properties that are both an IEnumerable
and contain the values in a particular column or row. Specify the number of rows to apply the transformation to with the maxRows
parameter.
The result of inspecting an IDataView
looks similar to the following image: