Reading Data as Spark DataFrame using mltable

Abdelkhalek Hamdi 40 Reputation points
2025-02-04T09:34:46.4233333+00:00

The Microsoft documentation for mltable mentions support for reading data as a Spark DataFrame, but specific examples or references are hard to find.

Docs: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-mltable?view=azureml-api-2&tabs=cli#:~:text=Azure%20Machine%20Learning%20supports%20a%20Table%20type%20(mltable).%20This%20allows%20for%20the%20creation%20of%20a%20blueprint%20that%20defines%20how%20to%20load%20data%20files%20into%20memory%20as%20a%20Pandas%20or%20Spark%20data%20frame.%20In%20this%20article%20you%20learn%3A

Has anyone successfully implemented this?

Additionally, is it possible to use a Spark Serverless job to read data directly from an mltable YAML file?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,192 questions
{count} votes

Accepted answer
  1. Vikram Singh 2,540 Reputation points Microsoft Employee
    2025-02-05T06:34:13.45+00:00

    Hi Abdelkhalek Hamdi,

    Thanks for troubleshooting on this.

    It looks like the error you're encountering is due to the MLTable object not having a method called to_spark_dataframe. Instead, you can convert the MLTable to a Pandas DataFrame and then convert it to a Spark DataFrame. Here's an example of how you can do this:

    from mltable import load
    from azureml.core import Workspace
    import pandas as pd
    from pyspark.sql import SparkSession
    
    # Initialize Spark session
    spark = SparkSession.builder.appName("MySparkApp").getOrCreate()
    
    # Load MLTable
    ws = Workspace.from_config()
    path = "./mltable-test/"  # Path to your mltable YAML file
    mltable = load(path)
    
    # Convert MLTable to Pandas DataFrame
    pandas_df = mltable.to_pandas_dataframe()
    
    # Convert Pandas DataFrame to Spark DataFrame
    spark_df = spark.createDataFrame(pandas_df)
    spark_df.show()
    

    Regarding the installation issue, you got it right so you should use pip install mltable instead of azureml-mltable. The correct command is:

    pip install mltable pyspark
    

    Hope this should resolve the version error you encountered.

    If you have any further questions or need additional assistance, feel free to ask!

    Thanks.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.