Save Apache Spark DataFrames as TFRecord files

This article shows you how to use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files and load TFRecord with TensorFlow.

The TFRecord file format is a simple record-oriented binary format for ML training data. The tf.data.TFRecordDataset class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.

Use spark-tensorflow-connector library

You can use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files.

spark-tensorflow-connector is a library within the TensorFlow ecosystem that enables conversion between Spark DataFrames and TFRecords (a popular format for storing data for TensorFlow). With spark-tensorflow-connector, you can use Spark DataFrame APIs to read TFRecords files into DataFrames and write DataFrames as TFRecords.

Note

The spark-tensorflow-connector library is included in Databricks Runtime for Machine Learning. To use spark-tensorflow-connector on Databricks Runtime release notes versions and compatibility, you need to install the library from Maven. See Maven or Spark package for details.

Example: Load data from TFRecord files with TensorFlow

The example notebook demonstrates how to save data from Apache Spark DataFrames to TFRecord files and load TFRecord files for ML training.

You can load the TFRecord files using the tf.data.TFRecordDataset class. See Reading a TFRecord file from TensorFlow for details.

Prepare image data for Distributed DL notebook

Get notebook