Saving Spark Distributed Data Frame (DDF) To PowerBI

Artigo
03/09/2016

The sample Jupyter Scala notebook described in this blog can be downloaded from https://github.com/hdinsight/spark-jupyter-notebooks/blob/master/Scala/SparkDataFrameToPowerBI.ipynb.
Spark PowerBI connector source code is available at https://github.com/hdinsight/spark-powerbi-connector.

Data visualization is often the most important part of data processing as it can surface up data patterns and trends in data that cannot be otherwise easily perceptible by humans. PowerBI (https://powerbi.microsoft.com/en-us/) is a powerful data visualization platform offered by Microsoft. As PowerBI turned out to be a natural choice for data visualization we developed a Spark to PowerBI connector (https://github.com/hdinsight/spark-powerbi-connector) which implements a PowerBI REST client and extension methods on Spark DataFrame, RDD and DStream in the same way described in a previous blog https://blogs.msdn.microsoft.com/azuredatalake/2016/03/01/extending-spark-with-extension-methods-in-scala-fun-with-implicits/. These extension methods can be called in a Spark application in codes that run either in the driver or the executors.

This blog describes one of the ways of using the connector for pushing Spark DataFrame to PowerBI as part of a Spark interactive or batch job through an example Jupyter notebook in Scala which can be run on an HDInsight cluster. To learn about how to deploy Azure HDInsight Linux Spark cluster and launch Jupyter notebook refer to the Azure article at https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-jupyter-spark-sql/. PowerBI has always been able to pull data from various data sources including those supported by Spark (like Hive table, Azure blob, SQL server) through appropriate drivers but the Spark to PowerBI connector enables pushing data from Spark directly to PowerBI. Though this example shows pushing an entire DataFrame directly to PowerBI, it is expected that users should use their judgment on the amount of data to be pushed through the DataFrame. Be aware of the limitations of PowerBI REST APIs as detailed in https://msdn.microsoft.com/en-us/library/dn950053.aspx. An ideal use case will be displaying some metrics or aggregates of a job at certain intervals or stages as PowerBI Dashboard.

In this example we will visualize a DataFrame on PowerBI that displays the relation between the radius and area of circles. To start with this example we need to configure Jupyter to use two additional JARs and place them in a known folder in the default container of the default storage account of the HDInsight cluster:

adal4j.jar available at https://mvnrepository.com/artifact/com.microsoft.azure/adal4j
spark- powerbi-connector_2.10-0.6.0.jar (compiled with Spark 1.6.1) available at https://github.com/hdinsight/spark-powerbi-connector. Follow the README for Maven and SBT co-ordinates.

If required, source codes from Github repositories can be cloned, built and packaged into appropriate JAR artifacts through IntelliJ. Refer to the appropriate section of the Azure article at https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-eventhub-streaming/ for detailed instructions.

Define a case class Circle that will hold the data which will be used to create the DataFrame for this example. In practice the DataFrame data source can be anything that is supported by Spark.

Generate a list of Circle objects and create a DataFrame out of it.

Declare a user defined function with radius as the input parameter to compute the area of a circle.

Add an additional column “area” to hold the areas of the circles.

Create a new DataFrame by selecting only the columns from the last DataFrame which we want to show on the PowerBI dashboard. In this example columns “radius” and “area” are selected.

Enter the PowerBI Client ID, PowerBI Account Username and PowerBI Account Password and declare a PowerBIAuthentication object. The PowerBIAuthentication class is defined in the spark-powerbi-connector.jar under com.microsoft.spark.powerbi.authetication.

Initialize the PowerBIAuthentication object, declare PowerBI table with column names and datatypes and create (or get) the PowerBI dataset containing the table. This step runs in the driver and actually goes to PowerBI and creates (or gets) the dataset and the underlying table(s). The column order of the PowerBI table should match that of the DataFrame which it is storing.

Simply call the toPowerBI method on the DataFrame with PowerBI dataset details received when creating (or getting) the dataset in the previous step, the PowerBI table name and the PowerBIauthentication object.

Verify that the data is actually saved and can be displayed on the PowerBI dashboard.

For further reading into how DataFrame has been extended to support saving it to PowerBI, following is the code behind. Each partition of the DataFrame is grouped into 1000 records and serialized into a POST request of multiple rows to PowerBI table in JSON format. Since DataFrame and PowerBI table both maintain column order and PowerBI table and DataFrame column orders should match, no name matching is done between columns of DataFrame and PowerBI table.

[Contributed by Arijit Tarafdar]

Compartilhar via

Saving Spark Distributed Data Frame (DDF) To PowerBI

Recursos adicionais