次の方法で共有


How to allow Spark to access Microsoft SQL Server

 

Today we will look at configuring Spark to access Microsoft SQL Server through JDBC. On HDInsight the Microsoft SQL Server JDBC jar is already installed. On Linux the path is /usr/hdp/2.2.7.1-10/hive/lib/sqljdbc4.jar. If you need more information or to download the driver you can start here Microsoft
SQL Server JDBC

Spark needs to know the path to the sqljdbc4.jar. There are multiple ways to add the path to Spark's classpath. Spark has two runtime environment properties that can do this spark.driver.extraClassPath and spark.executor.extraClassPath. To review all the properties available, see Spark's Configuration - Spark 1.4.1 Documentation.

If you use spark-shell or spark-submit you can pass these properties with –conf. I like to add the properties to Spark's default configuration file at /etc/spark/conf/spark-defaults.conf. A third option is to include the sqljdbc.jar in your assembly jar. This same technique works for other jars that your Spark application might need. Whichever technique you choose, Spark needs to know where to find the sqljdbc4.jar for both the driver application and the executors.

You can check the environment tab in the Spark Properties section to verify the properties are set.

 

Spark's API is very dynamic and changes are being made with each new release, especially around JDBC. If you are going to use Spark with JDBC I would suggest reviewing Spark's API documentation for the version of Spark you are using Spark 1.4.1 API to make sure the methods are still valid and the same behavior exists. Depending on the release there are a few places to look for methods involving JDBC, which include SQLContext, DataFrame, and JdbcRDD. Also notice that some methods are marked experimental and or deprecated. Make sure you test your code.

Some issue to consider are:

  • Make sure firewall ports are open for port 1433.
  • If using Microsoft Azure SQL Server DB, tables require a primary key. Some of the methods create the table, but Spark's code is not creating the primary key so the table creation fails.

 

Here are some code snippets. A DataFrame is used to create the table t2 and insert data. The SqlContext is used to load the data from the t2 table into a DataFrame. I added the spark.driver.extraClassPath and spark.executor.extraClassPath to my spark-default.conf file.

//Spark 1.4.1

//Insert data from DataFrame

case class Conf(mykey: String, myvalue: String)

val data = sc.parallelize( Seq(Conf("1", "Delaware"), Conf("2", "Virginia"), Conf("3", "Maryland"), Conf("4", "South Carolina") ))

val df = data.toDF()

val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"

val table = "t2"

df.insertIntoJDBC(url, table, true)

//Load from database using SqlContext

val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"

val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";

val tbl = { sqlContext.load("jdbc", Map( "url" -> url, "driver" -> driver, "dbtable" -> "t2", "partitionColumn" -> "mykey", "lowerBound" -> "0", "upperBound" -> "100", "numPartitions" -> "1" ))}

tbl.show()

 

 

If you run a Microsoft SQL Server profiler trace while running the spark-shell you can see the table being created, data inserted and then data being read.

 

HDInsight and Spark is a great platform to process and analyze your data, but often data resided in a relational database system like Microsoft SQL Server. Allowing Spark to read and write data from Microsoft SQL Server allows you to create a richer pipeline.

 

Hope this helps,

 

Bill

Comments

  • Anonymous
    October 27, 2015
    The comment has been removed

    • Anonymous
      January 09, 2017
      HI is there any update on how to load data in Azure Sql using spark?
  • Anonymous
    October 28, 2015
    The comment has been removed

  • Anonymous
    November 25, 2015
    is this faster than BCPing out the data from SQL server, and copying on to HDFS ? My usecase, i need to transfer 10Gbs of data from SQL server to HDFS. Veera

  • Anonymous
    December 06, 2015
    The comment has been removed

  • Anonymous
    December 08, 2015
    Hi Bill, I am trying to append data using below spark code I am hitting table already exists exception is this a bug in spark ? I am using spark 1.5 val sourcedfmode=sourcedf.write.mode("append") sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops) There is already an object named 'customer_spark' in the database. Thanks Sri

  • Anonymous
    May 18, 2016
    The comment has been removed

  • Anonymous
    October 19, 2016
    Spark 1.6.1Hortonworks HDP 2.4Spark-Submit, Yarn-Client modeI tried:-- Placing sqljdbc4.jar in cd /usr/hdp/2.4.2.0-258find . -name sqljdbc./hadoop/lib/sqljdbc4.jar./spark/lib/sqljdbc4.jar./hadoop-yarn/lib/sqljdbc4.jar./hadoop-mapreduce/lib/sqljdbc4.jar#2:I tried placing in JDBC[~]$ cd /usr/jdk64/jdk1.8.0_60[jdk1.8.0_60]$ find . -name sql*./lib/sqljdbc4.jar./jre/lib/sqljdbc4.jar3: I tried /etc/spark/spark-env.shSPARK_CLASSPATH4: I tried /etc/spark/spark-default.confAdded key and value for driver.extra, etc. No luck

  • Anonymous
    March 20, 2017
    How we can achieve the same using SparkSessions in Spark 2.0 or later?