How to allow Spark to access Microsoft SQL Server
Today we will look at configuring Spark to access Microsoft SQL Server through JDBC. On HDInsight the Microsoft SQL Server JDBC jar is already installed. On Linux the path is /usr/hdp/2.2.7.1-10/hive/lib/sqljdbc4.jar. If you need more information or to download the driver you can start here Microsoft
SQL Server JDBC
Spark needs to know the path to the sqljdbc4.jar. There are multiple ways to add the path to Spark's classpath. Spark has two runtime environment properties that can do this spark.driver.extraClassPath and spark.executor.extraClassPath. To review all the properties available, see Spark's Configuration - Spark 1.4.1 Documentation.
If you use spark-shell or spark-submit you can pass these properties with –conf. I like to add the properties to Spark's default configuration file at /etc/spark/conf/spark-defaults.conf. A third option is to include the sqljdbc.jar in your assembly jar. This same technique works for other jars that your Spark application might need. Whichever technique you choose, Spark needs to know where to find the sqljdbc4.jar for both the driver application and the executors.
You can check the environment tab in the Spark Properties section to verify the properties are set.
Spark's API is very dynamic and changes are being made with each new release, especially around JDBC. If you are going to use Spark with JDBC I would suggest reviewing Spark's API documentation for the version of Spark you are using Spark 1.4.1 API to make sure the methods are still valid and the same behavior exists. Depending on the release there are a few places to look for methods involving JDBC, which include SQLContext, DataFrame, and JdbcRDD. Also notice that some methods are marked experimental and or deprecated. Make sure you test your code.
Some issue to consider are:
- Make sure firewall ports are open for port 1433.
- If using Microsoft Azure SQL Server DB, tables require a primary key. Some of the methods create the table, but Spark's code is not creating the primary key so the table creation fails.
Here are some code snippets. A DataFrame is used to create the table t2 and insert data. The SqlContext is used to load the data from the t2 table into a DataFrame. I added the spark.driver.extraClassPath and spark.executor.extraClassPath to my spark-default.conf file.
//Spark 1.4.1
//Insert data from DataFrame
case class Conf(mykey: String, myvalue: String)
val data = sc.parallelize( Seq(Conf("1", "Delaware"), Conf("2", "Virginia"), Conf("3", "Maryland"), Conf("4", "South Carolina") ))
val df = data.toDF()
val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"
val table = "t2"
df.insertIntoJDBC(url, table, true)
//Load from database using SqlContext
val url = "jdbc:sqlserver://wcarroll3:1433;database=mydb;user=ReportUser;password=ReportUser"
val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver";
val tbl = { sqlContext.load("jdbc", Map( "url" -> url, "driver" -> driver, "dbtable" -> "t2", "partitionColumn" -> "mykey", "lowerBound" -> "0", "upperBound" -> "100", "numPartitions" -> "1" ))}
tbl.show()
If you run a Microsoft SQL Server profiler trace while running the spark-shell you can see the table being created, data inserted and then data being read.
HDInsight and Spark is a great platform to process and analyze your data, but often data resided in a relational database system like Microsoft SQL Server. Allowing Spark to read and write data from Microsoft SQL Server allows you to create a richer pipeline.
Hope this helps,
Bill
Comments
Anonymous
October 27, 2015
The comment has been removed- Anonymous
January 09, 2017
HI is there any update on how to load data in Azure Sql using spark?
- Anonymous
Anonymous
October 28, 2015
The comment has been removedAnonymous
November 25, 2015
is this faster than BCPing out the data from SQL server, and copying on to HDFS ? My usecase, i need to transfer 10Gbs of data from SQL server to HDFS. VeeraAnonymous
December 06, 2015
The comment has been removedAnonymous
December 08, 2015
Hi Bill, I am trying to append data using below spark code I am hitting table already exists exception is this a bug in spark ? I am using spark 1.5 val sourcedfmode=sourcedf.write.mode("append") sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops) There is already an object named 'customer_spark' in the database. Thanks SriAnonymous
May 18, 2016
The comment has been removedAnonymous
October 19, 2016
Spark 1.6.1Hortonworks HDP 2.4Spark-Submit, Yarn-Client modeI tried:-- Placing sqljdbc4.jar in cd /usr/hdp/2.4.2.0-258find . -name sqljdbc./hadoop/lib/sqljdbc4.jar./spark/lib/sqljdbc4.jar./hadoop-yarn/lib/sqljdbc4.jar./hadoop-mapreduce/lib/sqljdbc4.jar#2:I tried placing in JDBC[~]$ cd /usr/jdk64/jdk1.8.0_60[jdk1.8.0_60]$ find . -name sql*./lib/sqljdbc4.jar./jre/lib/sqljdbc4.jar3: I tried /etc/spark/spark-env.shSPARK_CLASSPATH4: I tried /etc/spark/spark-default.confAdded key and value for driver.extra, etc. No luckAnonymous
March 20, 2017
How we can achieve the same using SparkSessions in Spark 2.0 or later?