แชร์ผ่าน


Databricks Runtime 7.x migration guide (EoS)

Note

Support for this Databricks Runtime version has ended. For the end-of-support date, see End-of-support history. For all supported Databricks Runtime versions, see Databricks Runtime release notes versions and compatibility.

This guide provides guidance to help you migrate your Azure Databricks workloads from Databricks Runtime 6.x, built on Apache Spark 2.4, to Databricks Runtime 7.3 LTS (EoS), both built on Spark 3.0.

This guide lists the Spark 3.0 behavior changes that might require you to update Azure Databricks workloads. Some of those changes include complete removal of Python 2 support, the upgrade to Scala 2.12, full support for JDK 11, and the switch from the Gregorian to the Proleptic calendar for dates and timestamps.

This guide is a companion to the Databricks Runtime 7.3 LTS (EoS) migration guide.

New features and improvements available on Databricks Runtime 7.x

For a list of new features, improvements, and library upgrades included in Databricks Runtime 7.3 LTS, see the release notes for each Databricks Runtime version above the one you are migrating from. Supported Databricks Runtime 7.x versions include:

Post-release maintenance updates are listed in Maintenance updates for Databricks Runtime (archived).

Databricks Runtime 7.3 LTS system environment

  • Operating System: Ubuntu 18.04.5 LTS
  • Java:
    • 7.3 LTS: Zulu 8.48.0.53-CA-linux64 (build 1.8.0_265-b11)
  • Scala: 2.12.10
  • Python: 3.7.5
  • R: 3.6.3 (2020-02-29)
  • Delta Lake 0.7.0

Major Apache Spark 3.0 behavior changes

The following behavior changes from Spark 2.4 to Spark 3.0 might require you to update Azure Databricks workloads when you migrate from Databricks Runtime 6.x to Databricks Runtime 7.x.

Note

This article provides a list of the important Spark behavior changes for you to consider when you migrate to Databricks Runtime 7.x.

Core

  • In Spark 3.0, the deprecated accumulator v1 is removed.
  • Event log file will be written as UTF-8 encoding, and Spark History Server will replay event log files as UTF-8 encoding. Previously Spark wrote the event log file as default charset of driver JVM process, so Spark History Server of Spark 2.x is needed to read the old event log files in case of incompatible encoding.
  • A new protocol for fetching shuffle blocks is used. It is recommended that external shuffle services be upgraded when running Spark 3.0 apps. You can still use old external shuffle services by setting the configuration spark.shuffle.useOldFetchProtocol to true. Otherwise, Spark may run into errors with messages like IllegalArgumentException: Unexpected message type: <number>.

PySpark

  • In Spark 3.0, Column.getItem is fixed such that it does not call Column.apply. Consequently, if Column is used as an argument to getItem, the indexing operator should be used. For example, map_col.getItem(col('id')) should be replaced with map_col[col('id')].
  • As of Spark 3.0, Row field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable PYSPARK_ROW_FIELD_SORTING_ENABLED to true for both executors and driver. This environment variable must be consistent on all executors and driver. Otherwise, it may cause failures or incorrect answers. For Python versions lower than 3.6, the field names are sorted alphabetically as the only option.
  • Deprecated Python 2 support (SPARK-27884).

Structured Streaming

  • In Spark 3.0, Structured Streaming forces the source schema into nullable when file-based datasources such as text, json, csv, parquet and orc are used via spark.readStream(...). Previously, it respected the nullability in source schema; however, it caused issues tricky to debug with NPE. To restore the previous behavior, set spark.sql.streaming.fileSource.schema.forceNullable to false.
  • Spark 3.0 fixes the correctness issue on Stream-stream outer join, which changes the schema of state. See SPARK-26154 for more details. If you start your query from checkpoint constructed from Spark 2.x which uses stream-stream outer join, Spark 3.0 fails the query. To recalculate outputs, discard the checkpoint and replay previous inputs.
  • In Spark 3.0, the deprecated class org.apache.spark.sql.streaming.ProcessingTime has been removed. Use org.apache.spark.sql.streaming.Trigger.ProcessingTime instead. Likewise, org.apache.spark.sql.execution.streaming.continuous.ContinuousTrigger has been removed in favor of Trigger.Continuous, and org.apache.spark.sql.execution.streaming.OneTimeTrigger has been hidden in favor of Trigger.Once. See SPARK-28199.

SQL, Datasets, and DataFrame

  • In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting string to int and double to boolean are disallowed. A runtime exception will be thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and earlier, type conversions during table insertion are allowed as long as they are valid Cast. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the option spark.sql.storeAssignmentPolicy, with a default value as “ANSI”. Setting the option to “Legacy” restores the previous behavior.
  • In Spark 3.0, when casting string value to integral types (tinyint, smallint, int and bigint), datetime types (date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ACSII 32) are trimmed before being converted to these type values, for example cast(' 1\t' as int) returns 1, cast(' 1\t' as boolean) returns true, cast('2019-10-10\t as date) returns the date value 2019-10-10. In Spark version 2.4 and earlier, while casting string to integrals and booleans, it will not trim the whitespaces from both ends, the foregoing results will be null, while to datetimes, only the trailing spaces (= ASCII 32) will be removed. See https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html.
  • In Spark 3.0, the deprecated methods SQLContext.createExternalTable and SparkSession.createExternalTable have been removed in favor of their replacement, createTable.
  • In Spark 3.0, configuration spark.sql.crossJoin.enabled becomes internal configuration, and is true by default, so by default Spark won’t raise an exception on SQL with implicit cross joins.
  • In Spark 3.0, we reversed argument order of the trim function from TRIM(trimStr, str) to TRIM(str, trimStr) to be compatible with other databases.
  • In Spark version 2.4 and earlier, SQL queries such as FROM <table> or FROM <table> UNION ALL FROM <table> are supported by accident. In hive-style FROM <table> SELECT <expr>, the SELECT clause is not negligible. Neither Hive nor Presto support this syntax. Therefore we will treat these queries as invalid since Spark 3.0.
  • Since Spark 3.0, the Dataset and DataFrame API unionAll is not deprecated any more. It is an alias for union.
  • In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as IntegerType. For FloatType and DoubleType, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except for StringType and BinaryType.
  • Since Spark 3.0, the from_json functions support two modes - PERMISSIVE and FAILFAST. The modes can be set via the mode option. The default mode became PERMISSIVE. In previous versions, behavior of from_json did not conform to either PERMISSIVE or FAILFAST, especially in processing of malformed JSON records. For example, the JSON string {"a" 1} with the schema a INT is converted to null by previous versions but Spark 3.0 converts it to Row(null).

DDL Statements

  • In Spark 3.0, CREATE TABLE without a specific provider uses the value of spark.sql.sources.default as its provider. In Spark version 2.4 and below, it was Hive. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled to true.
  • In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting string to int and double to boolean are disallowed. A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below, type conversions during table insertion are allowed as long as they are valid Cast. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the option spark.sql.storeAssignmentPolicy, with a default value as “ANSI”. Setting the option as “Legacy” restores the previous behavior.
  • In Spark 3.0, SHOW CREATE TABLE always returns Spark DDL, even when the given table is a Hive SerDe table. For generating Hive DDL, use SHOW CREATE TABLE AS SERDE command instead.
  • In Spark 3.0, column of CHAR type is not allowed in non-Hive-Serde tables, and CREATE/ALTER TABLE commands will fail if CHAR type is detected. Please use STRING type instead. In Spark version 2.4 and below, CHAR type is treated as STRING type and the length parameter is simply ignored.

UDFs and Built-in Functions

  • In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and returns 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
  • In Spark version 2.4 and below, you can create a map with duplicated keys via built-in functions like CreateMap, StringToMap, etc. The behavior of map with duplicated keys is undefined, for example, map look up respects the duplicated key appears first, Dataset.collect only keeps the duplicated key appears last, MapKeys returns duplicated keys, etc. In Spark 3.0, Spark throws RuntimeException when duplicated keys are found. You can set spark.sql.mapKeyDedupPolicy to LAST_WIN to deduplicate map keys with last wins policy. Users may still read map values with duplicated keys from data sources which do not enforce it (for example, Parquet), the behavior is undefined.

Data Sources

  • In Spark version 2.4 and below, partition column value is converted as null if it can’t be cast to a corresponding user provided schema. In 3.0, partition column value is validated with a user provided schema. An exception is thrown if the validation fails. You can disable such validation by setting spark.sql.sources.validatePartitionColumns to false.
  • In Spark version 2.4 and below, the parser of JSON data source treats empty strings as null for some data types such as IntegerType. For FloatType, DoubleType, DateType and TimestampType, it fails on empty strings and throws exceptions. Spark 3.0 disallows empty strings and will throw an exception for data types except for StringType and BinaryType. The previous behavior of allowing an empty string can be restored by setting spark.sql.legacy.json.allowEmptyString.enabled to true.
  • In Spark 3.0, if files or subdirectories disappear during recursive directory listing (that is, they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless spark.sql.files.ignoreMissingFiles is true (default false). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or during REFRESH TABLE), not during query execution: the net change is that spark.sql.files.ignoreMissingFiles is now obeyed during table file listing and query planning, not only at query execution time.
  • In Spark version 2.4 and below, CSV datasource converts a malformed CSV string to a row with all nulls in the PERMISSIVE mode. In Spark 3.0, the returned row can contain non-null fields if some of CSV column values were parsed and converted to desired types successfully.
  • In Spark 3.0, parquet logical type TIMESTAMP_MICROS is used by default while saving TIMESTAMP columns. In Spark version 2.4 and below, TIMESTAMP columns are saved as INT96 in parquet files. Note that some SQL systems such as Hive 1.x and Impala 2.x can only read INT96 timestamps. You can set spark.sql.parquet.outputTimestampType as INT96 to restore the previous behavior and keep interoperability.
  • In Spark 3.0, when Avro files are written with user provided schema, the fields are matched by field names between catalyst schema and Avro schema instead of positions.

Query Engine

  • In Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: val df1 = ...; val df2 = df1.filter(...);, then df1.join(df2, df1("a") > df2("a")) returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and df1("a") is exactly the same as df2("a") in Spark. To restore the behavior before Spark 3.0, you can set spark.sql.analyzer.failAmbiguousSelfJoin to false.
  • In Spark 3.0, numbers written in scientific notation (for example, 1E2) are parsed as Double. In Spark version 2.4 and below, they’re parsed as Decimal. To restore the pre-Spark 3.0 behavior, you can set spark.sql.legacy.exponentLiteralAsDecimal.enabled to true.
  • In Spark 3.0, configuration spark.sql.crossJoin.enabled becomes an internal configuration and is true by default. By default Spark won’t raise exceptions on SQL with implicit cross joins.
  • In Spark version 2.4 and below, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys, and join keys. In Spark 3.0, this bug is fixed. For example, Seq(-0.0, 0.0).toDF("d").groupBy("d").count() returns [(0.0, 2)] in Spark 3.0, and [(0.0, 1), (-0.0, 1)] in Spark 2.4 and below.
  • In Spark 3.0, TIMESTAMP literals are converted to strings using the SQL config spark.sql.session.timeZone. In Spark version 2.4 and below, the conversion uses the default time zone of the Java virtual machine.
  • In Spark 3.0, Spark casts String to Date/Timestamp in binary comparisons with dates/timestamps. The previous behavior of casting Date/Timestamp to String can be restored by setting spark.sql.legacy.typeCoercion.datetimeToString.enabled to true.
  • In Spark version 2.4 and below, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the from_utc_timestamp function. In Spark 3.0, such time zone ids are rejected, and Spark throws java.time.DateTimeException.
  • In Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and so on. Spark 3.0 uses Java 8 API classes from the java.time packages that are based on ISO chronology. In Spark version 2.4 and below, those operations are performed using the hybrid calendar (Julian + Gregorian). The changes impact the results for dates before October 15, 1582 (Gregorian) and affect the following Spark 3.0 API:
    • Parsing/formatting of timestamp/date strings. This effects on CSV/JSON datasources and on the unix_timestamp, date_format, to_unix_timestamp, from_unixtime, to_date, to_timestamp functions when patterns specified by users is used for parsing and formatting. In Spark 3.0, we define our own pattern strings in sql-ref-datetime-pattern.md, which is implemented via java.time.format.DateTimeFormatter under the hood. The new implementation performs strict checking of its input. For example, the 2015-07-22 10:00:00 timestamp cannot be parse if pattern is yyyy-MM-dd because the parser does not consume whole input. Another example is the 31/01/2015 00:00 input cannot be parsed by the dd/MM/yyyy hh:mm pattern because hh presupposes hours in the range 1-12. In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in simpleDateFormat. The old behavior can be restored by setting spark.sql.legacy.timeParserPolicy to LEGACY.
    • The weekofyear, weekday, dayofweek, date_trunc, from_utc_timestamp, to_utc_timestamp, and unix_timestamp functions use java.time API for calculating week number of year, day number of week as well for conversion from/to TimestampType values in UTC time zone.
    • The JDBC options lowerBound and upperBound are converted to TimestampType/DateType values in the same way as casting strings to TimestampType/DateType values. The conversion is based on Proleptic Gregorian calendar, and time zone defined by the SQL config spark.sql.session.timeZone. In Spark version 2.4 and below, the conversion is based on the hybrid calendar (Julian + Gregorian) and on default system time zone.
    • Formatting TIMESTAMP and DATE literals.
    • Creating typed TIMESTAMP and DATE literals from strings. In Spark 3.0, string conversion to typed TIMESTAMP/DATE literals is performed via casting to TIMESTAMP/DATE values. For example, TIMESTAMP '2019-12-23 12:59:30' is semantically equal to CAST('2019-12-23 12:59:30' AS TIMESTAMP). When the input string does not contain information about time zone, the time zone from the SQL config spark.sql.session.timeZone is used in that case. In Spark version 2.4 and below, the conversion is based on JVM system time zone. The different sources of the default time zone may change the behavior of typed TIMESTAMP and DATE literals.

Apache Hive

  • In Spark 3.0, we upgraded the built-in Hive version from 1.2 to 2.3 which brings the following impacts:
    • You may need to set spark.sql.hive.metastore.version and spark.sql.hive.metastore.jars according to the version of the Hive metastore you want to connect to. For example: set spark.sql.hive.metastore.version to 1.2.1 and spark.sql.hive.metastore.jars to maven if your Hive metastore version is 1.2.1.
    • You need to migrate your custom SerDes to Hive 2.3 or build your own Spark with hive-1.2 profile. See HIVE-15167 for more details.
    • The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using TRANSFORM operator in SQL for script transformation, which depends on hive’s behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary.
    • In Databricks Runtime 7.x, when reading a Hive SerDe table, by default Spark disallows reading files under a subdirectory that is not a table partition. To enable it, set the configuration spark.databricks.io.hive.scanNonpartitionedDirectory.enabled as true. This does not affect Spark native table readers and file readers.

MLlib

  • OneHotEncoder, which is deprecated in 2.3, is removed in 3.0 and OneHotEncoderEstimator is now renamed to OneHotEncoder.
  • org.apache.spark.ml.image.ImageSchema.readImages, which is deprecated in 2.3, is removed in 3.0. Use spark.read.format('image') instead.
  • org.apache.spark.mllib.clustering.KMeans.train with param Int runs, which is deprecated in 2.1, is removed in 3.0. Use train method without runs instead.
  • org.apache.spark.mllib.classification.LogisticRegressionWithSGD, which is deprecated in 2.0, is removed in 3.0, use org.apache.spark.ml.classification.LogisticRegression or spark.mllib.classification.LogisticRegressionWithLBFGS instead.
  • org.apache.spark.mllib.feature.ChiSqSelectorModel.isSorted, which is deprecated in 2.1, is removed in 3.0, is not intended for subclasses to use.
  • org.apache.spark.mllib.regression.RidgeRegressionWithSGD, which is deprecated in 2.0, is removed in 3.0. Use org.apache.spark.ml.regression.LinearRegression with elasticNetParam = 0.0. Note the default regParam is 0.01 for RidgeRegressionWithSGD, but is 0.0 for LinearRegression.
  • org.apache.spark.mllib.regression.LassoWithSGD, which is deprecated in 2.0, is removed in 3.0. Use org.apache.spark.ml.regression.LinearRegression with elasticNetParam = 1.0. Note the default regParam is 0.01 for LassoWithSGD, but is 0.0 for LinearRegression.
  • org.apache.spark.mllib.regression.LinearRegressionWithSGD, which is deprecated in 2.0, is removed in 3.0. Use org.apache.spark.ml.regression.LinearRegression or LBFGS instead.
  • org.apache.spark.mllib.clustering.KMeans.getRuns and setRuns, which are deprecated in 2.1, are removed in 3.0, and have had no effect since Spark 2.0.0.
  • org.apache.spark.ml.LinearSVCModel.setWeightCol, which is deprecated in 2.4, is removed in 3.0, and is not intended for users.
  • In 3.0, org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel extends MultilayerPerceptronParams to expose the training params. As a result, layers in MultilayerPerceptronClassificationModel has been changed from Array[Int] to IntArrayParam. You should use MultilayerPerceptronClassificationModel.getLayers instead of MultilayerPerceptronClassificationModel.layers to retrieve the size of layers.
  • org.apache.spark.ml.classification.GBTClassifier.numTrees, which is deprecated in 2.4.5, is removed in 3.0. Use getNumTrees instead.
  • org.apache.spark.ml.clustering.KMeansModel.computeCost, which is deprecated in 2.4, is removed in 3.0, use ClusteringEvaluator instead.
  • The member variable precision in org.apache.spark.mllib.evaluation.MulticlassMetrics, which is deprecated in 2.0, is removed in 3.0. Use accuracy instead.
  • The member variable recall in org.apache.spark.mllib.evaluation.MulticlassMetrics, which is deprecated in 2.0, is removed in 3.0. Use accuracy instead.
  • The member variable fMeasure in org.apache.spark.mllib.evaluation.MulticlassMetrics, which is deprecated in 2.0, is removed in 3.0. Use accuracy instead.
  • org.apache.spark.ml.util.GeneralMLWriter.context, which is deprecated in 2.0, is removed in 3.0. Use session instead.
  • org.apache.spark.ml.util.MLWriter.context, which is deprecated in 2.0, is removed in 3.0. Use session instead.
  • org.apache.spark.ml.util.MLReader.context, which is deprecated in 2.0, is removed in 3.0. Use session instead.
  • abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]] is changed to abstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag, T <: UnaryTransformer[IN, OUT, T]] in 3.0.
  • In Spark 3.0, a multiclass logistic regression in Pyspark will now (correctly) return LogisticRegressionSummary, not the subclass BinaryLogisticRegressionSummary. The additional methods exposed by BinaryLogisticRegressionSummary would not work in this case anyway. (SPARK-31681)
  • In Spark 3.0, pyspark.ml.param.shared.Has* mixins do not provide any set*(self, value) setter methods anymore, use the respective self.set(self.*, value) instead. See SPARK-29093 for details. (SPARK-29093)

Other behavior changes

  • The upgrade to Scala 2.12 involves the following changes:

    • Package cell serialization is handled differently. The following example illustrates the behavior change and how to handle it.

      Running foo.bar.MyObjectInPackageCell.run() as defined in the following package cell will trigger the error java.lang.NoClassDefFoundError: Could not initialize class foo.bar.MyObjectInPackageCell$

      package foo.bar
      
      case class MyIntStruct(int: Int)
      
      import org.apache.spark.sql.SparkSession
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.Column
      
      object MyObjectInPackageCell extends Serializable {
      
        // Because SparkSession cannot be created in Spark executors,
        // the following line triggers the error
        // Could not initialize class foo.bar.MyObjectInPackageCell$
        val spark = SparkSession.builder.getOrCreate()
      
        def foo: Int => Option[MyIntStruct] = (x: Int) => Some(MyIntStruct(100))
      
        val theUDF = udf(foo)
      
        val df = {
          val myUDFInstance = theUDF(col("id"))
          spark.range(0, 1, 1, 1).withColumn("u", myUDFInstance)
        }
      
        def run(): Unit = {
          df.collect().foreach(println)
        }
      }
      

      To work around this error, you can wrap MyObjectInPackageCell inside a serializable class.

    • Certain cases using DataStreamWriter.foreachBatch will require a source code update. This change is due to the fact that Scala 2.12 has automatic conversion from lambda expressions to SAM types and can cause ambiguity.

      For example, the following Scala code can’t compile:

      streams
        .writeStream
        .foreachBatch { (df, id) => myFunc(df, id) }
      

      To fix the compilation error, change foreachBatch { (df, id) => myFunc(df, id) } to foreachBatch(myFunc _) or use the Java API explicitly: foreachBatch(new VoidFunction2 ...).

  • Because the Apache Hive version used for handling Hive user-defined functions and Hive SerDes is upgraded to 2.3, two changes are required:

    • Hive’s SerDe interface is replaced by an abstract class AbstractSerDe. For any custom Hive SerDe implementation, migrating to AbstractSerDe is required.
    • Setting spark.sql.hive.metastore.jars to builtin means that the Hive 2.3 metastore client will be used to access metastores for Databricks Runtime 7.x. If you need to access Hive 1.2 based external metastores, set spark.sql.hive.metastore.jars to the folder that contains Hive 1.2 jars.

Deprecations and removals

  • Data skipping index was deprecated in Databricks Runtime 4.3 and removed in Databricks Runtime 7.x. We recommend that you use Delta tables instead, which offer improved data skipping capabilities.
  • In Databricks Runtime 7.x, the underlying version of Apache Spark uses Scala 2.12. Since libraries compiled against Scala 2.11 can disable Databricks Runtime 7.x clusters in unexpected ways, clusters running Databricks Runtime 7.x do not install libraries configured to be installed on all clusters. The cluster Libraries tab shows a status Skipped and a deprecation message that explains the changes in library handling. However, if you have a cluster that was created on an earlier version of Databricks Runtime before Azure Databricks platform version 3.20 was released to your workspace, and you now edit that cluster to use Databricks Runtime 7.x, any libraries that were configured to be installed on all clusters will be installed on that cluster. In this case, any incompatible JARs in the installed libraries can cause the cluster to be disabled. The workaround is either to clone the cluster or to create a new cluster.

Known issues

  • Parsing day of year using pattern letter ‘D’ returns the wrong result if the year field is missing. This can happen in SQL functions like to_timestamp which parses datetime string to datetime values using a pattern string. (SPARK-31939)
  • Join/Window/Aggregate inside subqueries may lead to wrong results if the keys have values -0.0 and 0.0. (SPARK-31958)
  • A window query may fail with ambiguous self-join error unexpectedly. (SPARK-31956)
  • Streaming queries with dropDuplicates operator may not be able to restart with the checkpoint written by Spark 2.x. (SPARK-31990)