SparkR 및 sparklyr 비교

아티클
12/27/2024

중요하다

Databricks의 SparkR은 Databricks Runtime 16.0 이상에서 더 이상 사용되지 .

R 사용자용 Apache Spark에는 두 가지 API를 사용할 수 있습니다. SparkR 및 sparklyr. Databricks에서는 SparkR이 더 이상 지원되지 않으므로 sparklyr의 사용을 권장합니다. 코드를 마이그레이션하는 데 도움이 되도록 이 문서에서는 이러한 API를 비교합니다.

API 기원

SparkR Spark 커뮤니티 및 Databricks의 개발자가 빌드합니다. 이 때문에 SparkR은 Spark Scala 클래스 및 DataFrame API밀접하게 따릅니다.

sparklyr RStudio 시작했으며 이후 Linux 재단에 기부되었습니다. sparklyr는 프로그래밍 스타일과 dplyrAPI 상호 운용성을 통해 깔끔한 긴밀하게 통합됩니다.

SparkR 및 sparklyr는 R에서 빅 데이터 작업에 능숙합니다. 최근 몇 년 동안, 그들의 기능 세트가 동등해지고 있습니다.

API 차이점

다음 코드 예제에서는 Azure Databricks Notebook에서 SparkR 및 sparklyr를 사용하여 샘플 데이터 세트의 CSV 파일을 Spark로 읽는 방법을 보여 줍니다.

# #############################################################################
# SparkR usage

# Note: To load SparkR into a Databricks notebook, run the following:

# library(SparkR)

# You can then remove "SparkR::" from the following function call.
# #############################################################################

# Use SparkR to read the airlines dataset from 2008.
airlinesDF <- SparkR::read.df(path        = "/databricks-datasets/asa/airlines/2008.csv",
                              source      = "csv",
                              inferSchema = "true",
                              header      = "true")

# Print the loaded dataset's class name.
cat("Class of SparkR object: ", class(airlinesDF), "\n")

# Output:
#
# Class of SparkR object: SparkDataFrame

# #############################################################################
# sparklyr usage

# Note: To install, load, and connect with sparklyr in a Databricks notebook,
# run the following:

# install.packages("sparklyr")
# library(sparklyr)
# sc <- sparklyr::spark_connect(method = "databricks")

# If you run "library(sparklyr)", you can then remove "sparklyr::" from the
# preceding "spark_connect" and from the following function call.
# #############################################################################

# Use sparklyr to read the airlines dataset from 2007.
airlines_sdf <- sparklyr::spark_read_csv(sc   = sc,
                                         name = "airlines",
                                         path = "/databricks-datasets/asa/airlines/2007.csv")

# Print the loaded dataset's class name.
cat("Class of sparklyr object: ", class(airlines_sdf))

# Output:
#
# Class of sparklyr object: tbl_spark tbl_sql tbl_lazy tbl

그러나 SparkR의 SparkDataFrame 개체에서 sparklyr 함수를 실행하려고 하거나 sparklyr의 tbl_spark 개체에서 SparkR 함수를 실행하려고 하면 다음 코드 예제와 같이 작동하지 않습니다.

# Try to call a sparklyr function on a SparkR SparkDataFrame object. It will not work.
sparklyr::sdf_pivot(airlinesDF, DepDelay ~ UniqueCarrier)

# Output:
#
# Error : Unable to retrieve a Spark DataFrame from object of class SparkDataFrame

## Now try to call s Spark R function on a sparklyr tbl_spark object. It also will not work.
SparkR::arrange(airlines_sdf, "DepDelay")

# Output:
#
# Error in (function (classes, fdef, mtable) :
#   unable to find an inherited method for function ‘arrange’ for signature ‘"tbl_spark", "character"’

sparklyr는 arrange 같은 dplyr 함수를 SparkSQL에서 사용하는 SQL 쿼리 계획으로 변환하기 때문입니다. SparkSQL tables 및 Spark DataFrames에 대한 함수가 있는 SparkR의 경우는 그렇지 않습니다. 이러한 동작 때문에 Databricks는 동일한 스크립트, 노트북 또는 작업에서 SparkR 및 sparklyr API를 결합하는 것을 권장하지 않습니다.

API 상호 운용성

드문 경우 where SparkR 및 sparklyr API를 결합하는 것을 피할 수 없으므로 SparkSQL을 일종의 브리지로 사용할 수 있습니다. 예를 들어 이 문서의 첫 번째 예제에서 sparklyr는 2007년의 항공사 데이터 세트를 table으로 로드하여 airlines로 명명했습니다. SparkR sql 함수를 사용하여 이 table쿼리할 수 있습니다. 예를 들면 다음과 같습니다.

top10delaysDF <- SparkR::sql("SELECT
                               UniqueCarrier,
                               DepDelay,
                               Origin
                             FROM
                               airlines
                             WHERE
                               DepDelay NOT LIKE 'NA'
                             ORDER BY DepDelay
                             DESC LIMIT 10")

# Print the class name of the query result.
cat("Class of top10delaysDF: ", class(top10delaysDF), "\n\n")

# Show the query result.
cat("Top 10 airline delays for 2007:\n\n")
head(top10delaysDF, 10)

# Output:
#
# Class of top10delaysDF: SparkDataFrame
#
# Top 10 airline delays for 2007:
#
#   UniqueCarrier DepDelay Origin
# 1            AA      999    RNO
# 2            NW      999    EWR
# 3            AA      999    PHL
# 4            MQ      998    RST
# 5            9E      997    SWF
# 6            AA      996    DFW
# 7            NW      996    DEN
# 8            MQ      995    IND
# 9            MQ      994    SJT
# 10           AA      993    MSY

추가 예제는 DataFrames 작업 및 Rtables 참조하세요.

다음을 통해 공유

SparkR 및 sparklyr 비교

API 기원

API 차이점

API 상호 운용성

피드백

추가 리소스