revoscalepy package

Article
07/15/2019

The revoscalepy module is a collection of portable, scalable and distributable Python functions used for importing, transforming, and analyzing data at scale. You can use it for descriptive statistics, generalized linear models, logistic regression, classification and regression trees, and decision forests.

Functions run on the revoscalepy interpreter, built on open-source Python, engineered to leverage the multithreaded and multinode architecture of the host platform.

Package details	Information
Current version:	9.4
Built on:	Anaconda 4.2 distribution of Python 3.5
Package distribution:	Machine Learning Server 9.x SQL Server 2017 Machine Learning Services SQL Server 2017 Machine Learning Server (Standalone)

How to use revoscalepy

The revoscalepy module is found in Machine Learning Server or SQL Server Machine Learning when you add Python to your installation. You get the full collection of proprietary packages plus a Python distribution with its modules and interpreter.

You can use any Python IDE to write Python script calling functions in revoscalepy, but the script must run on a computer having our proprietary modules. For a review of common tasks, see How to use revoscalepy with Spark.

Run it locally

This is the default. The revoscalepy library runs locally on all platforms. On a standalone Linux or Windows system, data and operations are local to the machine. On Spark, a local compute context means that data and operations are local to current execution environment (typically, an edge node).

Run in a remote compute context

In a remote compute context, the script running on a local Machine Learning Server shifts execution to a remote Machine Learning Server on Spark or SQL Server. For example, script running on Windows might shift execution to a Spark cluster to process data there.

On Spark, set the compute context to RxSpark cluster and give the cluster name. In this context, if you call a function that can run in parallel, the task is distributed across data nodes in the cluster, where the operation is co-located with the data.

On SQL Server, set the compute context to RxInSQLServer. There are two primary use cases for remote compute context:

Call Python functions in T-SQL script or stored procedures running on SQL Server.
Call revoscalepy functions in Python script executing in a SQL Server compute context. In your script, you can set a compute context to shift execution of revoscalepy operations to a remote SQL Server instance that has the revoscalepy interpreter.

Functions by category

The library includes data transformation and manipulation, visualization, predictions, and statistical analysis functions. It also includes functions for controlling jobs, serializing data, and performing common utility tasks.

This section lists the functions by category to give you an idea of how each one is used. The table of contents to lists functions in alphabetical order.

Note

Some function names begin with rx- and others with Rx. The Rx function name prefix is used for class constructors for data sources and compute contexts.

1-Compute context functions

Function	Description
RxInSqlServer	Creates a compute context for running revoscalepy analyses inside a remote Microsoft SQL Server.
RxLocalSeq	This is the default but you can call it switch back to a local compute context if your script runs in multiple. Computations using rx_exec will be processed sequentially.
rx_get_compute_context	Returns the current compute context.
rx_set_compute_context	Change the compute context to a different one.
RxSpark	Creates a compute context for running revoscalepy analyses in a remote Spark cluster.
rx_get_pyspark_connection	Gets a connection to a PySpark data set, in support of revoscalepy and PySpark interoperability.
rx_spark_connect	Creates a persistent Spark Connection.
rx_spark_disconnect	Closes the connection.

2-Data source functions

Data sources are used by microsoftml functions as well as revoscalepy.

Function	Compute Context	Description
RxDataSource	All	Base class for all revoscalepy data sources.
RxHdfsFileSystem	Local, RxSpark	Data source is accessed through HDFS instead of Linux.
RxNativeFileSystem	Local, RxSpark	Data source is accessed through Linux instead of HDFS.
RxHiveData	Local, RxSpark	Generates a data source object from a Hive data file.
RxTextData	Local, RxSpark	Generates a data source object from a text data file.
RxXdfData	All	Generates a data source object from an XDF data source.
RxOdbcData	All	Generates a data source object from an ODBC data source.
RxOrcData	Local, RxSpark	Generates a data source object from an Orc data file.
RxParquetData	Local, RxSpark	Generates a data source object from a Parquet data file.
RxSparkData	Local, RxSpark	Generates a data source object from a Spark data source.
RxSparkDataFrame	Local, RxSpark	Generates a data source object from a Spark data frame.
rx_get_partitions	Local, RxSpark	Get partitions of a partitioned Xdf data source.
rx_partition	Local, RxSpark	Partition input data sources by key values and save the results to a partitioned .xdf on disk.
rx_spark_cache_data	Local, RxSpark	Generates a data source object from cached data.
rx_spark_list_data	Local, RxSpark	Generates a data source object from a list.
rx_spark_remove_data	Local, RxSpark	Deletes the Spark cached data source object.
RxSqlServerData	Local, RxInSqlServer	Generates a data source object from a SQL table or query.

3-Data manipulation (ETL) functions

Function	Compute Context	Description
rx_import	All	Import data into an .xdf file or data frame.
rx_data_step	All	Transform data from an input data set to an output data set.

4-Analytic functions

Function	Compute Context	Description
rx_exec_by	Local, RxSpark	Execute an arbitrary function in parallel on multiple data nodes.
rx_summary	All	Produce univariate summaries of objects in revoscalepy.
rx_lin_mod	All	Fit linear models on small or large data.
rx_logit	All	Use rx_logit to fit logistic regression models for small or large data.
rx_dtree	All	Fit classification and regression trees on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm.
rx_dforest	All	Fit classification and regression decision forests on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm.
rx_btrees	All	Fit stochastic gradient boosted decision trees on an ‘.xdf’ file or data frame for small or large data using parallel external memory algorithm.
rx_predict_default	All	Compute predicted values and residuals using rx_lin_mod and rx_logit objects.
rx_predict_rx_dforest	All	Calculate predicted or fitted values for a data set from an rx_dforest or rx_btrees object.
rx_predict_rx_dtree	All	Calculate predicted or fitted values for a data set from an rx_dtree object.

5-Job functions

In an RxSpark context, job management is built in. You only need job functions if you want to manually control the Yarn queue.

Function	Compute Context	Description
rx_exec	All	Allows distributed execution of a function in parallel across nodes (computers) or cores of a “compute context” such as a cluster.
rx_cancel_job	All	Removes all job-related artifacts from the distributed computing resources, including any job results.
rx_cleanup_jobs	All	Removes the artifacts for a specific job.
RxRemoteJob class	All	Closes the remote job, purging all associated job-related data.
RxRemoteJobStatus	All	Represents the execution status of a remote Python job.
rx_get_job_info	All	Contains complete information on the job’s compute context as well as other information needed by the distributed computing resources.
rx_get_job_output	All	Returns console output for the nodes participating in a distributed computing job.
rx_get_job_results	All	Returns results of the run or a message stating why results are not available.
rx_get_job_status	All	Obtain distributed computing processing status for the specified job.
rx_get_jobs	All	Returns a list of job objects associated with the given compute context and matching the specified parameters.
rx_wait_for_job	All	Block on an existing distributed job until completion, effectively turning a non-blocking job into a blocking job.

6-Serialization functions

Function	Compute Context	Description
rx_serialize_model	All	Serialize a given python model.
rx_read_object	All	Retrieves an ODBC data source object.
rx_read_xdf	All	Read data from an .xdf file into a data frame.
rx_write_object	All	Stores an ODBC data source object.
rx_delete_object	All	Deletes an object from the ODBC data source.
rx_list_keys	All	Enumerates all keys or versions for a given key, depending on the parameters.

7-Utility functions

Function	Compute Context	Description
RxOptions	All	Specify and retrieve options needed for revoscalepy computations.
rx_get_info	All	Get basic information about a revoscalepy data source or data frame.
rx_get_var_info	All	Get variable information for a revoscalepy data source or data frame, including variable names, descriptions, and value labels.
rx_get_var_names	All	Read the variable names for data source or data frame.
rx_set_var_info	All	Set the variable information for an .xdf file, including variable names, descriptions, and value labels, or set attributes for variables in a data frame.
RxMissingValues	All	Provides missing values for various `NumPy` data types which you can use to mark missing values in a sequence of data in `ndarray`.
rx_privacy_control	All	Opt out of usage data collection.
rx_hadoop_command	Local, RxSpark	Execute arbitrary Hadoop commands and perform standard file operations in Hadoop.
rx_hadoop_copy_from_local	Local, RxSpark	Wraps the Hadoop `fs -copyFromLocal` command.
rx_hadoop_copy_to_local	Local, RxSpark	Wraps the Hadoop `fs -copyToLocal` command.
rx_hadoop_copy	Local, RxSpark	Wraps the Hadoop `fs -cp` command.
rx_hadoop_file_exists	Local, RxSpark	Wraps the Hadoop `fs -test -e` command.
rx_hadoop_list_files	Local, RxSpark	Wraps the Hadoop `fs -ls or -lsr` command.
rx_hadoop_make_dir	Local, RxSpark	Wraps the Hadoop `fs -mkdir -p` command.
rx_hadoop_move	Local, RxSpark	wraps the Hadoop `fs -mv` command.
rx_hadoop_remove_dir	Local, RxSpark	Wraps the Hadoop `fs -rm -r` or `fs -rm -r -skipTrash` command.
rx_hadoop_remove	Local, RxSpark	Wraps the Hadoop `fs -rm` or `fs -rm -skipTrash` command.

Next steps

For Machine Learning Server, try a quickstart as an introduction to revoscalepy:

revoscalepy and PySpark interoperability

For SQL Server, add both Python modules to your computer by running setup:

Set up Python Machine Learning Services.

Follow these SQL Server tutorials for hands-on experience:

Partager via