Share via


rx_import

Usage

revoscalepy.rx_import(input_data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource,
    pandas.core.frame.DataFrame, str], output_file=None,
    vars_to_keep: list = None, vars_to_drop: list = None,
    row_selection: str = None, transforms: dict = None,
    transform_objects: dict = None, transform_function: <built-
    in function callable> = None,
    transform_variables: dict = None,
    transform_packages: dict = None, append: str = None,
    overwrite: bool = False, number_rows: int = None,
    strings_as_factors: bool = None, column_classes: dict = None,
    column_info: dict = None, rows_per_read: int = None,
    type: str = None, max_rows_by_columns: int = None,
    report_progress: int = None, verbose: int = None,
    xdf_compression_level: int = None,
    create_composite_set: bool = None,
    blocks_per_composite_file: int = None)

Description

Import data and store as an .xdf file on disk or in-memory as a data.frame object.

Arguments

input_data

A character string with the path for the data to import (delimited, fixed format, ODBC, or XDF). Alternatively, a data source object representing the input data source can be specified. If a Spark compute context is being used, this argument may also be an RxHiveData, RxOrcData, RxParquetData or RxSparkDataFrame object or a Spark data frame object from pyspark.sql.DataFrame.

output_file

A character string representing the output ‘.xdf’ file or an RxXdfData object. If None, a data frame will be returned in memory. If a Spark compute context is being used, this argument may also be an RxHiveData, RxOrcData, RxParquetData or RxSparkDataFrame object.

vars_to_keep

List of strings of variable names to include when reading from the input data file. If None, argument is ignored. Cannot be used with vars_to_drop. Not supported for ODBC or fixed format text files.

vars_to_drop

List of strings of variable names to exclude when reading from the input data file. If None, argument is ignored. Cannot be used with vars_to_keep. Not supported for ODBC or fixed format text files.

row_selection

None. Not currently supported, reserved for future use.

transforms

None. Not currently supported, reserved for future use.

transform_objects

A dictionary of variables besides the data that are used in the transform function. See rx_data_step for examples.

transform_function

Name of the function that will be used to modify the data. The variables used in the transformation function must be specified in transform_objects. See rx_data_step for examples.

transform_variables

List of strings of the column names needed for the transform function.

transform_packages

None. Not currently supported, reserved for future use.

append

Either “none” to create a new ‘.xdf’ file or “rows” to append rows to an existing ‘.xdf’ file. If output_file exists and append is “none”, the overwrite argument must be set to True. Ignored if a data frame is returned.

overwrite

Bool value. If True, the existing output_file will be overwritten. Ignored if a dataframe is returned.

number_rows

Integer value specifying the maximum number of rows to import. If set to -1, all rows will be imported.

strings_as_factors

Bool value indicating whether or not to automatically convert strings to factors on import. This can be overridden by specifying “character” in column_classes and column_info. If True, the factor levels will be coded in the order encountered. Since this factor level ordering is row dependent, the preferred method for handling factor columns is to use column_info with specified “levels”.

column_classes

Dictionary of column name to strings specifying the column types to use when converting the data. The element names for the vector are used to identify which column should be converted to which type.

Allowable column types are: ”bool” (stored as uchar), “integer” (stored as int32), “float32” (the default for floating point data for ‘.xdf’ files), “numeric” (stored as float64 as in R), “character” (stored as string), “factor” (stored as uint32), “ordered” (ordered factor stored as uint32. Ordered factors are treated the same as factors in RevoScaleR analysis functions.), ”int16” (alternative to integer for smaller storage space), “uint16” (alternative to unsigned integer for smaller storage space), “Date” (stored as Date, i.e. float64. Not supported for import types “textFast”, “fixedFast”, or “odbcFast”.) ”POSIXct” (stored as POSIXct, i.e. float64. Not supported for import types “textFast”, “fixedFast”, or “odbcFast”.) Note for “factor” and “ordered” types, the levels will be coded in the

order encountered. Since this factor level ordering is row dependent, the preferred method for handling factor columns is to use column_info with specified “levels”.

Note that equivalent types share the same bullet in the list above; for some types we allow both ‘R-friendly’ type names, as well as our own, more specific type names for ‘.xdf’ data.

Note also that specifying the column as a “factor” type is currently equivalent to “string” - for the moment, if you wish to import a column as factor data you must use the column_info argument, documented below.

column_info

List of named variable information lists. Each variable information list contains one or more of the named elements given below. When importing fixed format data, either column_info or an ‘.sts’ schema file should be supplied. For fixed format text files, only the variables specified will be imported. For all text types, the information supplied for column_info overrides that supplied for column_classes. Currently available properties for a column information list are:

type: Character string specifying the data type for the column. See
    column_classes argument description for the available types. If the
    type is not specified for fixed format data, it will be read as
    character data.

newName: Character string specifying a new name for the variable.
    description: character string specifying a description for the
    variable.

levels: List of strings containing the levels when type =
    ”factor”. If the levels property is not provided, factor levels
    will be determined by the values in the source column. If levels
    are provided, any value that does not match a provided level will
    be converted to a missing value.

newLevels: New or replacement levels specified for a column of type
    “factor”. It must be used in conjunction with the levels argument.
    After reading in the original data, the labels for each level will
    be replaced with the newLevels.

low: The minimum data value in the variable (used in computations
    using the F() function.)

high: The maximum data value in the variable (used in computations
    using the F() function.)

start: The left-most position, in bytes, for the column of a fixed
    format file respectively. When all elements of column_info have start,
    the text file is designated as a fixed format file. When none of
    the elements have it, the text file is designated as a delimited
    file. Specification of start must always be accompanied by
    specification of width.

width: The number of characters in a fixed-width character column
    or the column of a fixed format file. If width is specified for a
    character column, it will be imported as a fixed-width character
    variable. Any characters beyond the fixed width will be ignored.
    Specification of width is required for all columns of a fixed
    format file.

decimalPlaces: The number of decimal places.

rows_per_read

Number of rows to read at a time.

type

Character string set specifying file type of input_data. This is ignored if input_data is a data source. Possible values are: “auto”: File type is automatically detected by looking at file extensions and argument values.

”textFast”: Delimited text import using faster, more limited import mode. By default variables containing the values True and False or T and F will be created as bool variables.

”text”: Delimited text import using enhanced, slower import mode. This allows for importing Date and POSIXct data types, handling the delimiter character inside a quoted string, and specifying decimal character and thousands separator. (See RxTextData.)

”fixedFast”: Fixed format text import using faster, more limited import mode. You must specify a ‘.sts’ format file or column_info specifications with start and width for each variable.

”fixed”: Fixed format text import using enhanced, slower import mode. This allows for importing Date and POSIXct data types and specifying decimal character and thousands separator. You must specify a ‘.sts’ format file or column_info specifications with start and width for each variable.

”odbcFast”: ODBC import using faster, more limited import mode. “odbc”: ODBC import using slower, enhanced import on Windows. (See RxOdbcData.)

max_rows_by_columns

The maximum size of a data frame that will be read in if output_file is set to None, measured by the number of rows times the number of columns. If the number of rows times the number of columns being imported exceeds this, a warning will be reported and a smaller number of rows will be read in than requested. If max_rows_by_columns is set to be too large, you may experience problems from loading a huge data frame into memory.

report_progress

Integer value with options: 0: no progress is reported. 1: the number of processed rows is printed and updated. 2: rows processed and timings are reported. 3: rows processed and all timings are reported.

verbose

Integer value. If 0, no additional output is printed. If 1, information on the import type is printed if type is set to auto.

xdf_compression_level

Integer in the range of -1 to 9. The higher the value, the greater the amount of compression - resulting in smaller files but a longer time to create them. If xdfCompressionLevel is set to 0, there will be no compression and files will be compatible with the 6.0 release of Revolution R Enterprise. If set to -1, a default level of compression will be used.

create_composite_set

Bool value or None. If True, a composite set of files will be created instead of a single ‘.xdf’ file. A directory will be created whose name is the same as the ‘.xdf’ file that would otherwise be created, but with no extension. Subdirectories ‘data’ and ‘metadata’ will be created. In the ‘data’ subdirectory, the data will be split across a set of ‘.xdfd’ files (see blocks_per_composite_file below for determining how many blocks of data will be in each file). In the ‘metadata’ subdirectory there is a single ‘.xdfm’ file, which contains the meta data for all of the ‘.xdfd’ files in the ‘data’ subdirectory.

blocks_per_composite_file

Integer value. If create_composite_set=True, this will be the number of blocks put into each ‘.xdfd’ file in the composite set. If the output_file is an RxXdfData object, set the value for blocks_per_composite_file there instead.

kwargs

Additional arguments to be passed directly to the underlying data source objects to be imported.

Returns

If an output_file is not specified, an output data frame is returned. If an output_file is specified, an RxXdfData data source is returned that can be used in subsequent revoscalepy analysis.

Example

import os
from revoscalepy import rx_import, RxOptions, RxXdfData
sample_data_path = RxOptions.get_option("sampleDataDir")
ds = RxXdfData(os.path.join(sample_data_path, "kyphosis.xdf"))
kyphosis = rx_import(input_data = ds)
kyphosis.head()