다음을 통해 공유


How to Make Python Easier for the R User: revoscalepy

Python for the R User

I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.

Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.

Python has many strengths, such as its robust data structures (like Dictionaries), compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, for which the classic Data Science packages in Python are not as intuitive as R. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.

Why revoscalepy?

Revoscalepy is the Python implementation of the R library RevoScaleR.

The methods in ‘revoscalepy’ are the same, and more importantly, the way the R user can view data is the same. The reason this is so important is that for an R programmer, being able to understand the data shape and structure is one of the challenges with getting used to Python.

In Python, data types are different, preprocessing the data is different, and the criteria to feed the processed dataset into a model is different.

To understand how revoscalepy eases the transition from R to Python, the following section will compare building a decision tree using revoscalepy with building a decision tree using sklearn. The Titanic dataset from Kaggle will be used for this example. To be clear, this post is written from an R user’s perspective, as many of the challenges this post will outline are standard practices for native Python users.

Revoscalepy Versus Sklearn

Dependencies of revoscalepy

Revoscalepy works on Python 3.5, and can be downloaded as a part of Microsoft Machine Learning Server. Once downloaded, set the Python environment path to the python executable in the MML directory, and then import the packages.

Data Import

The first chunk of code imports the revoscalepy, numpy, pandas, and sklearn packages. Pandas has some R roots in that it has its own implementation of DataFrames as well as methods that resemble R’s exploratory methods.

 import revoscalepy as rp
import numpy as np
import pandas as pd
import sklearn as sk
titanic_data = pd.read_csv('titanic.csv')
titanic_data.head()

Preprocessing

sklearn

One of the challenges as an R user with using sklearn is that the decision tree model for sklearn can only handle the numeric datatype. Pandas has a categorical type that looks like factors in R, but sklearn’s decision tree does not integrate with this. As a result, numerically encoding the categorical data becomes a mandatory step. This example will use a one-hot encoder to shape the categories in a way that sklearn’s decision tree understands.

The side effect of having to one-hot encode variables is that if the dataset contains high cardinality features, it can be memory intensive and computationally expensive because each category becomes its own binary column. While implementing one-hot encoding itself is not a difficult transformation in Python and provides good results, it is still an extra step for an R programmer to have to manually implement. The following chunk of code detaches the categorical columns, label and one-hot encodes them, and then reattaches the encoded columns to the rest of the dataset.

 from sklearn import tree
le = sk.preprocessing.LabelEncoder()
x = titanic_data.select_dtypes(include=[object])
x = x.drop(['Name', 'Ticket', 'Cabin'], 1)
x = pd.concat([x, titanic_data['Pclass']], axis = 1)
x['Pclass'] = x['Pclass'].astype('object')
x = pd.DataFrame(x)
x = x.fillna('Missing')
x_cats = x.apply(le.fit_transform)
enc = sk.preprocessing.OneHotEncoder()
enc.fit(x_cats)
onehotlabels = enc.transform(x_cats).toarray()
encoded_titanic_data =
     pd.concat([pd.DataFrame(titanic_data.select_dtypes(include=[np.number])),
     pd.DataFrame(onehotlabels)], axis = 1)

At this point, there are more columns than before, and the columns no longer have semantic names (they have been enumerated). This means that if a decision tree is visualized, it will be difficult to understand without going through the extra step of renaming these columns. There are techniques in Python to help with this, but it is still an extra step that must be considered.

revoscalepy

Unlike sklearn, revoscalepy reads pandas’ ‘category’ type like factors in R. This section of code iterates through the DataFrame, finds the string types, and converts those types to ‘category’. In pandas, there is an argument to set the order to False, to prevent ordered factors.

 titanic_data_object_types = titanic_data.select_dtypes(include = ['object'])
titanic_data_object_types_columns = np.array(titanic_data_object_types.columns)
for column in titanic_data_object_types_columns:
     titanic_data[column] = titanic_data[column].astype('category', ordered = False)
titanic_data['Pclass'] = titanic_data['Pclass'].astype('category', ordered = False)

This dataset is already ready to be fed into the revoscalepy model.

Training Models

sklearn

One difference between implementing a model in R and in sklearn in Python is that sklearn does not use formulas.

Formulas are important and useful for modeling because they provide a consistent framework to develop models with varying degrees of complexity. With formulas, users can easily apply different types of variable cases, such as ‘+’ for separate independent variables, ‘:’ for interaction terms, and ‘*’ to include both the variable and its interaction terms, along with many other convenient calculations. Within a formula, users can implement mathematical calculations, create factors, and include more complex entities like third order interactions. In addition, with formulas, programmers can build extraordinarily complex models such as mixed effect models, which are next to impossible build without them. In Python, there are packages such as ‘statsmodels’ which have more intuitive ways to build certain statistical models. However, statsmodels has a limited selection of models, and does not include tree based models.

With sklearn, model.fit expects the independent and dependent terms to be columns from the DataFrame. Interactions must be created manually as a preprocessing step for more complex examples. The code below trains the decision tree:

 model = tree.DecisionTreeClassifier(max_depth = 50)
x = encoded_titanic_data.drop(['Survived'], 1)
x = x.fillna(-1)
y =  encoded_titanic_data['Survived']
model = model.fit(x,y)

revoscalepy

Revoscalepy brings back formulas. Granted, users cannot view the formula the same way as they can in R, because formulas are strings in Python. However, importing code from R to Python is an easy transition because formulas are read the same way in the revoscalepy functions as the model-fit functions in R. The code below fits the decision tree in revoscalepy:

 form = 'Survived ~ Pclass + Sex + Age  + Parch  + Fare + Embarked'
titanic_data_tree = rp.rx_dtree(form, titanic_data, max_depth = 50)

The resulting object, titanic_data_tree, is the same structural object that RxDTree() would create in R. RxDTree() integrates with the rpart library in R, meaning rx_dtree() indirectly also has this integration if the users wants to export the object back into R for further exploratory analysis around this decision tree.

Conclusion

From the workflow, it should be clear how revoscalepy can help with transliteration between R and Python. Sklearn has different preprocessing considerations because the data must be fed into the model differently. The advantage to revoscalepy is that R programmers can easily convert their R code to Python without thinking too much about the ‘Pythonic way’ of implementing their R code. Categories replace factors, rx_dtree() reads the R-like formula, and the arguments of the function are similar to their R equivalent. Looking at the big picture, revoscalepy is one way to ease Python for the R user and future posts will cover more ways to transition between R and Python.

Comments

  • Anonymous
    November 21, 2017
    Great initivative by Microsoft for developing the revoscalepy package.Thanks!