Using Microsoft R Server on a Single Machine for Experiments With 600M Taxi Rides
Re-posted from R-bloggers.
The New York City taxi dataset is one of the largest publicly available datasets, with information about 1.1 billion NYC taxi rides. This dataset has been explored and visualized in a number of blog posts, using a variety of techniques and technologies (e.g., PostgreSQL, Apache Elastic Search). A recent blog post showed how to build ML models over one years' worth of this dataset using MRS running in a 4-node Hadoop cluster.
In a new blog post, Microsoft Data Scientist Dmitry Pechyoni shows us how to build a binary classification model that will predict if a passenger will pay a tip. Dmitry was able to use Microsoft R Server (MRS) to drive the entire process of building and evaluating machine learning models over hundreds of millions of examples using a single commodity machine. In his end-to-end process, he downloads and cleans 4 years' worth of data and the entire process takes just about 12 hours.
This example uses the Data Science Virtual Machine in the Azure cloud. The VM runs Windows, has Standard A4 configuration: 8 cores, 14 GB memory. This machine comes with 126 GB of hard disk, which is not enough to store NYC taxi dataset. To store the dataset locally, he attached a 1 TB hard disk to the VM.
In the future, the team may look to develop models with larger feature sets which will have an even more accurate prediction of tips before an actual payment.
ML Blog Team