Exporting large data using Microsoft R (IDE: RTVS)

Introduction

Very often in our projects we encounter a need to export huge amount of data (in GBs) and the conventional solution, write.csv, can test anyone’s patience with the time it demands.

In this blog, we will learn by doing. We make use of a package that is not very popular, but serves the purpose really well.

 

Package Feather

In words of “revolution analytics” blog, package feather is defined as,
“A collaboration of Wes McKinney and Hadley Wickham, to create a standard data file format that can be used for data exchange by and between R, Python and any other software that implements its open-source format.”
When we export data with feather, it is stored in a binary format file, which makes it less bulky (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file). There’s no need to go to and fro from numbers to text, and this aids in speedier reading and writing. Additionally, feather is a column-oriented file format, which matches R’s internal representation of data.

Code

With the primary motive of reducing the exporting time using R, I have created a random dataset of 25,000,000 rows and 3 columns and ran it with compatible solutions to compare the time taken by them to export the data in a csv or a bin format.

Here’s the sample code I used:

####################################

install.packages("data.table")
install.packages("stringi")
install.packages("feather")
library(feather)
library(data.table)
library(stringi)
num = 10000
size = 25000000
path0 <- "D:\\dataset101.feather"
path1 <- "D:\\dataset102.csv"
#######Generating Random DataSet##########
dataset <- data.table(col1 = rep(stri_rand_strings(num, 10), size / num),
col2 = rep(1:(size/ num), each = num),
col3 = rnorm(size))
#######Comparing Methods to Export#########

#1 Using 'feather'
print(system.time(write_feather(dataset, path0)))

#2 Using 'write.csv'
print(system.time(write.csv(dataset, path1)))

#####################################

Output:
#1 Using 'FEATHER'
user system elapsed
1.86    1.21       6.11

 

 

#2 Using 'write.csv'
user system elapsed

437.80       6.64    452.89

 

Conclusion
Here, we have seen that Package Feather is one the most efficient method which can be used to export and import datasets of all sizes.

In the next blog we will look at a few other options to do the same, and compare them with Package Feather.

The package bigmemory also works well with R but comes paired with a limitation, it can import/export dataset of only one type. It has been devised to work on matrices, and matrices in R support only one type of data.

For more information on types of data structures in R, please refer to this link.

Blog Author
Prashant Babber,
ASSC Consultant, Data Insights, MACH,
IGD

Source: https://blog.revolutionanalytics.com/2016/05/feather-package.html\>

Comments

  • Anonymous
    February 17, 2017
    Very useful! Thanks for this blog
  • Anonymous
    February 19, 2017
    Excellent reference on large data set handling...thanks for sharing
  • Anonymous
    February 20, 2017
    This is a nice article to take care of the data management. Can we extend this to take care of much higher data flow?
  • Anonymous
    February 20, 2017
    You may want to take a look at this comparison: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/