Exporting large data using Microsoft R (IDE: RTVS)
Introduction
Very often in our projects we encounter a need to export huge amount of data (in GBs) and the conventional solution, write.csv, can test anyone’s patience with the time it demands.
In this blog, we will learn by doing. We make use of a package that is not very popular, but serves the purpose really well.
Package Feather
In words of “revolution analytics” blog, package feather is defined as,
“A collaboration of Wes McKinney and Hadley Wickham, to create a standard data file format that can be used for data exchange by and between R, Python and any other software that implements its open-source format.”
When we export data with feather, it is stored in a binary format file, which makes it less bulky (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file). There’s no need to go to and fro from numbers to text, and this aids in speedier reading and writing. Additionally, feather is a column-oriented file format, which matches R’s internal representation of data.
Code
With the primary motive of reducing the exporting time using R, I have created a random dataset of 25,000,000 rows and 3 columns and ran it with compatible solutions to compare the time taken by them to export the data in a csv or a bin format.
Here’s the sample code I used:
####################################
install.packages("data.table")
install.packages("stringi")
install.packages("feather")
library(feather)
library(data.table)
library(stringi)
num = 10000
size = 25000000
path0 <- "D:\\dataset101.feather"
path1 <- "D:\\dataset102.csv"
#######Generating Random DataSet##########
dataset <- data.table(col1 = rep(stri_rand_strings(num, 10), size / num),
col2 = rep(1:(size/ num), each = num),
col3 = rnorm(size))
#######Comparing Methods to Export#########
#1 Using 'feather'
print(system.time(write_feather(dataset, path0)))
#2 Using 'write.csv'
print(system.time(write.csv(dataset, path1)))
#####################################
Output:
#1 Using 'FEATHER'
user system elapsed
1.86 1.21 6.11
#2 Using 'write.csv'
user system elapsed
437.80 6.64 452.89
Conclusion
Here, we have seen that Package Feather is one the most efficient method which can be used to export and import datasets of all sizes.
In the next blog we will look at a few other options to do the same, and compare them with Package Feather.
The package bigmemory also works well with R but comes paired with a limitation, it can import/export dataset of only one type. It has been devised to work on matrices, and matrices in R support only one type of data.
For more information on types of data structures in R, please refer to this link.
Blog Author
Prashant Babber,
ASSC Consultant, Data Insights, MACH,
IGD
Source: https://blog.revolutionanalytics.com/2016/05/feather-package.html\>
Comments
- Anonymous
February 17, 2017
Very useful! Thanks for this blog - Anonymous
February 19, 2017
Excellent reference on large data set handling...thanks for sharing - Anonymous
February 20, 2017
This is a nice article to take care of the data management. Can we extend this to take care of much higher data flow? - Anonymous
February 20, 2017
You may want to take a look at this comparison: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/