Saturday, March 21, 2009

Reading Large datasets in R: filehash

Theoretically, the package 'filehash' makes R handle a large dataset by allowing a hard-disk space instead of a ram area for a dataset loading. I've tested this package with a 1G Stata-format dataset. It didn't work well. Anyway, here is howto:

(1) Install 'filehash'

> install.packages('filehash')
> library(filehash)

(2) Set an environment for the large dataset you'd like to use

> dumpDF(read.csv("largedata.csv"), dbName="dbname")
> envname <- db2env(db="dbname")

(3) Analyze with the environment

> with(envname, lm(y~x))

* envname & dbname can be any name you like.

filehash manual; howto by Yu-Sung Su

No comments:

Post a Comment