Sunday, February 19, 2012

Reading huge files into R

SAS is much touted for its ability to read in huge datasets, and rightly so. However, that ability comes at a cost: for smaller datasets, since files remain on the disk rather than in memory (as is the case with Stata and R), it is potentially less fast.

If you don't want to learn/buy SAS but you have some large files you need to cut down to size (rather like the gorilla in the corner), R has several packages which can help. In particular, sqldf and ff both have methods to read in large CSV files. More advice is available here: http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r

If you're a Stata person, you can often get by reading the .dta file in chunks within a loop by adding e.g. "in 1/1000" afterwards if you want to read the 1st through 1000th observation in.

1 comment:

  1. I've never read a Stata dta in a loop before. Can you give an example of how this would work? Maybe a use-case as well? Thanks.

    ReplyDelete