Monday, February 20, 2012

STATA - loops for big-data

A user requested the following:
"I've never read a Stata dta in a loop before. Can you give an example of how this would work? Maybe a use-case as well? Thanks."

There are a number of ways to read in sets of data files into STATA using a loop.  Following on Ari's example (see previous post), let's say you have a file with a million lines which is too large for stata and you want to read in a thousand lines at a time, do some stuff to it to make it smaller, then append the smaller data sets together to create your final single analytic file.  Here is one way

*** Loop will start at 1000, then increment by 1000
***     until it gets to one million
forvalues high = 1000(1000)1000000 {
            local low = `high' - 999               //simple counter
            use datafile.dta in `low'/`high', clear

            <insert code to cut down size of file>

          *** Now create temporary file
           if `high' == 1000 {
                save temp, replace         //only first time through the loop
           else {
               append using temp       
               save temp, replace

save finalfile.dta, replace
erase temp
 *** You can also use a tempfile
***  and avoid the extra erase statement

Another way is to use the 'if' statement.  Lets say you have a large database but only want to look at females in that dataset:

use datafile.dta if gender=="female"

You could also put this into a loop to get certain cuts of data, again the gender example

local sex male female
foreach s of local sex {
     use datafile.dta if gender == "`s'", clear
    ** create two data files
    ** male_newfile.dta and then female_newfile.dta
    save `s'_newfile.dta, replace 

 Back to my bananas...

primary data primate


Sunday, February 19, 2012

Reading huge files into R

SAS is much touted for its ability to read in huge datasets, and rightly so. However, that ability comes at a cost: for smaller datasets, since files remain on the disk rather than in memory (as is the case with Stata and R), it is potentially less fast.

If you don't want to learn/buy SAS but you have some large files you need to cut down to size (rather like the gorilla in the corner), R has several packages which can help. In particular, sqldf and ff both have methods to read in large CSV files. More advice is available here:

If you're a Stata person, you can often get by reading the .dta file in chunks within a loop by adding e.g. "in 1/1000" afterwards if you want to read the 1st through 1000th observation in.