Monday, February 20, 2012

STATA - loops for big-data

A user requested the following:
"I've never read a Stata dta in a loop before. Can you give an example of how this would work? Maybe a use-case as well? Thanks."

There are a number of ways to read in sets of data files into STATA using a loop.  Following on Ari's example (see previous post), let's say you have a file with a million lines which is too large for stata and you want to read in a thousand lines at a time, do some stuff to it to make it smaller, then append the smaller data sets together to create your final single analytic file.  Here is one way

*** Loop will start at 1000, then increment by 1000
***     until it gets to one million
forvalues high = 1000(1000)1000000 {
            local low = `high' - 999               //simple counter
            use datafile.dta in `low'/`high', clear

            <insert code to cut down size of file>

          *** Now create temporary file
           if `high' == 1000 {
                save temp, replace         //only first time through the loop
           }
           else {
               append using temp       
               save temp, replace
          }
}

save finalfile.dta, replace
erase temp
 *** You can also use a tempfile
***  and avoid the extra erase statement


Another way is to use the 'if' statement.  Lets say you have a large database but only want to look at females in that dataset:

use datafile.dta if gender=="female"

You could also put this into a loop to get certain cuts of data, again the gender example

local sex male female
foreach s of local sex {
     use datafile.dta if gender == "`s'", clear
    ** create two data files
    ** male_newfile.dta and then female_newfile.dta
    save `s'_newfile.dta, replace 
}

 Back to my bananas...

Sincerely,
primary data primate

 

4 comments:

  1. Note that Stata has a `touch` command that you can use with `capture` to avoid the if statements the first time you run a loop. Basically `touch` creates a blank file (optionally with the variables you need in it) so that the subsequent `append` works even if the file hadn't previously existed. The `capture` makes sure that when `touch` fails after the first iteration (because the file already exists) the error just gets ignored. This is also a great example of how solving your own problems can solve other peoples', as a certain simian wrote `touch` about 5 years ago and still gets e-mail from people who find it useful.

    ReplyDelete
  2. Great stuff Data Monkey. Thanks for the follow-up post.

    ReplyDelete
  3. I still like your way better though. A blog should be appreciated for its overall beauty
    Buy Pre Written Essays
    Online Writing Services
    Accounts Software For Small Business

    ReplyDelete
  4. Users of this technique should be aware that the use statement reads the entire .dta file, if if instructed to store only a subset of observations. So to keep run times reasonable it is best to keep the number of iterations in the for loop as small as possible, while still keeping within the available memory.

    ReplyDelete