Monday, February 20, 2012

STATA - loops for big-data

A user requested the following:
"I've never read a Stata dta in a loop before. Can you give an example of how this would work? Maybe a use-case as well? Thanks."

There are a number of ways to read in sets of data files into STATA using a loop.  Following on Ari's example (see previous post), let's say you have a file with a million lines which is too large for stata and you want to read in a thousand lines at a time, do some stuff to it to make it smaller, then append the smaller data sets together to create your final single analytic file.  Here is one way

*** Loop will start at 1000, then increment by 1000
***     until it gets to one million
forvalues high = 1000(1000)1000000 {
            local low = `high' - 999               //simple counter
            use datafile.dta in `low'/`high', clear

            <insert code to cut down size of file>

          *** Now create temporary file
           if `high' == 1000 {
                save temp, replace         //only first time through the loop
           else {
               append using temp       
               save temp, replace

save finalfile.dta, replace
erase temp
 *** You can also use a tempfile
***  and avoid the extra erase statement

Another way is to use the 'if' statement.  Lets say you have a large database but only want to look at females in that dataset:

use datafile.dta if gender=="female"

You could also put this into a loop to get certain cuts of data, again the gender example

local sex male female
foreach s of local sex {
     use datafile.dta if gender == "`s'", clear
    ** create two data files
    ** male_newfile.dta and then female_newfile.dta
    save `s'_newfile.dta, replace 

 Back to my bananas...

primary data primate



  1. Note that Stata has a `touch` command that you can use with `capture` to avoid the if statements the first time you run a loop. Basically `touch` creates a blank file (optionally with the variables you need in it) so that the subsequent `append` works even if the file hadn't previously existed. The `capture` makes sure that when `touch` fails after the first iteration (because the file already exists) the error just gets ignored. This is also a great example of how solving your own problems can solve other peoples', as a certain simian wrote `touch` about 5 years ago and still gets e-mail from people who find it useful.

  2. Great stuff Data Monkey. Thanks for the follow-up post.

  3. I still like your way better though. A blog should be appreciated for its overall beauty
    Buy Pre Written Essays
    Online Writing Services
    Accounts Software For Small Business