Monday, January 31, 2011

STATA: Open partial files

A handy trick which works for any situation in which you can't open a file because it's too big for memory (or you just want massive files to load faster to get a sense of what the data's like before you begin analysis in earnest). You can combine -use- with an if or in statement, as in:

use hugefile.dta in 1/50000, clear

or

use hugefile.dta if sex == 1, clear

Sunday, January 30, 2011

Tab completion

Let's say your hands are aching from too much typing in of variables. What to do? Get a keyboard tray and learn proper ergonomics, of course.

But what if you just want to reduce the amount of typing in of variables you do for reasons of laziness...err...efficiency. Well, you can type the part of the variable that's unique and then hit Tab. Stata or R (or many other programming environments) will both fill in the rest for you.

Suppose you have variables named:
AriWillGetRejectedFromHisFavoriteSchools
MaraWillGetInEverywhere
MarkWhosMark
DumplingsAndData

If you wanted the first variable to show up. Just type "A" and hit TAB.
But if you want the second, you'd have to type "Mara" and hit TAB, because until you hit the fourth letter it won't be sure which variable you want.

Saturday, January 29, 2011

STATA: Matrices

Have a loop that runs through a bunch of different, say, years of data sets, but at the end you only want to store a few summary values for each year? It can be a pain keeping track if the dataset is too large to load all the years into memory and do something with -by-. Plus there are things that are hard to store except as individual values. There's no collapse command for correlation, for instance.

What to do....

Well, other languages (especially R) handle this much more elegantly. But there are some work-arounds in Stata. Basically, we'll create an empty matrix with the number of rows for the number of loops we're going to run and the number of columns for the different types of things we want to store (in this case four correlations for each year).

Alternatively, you could use systematically-named locals (e.g. `cor`yr''), but those get moderately ugly when you get past a few variables, 'cause you still have to output them to a dataset.

Here goes:

local startyr = 1988
local endyr = 2001

local numyears = `endyr' - `startyr' + 1
matrix correlations = J(`numyears',4,.)
matrix colnames correlations = CVD_LC CVD_NONLC CVD_INJINT INJINT_INJACC

forvalues year = `startyr'/`endyr' {
local yearnum = `year' - `startyr' + 1
use "`year'data"

corr drateCVD drateLC
matrix correlations[`yearnum',1] = r(rho)
corr drateCVD drateNONLC
matrix correlations[`yearnum',2] = r(rho)
corr drateCVD drateINJINT
matrix correlations[`yearnum',3] = r(rho)
corr drateINJINT drateINJACC
matrix correlations[`yearnum',4] = r(rho)
}

drop _all
svmat correlations, names(col)

exit

Friday, January 28, 2011

STATA: egen basics

-egen- has all sorts of cool things for you to play with. In particular, whenever you're thinking about doing something that spans multiple columns or rows, -egen- is usually the preferred solution. It's especially useful in combination with the -by:- prefix.

For instance:
* Want to sum across rows? egen poptotal = rsum(pop1-pop10)
* Want to figure out how many apples are in each household (assuming each row is a person and the apple variable contains the number of apples they own?
bysort householdID: egen applestotal = apples
egen tag = tag(householdID)
keep if tag == 1
drop tag
keep householdID applestotal

Thursday, January 27, 2011

STATA: Assert

Do you have trouble sleeping at night? Do you have a massive proliferation of -if- statements designed to check that all is well in the world? Have I got a product for you! And what a price, only $9.95. Please address all checks to: Me.

Err, right. -assert- is a simple command. Give it a logical statement (like you would an if option), and it will make your program fail if it's not true. Easy error checking. Now you can sleep.

STATA: Formatting display numbers

-format-

Example:
clear
set obs 2
gen x = 1.1234567
gen y = 2
l
format * %09.3f
l
format * %9.3f
l
format * %9.3g
l

Just to clarify, the * in the format command is a varlist (the * means "all" in pretty much any language). You could give it x or y instead. See -help varlist- for more fun with varlists.

And to further clarify, the 0 before the rest of the format string (as in the zero in %09.3f) makes Stata pad out zeroes. Why would you ever want this? Well, say you had a state/county code for Augusta, Alabama (these are known as FIPS codes). That's state 01, county 001. So the combined code is 01001. Now say you actually wanted it to export as 01001 instead of 1001, as it would if it were numeric....

clear
set obs 1
gen fips = 01001
l
tostring fips, replace format(%05.0f)
l

Wednesday, January 26, 2011

STATA: Mass renaming variables

Have a bunch of variables with the same beginning of their name, but you want them to be named something else? E.g.
pop1961
pop1962
pop1963

You could -reshape long-, then -rename-, then -reshape wide-, but that's ugly, takes forever, will generate missing values if you don't have all the years, doesn't work for things not years, etc. etc.

Instead, try:
-renpfix-

E.g. -renpfix pop mom-

Now you've got:
mom1961
mom1962
mom1963

Tuesday, January 25, 2011

STATA: Tokenizing locals

There's a command called -tokenize-. Some people use it a lot, some people only a little. You could do everything it does with regular expressions if you reeeally wanted to, but it makes the whole process a bit easier. It works like this:
* First you run -tokenize- on a string you want to break up into pieces. E.g.:
local States "AL MI TN FL"
tokenize `States'
* Now every word (e.g. something separated by a space in that string) is stored in a series of macros `1' `2' `3'. Try it:
di "`1' `2' `3' `4' `5'"
* But what use is that, you ask? Ah, well you can use them one at a time:
while "`*'" != "" {
local ifrace "`ifrace' | race == `1'"
macro shift
}
keep if `ifrace'
* What's this funny `*' thing? And what the heck is -macro shift-? We'll start with the latter. -macro shift- does exactly what it sounds like: it "shifts" the entire stack of tokenized locals over by one. So the one that used to be `2' will now be `1', and so forth all the way down the line. The one that used to be `1' is vanished into the ether. The `*' local contains all the remaining token terms that haven't been shifted off the end until the ether yet. So that loop will essentially keep looping over all the words/terms in `States' until they're exhausted, then be done. Within the loop, you can do whatever you want with the contents.
* Note that to actually make that loop produce a working if statement, you'll need to remove the first |. You could do that either by putting the first -local ifrace...- and -macro shift- outside of the loop, or you could use a regular expression to remove the first | once the local is created.

Monday, January 24, 2011

STATA: Appending in loops

Say you have a loop and you want to add the results of the previous iteration on the end of one data file.

local PBFs "David1 David2 David3"
tempfile PBFdata
foreach PBF of local PBFs {
set obs 5
gen currentPBF=`PBF'
append using `PBFdata'
save `PBFdata', replace
}

This is all well and good, except that code above doesn't work. It fails the first time you run the loop, because there's no `PBFdata' file at that point, only the local pointing to an empty location.

What to do? You've got some options:

-if word("`PBFs'",1) == "`PBF'" append using `PBFdata'-
// This works because it checks whether this is the first time you're running the loop or not, but who wants to type all that? Still, that word function is pretty cool, huh?

-cap append using `PBFdata'-
//This works, but if it fails for other reasons, your program will keep on going and things mess up badly.

-touch `PBFdata'-
//This is what I do. I like it so much, that I had to write the -touch- command just to make it work. Note you have to put it outside the loop (I usually put it at the very top of my do file), otherwise you'll overwrite your file each time!

To install touch, type -findit touch- to locate it in the user-contributed repositories.

Sunday, January 23, 2011

STATA: Regular expressions

A regular expression allows you to do a moderately fancy search (and replace if you want). So say you wanted to replace all the "Dennis"s in a variable with "Awesome"s, but only if they're at the end of the line. You could try:
-replace PBFnamevar = regexr(PBFnamevar,"Dennis$","Awesome")-
You could also replace any character, or just capitals, or just digits...there are lots of possibilities:
http://www.stata.com/support/faqs/data/regex.html

You can also use it for locals:
-local strata = regexr("agecat","age")-

Or -if- commands:
if regexm("`strata'","age") {
}

On a related note (although not actually regular expressions), say that you've got a string variable that consists of a bunch of what should be separate variables, only lumped all into one, separated by a semicolon (e.g. a row might look like "1;15.2;89;hi;21"). Try -split-:
-split textvar, gen(newtextvars) parse(";")-

I should note that Stata's regular expressions are wimpy compared to what other languages support. R supports PERL regular expressions, which can do so many things it's scary.

Saturday, January 22, 2011

STATA: Locals in global names

> i have a series of globals with the names: strata_pop1 strata_pop2
> strata_pop3, etc. all the way up to strata_pop21
>
> i'm trying to reference the values stored in each with a loop
like this:
> forvalues i = 1/21 {
> display $strata_pop`i'
> }
>
> the problem is, stata seems to display the global $strata_pop first
> (which has nothing stored in it), and then the value of the local
`i'
> so all i get is the values of the local `i' spit back at me. is there
> a way to use a loop to reference the values stored in each of the
> strata_pop globals?

The solution is simple: Enclose the global in curly braces, like this:
display ${strata_pop`i'}

Friday, January 21, 2011

STATA: Stata resources

All,

Lifted from marginal revolutions (http://tinyurl.com/5vvzbc4) but reproduced fully here. And I totally agree on the Baum book, buy it now, it is awesome (go to marginal revolution for the links)

Stata Resources

Here are some Stata resources that I have found useful. Statistics with Stata by Hamilton is good for beginners although it is overpriced. For the basics I like German Rodriguez's free Stata tutorial best, good material can also be found at UCLA's Stata starter kit and UNC's Stata Tutorial; two page Stata is good for getting started quickly.

Christopher Baum's book An Introduction to Modern Econometrics using Stata is excellent and worth the price. The world is indebted to Baum for a number of Stata programs such as NBERCycles which shades in NBER recession dates on time series graphs--this was a big help in producing graphs for our textbooks!--so buy Baum's book and support a public good.

I have found it hugely useful to peruse the proceedings of Stata meetings where you can find professional guides to using Stata to do advanced econometrics. For example, here is Austin Nichols on Regression Discontinuity and related methods, Robert Guitierrez on Recent Developments in Multilevel Modeling, Colin Cameron on Panel Data Methods and David Drukker on Dynamic Panel Models.

I found A Visual Guide to Stata Graphics very useful and then I lent it to someone who never returned it. I suppose they found it very useful as well. I haven't bought another copy, since it is fairly easy to edit graphs in the newer versions of Stata. You can probably get by with this online guide.

German Rodriguez, mentioned earlier, has an attractively presented class on generalized linear models with lots of material. The LSE has a PhD class on Stata, here are the class notes: Introduction to Stata and Advanced Stata Topics.

Creating a map in Stata is painful since there are a host of incompatible file formats that have to be converted (I spent several hours yesterday working to convert a dBase IV to dBase III file just so I could convert the latter to dta). Still, when it works, it works well. Friedrich Huebler has some of the details.

The reshape command is often critical but difficult, here is a good guide.

Here are many more sources of links: Stata resources, Stata Links, Resources for Learning Stata, and Gabriel Rossman's blog Code and Culture.

Slash confusion

Windows uses backwards slashes to mark off directories (e.g. "c:\temp\PBFSRULE.dta"). UNIX uses forwards slashes (e.g. "c:/temp/PBFSRULE.dta"). Stata accepts either on Windows. However, since the back slash is also used as an escape character (e.g. if you want the ` that starts a local to actually appear as a ` instead of starting a local, you can type \` ), it is not a bad idea to get in the habit of using forward slashes.

That prevents problems with something like this:
local outfile "PBFSRULE.dta"
save "c:\temp\`outfile'"

Besides, if you switch between UNIX and Windows environments, your code will be usable in both environments this way.

For a little history:
http://blogs.msdn.com/larryosterman/archive/2005/06/24/432386.aspx
And a more lyrical interpretation:
http://backslashconspiracy.org/article.php?id=1

Try these in Stata:
local filename "blah.dta"
local directory "c:/temp"
di " `directory'\`filename' "
di "Oops. It doesn't work because the backslash is escaping the character that follows it instead of allowing Stata to interpret it as usual."
di " `directory'/`filename' "
di "Forward slashes don't have this problem."
di " `directory'\\`filename' "
di "This time the first backslash escapes the second one, preventing it from having its usual function of escaping the local quote mark \` "
di "Crazy huh?"

Thursday, January 20, 2011

STATA: Do file header

Recommended code to put at the top of every do file:

clear
cap log close
set mem __m // Where __ is the memory size you want, obviously
set more off
pause on

STATA: Window management

Have you moved all your windows within Stata to the wrong place and can't figure out how to get them back? Try -window manage prefs default- .

Wednesday, January 19, 2011

STATA: Debugging with trace

Having a problem figuring out where a pesky line of code is hiding? Type -set trace on- and run your do file again. You'll get more ridiculously verbose output than you can shake a stick at, which will take twice as long to run but let you know in excruciating detail exactly what Stata was thinking. Just remember to -set trace off- when you're done.

Tuesday, January 18, 2011

STATA: Profile.do

Each time Stata loads, it runs the commands in the profile.do file located in the Stata installation directory (typically "c:\program files\stata11" on a Windows machine). This provides a handy way to configure your Stata environment just the way you like it. For instance, if you often find that 10MB of memory isn't enough, that you want a global pointing to a particular location*, and that your custom ADO files are in another location, your profile.do might look like:
set mem 50m
global xdrive "c:/xdrive"
adopath ++ "$xdrive/projects/ado"

* This is particularly useful when you routinely do your work on several computers with different directory structures.

Monday, January 17, 2011

STATA: Nifty commands (-expand- and -set obs-)

One of the common ways of getting things done in Stata is to add observations to the end of the dataset, then modify them in some way. The -expand- command makes this easy, by adding replicates of the observations in memory on to the end, after which you can modify them. You will likely want to save the current number of observations to a local so you know which are the new copies: - local originalN = _N -

Want to duplicate your dataset?
-expand 2-
Want to triplicate your dataset?
-expand 3-
Want to duplicate only the observations from year 1999?
-expand 2 if year==1999-

If you want to create blank rows at the end of a dataset, use -set obs- instead:
local numobs = _N + 1
set obs `numobs'
replace x = 10 in l

Thursday, January 13, 2011

STATA: Executing code over multiple dataset with different variables

Many times panel datasets comes as yearly data files, which may have variables that are dropped or added over time. This is especially true as the panel gets longer. There are times when you just want to automatically go over each yearly file and execute some code, but if you try to execute code on a variable that doesn't exist STATA will halt the execution of your do file. You can use a bunch of if statements, but an easy solution:

capture confirm variable varname
if _rc == 0 {
code block
}

The capture command allows a command to be executed even if it creates an error, in this case the command is: confirm variable varname). If there is an error capture puts a code in the scalar _rc, but allows the do file to keep running.

So in this case if the variable didn't exist the scalar _rc would be 111, if the variable does exist _rc is set to 0. So the only time that the code block is executed is if the variable actually exists, if it doesn't then the do file skips that step and then keeps going.