Monday, March 7, 2011

STATA: To the Power of _n and _N, filling in missing data

I'm posting this based on a question I got from one of the other students, and it is a common enough of an issue that I thought it would be worthwhile posting a solution.

STATA has a number of built in variables that you can use in pretty powerful ways. Two key ones are _n and _N where _n is the observation number and _N is the total number of observations in your data. One way to use these is to have stata look "up" or "down" your data.

For example, many times you will have data in the following format

id group name
1 1 "Mickey"
2 1 ""
3 1 ""
4 2 "Davy"
5 2 ""
6 3 "Peter"
7 4 "Michael"
8 4 ""
9 4 ""

But you want your data to look like this

id group name
1 1 "Mickey"
2 1 "Mickey"
3 1 "Mickey"
4 2 "Davy"
5 2 "Davy"
6 3 "Peter"
7 4 "Michael"
8 4 "Michael"
9 4 "Michael"

A very simple solution is:

gsort group -name
replace name = name[_n-1] if name=="" & _n !=1

STATA will then go through the data, in the order it is sorted*, and pull the string value for the previous observation [_n-1] and put it in the current observation if it meets the conditions noted (i.e. it isn't the first observation and the current observation has a missing value in the name variable)

* Important note: For string variables you need to specify gsort group -name. The "-" makes sure that the missing values are below the non-missing. For numeric variables, the opposite is required, namely gsort group num_var because STATA handles missing numeric values as very large numbers.

Also, if your data has been tset (set to a time series database) you can use tsfill. Ah but that is for a later post. I need a banana...good monkey...

Happy Coding!!!

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thank you very much for this handy trick!

    ReplyDelete