The Data Monkey: March 2011

Thursday, March 10, 2011

R: Drop factor levels in a dataset

R has factors, which are very cool (and somewhat analogous to labeled levels in Stata). Unfortunately, the factor list sticks around even if you remove some data such that no examples of a particular level still exist

# Create some fake data
x <- as.factor(sample(head(colors()),100,replace=TRUE))
levels(x)
x <- x[x!="aliceblue"]
levels(x) # still the same levels
table(x) # even though one level has 0 entries!

The solution is simple: run factor() again:
x <- factor(x)
levels(x)

If you need to do this on many factors at once (as is the case with a data.frame containing several columns of factors), use drop.levels() from the gdata package:
x <- x[x!="antiquewhite1"]
df <- data.frame(a=x,b=x,c=x)
df <- drop.levels(df)

Now I'm going to quit monkeying around and get to sleep.

Wednesday, March 9, 2011

STATA: Useful tidbits and are you there?

Hey all,

A couple things if you find this useful please comment or "follow us" on the blog. Questions? Leave them in the comments or post it (or just email me or Ari and we can post):

Useful tidbit?
Two super important user written codes for STATA that you may not be aware of but will make your life A LOT EASIER:

outreg2

and

logout

outreg2: exports your regressions to journal ready tables in text, excel, latex, or other formats. It has a lot of options such as controlling formatting, adding in stars, number of decimal places, etc. It can also append multiple models to the same output file.

logout: This nice little utility also allows you to output almost anything that appears on the STATA window to a file like tables of summary statistics, cross-tabs, etc.

How do you add them to your local copy of STATA? Just type

findit outreg2 and findit logout

Then just download the .ado and .hlp files and you are all set. I give them my highest rating, five bananas, so download them now!

Monday, March 7, 2011

STATA: To the Power of _n and _N, filling in missing data

I'm posting this based on a question I got from one of the other students, and it is a common enough of an issue that I thought it would be worthwhile posting a solution.

STATA has a number of built in variables that you can use in pretty powerful ways. Two key ones are _n and _N where _n is the observation number and _N is the total number of observations in your data. One way to use these is to have stata look "up" or "down" your data.

For example, many times you will have data in the following format

id group name
1 1 "Mickey"
2 1 ""
3 1 ""
4 2 "Davy"
5 2 ""
6 3 "Peter"
7 4 "Michael"
8 4 ""
9 4 ""

But you want your data to look like this

id group name
1 1 "Mickey"
2 1 "Mickey"
3 1 "Mickey"
4 2 "Davy"
5 2 "Davy"
6 3 "Peter"
7 4 "Michael"
8 4 "Michael"
9 4 "Michael"

A very simple solution is:

gsort group -name
replace name = name[_n-1] if name=="" & _n !=1

STATA will then go through the data, in the order it is sorted*, and pull the string value for the previous observation [_n-1] and put it in the current observation if it meets the conditions noted (i.e. it isn't the first observation and the current observation has a missing value in the name variable)

* Important note: For string variables you need to specify gsort group -name. The "-" makes sure that the missing values are below the non-missing. For numeric variables, the opposite is required, namely gsort group num_var because STATA handles missing numeric values as very large numbers.

Also, if your data has been tset (set to a time series database) you can use tsfill. Ah but that is for a later post. I need a banana...good monkey...

Happy Coding!!!

Thursday, March 3, 2011

R: Spatial statistics tutorials

I've done more than just monkey around with spatial statistics and map-making, and for that R is one of (if not the single) best platform out there. Now there's a promising new tutorial to make some of the analysis a little easier to work out. Looks like a big help for people just getting started exploring spatial data.

Tuesday, March 1, 2011

R: Excel spreadsheet manipulation

Sure, statistical packages are much cooler than Excel for data work, but sometimes other monkeys just like doing things in Excel. And primates are social creatures, so you have to collaborate with them. What to do?

There's a nifty new R package called XLConnect that looks like it will manipulate Excel files nicely.

The Data Monkey