The Data Monkey

2013-04-26T08:54:00.001-04:00

Noam Ross has a great overview of making code go faster in R here, although a lot of the ideas (such as pre-allocation) apply to every language. My own tips are not that dissimilar--a little less complete but more specific.

Apply-style commands in R

2013-03-14T09:43:00.000-04:00

Here's a quick table of what I think are the most useful apply-style commands in R:

Function	Input	Output	Best for
apply	Rectangular	Rectangular or vector	Applying function to rows or columns
lapply	Anything	List	Non-trivial operations on almost any data type
sapply	Anything	Simplified (if possible) or list	Same as lapply, but with simplified output
plyr::ddply	data.frame	data.frame	Applying function to groupings defined by variables

For alternatives to plyr, see this post on StackOverflow.

R scripts for analyzing survey data

2013-02-02T17:55:00.000-05:00

Another site pops up with open code for analyzing public survey data: http://www.asdfree.com/ It will be interesting to see whether this gets used by the general public--given the growing trend of data journalism and so forth--versus academics. It is a useful resource for both.

SAS-b-gon?

2012-11-04T06:08:00.000-05:00

There have been some improvements in the way that R reads the arcane file format that is a native SAS file. Hopefully soon I will never have to use SAS again, and my species can remain homo sapiens as opposed to screech monkey.

NHIS with R

2012-10-16T07:53:00.002-04:00

Here are handy code snippets and explanations to get you running on the NHIS. The same site has R code for the CPS and ARF. Please be sure to thank him if you make use of the code.

Spreadsheet mayhem

2012-10-15T08:15:00.003-04:00

I just chanced across a classic anti-spreadsheet screed (no monkey, not screech, screed!). See also the European Spreadsheet Risks Interest Group (!)'s list of quotable quotes and very expensive mistakes directly due to the use of spreadsheets.

This is huge: SAScii package

2012-07-10T18:03:00.000-04:00

http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html Q: How do you make a hairless primate? Answer 1: Take a hairy primate, wait a few million years and see if Darwin was right. Answer 2: Make them work in SAS and watch them pull all their hair out. Unfortunately many public datasets are released as ASCII files with only SAS code to read them in, name all the variables properly, etc. Now there's a new kid on the block, the SAScii package for R, which will read in the SAS script, parse it, and deliver you an R file instead. Since R has fabulous import/export abilities (via the foreign package), this means even if you are a Stata user you can take advantage.

Cleaning data: Removing unwanted characters from large files

2012-05-15T11:11:00.000-04:00

Hi all,

On occasion I have to pull in data from poorly formatted sources, such as excel, access or text/comma/pipe deliminated files, etc. And many times I have problems with single quotes, double quotes, carriage returns, or line feeds. Usually I can strip these things out using notepad. However, I had an ugly problem with a large pipe deliminated file which was created from an oracle database. The problem was that there were hard returns in a memo field and this caused numerous problems for SAS and STATA trying to read in the file. The memo field should be a single variable, but SAS and STATA "see" the carriage return and put it as a new observation, causing multiple problems. Nothing worked (infile, infix, proc import, notepad ect.).

After some hard work I found a cool free text editor (http://www.hhdsoftware.com/free-hex-editor) that lets me look at the files raw ascii, hex, decimal, float, double and/or binary encoding of the file (see: http://www.asciitable.com/). What this means is that I could do a global replace on any of the hard returns as the hexeditor is agnostic to formatting and shows you everything in the file, nothing is hidden. So in this case I opened the file in octadecimal and replace all carriage returns (oct: 014) and line feeds (oct: 012) with spaces (oct: 040). And it is super efficient. The other neat thing is that you can look for all sorts of patterns etc. in the data so makes string searches really really easy.

So if you ever have trouble reading in a raw data file or have some complicated sting variables, you may be able to use these free hex editors to help things along.

Best,
PDP - primary data primate

STATA access to World Bank data

2012-03-22T07:56:00.000-04:00

Talk about bananas! The World Bank has just published a new version of the wbopendata_module that gives STATA users direct access to a lot of their data! More information here:

http://data.worldbank.org/news/accessing-world-bank-open-data-in-stata

According to their website:

1,000 new indicators for a total of 5,300 time series

Access to the metadata including indicator definitions and other supporting documentation

Links to maps from within STATA

And it couldn't be easier to get access, just type:

ssc install wbopendata

The help file gives you all the details

Dates and times in R

2012-03-16T23:13:00.002-04:00

Nothing looks funnier than a patchy simian. That's why we sighed a great sigh of relief when we spotted this article on the lubridate package in R. It saves a great deal of hair pulling.

http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/

STATA - loops for big-data

2012-02-20T07:41:00.002-05:00

A user requested the following:
"I've never read a Stata dta in a loop before. Can you give an example of how this would work? Maybe a use-case as well? Thanks."

There are a number of ways to read in sets of data files into STATA using a loop. Following on Ari's example (see previous post), let's say you have a file with a million lines which is too large for stata and you want to read in a thousand lines at a time, do some stuff to it to make it smaller, then append the smaller data sets together to create your final single analytic file. Here is one way

*** Loop will start at 1000, then increment by 1000
***     until it gets to one million
forvalues high = 1000(1000)1000000 {
            local low = `high' - 999               //simple counter
            use datafile.dta in `low'/`high', clear

<insert code to cut down size of file>

          *** Now create temporary file
   if `high' == 1000 {
                save temp, replace         //only first time through the loop
           }
           else {
               append using temp
   save temp, replace
}
}

save finalfile.dta, replace
erase temp
*** You can also use a tempfile
*** and avoid the extra erase statement

Another way is to use the 'if' statement. Lets say you have a large database but only want to look at females in that dataset:

use datafile.dta if gender=="female"

You could also put this into a loop to get certain cuts of data, again the gender example

local sex male female
foreach s of local sex {
   use datafile.dta if gender == "`s'", clear
** create two data files
    ** male_newfile.dta and then female_newfile.dta
save `s'_newfile.dta, replace
}

Back to my bananas...

Sincerely,
primary data primate

Reading huge files into R

2012-02-19T22:09:00.003-05:00

SAS is much touted for its ability to read in huge datasets, and rightly so. However, that ability comes at a cost: for smaller datasets, since files remain on the disk rather than in memory (as is the case with Stata and R), it is potentially less fast.

If you don't want to learn/buy SAS but you have some large files you need to cut down to size (rather like the gorilla in the corner), R has several packages which can help. In particular, sqldf and ff both have methods to read in large CSV files. More advice is available here: http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r

If you're a Stata person, you can often get by reading the .dta file in chunks within a loop by adding e.g. "in 1/1000" afterwards if you want to read the 1st through 1000th observation in.

iMacros - webscraping

2012-01-27T10:12:00.000-05:00

If you ever need to download a lot of data, use iMacros (http://wiki.imacros.net/Main_Page)

Recently I wanted to download a large public data sets for multiple years (NHANES) however this would have required a lot of manual downloading . For example, the 2007-2008 NHANES wave has 113 individual files and I wanted all the files from 1999-2010 so close to a thousand different files.

In order to do this I found a free browser automation tool iMacros that can automate anything that you do in a browser.

The other nice thing is that it can read in data from a .csv file to update what it has to do. So I just copied cut and pasted the names of the data files. Wrote eleven lines of code and off the program went, resulting in a a rich repeated cross sections of NHANES with close to 7000 different variables. Here's the code:

VERSION BUILD=7401110 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
URL GOTO=ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2001-2002/

SET !TIMEOUT 500
SET !DATASOURCE c:\nhanes_names.csv
SET !DATASOURCE_COLUMNS 1
SET !DATASOURCE_LINE {{!LOOP}}
ONDOWNLOAD FOLDER=* FILE={{!COL1}} WAIT=YES
TAG POS=1 TYPE=A ATTR=TXT:{{!COL1}} CONTENT={{!COL1}}

Play around with the tutorials, but it is real easy tool with minimal upfront cost but huge potential returns.

Download them ALL

2012-01-17T10:57:00.002-05:00

Have a bunch of tasty, tasty ~~bananas~~ files you want to download? But they're stuck on a webpage?

Get the Firefox web browser and add the DownThemAll extension. It lets you download all of the links of a particular type on a page. Super handy.

Health economics datasets

2011-11-02T11:13:00.001-04:00

"This Compendium of Health Datasets for Economists (ICoHDE) provides the largest collection of English specific datasets available for researchers interested in the field of economics of health and health care."

http://www.herc.ox.ac.uk/icohde

STATA: Capturing information in labels

2011-09-19T17:22:00.000-04:00

Been quoit for awhile, here is a tip on using and getting variable and value labels

Labeling variables and values is useful for understanding what your underlying data represents. This is great while you are in the STATA environment. But many times you may want to use this information in more dynamic ways.

For example, lets say that I am using the autos database and I want to output a simple table of frequencies that looks like this:

Car type Freq.
Domestic 52
Foreign 22

In order to do this I have to get the information stored in my variable labels and value label, so follow along:

clear
set more off
sysuse auto
label list

tempname my_table file open `my_table' using ///
"c:\my_table.xls", write replace

** This is where I get the variable label **
** in long hand **
** local var_name : variable label foreign **
local var_name : var l foreign

** Now that this is in a local **
** I can use it anywhere **
** so let's write it to our file **
file write `my_table' ("`var_name'") _tab ("Freq.") _n

** Now lets get our frequencies **
** and value labels First get the **
** name of the label value **
local nm_label : val l foreign

forvalues x=0(1)1 {
quietly sum foreign if foreign == `x'
** Now to get the label values **
** for 0 "Domestic" and 1 "Foreign" **
** in the value label origin **
local val_name : label `nm_label' `x'
file write `my_table' ///
("`val_name'") _tab (r(N)) _n
}
file close `my_table'

This is a powerful way to export data in a meaningful fashion and can save you a lot of time. Recall that after the sum, there are a number of values that we can recover. Type return list, if you need other descriptive statics use the detail option for the sum command. Also you can get post regression estimates through ereturn list after you run a regression. If you are familiar with using matrices in STATA then you can get all of your coefficents, etc.

More on that later

Happy coding monkeys...

STATA: Geographic data - cool new commands

2011-06-24T11:48:00.000-04:00

I emailed this to most Wharton PhD health care students, but thought this was worthy posting here for others. There are two new commands in stata that allow you to link with google maps and turn addresses into latitudes and longitudes as well as calculate distances and travel times.

First type:

findit geocode

And install the two commands, geocode and travel time

What do these two commands do?

First geocode can take addresses, in many sorts of formats, and then return the latitude and longitude based on Google Geocoding. Because it is using Google the matches can be pretty good, there is flexibility on the addresses, and geocode can also return the geoscore which gives you an estimate of accuracy of the match.

Once you have the latitudes and longitudes you can use the traveltime command to find the distance between points AND the travel time. What is really cool is that it can be driving, walking, or public transport time.

These are probably really useful for a lot of hospital based studies, and other things. Either way check out the help documentation to learn more.

All the best!!! - Hat tip to Mike Harhay who put me onto this.

Two cool new packages for R

2011-06-23T10:16:00.004-04:00

Let's say you have some data stored in a primate-tive format like paper. But you'd like to get it into something a little more evolved. A new R package called digitize lets you do just that. Click a few points to calibrate the axis, and all your new shiny scatterplot points will be stored as real digital data. Not a tool you'll use often, but invaluable when you need it.

If you've ever monkeyed around with ArcGIS, you'll know that it produces pretty maps. Unfortunately its interface is terrible, it crashes frequently, and it's not very easy to automate. R, on the other hand, does not crash and is eahttp://www.blogger.com/img/blank.gifsy to automate, but its maps are pretty ugly. Enter rworldmap, a package which produces pretty world maps like this:

It's a marked improvement. Read more in the R Journal.

We are now a part of R-bloggers.com

2011-06-09T16:37:00.000-04:00

R-bloggers is a site that aggregates many of the best R blogs on the internet. We're glad they've allowed our R-related posts to be aggregated there. If you mainly write in R, it's worth checking them out.

R: Speeding things up

2011-06-09T15:18:00.004-04:00

R is many things, but it's not exactly speedy like a Patas Monkey. In fact, while it is much faster than many other solutions, R is notably slower than Stata (even inspiring talks that it should be rewritten from scratch!).

Fortunately, Radford Neal has been hard at work speeding R up, and has released some new patches to play with if you find it too slow. You can also try writing key sections in C++, or using Revolution Analytics' offerings (free for academics).

For extreme speed needs, however, R can't be beat, as it has long offered graphics-card based extreme parallelism that commercial solutions are only beginning to match.

Of course, for more prosaic needs, focusing on vectorizing key operations can solve speed troubles. And it's worth noting that the $1,000+ per copy that Stata costs can buy an awful lot of extra processing power to throw at the problem.

SAS: Design of experiments - Marketing research

2011-04-18T15:36:00.000-04:00

All,

There have been some requests for SAS tips so I'll post a couple of useful things over the next couple of weeks. SAS has a lot of functions that STATA doesn't or are hard in STATA. For example, doing maps with data is quite easy, like displaying immunization rates by country on a world map (more on this later).

For this post, I just wanted to point pople to an excellent resource if you ever have to design an experiment.

http://support.sas.com/techsup/technote/mr2010.pdf

This was put together by Walter Kuhfeld and is an excellent guide on how to design discrete choice and conjoint studies using SAS, along with a number of other marketing based analyses. These obviously come out of the marketing area, but these techniques are being increasingly adapted to the health care field to elicit patient or provider preferences. I found it quite useful in a discrete choice experiment I will be testing on physicians dealing with smoking cessation.

Best,
Monkey out...

STATA: file write or a way to exporting of almost anything

2011-04-13T16:50:00.000-04:00

This is a bit of a repost, but it is so useful that I thought it would useful to people.

Ever want to get a formatted table of summary statitics exported directly from Stata? Outreg2 does a great job with exporting regression results, but what about variable means, variances, or other summary statitics. A great way to do this is with file write. This is a great command and provides you with a lot of control. Its simple:

sysuse auto
file open myfile using"C:/mytable.txt", write replace
file write myfile "Table of descriptive stats" _n _n
file write myfile _tab "Mean" _tab "5th pct" _tab "95th pct"_n
quietly sum price, detail
file write "Price" _tab %7.2f (r(mean)) _tab %7.2f (r(p5)) ///
_tab %7.2f (r(p95)) _n
file close myfile

Here is what just happened. We first open a file with the handle "myfile" that is associated with a text file "mytable.txt". Then I write a header on the first line. The _n sends a hard return, so I sent two hard returns after the header. Then I write my column headers, seperated by tabs (_tab). Then I write my formated summary statistics (%7.2f), again seperated by tabs. Note: you can send anything that is shown in return list or ereturn list so it is pretty flexible. Finally, I close the file. I have created a tab deliminated text file that we can open in excel or elsewhere.

When you combine this with loops and lists of variables that you can store in a local macro, it makes exporting standard tables very easy and automated. See my February 2010 post for a more complicated example.

Happy coding...

R: Drop factor levels in a dataset

2011-03-10T00:38:00.002-05:00

R has factors, which are very cool (and somewhat analogous to labeled levels in Stata). Unfortunately, the factor list sticks around even if you remove some data such that no examples of a particular level still exist

# Create some fake data
x <- as.factor(sample(head(colors()),100,replace=TRUE))
levels(x)
x <- x[x!="aliceblue"]
levels(x) # still the same levels
table(x) # even though one level has 0 entries!

The solution is simple: run factor() again:
x <- factor(x)
levels(x)

If you need to do this on many factors at once (as is the case with a data.frame containing several columns of factors), use drop.levels() from the gdata package:
x <- x[x!="antiquewhite1"]
df <- data.frame(a=x,b=x,c=x)
df <- drop.levels(df)

Now I'm going to quit monkeying around and get to sleep.

STATA: Useful tidbits and are you there?

2011-03-09T11:19:00.000-05:00

Hey all,

A couple things if you find this useful please comment or "follow us" on the blog. Questions? Leave them in the comments or post it (or just email me or Ari and we can post):

Useful tidbit?
Two super important user written codes for STATA that you may not be aware of but will make your life A LOT EASIER:

outreg2

and

logout

outreg2: exports your regressions to journal ready tables in text, excel, latex, or other formats. It has a lot of options such as controlling formatting, adding in stars, number of decimal places, etc. It can also append multiple models to the same output file.

logout: This nice little utility also allows you to output almost anything that appears on the STATA window to a file like tables of summary statistics, cross-tabs, etc.

How do you add them to your local copy of STATA? Just type

findit outreg2 and findit logout

Then just download the .ado and .hlp files and you are all set. I give them my highest rating, five bananas, so download them now!

STATA: To the Power of _n and _N, filling in missing data

2011-03-07T12:44:00.002-05:00

I'm posting this based on a question I got from one of the other students, and it is a common enough of an issue that I thought it would be worthwhile posting a solution.

STATA has a number of built in variables that you can use in pretty powerful ways. Two key ones are _n and _N where _n is the observation number and _N is the total number of observations in your data. One way to use these is to have stata look "up" or "down" your data.

For example, many times you will have data in the following format

id group name
1 1 "Mickey"
2 1 ""
3 1 ""
4 2 "Davy"
5 2 ""
6 3 "Peter"
7 4 "Michael"
8 4 ""
9 4 ""

But you want your data to look like this

id group name
1 1 "Mickey"
2 1 "Mickey"
3 1 "Mickey"
4 2 "Davy"
5 2 "Davy"
6 3 "Peter"
7 4 "Michael"
8 4 "Michael"
9 4 "Michael"

A very simple solution is:

gsort group -name
replace name = name[_n-1] if name=="" & _n !=1

STATA will then go through the data, in the order it is sorted*, and pull the string value for the previous observation [_n-1] and put it in the current observation if it meets the conditions noted (i.e. it isn't the first observation and the current observation has a missing value in the name variable)

* Important note: For string variables you need to specify gsort group -name. The "-" makes sure that the missing values are below the non-missing. For numeric variables, the opposite is required, namely gsort group num_var because STATA handles missing numeric values as very large numbers.

Also, if your data has been tset (set to a time series database) you can use tsfill. Ah but that is for a later post. I need a banana...good monkey...

Happy Coding!!!