The Data Monkey

Friday, April 26, 2013

Noam Ross has a great overview of making code go faster in R here, although a lot of the ideas (such as pre-allocation) apply to every language. My own tips are not that dissimilar--a little less complete but more specific.

Thursday, March 14, 2013

Apply-style commands in R

Here's a quick table of what I think are the most useful apply-style commands in R:

Function	Input	Output	Best for
apply	Rectangular	Rectangular or vector	Applying function to rows or columns
lapply	Anything	List	Non-trivial operations on almost any data type
sapply	Anything	Simplified (if possible) or list	Same as lapply, but with simplified output
plyr::ddply	data.frame	data.frame	Applying function to groupings defined by variables

For alternatives to plyr, see this post on StackOverflow.

Saturday, February 2, 2013

R scripts for analyzing survey data

Another site pops up with open code for analyzing public survey data: http://www.asdfree.com/ It will be interesting to see whether this gets used by the general public--given the growing trend of data journalism and so forth--versus academics. It is a useful resource for both.

Sunday, November 4, 2012

SAS-b-gon?

There have been some improvements in the way that R reads the arcane file format that is a native SAS file. Hopefully soon I will never have to use SAS again, and my species can remain homo sapiens as opposed to screech monkey.

Tuesday, October 16, 2012

NHIS with R

Here are handy code snippets and explanations to get you running on the NHIS. The same site has R code for the CPS and ARF. Please be sure to thank him if you make use of the code.

Monday, October 15, 2012

Spreadsheet mayhem

I just chanced across a classic anti-spreadsheet screed (no monkey, not screech, screed!). See also the European Spreadsheet Risks Interest Group (!)'s list of quotable quotes and very expensive mistakes directly due to the use of spreadsheets.

Tuesday, July 10, 2012

This is huge: SAScii package

http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html Q: How do you make a hairless primate? Answer 1: Take a hairy primate, wait a few million years and see if Darwin was right. Answer 2: Make them work in SAS and watch them pull all their hair out. Unfortunately many public datasets are released as ASCII files with only SAS code to read them in, name all the variables properly, etc. Now there's a new kid on the block, the SAScii package for R, which will read in the SAS script, parse it, and deliver you an R file instead. Since R has fabulous import/export abilities (via the foreign package), this means even if you are a Stata user you can take advantage.

Tuesday, May 15, 2012

Cleaning data: Removing unwanted characters from large files

Hi all,

On occasion I have to pull in data from poorly formatted sources, such as excel, access or text/comma/pipe deliminated files, etc. And many times I have problems with single quotes, double quotes, carriage returns, or line feeds. Usually I can strip these things out using notepad. However, I had an ugly problem with a large pipe deliminated file which was created from an oracle database. The problem was that there were hard returns in a memo field and this caused numerous problems for SAS and STATA trying to read in the file. The memo field should be a single variable, but SAS and STATA "see" the carriage return and put it as a new observation, causing multiple problems. Nothing worked (infile, infix, proc import, notepad ect.).

After some hard work I found a cool free text editor (http://www.hhdsoftware.com/free-hex-editor) that lets me look at the files raw ascii, hex, decimal, float, double and/or binary encoding of the file (see: http://www.asciitable.com/). What this means is that I could do a global replace on any of the hard returns as the hexeditor is agnostic to formatting and shows you everything in the file, nothing is hidden. So in this case I opened the file in octadecimal and replace all carriage returns (oct: 014) and line feeds (oct: 012) with spaces (oct: 040). And it is super efficient. The other neat thing is that you can look for all sorts of patterns etc. in the data so makes string searches really really easy.

So if you ever have trouble reading in a raw data file or have some complicated sting variables, you may be able to use these free hex editors to help things along.

Best,
PDP - primary data primate

Thursday, March 22, 2012

STATA access to World Bank data

Talk about bananas! The World Bank has just published a new version of the wbopendata_module that gives STATA users direct access to a lot of their data! More information here:

http://data.worldbank.org/news/accessing-world-bank-open-data-in-stata

According to their website:

1,000 new indicators for a total of 5,300 time series

Access to the metadata including indicator definitions and other supporting documentation

Links to maps from within STATA

And it couldn't be easier to get access, just type:

ssc install wbopendata

The help file gives you all the details

Friday, March 16, 2012

Dates and times in R

Nothing looks funnier than a patchy simian. That's why we sighed a great sigh of relief when we spotted this article on the lubridate package in R. It saves a great deal of hair pulling.

http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/

Monday, February 20, 2012

STATA - loops for big-data

A user requested the following:
"I've never read a Stata dta in a loop before. Can you give an example of how this would work? Maybe a use-case as well? Thanks."

There are a number of ways to read in sets of data files into STATA using a loop. Following on Ari's example (see previous post), let's say you have a file with a million lines which is too large for stata and you want to read in a thousand lines at a time, do some stuff to it to make it smaller, then append the smaller data sets together to create your final single analytic file. Here is one way

*** Loop will start at 1000, then increment by 1000
***     until it gets to one million
forvalues high = 1000(1000)1000000 {
            local low = `high' - 999               //simple counter
            use datafile.dta in `low'/`high', clear

<insert code to cut down size of file>

          *** Now create temporary file
   if `high' == 1000 {
                save temp, replace         //only first time through the loop
           }
           else {
               append using temp
   save temp, replace
}
}

save finalfile.dta, replace
erase temp
*** You can also use a tempfile
*** and avoid the extra erase statement

Another way is to use the 'if' statement. Lets say you have a large database but only want to look at females in that dataset:

use datafile.dta if gender=="female"

You could also put this into a loop to get certain cuts of data, again the gender example

local sex male female
foreach s of local sex {
   use datafile.dta if gender == "`s'", clear
** create two data files
    ** male_newfile.dta and then female_newfile.dta
save `s'_newfile.dta, replace
}

Back to my bananas...

Sincerely,
primary data primate

Sunday, February 19, 2012

Reading huge files into R

SAS is much touted for its ability to read in huge datasets, and rightly so. However, that ability comes at a cost: for smaller datasets, since files remain on the disk rather than in memory (as is the case with Stata and R), it is potentially less fast.

If you don't want to learn/buy SAS but you have some large files you need to cut down to size (rather like the gorilla in the corner), R has several packages which can help. In particular, sqldf and ff both have methods to read in large CSV files. More advice is available here: http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r

If you're a Stata person, you can often get by reading the .dta file in chunks within a loop by adding e.g. "in 1/1000" afterwards if you want to read the 1st through 1000th observation in.

Friday, January 27, 2012

iMacros - webscraping

If you ever need to download a lot of data, use iMacros (http://wiki.imacros.net/Main_Page)

Recently I wanted to download a large public data sets for multiple years (NHANES) however this would have required a lot of manual downloading . For example, the 2007-2008 NHANES wave has 113 individual files and I wanted all the files from 1999-2010 so close to a thousand different files.

In order to do this I found a free browser automation tool iMacros that can automate anything that you do in a browser.

The other nice thing is that it can read in data from a .csv file to update what it has to do. So I just copied cut and pasted the names of the data files. Wrote eleven lines of code and off the program went, resulting in a a rich repeated cross sections of NHANES with close to 7000 different variables. Here's the code:

VERSION BUILD=7401110 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
URL GOTO=ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2001-2002/

SET !TIMEOUT 500
SET !DATASOURCE c:\nhanes_names.csv
SET !DATASOURCE_COLUMNS 1
SET !DATASOURCE_LINE {{!LOOP}}
ONDOWNLOAD FOLDER=* FILE={{!COL1}} WAIT=YES
TAG POS=1 TYPE=A ATTR=TXT:{{!COL1}} CONTENT={{!COL1}}

Play around with the tutorials, but it is real easy tool with minimal upfront cost but huge potential returns.

Tuesday, January 17, 2012

Download them ALL

Have a bunch of tasty, tasty ~~bananas~~ files you want to download? But they're stuck on a webpage?

Get the Firefox web browser and add the DownThemAll extension. It lets you download all of the links of a particular type on a page. Super handy.

Wednesday, November 2, 2011

Health economics datasets

"This Compendium of Health Datasets for Economists (ICoHDE) provides the largest collection of English specific datasets available for researchers interested in the field of economics of health and health care."

http://www.herc.ox.ac.uk/icohde