Friday, January 27, 2012

iMacros - webscraping

If you ever need to download a lot of data, use iMacros (http://wiki.imacros.net/Main_Page)

Recently I wanted to download a large public data sets for multiple years (NHANES) however this would have required a lot of manual downloading .  For example, the 2007-2008 NHANES wave has 113 individual files and I wanted all the files from 1999-2010 so close to a thousand different files.

In order to do this I found a free browser automation tool iMacros that can automate anything that you do in a browser.

The other nice thing is that it can read in data from a .csv file to update what it has to do.  So I just copied cut and pasted the names of the data files.  Wrote eleven lines of code and off the program went, resulting in a a rich repeated cross sections of NHANES with close to 7000 different variables.  Here's the code:


VERSION BUILD=7401110 RECORDER=FX
TAB T=1
TAB CLOSEALLOTHERS
URL GOTO=ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2001-2002/

SET !TIMEOUT 500
SET !DATASOURCE c:\nhanes_names.csv
SET !DATASOURCE_COLUMNS 1
SET !DATASOURCE_LINE {{!LOOP}}
ONDOWNLOAD FOLDER=* FILE={{!COL1}} WAIT=YES
TAG POS=1 TYPE=A ATTR=TXT:{{!COL1}} CONTENT={{!COL1}}

Play around with the tutorials, but it is real easy tool with minimal upfront cost but huge potential returns.

Tuesday, January 17, 2012

Download them *ALL*

Have a bunch of tasty, tasty bananas files you want to download? But they're stuck on a webpage?

Get the Firefox web browser and add the DownThemAll extension. It lets you download all of the links of a particular type on a page. Super handy.