tag:blogger.com,1999:blog-21824946749058800522024-03-13T18:53:28.436-04:00The Data MonkeyWe want to share code and tips to make data programming much easier. Most of us work primarily in STATA but SAS, R and other languages might be used sporadically.The Data Monkeyhttp://www.blogger.com/profile/14266318327765112012noreply@blogger.comBlogger64125tag:blogger.com,1999:blog-2182494674905880052.post-57013994235162238012013-04-26T08:54:00.001-04:002013-04-26T08:54:38.682-04:00Noam Ross has a great overview of making code go faster in R <a href="http://www.noamross.net/blog/2013/4/25/faster-talk.html">here</a>, although a lot of the ideas (such as pre-allocation) apply to every language. <a href="http://stackoverflow.com/a/8474941/636656">My own tips</a> are not that dissimilar--a little less complete but more specific.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com1tag:blogger.com,1999:blog-2182494674905880052.post-11074385820070751382013-03-14T09:43:00.000-04:002013-03-14T09:43:06.364-04:00Apply-style commands in RHere's a quick table of what I think are the most useful apply-style commands in R:
<table>
<tr><td>Function</td><td>Input</td><td>Output</td><td>Best for</td></tr>
<tr><td>apply</td><td>Rectangular</td><td>Rectangular or vector</td><td>Applying function to rows or columns</td></tr>
<tr><td>lapply</td><td>Anything</td><td>List</td><td>Non-trivial operations on almost any data type</td></tr>
<tr><td>sapply</td><td>Anything</td><td>Simplified (if possible) or list</td><td>Same as lapply, but with simplified output</td></tr>
<tr><td>plyr::ddply</td><td>data.frame</td><td>data.frame</td><td>Applying function to groupings defined by variables</td></tr>
<tr><td></td></tr>
</table>
For alternatives to plyr, see <a href="http://stackoverflow.com/questions/11562656/averaging-column-values-for-specific-sections-of-data-corresponding-to-other-col/11562850#11562850">this post</a> on StackOverflow.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com4tag:blogger.com,1999:blog-2182494674905880052.post-55525212154724310372013-02-02T17:55:00.000-05:002013-02-02T17:55:23.780-05:00R scripts for analyzing survey dataAnother site pops up with open code for analyzing public survey data:
http://www.asdfree.com/
It will be interesting to see whether this gets used by the general public--given the growing trend of data journalism and so forth--versus academics. It is a useful resource for both.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com3tag:blogger.com,1999:blog-2182494674905880052.post-31976524786536635462012-11-04T06:08:00.000-05:002012-11-04T06:08:13.393-05:00SAS-b-gon?There have been some <a href="http://biostatmatt.com/archives/2256">improvements</a> in the way that R reads the arcane file format that is a native SAS file. Hopefully soon I will never have to use SAS again, and my species can remain homo sapiens as opposed to screech monkey.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-64985112138648043452012-10-16T07:53:00.002-04:002012-10-16T07:53:27.012-04:00NHIS with RHere are handy <a href="https://github.com/ajdamico/usgsd/tree/master/National%20Health%20Interview%20Survey">code snippets</a> and <a href="http://usgsd.blogspot.com/2012/10/analyzing-national-health-interview.html">explanations</a> to get you running on the NHIS.
The same site has R code for the <a href="http://usgsd.blogspot.com/search/label/current%20population%20survey%20%28cps%29">CPS</a> and <a href="http://usgsd.blogspot.com/search/label/area%20resource%20file%20%28arf%29">ARF</a>.
Please be sure to <a href="mailto:ajdamico@gmail.com">thank him</a> if you make use of the code.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com1tag:blogger.com,1999:blog-2182494674905880052.post-66614227051998175942012-10-15T08:15:00.003-04:002012-10-15T08:15:57.272-04:00Spreadsheet mayhemI just chanced across a <a href="http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html">classic anti-spreadsheet screed</a> (no monkey, not screech, screed!).
See also the European Spreadsheet Risks Interest Group (!)'s list of <a href="http://www.eusprig.org/quotes.htm">quotable quotes</a> and very expensive mistakes directly <a href="http://www.eusprig.org/horror-stories.htm">due to the use of spreadsheets</a>.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com2tag:blogger.com,1999:blog-2182494674905880052.post-64067893770561588492012-07-10T18:03:00.000-04:002012-07-10T18:03:10.073-04:00This is *huge*: SAScii packagehttp://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html
Q: How do you make a hairless primate?
Answer 1: Take a hairy primate, wait a few million years and see if Darwin was right.
Answer 2: Make them work in SAS and watch them pull all their hair out.
Unfortunately many public datasets are released as ASCII files with only SAS code to read them in, name all the variables properly, etc.
Now there's a new kid on the block, the SAScii package for R, which will read in the SAS script, parse it, and deliver you an R file instead. Since R has fabulous import/export abilities (via the foreign package), this means even if you are a Stata user you can take advantage.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com5tag:blogger.com,1999:blog-2182494674905880052.post-37716349598884356432012-05-15T11:11:00.000-04:002012-05-15T11:11:01.935-04:00Cleaning data: Removing unwanted characters from large filesHi all,<br />
<br />
On occasion I have to pull in data from poorly formatted sources, such as excel, access or text/comma/pipe deliminated files, etc. And many times I have problems with single quotes, double quotes, carriage returns, or line feeds. Usually I can strip these things out using notepad. However, I had an ugly problem with a large pipe deliminated file which was created from an oracle database. The problem was that there were hard returns in a memo field and this caused numerous problems for SAS and STATA trying to read in the file. The memo field should be a single variable, but SAS and STATA "see" the carriage return and put it as a new observation, causing multiple problems. Nothing worked (infile, infix, proc import, notepad ect.).<br />
<br />
After some hard work I found a cool free text editor (<a href="http://www.hhdsoftware.com/free-hex-editor">http://www.hhdsoftware.com/free-hex-editor</a>) that lets me look at the files raw ascii, hex, decimal, float, double and/or binary encoding of the file (see: <a href="http://www.asciitable.com/">http://www.asciitable.com/</a>). What this means is that I could do a global replace on any of the hard returns as the hexeditor is agnostic to formatting and shows you everything in the file, nothing is hidden. So in this case I opened the file in octadecimal and replace all carriage returns (oct: 014) and line feeds (oct: 012) with spaces (oct: 040). And it is super efficient. The other neat thing is that you can look for all sorts of patterns etc. in the data so makes string searches really really easy.<br />
<br />
So if you ever have trouble reading in a raw data file or have some complicated sting variables, you may be able to use these free hex editors to help things along.<br />
<br />
Best,<br />PDP - primary data primateAnonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com4tag:blogger.com,1999:blog-2182494674905880052.post-10091198702114103622012-03-22T07:56:00.000-04:002012-03-24T12:55:02.086-04:00STATA access to World Bank dataTalk about bananas! The World Bank has just published a new version of the wbopendata_module that gives STATA users direct access to a lot of their data! More information here:<br />
<div>
<br /></div>
<div>
http://data.worldbank.org/news/accessing-world-bank-open-data-in-stata</div>
<div>
<br /></div>
<div>
According to their website:</div>
<div>
1,000 new indicators for a total of 5,300 time series</div>
<div>
Access to the metadata including indicator definitions and other supporting documentation</div>
<div>
Links to maps from within STATA</div>
<div>
<br /></div>
<div>
And it couldn't be easier to get access, just type:</div>
<div>
<br /></div>
<div>
ssc install wbopendata<br />
<br />
The help file gives you all the details</div>
<div>
<br /></div>Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com1tag:blogger.com,1999:blog-2182494674905880052.post-50925990655082009672012-03-16T23:13:00.002-04:002012-03-16T23:14:57.939-04:00Dates and times in RNothing looks funnier than a patchy simian. That's why we sighed a great sigh of relief when we spotted this article on the lubridate package in R. It saves a great deal of hair pulling.<br /><br />http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com1tag:blogger.com,1999:blog-2182494674905880052.post-80012106828687170592012-02-20T07:41:00.002-05:002012-02-20T07:45:53.519-05:00STATA - loops for big-dataA user requested the following:<br />
"I've never read a Stata dta in a loop before. Can you give an example of how this would work? Maybe a use-case as well? Thanks."<br />
<br />
There are a number of ways to read in sets of data files into STATA using a loop. Following on Ari's example (see previous post), let's say you have a file with a million lines which is too large for stata and you want to read in a thousand lines at a time, do some stuff to it to make it smaller, then append the smaller data sets together to create your final single analytic file. Here is one way<br />
<br />
*** Loop will start at 1000, then increment by 1000<br />
*** until it gets to one million<br />
forvalues high = 1000(1000)1000000 {<br />
local low = `high' - 999 //simple counter<br />
use <i>datafile</i>.<i>dta</i> in `low'/`high', clear<br />
<br />
<<i>insert code to cut down size of file></i> <br />
<br />
*** Now create temporary file <br />
if `high' == 1000 {<br />
save temp, replace //only first time through the loop<br />
}<br />
else {<br />
append using temp <br />
save temp, replace<br />
}<br />
}<br />
<br />
save <i>finalfile.dta</i>, replace<br />
erase temp <br />
*** You can also use a tempfile<br />
*** and avoid the extra erase statement <br />
<br />
<br />
Another way is to use the 'if' statement. Lets say you have a large database but only want to look at females in that dataset:<br />
<br />
use <i>datafile.dta</i> if gender=="female"<br />
<br />
You could also put this into a loop to get certain cuts of data, again the gender example<br />
<br />
local sex male female<br />
foreach s of local sex {<br />
use <i>datafile.dta</i> if gender == "`s'", clear<br />
** create two data files<br />
** male_newfile.dta and then female_newfile.dta<br />
save <i>`</i>s'_newfile.dta, replace <br />
}<br />
<br />
Back to my bananas...<br />
<br />
Sincerely,<br />
primary data primate<br />
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com4tag:blogger.com,1999:blog-2182494674905880052.post-40852844912984249782012-02-19T22:09:00.003-05:002012-02-19T22:20:49.695-05:00Reading huge files into RSAS is much touted for its ability to read in huge datasets, and rightly so. However, that ability comes at a cost: for smaller datasets, since files remain on the disk rather than in memory (as is the case with Stata and R), it is potentially less fast.<br /><br />If you don't want to learn/buy SAS but you have some large files you need to cut down to size (rather like the gorilla in the corner), R has several packages which can help. In particular, sqldf and ff both have methods to read in large CSV files. More advice is available here: http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r<br /><br />If you're a Stata person, you can often get by reading the .dta file in chunks within a loop by adding e.g. "in 1/1000" afterwards if you want to read the 1st through 1000th observation in.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com1tag:blogger.com,1999:blog-2182494674905880052.post-48141826395821642222012-01-27T10:12:00.000-05:002012-01-27T10:13:09.123-05:00iMacros - webscrapingIf you ever need to download a lot of data, use iMacros (http://wiki.imacros.net/Main_Page)<br />
<br />
Recently I wanted to download a large public data sets for multiple years (NHANES) however this would have required a lot of manual downloading . For example, the 2007-2008 NHANES wave has 113 individual files and I wanted all the files from 1999-2010 so close to a thousand different files.<br />
<br />
In order to do this I found a free browser automation tool iMacros that can automate anything that you do in a browser.<br />
<br />
The other nice thing is that it can read in data from a .csv file to update what it has to do. So I just copied cut and pasted the names of the data files. Wrote eleven lines of code and off the program went, resulting in a a rich repeated cross sections of NHANES with close to 7000 different variables. Here's the code:<br />
<br />
<br />
VERSION BUILD=7401110 RECORDER=FX<br />
TAB T=1<br />
TAB CLOSEALLOTHERS<br />
URL GOTO=ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2001-2002/<br />
<br />
SET !TIMEOUT 500<br />
SET !DATASOURCE c:\nhanes_names.csv<br />
SET !DATASOURCE_COLUMNS 1<br />
SET !DATASOURCE_LINE {{!LOOP}}<br />
ONDOWNLOAD FOLDER=* FILE={{!COL1}} WAIT=YES<br />
TAG POS=1 TYPE=A ATTR=TXT:{{!COL1}} CONTENT={{!COL1}}<br />
<br />
Play around with the tutorials, but it is real easy tool with minimal upfront cost but huge potential returns.Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-46528735678892038872012-01-17T10:57:00.002-05:002012-01-17T10:59:48.703-05:00Download them *ALL*Have a bunch of tasty, tasty <strike>bananas</strike> files you want to download? But they're stuck on a webpage?<br /><br />Get the Firefox web browser and add the DownThemAll extension. It lets you download all of the links of a particular type on a page. Super handy.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-47601735465917328632011-11-02T11:13:00.001-04:002011-11-02T11:13:51.073-04:00Health economics datasets"This Compendium of Health Datasets for Economists (ICoHDE) provides the largest collection of English specific datasets available for researchers interested in the field of economics of health and health care."<br /><br />http://www.herc.ox.ac.uk/icohdeAri F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com2tag:blogger.com,1999:blog-2182494674905880052.post-49746865662669629862011-09-19T17:22:00.000-04:002011-09-19T17:23:58.430-04:00STATA: Capturing information in labelsBeen quoit for awhile, here is a tip on using and getting variable and value labels<br />
<br />
Labeling variables and values is useful for understanding what your underlying data represents. This is great while you are
in the STATA environment. But many times you may want to use this information in more dynamic ways.<br />
<br />
For example, lets say that I am using the autos database and I want to output a simple table of frequencies that looks like this:<br />
<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">Car type Freq. </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">Domestic 52 </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">Foreign 22 </span><br />
<br />
In order to do this I have to get the information stored in my variable labels and value label, so follow along:<br />
<br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">clear </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">set more off </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">sysuse auto </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">label list </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">tempname my_table
file open `my_table' using ///</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">"c:\my_table.xls", write replace </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** This is where I get the variable label ** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** in long hand ** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** local var_name : variable label foreign ** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">local var_name : var l foreign </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** Now that this is in a local **</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** I can use it anywhere **</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** so let's write it to our file ** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">file write `my_table' ("`var_name'") _tab ("Freq.") _n </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** Now lets get our frequencies **</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** and value labels </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">First get the </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">** name of the label value ** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">local nm_label : val l foreign </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> forvalues x=0(1)1 { </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> quietly sum foreign if foreign == `x' </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> ** Now to get the label values **</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> ** for 0 "Domestic" and 1 "Foreign" ** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> ** in the value label origin ** </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> local val_name : label `nm_label' `x' </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> file write `my_table' ///</span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> ("`val_name'") _tab (r(N)) _n </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">} </span><br />
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> file close `my_table' </span><br />
<br />
<br />
<br />
This is a powerful way to export data in a meaningful fashion and can save you a lot of time. Recall that after the sum, there are a number
of values that we can recover. Type <span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">return list</span>, if you need other descriptive statics use the <span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">detail</span> option for the
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">sum</span> command. Also you can get post regression estimates through <span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">ereturn list</span> after you run a regression. If you
are familiar with using matrices in STATA then you can get all of your coefficents, etc.<br />
<br />
More on that later<br />
<br />
Happy coding monkeys...Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-76907218683024144782011-06-24T11:48:00.000-04:002011-09-19T16:36:22.148-04:00STATA: Geographic data - cool new commandsI emailed this to most Wharton PhD health care students, but thought this was worthy posting here for others. There are two new commands in stata that allow you to link with google maps and turn addresses into latitudes and longitudes as well as calculate distances and travel times. <br />
<br />
First type:<br />
<br />
findit geocode<br />
<br />
And install the two commands, geocode and travel time<br />
<br />
What do these two commands do?<br />
<br />
First geocode can take addresses, in many sorts of formats, and then return the latitude and longitude based on Google Geocoding. Because it is using Google the matches can be pretty good, there is flexibility on the addresses, and geocode can also return the geoscore which gives you an estimate of accuracy of the match.<br />
<br />
Once you have the latitudes and longitudes you can use the traveltime command to find the distance between points AND the travel time. What is really cool is that it can be driving, walking, or public transport time.<br />
<br />
These are probably really useful for a lot of hospital based studies, and other things. Either way check out the help documentation to learn more.<br />
<br />
All the best!!! - Hat tip to Mike Harhay who put me onto this.Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com6tag:blogger.com,1999:blog-2182494674905880052.post-61453296701669719442011-06-23T10:16:00.004-04:002011-06-23T10:23:31.027-04:00Two cool new packages for RLet's say you have some data stored in a primate-tive format like paper. But you'd like to get it into something a little more evolved. A new R package called <a href="http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Poisot.pdf">digitize </a>lets you do just that. Click a few points to calibrate the axis, and all your new shiny scatterplot points will be stored as real digital data. Not a tool you'll use often, but invaluable when you need it.<br /><br />If you've ever monkeyed around with ArcGIS, you'll know that it produces pretty maps. Unfortunately its interface is terrible, it crashes frequently, and it's not very easy to automate. R, on the other hand, does not crash and is eahttp://www.blogger.com/img/blank.gifsy to automate, but its maps are pretty ugly. Enter rworldmap, a package which produces pretty world maps like this:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.oga-lab.net/RGM_results/rworldmap/rworldmap-package/rworldmap-package_001_big.png"><img style="cursor:pointer; cursor:hand;width: 500px; height: 400px;" src="http://www.oga-lab.net/RGM_results/rworldmap/rworldmap-package/rworldmap-package_001_big.png" border="0" alt="" /></a><br /><br />It's a marked improvement. <a href="http://journal.r-project.org/archive/2011-1/RJournal_2011-1_South.pdf">Read more </a>in the R Journal.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-21261116372803519082011-06-09T16:37:00.000-04:002011-06-09T16:39:03.010-04:00We are now a part of R-bloggers.com<a href="http://www.R-bloggers.com">R-bloggers</a> is a site that aggregates many of the best R blogs on the internet. We're glad they've allowed our R-related posts to be aggregated there. If you mainly write in R, it's worth checking them out.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-56124200577273393122011-06-09T15:18:00.004-04:002011-06-09T15:31:14.821-04:00R: Speeding things upR is many things, but it's not exactly speedy like a <a href="http://ngm.nationalgeographic.com/ngm/0402/resources_cre.html">Patas Monkey</a>. In fact, while it is much faster than many other solutions, R is notably slower than Stata (even inspiring talks that it should be <a href="http://www.r-bloggers.com/%E2%80%9Csimply-start-over-and-build-something-better%E2%80%9D/">rewritten from scratch</a>!).<br /><br />Fortunately, Radford Neal has been hard at work speeding R up, and has released some <a href="http://radfordneal.wordpress.com/2011/06/09/new-patches-to-speed-up-r-2-13-0/">new patches </a>to play with if you find it too slow. You can also try <a href="http://dirk.eddelbuettel.com/code/rcpp.html">writing key sections in C++</a>, or using <a href="http://www.revolutionanalytics.com/products/enterprise-performance.php">Revolution Analytics' offerings</a> (free for academics).<br /><br />For extreme speed needs, however, R can't be beat, as it has long offered <a href="http://cran.r-project.org/web/packages/gputools/index.html">graphics-card based extreme parallelism </a>that commercial solutions are only <a href="http://www.mathworks.com/discovery/matlab-gpu.html">beginning to match</a>.<br /><br />Of course, for more prosaic needs, focusing on vectorizing key operations can solve speed troubles. And it's worth noting that the $1,000+ per copy that Stata costs can buy an awful lot of extra processing power to throw at the problem.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-5440486901646079622011-04-18T15:36:00.000-04:002011-04-18T15:36:18.313-04:00SAS: Design of experiments - Marketing researchAll,<br />
<br />
There have been some requests for SAS tips so I'll post a couple of useful things over the next couple of weeks. SAS has a lot of functions that STATA doesn't or are hard in STATA. For example, doing maps with data is quite easy, like displaying immunization rates by country on a world map (more on this later). <br />
<br />
For this post, I just wanted to point pople to an excellent resource if you ever have to design an experiment. <br />
<br />
http://support.sas.com/techsup/technote/mr2010.pdf<br />
<br />
This was put together by Walter Kuhfeld and is an excellent guide on how to design discrete choice and conjoint studies using SAS, along with a number of other marketing based analyses. These obviously come out of the marketing area, but these techniques are being increasingly adapted to the health care field to elicit patient or provider preferences. I found it quite useful in a discrete choice experiment I will be testing on physicians dealing with smoking cessation.<br />
<br />
Best,<br />
Monkey out...Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com0tag:blogger.com,1999:blog-2182494674905880052.post-46061030932203831732011-04-13T16:50:00.000-04:002011-09-19T16:36:22.133-04:00STATA: file write or a way to exporting of almost anythingThis is a bit of a repost, but it is so useful that I thought it would useful to people.<br />
<br />
Ever want to get a formatted table of summary statitics exported directly from Stata? Outreg2 does a great job with exporting regression results, but what about variable means, variances, or other summary statitics. A great way to do this is with <i>file write</i>. This is a great command and provides you with a lot of control. Its simple:<br />
<br />
sysuse auto<br />
file open myfile using"C:/mytable.txt", write replace<br />
file write myfile "Table of descriptive stats" _n _n<br />
file write myfile _tab "Mean" _tab "5th pct" _tab "95th pct"_n<br />
quietly sum price, detail<br />
file write "Price" _tab %7.2f (r(mean)) _tab %7.2f (r(p5)) ///<br />
_tab %7.2f (r(p95)) _n<br />
file close myfile<br />
<br />
Here is what just happened. We first open a file with the handle "myfile" that is associated with a text file "mytable.txt". Then I write a header on the first line. The <i>_n</i> sends a hard return, so I sent two hard returns after the header. Then I write my column headers, seperated by tabs (<i>_tab</i>). Then I write my formated summary statistics (<i>%7.2f</i>), again seperated by tabs. Note: you can send anything that is shown in return list or ereturn list so it is pretty flexible. Finally, I close the file. I have created a tab deliminated text file that we can open in excel or elsewhere.<br />
<br />
When you combine this with loops and lists of variables that you can store in a local macro, it makes exporting standard tables very easy and automated. See my <a href="http://thedatamonkey.blogspot.com/2010/02/stata-code-for-easy-descriptive-table.html">February 2010 </a>post for a more complicated example.<br />
<br />
Happy coding...Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com3tag:blogger.com,1999:blog-2182494674905880052.post-35788120246779968932011-03-10T00:38:00.002-05:002011-03-10T00:47:54.581-05:00R: Drop factor levels in a datasetR has factors, which are very cool (and somewhat analogous to labeled levels in Stata). Unfortunately, the factor list sticks around even if you remove some data such that no examples of a particular level still exist<br /><br /># Create some fake data<br />x <- as.factor(sample(head(colors()),100,replace=TRUE))<br />levels(x)<br />x <- x[x!="aliceblue"]<br />levels(x) # still the same levels<br />table(x) # even though one level has 0 entries!<br /><br />The solution is simple: run factor() again:<br />x <- factor(x)<br />levels(x)<br /><br />If you need to do this on many factors at once (as is the case with a data.frame containing several columns of factors), use drop.levels() from the gdata package:<br />x <- x[x!="antiquewhite1"]<br />df <- data.frame(a=x,b=x,c=x)<br />df <- drop.levels(df)<br /><br />Now I'm going to quit monkeying around and get to sleep.Ari F.http://www.blogger.com/profile/15354427423133432379noreply@blogger.com3tag:blogger.com,1999:blog-2182494674905880052.post-84447469502562654632011-03-09T11:19:00.000-05:002011-09-19T16:36:22.156-04:00STATA: Useful tidbits and are you there?Hey all,<br />
<br />
A couple things if you find this useful please comment or "follow us" on the blog. Questions? Leave them in the comments or post it (or just email me or Ari and we can post):<br />
<br />
Useful tidbit?<br />
Two super important user written codes for STATA that you may not be aware of but will make your life A LOT EASIER:<br />
<br />
<br />
<i>outreg2</i><br />
<br />
and <br />
<br />
<i>logout</i><br />
<br />
<br />
<i>outreg2</i>: exports your regressions to journal ready tables in text, excel, latex, or other formats. It has a lot of options such as controlling formatting, adding in stars, number of decimal places, etc. It can also append multiple models to the same output file.<br />
<br />
<i>logout</i>: This nice little utility also allows you to output almost anything that appears on the STATA window to a file like tables of summary statistics, cross-tabs, etc.<br />
<br />
How do you add them to your local copy of STATA? Just type <br />
<br />
<i>findit outreg2</i> and <i>findit logout</i><br />
<br />
Then just download the .ado and .hlp files and you are all set. I give them my highest rating, five bananas, so download them now!Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com3tag:blogger.com,1999:blog-2182494674905880052.post-71266376887735990732011-03-07T12:44:00.002-05:002011-09-19T16:36:22.141-04:00STATA: To the Power of _n and _N, filling in missing dataI'm posting this based on a question I got from one of the other students, and it is a common enough of an issue that I thought it would be worthwhile posting a solution.<br />
<br />
STATA has a number of built in variables that you can use in pretty powerful ways. Two key ones are _n and _N where _n is the observation number and _N is the total number of observations in your data. One way to use these is to have stata look "up" or "down" your data.<br />
<br />
For example, many times you will have data in the following format<br />
<br />
id group name<br />
1 1 "Mickey"<br />
2 1 ""<br />
3 1 ""<br />
4 2 "Davy"<br />
5 2 ""<br />
6 3 "Peter"<br />
7 4 "Michael"<br />
8 4 ""<br />
9 4 ""<br />
<br />
But you want your data to look like this<br />
<br />
id group name<br />
1 1 "Mickey"<br />
2 1 "Mickey"<br />
3 1 "Mickey"<br />
4 2 "Davy"<br />
5 2 "Davy"<br />
6 3 "Peter"<br />
7 4 "Michael"<br />
8 4 "Michael"<br />
9 4 "Michael"<br />
<br />
A very simple solution is:<br />
<br />
gsort group -name<br />
replace name = name[_n-1] if name=="" & _n !=1<br />
<br />
STATA will then go through the data, in the order it is sorted*, and pull the string value for the previous observation [_n-1] and put it in the current observation if it meets the conditions noted (i.e. it isn't the first observation and the current observation has a missing value in the name variable)<br />
<br />
* <b>Important note</b>: For string variables you need to specify <i>gsort group -name</i>. The "-" makes sure that the missing values are below the non-missing. For numeric variables, the <i><b>opposite</b> </i>is required, namely <i>gsort group num_var </i>because STATA handles missing numeric values as very large numbers. <br />
<br />
Also, if your data has been tset (set to a time series database) you can use tsfill. Ah but that is for a later post. I need a banana...good monkey...<br />
<br />
Happy Coding!!!Anonymoushttp://www.blogger.com/profile/07654900760380202548noreply@blogger.com2