The Data Monkey: 2010

Thursday, August 5, 2010

STATA & SAS: Life without STATRANSFER

Let's pretend for a second: The evil software overlord, let's call him Dr. Mulcahy because that sounds sinister, won't release a copy of STATransfer to you. Even as you plead "But great and wonderful Dr. Mulcahy, I really need to get my data from SAS to STATA!!!"

"No!" he bellows in his fractious Greek way.

You stumble off humiliated and ashamed, but luckily you run into the wondrous and helpful data monkey, let's call him Dr. McElligott because that is a nice sounding name, and explain your problem.

The wonderful data monkey explains, ah that is easy, here is how:

SAS to STATA:

save data as a SAS XPORT file, then to import into stata use "fdause filename"

or more directly

PROC EXPORT DATA= dirname.file_to_export
OUTFILE= "C:\get_dissertation_done\filename.dta"
DBMS=STATA REPLACE;
RUN;

From STATA to SAS:
In stata use "fdasave filename" then you can import into SAS

Additionally, you can save files as csv files and import in each for

STATA: use "insheet" and "outsheet"

SAS:
proc export data = filename;
outfile = 'pathname\filename.csv';
dbms = csv;
replace;
run;

or you can use the menu driven sas export facility

"Thank you oh wonderful datamonkey, you have saved my dissertation! Let's get a beer!" you exclaim.

"I'll get you someday datamonkey" gurgles the evil Dr. Mulchay as he slouches back to his cavelike cubicle.

-The End-

Friday, May 28, 2010

STATA: Did you know stata has all the ICD9 codes?

Yes that's right fellow dissertators, need to know what a specific ICD9 code is, or what is within a range of icd9 codes? You can just type 'icd9 lookup' For example:

icd9 lookup 740*

Returns all ICD9 codes that begin with 740
4 matches found:
740 anencephalus/simil anom*
740.0 anencephalus
740.1 craniorachischisis
740.2 iniencephaly

Interested in pneumonia, but don't know the relevant codes, type

icd9 search "pneumonia"

47 matches found:
003.22 salmonella pneumonia
011.6 tuberculous pneumonia*
011.60 tb pneumonia-unspec
011.61 tb pneumonia-no exam
011.62 tb pneumonia-exam unkn
011.63 tb pneumonia-micro dx
011.64 tb pneumonia-cult dx
011.65 tb pneumonia-histo dx
011.66 tb pneumonia-oth test
041.3 klebsiella pneumoniae
055.1 postmeasles pneumonia...

There are a number of other ICD9 tools (like dealing with those pesky dots), just type 'help icd9'

Thursday, May 13, 2010

STATA: Adding a dot to a graph

I recently had the need to put a single dot of a different color on a graph, and noted that stata doesn't have an option to allow you to place a single dot on a graph. You can put a line using xline and yline options (see below). I came up with the following work around:

sysuse auto
sum mpg
local avgmpg : display r(mean)
gen avgmpg = `avgmpg'

sum price
local avgprice : display r(mean)
gen avgprice = `avgprice'

twoway (scatter mpg price) ///
(scatter avgmpg price if _n==1, mcolor(red) legend(off) ///
text(`avgmpg' `avgprice' "average"))

What I did was create the average price and mpg as varabiles, then drew my primary scatter plot and then overlaid a single dot. The restriction if _n==1 tells stata to only use observation number one. This prevents me from having _N (total number of observations in the data set) dots all right on top of each other, which will use up more resources. The text option just allows me to place text at a given location.

If I wanted to intersecting lines at the averages, I would have done

twoway (scatter mpg price, xline(`avgprice') yline(`avgmpg'))

Happy Coding!

Wednesday, May 12, 2010

STATA: Power calculations

There may be instances, while planning your study, that you need to determine what sample size will give you sufficient power to detect an effect. Formal power calculations are many times used in clinical trials, but this can also be used for econometric studies. For instance, I recently had to do a series of power calculations to determine if we needed to buy the Medicare 100% file, or if we had enough observations in the 5% file. Additionally, if you are at the start of a research project it may be helpful to know if the data that you have on hand is sufficient to detect your hypothesized effect.

In order to do a power calculation you will need four items:
alpha: 0.05 (probability of choosing the alternate hypothesis when the null is true)
power: 0.80 (this is 1 - beta, i.e. the probability of choosing the alternative hypothesis when true)
effect size: how large and effect do you hypothesize, the difference in effect between your treatment and control groups
variance

Once you have these you can use powercal in stata, which is an add on so you will have to download it, type:

search powercal

in stata

Wednesday, May 5, 2010

Two Way Graphs in STATA

From Rob Lieberthal:

If you are using the graph twoway command to create graphs in Stata will render the graph in the same order as your command. For example:

graph twoway (scatter mpg car_weight) (lfitci mpg car_weight)

Will produce a scatter plot of mpg as a function of car_weight, then overlay the fitted line with 95% confidence interval on top of the scatter plot. The confidence interval will cover up most of your scatter plot points. If you want Stata to show the points over the confidence interval, switch the order of the scatter and lfitci commands like so:

graph twoway (lfitci mpg car_weight) (scatter mpg car_weight)

Monday, April 5, 2010

STATA: Working with large databases

STATA is limited in that it has to have all observations in active memory. This can cause problems especially for really large files like Medicare, HCUP, which will typically exceed the computers ability to set mem. However, many times you don't need a lot of the information that is in these large files.

In order to read in only the information that you need you can use an 'if' clause with your 'use' statemenat and direct stata to only pull in the variables and observations that meet certain conditions.

use charges dx1 age admit if dx1=='714' using neds_2006_core.dta, clear

This example will only pull in people with a primary diagnosis of 714 (Rhematoid arthritis) and the variables 'charges', 'dx1', 'age', and 'admit'. This many times will obviate the need of going to SAS to cut the data first. Obviously using a * in the var list will pull in all variables.

You can get even more complicated by using logical operators as well.

STATA: Getting variable information into a local macro

All,

Many times one may want to get information from a variable and use it dynamically in your code. There is an easy and powerful way to do this.

Let's say that I wish to have a loop based on the max of a variable you can do

sum var1
local numbr : display r(max)
forvalues i = 1(1)`numbr' {
....
}

Anything that you can "display" you can get into a macro. To see a list of what you can recover type

return list

after you do the sum or sum, detail and

ereturn list

after a regression

Thursday, March 4, 2010

SAS: I know who to do it in STATA, but how do I do it in SAS?

http://faculty.fuqua.duke.edu/home/blanc004/data_programming/sas_to_stata/sas_to_stata.html

Wednesday, February 3, 2010

SAS: code for many comorbid conditions by ICD9 (can be adapted)

*The following code has the ICD9 codes for many common coditions;
*NOTE: ICD9 codes have to be "dot" free;
*Code was originally intended to loop over a number of different years in the Medicare file;
*Note DX(Z) is to search over all diagnosis (primary and all secondary);

*CHD;
if substr(DX(Z),1,3) in ("410", "411", "412", "414") then dx_&type.&yr._chd = 1;
else if substr(DX(Z),1,4) in ("4130", "4139", "4292") then dx_&type.&yr._chd = 1;

*Diabetes;
if substr(DX(Z),1,3) in ("250") then dx_&type.&yr._diabet = 1;
else if substr(DX(Z), 1, 4) in ("3572") then dx_&type.&yr._diabet = 1;
else if substr (DX(Z), 1, 5) in ("36200", "36201", "36641") then dx_&type.&yr._diabet= 1;

*Osteoarthitis;
if substr(DX(Z),1,3) in ("715") then dx_&type.&yr._ostoarth = 1;
*Other arthritis;
if substr(DX(Z),1,3) in ('716') then dx_&type.&yr._otharth=1;
*Arthritis;
if substr(DX(Z),1,3)in ('714', '715', '716') then dx_&type.&yr._arth=1;

*Chronic Heart Failure;
if substr(DX(Z),1,3) in ("416", "417", "425", "428") then dx_&type.&yr._chf = 1;
else if substr(dx(Z),1,4) in ("4150", "4290", "4291") then dx_&type.&yr._chf = 1;
else if substr (DX(Z),1,5) in ("40201", "40211", "40291", "40401", "40403", "40411", "40413", "40491", "40493") then dx_&type.&yr._chf = 1;

*Hypertension;
if substr(dx(z),1,3) in ('401', '402', '403', '404', '405') then dx_&type.&yr._hypten = 1;

*Hyperlipidemia;
if substr(dx(Z),1,4) in ("2720", "2721", "2722", "2723", "2724") then dx_&type.&yr._hyplip = 1;

*acute myocardial infarction;
if substr(dx(z),1,3) in ("410") then dx_&type.&yr._ami = 1;

*other acute and subacture forms of ishemic heart disease;
if substr(dx(Z),1,3) in ("411") then dx_&type.&yr._othrAHD = 1;

*Old myocardial infaction;
if substr(dx(Z),1,3) in ("412") then dx_&type.&yr._oldMI = 1;

*Angina;
if substr(dx(z),1,3) in ("413") then dx_&type.&yr._angina = 1;

*other forms of chronic ishemic heart disease;
if substr(dx(z),1,3) in ("414") then dx_&type.&yr._othrCHD = 1;

*Atherosclerosis;
if substr(dx(z),1,3) in ("440") then dx_&type.&yr._athro = 1;
else if substr(dx(z),1,4) in ("4292", "4370", "4371") then dx_&type.&yr._athro = 1;

*Cerebrovascular disease;
if substr(dx(Z),1,3) in ('430', '431','436') then dx_&type.&yr._cerbvas = 1;
else if substr(dx(z),1,4) in ('4320','4321','4329') then dx_&type.&yr._cerbvas = 1;
else if substr(dx(z),1,5) in ('43301','43311','43321', '43331', '43381', '43391', '43401','43411','43491') then dx_&type.&yr._cerbvas = 1;

*Stroke;
if substr(dx(z),1,3) in ("433", "435", "436") then dx_&type.&yr._stroke = 1;
else if substr(dx(z),1,4) in ("4349", "4340") then dx_&type.&yr._stroke =1;

*Peripheral Vascular Disease;
if substr(dx(z),1,3) in ("441", "443") then dx_&type.&yr._prvasdis=1;

*Alcohol or drug dependence;
if substr(dx(z),1,3) in ("303", "304", "305") then dx_&type.&yr._drug = 1;
else if substr(dx(z),1,4) in ("V113") then dx_&type.&yr._drug = 1;

*Cancer;
if substr(dx(z),1,2) in ("14", "15", "16", "17", "18", "19") then dx_&type.&yr._cancer = 1;
else if substr(dx(z),1,3) in ("200", "201", "202", "203", "204",
"205", "206", "207", "208", "230", "231", "232", "233", "234") then dx_&type.&yr._cancer = 1;

*Benign neoplasms;
if substr(dx(z),1,3) in ('209', '235', '236', '237', '238', '239') then dx_&type.&yr._neoplas = 1;
else if substr(dx(z),1,2) in ('21', '22') then dx_&type.&yr._neoplas = 1;

*Rhematoid arthritis;
if substr(dx(z),1,3) in ("714") then dx_&type.&yr._ra = 1;

*End state renal disease;
if substr(dx(z),1,3) in ('584', '585', '586') then dx_&type.&yr._esrd = 1;

*Obesity;
if substr(dx(z),1,4) in ("2780") then dx_&type.&yr._obese = 1;

*Mood disorders;
if substr(dx(z),1,3) in ('296', '300') then dx_&type.&yr._mood = 1;
else if substr(dx(z),1,4) in ('3062') then dx_&type.&yr._mood = 1;

*Depression;
if substr(DX(Z),1,4) in ('2962','2963', '3004', '311') then dx_&type.&yr._deprss = 1;

*Alzhimers;
if substr(dx(z),1,4)='3310' then dx_&type.&yr._alzhm=1;

*Dementia;
if substr(dx(z),1,4) in ('3310', '3311', '3314', '3318', '0461','4380') then dx_&type.&yr._dmntia=1;

*Osteoporosis;
if substr(dx(z),1,4) in ("7330") then dx_&type.&yr._osteo = 1;

*COPD;
if substr(dx(z),1,3) in ("490", "491", "492", "494", "496") then dx_&type.&yr._copd = 1;
else if substr(dx(z), 1, 4) in ("4932") then dx_&type.&yr._copd = 1;

*Asthma;
if substr(dx(z),1,3) in ("493") then dx_&type.&yr._asthma = 1;

*Mental illness;
if "290" <= dx(z) <= "319" then dx_&type.&yr._mental = 1;

end;
run;

STATA code for easy descriptive table export

*This code creates two separate tables for summary statistics for different variables
*In my code I was getting average immunization rates by vaccine for different *development groups and regions

*You may change which summary statitics are pulled in the local display options

local vac dtp hepb hib polio pneumo rota yf measles
local vaclabels "DTP3" "Hepatitis B" "HIB" "Polio" "Pneumo. Conj." "Rotavirus" "Yellow fever" "Measles"

tempname vacyear_dev
tempname vacyear_geo
file open `vacyear_dev' using "Prelim analysis\dsc_vactoadd_`dsource'.txt", write replace
file write `vacyear_dev' ("Table: Number of years to add a vaccine to the entire national schedule by development status") _n _n
file write `vacyear_dev' ("Group") _tab ("Mean") _tab ("Median") _tab ("s.d.") _tab ("Min") _tab ("Max") _tab ("Count") _tab ("Prcnt add") _n

file open `vacyear_geo' using "Prelim analysis\dsc_vactoadd_geo1.txt", write replace
file write `vacyear_geo' ("Table: Number of years to add a vaccine to the entire national schedule by geography") _n _n
file write `vacyear_geo' ("Group") _tab ("Mean") _tab ("Median") _tab ("s.d.") _tab ("Min") _tab ("Max") _tab ("Count") _tab ("Prcnt add") _n

local i 1
foreach name of local vac {
local vacname : word `i' of "`vaclabels'"
file write `vacyear_dev' ("`vacname'") _n
file write `vacyear_geo'("`vacname'") _n

foreach group in "High income" "Upper-middle income" "Lower-middle income" "Lower income" {
* foreach group in "Developed economy" "Economy in transition" "Developing" "Least developed" {

quietly sum yrsto_`name' if devstatus== "`group'", detail
local vmean : display %-9.1f r(mean)
local vmed : display %-9.0f r(p50)
local vsd : display %-9.1f r(sd)
local vmin : display %-9.1f r(min)
local vmax : display %-9.1f r(max)
local vn : display %-9.0f r(N)
quietly sum max_grp if devstatus== "`group'"
local vpct : display %-9.3f `vn'/r(N)
file write `vacyear_dev' ("`group'") _tab (`vmean') _tab (`vmed') _tab (`vsd') _tab (`vmin') _tab (`vmax') _tab (`vn') _tab (`vpct') _n
}
if "`name'" != "pneumo" {
quietly kwallis yrsto_`name', by(devstatus)
local vchi : display %-9.4f r(chi2)
local vdf : display %-9.0f r(df)
local vprob : display %-9.5f chi2tail(`vdf', `vchi')
file write `vacyear_dev' ("Kruskall-wallis test") _tab ("chi2 = ") (`vchi') (" with ") (`vdf') (" df") _tab _tab _tab ("probaility = ") (`vprob') _n _n
}
foreach group in "AFRO" "AMRO" "EMRO" "EURO" "WPRO" "SEARO" {
quietly sum yrsto_`name' if WHO_region == "`group'", detail
local vmean : display %-9.1f r(mean)
local vmed : display %-9.0f r(p50)
local vsd : display %-9.1f r(sd)
local vmin : display %-9.1f r(min)
local vmax : display %-9.1f r(max)
local vn : display %-9.0f r(N)
quietly sum max_grp if WHO_region == "`group'"
local vpct : display %-9.3f `vn'/r(N)
file write `vacyear_geo' ("`group'") _tab (`vmean') _tab (`vmed') _tab (`vsd') _tab (`vmin') _tab (`vmax') _tab (`vn') _tab (`vpct') _n
}
quietly kwallis yrsto_`name', by(WHO_region)
local vchi : display %-9.4f r(chi2)
local vdf : display %-9.0f r(df)
local vprob : display %-9.5f chi2tail(`vdf', `vchi')
file write `vacyear_geo' ("Kruskall-wallis test") _tab ("chi2 = ") (`vchi') (" with ") (`vdf') (" df") _tab _tab _tab ("probaility = ") (`vprob') _n _n

local i = `i' + 1
}
file close `vacyear_dev'
file close `vacyear_geo'

Macro to combine multiple sheets of EXCEL data in stata

Sub Combine()
Dim J As Integer

On Error Resume Next
Sheets(1).Select
Worksheets.Add ' add a sheet in first place
Sheets(1).Name = "Combined"

' copy headings
Sheets(2).Activate
Range("A1").EntireRow.Select
Selection.Copy Destination:=Sheets(1).Range("A1")

' work through sheets
For J = 2 To Sheets.Count ' from sheet 2 to last sheet
Sheets(J).Activate ' make the sheet active
Range("A1").Select
Selection.CurrentRegion.Select ' select all cells in this sheets

' select all lines except title
Selection.Offset(1, 0).Resize(Selection.Rows.Count - 1).Select

' copy cells selected in the new sheet on last line
Selection.Copy Destination:=Sheets(1).Range("A65536").End(xlUp)(2)
Next
End Sub

The Data Monkey