Tuesday, May 15, 2012

Cleaning data: Removing unwanted characters from large files

Hi all,

On occasion I have to pull in data from poorly formatted sources, such as excel, access or text/comma/pipe deliminated files, etc.  And many times I have problems with single quotes, double quotes, carriage returns, or line feeds.  Usually I can strip these things out using notepad.  However, I had an ugly problem with a large  pipe deliminated file which was created from an oracle  database.  The problem was that there were hard returns in a memo field and this caused numerous problems for SAS and STATA trying to read in the file.    The memo field should be a single variable, but SAS and STATA "see" the carriage return and put it as a new observation, causing multiple problems.  Nothing worked (infile, infix, proc import, notepad ect.).

After some hard work I found a cool free text editor (http://www.hhdsoftware.com/free-hex-editor) that lets me look at the files raw ascii, hex, decimal, float, double and/or binary encoding of the file (see: http://www.asciitable.com/).  What this means is that I could do a global replace on any of the hard returns as the hexeditor is agnostic to formatting and shows you everything in the file, nothing is hidden.  So in this case I opened the file in octadecimal and replace all carriage returns (oct: 014) and line feeds (oct: 012) with spaces (oct: 040).  And it is super efficient.  The other neat thing is that you can look for all sorts of patterns etc. in the data so makes string searches really really easy.

So if you ever have trouble reading in a raw data file or have some complicated sting variables, you may be able to use these free hex editors to help things along.

Best,
PDP - primary data primate