The Data Monkey: Cleaning data: Removing unwanted characters from large files

Hi all,

On occasion I have to pull in data from poorly formatted sources, such as excel, access or text/comma/pipe deliminated files, etc. And many times I have problems with single quotes, double quotes, carriage returns, or line feeds. Usually I can strip these things out using notepad. However, I had an ugly problem with a large pipe deliminated file which was created from an oracle database. The problem was that there were hard returns in a memo field and this caused numerous problems for SAS and STATA trying to read in the file. The memo field should be a single variable, but SAS and STATA "see" the carriage return and put it as a new observation, causing multiple problems. Nothing worked (infile, infix, proc import, notepad ect.).

After some hard work I found a cool free text editor (http://www.hhdsoftware.com/free-hex-editor) that lets me look at the files raw ascii, hex, decimal, float, double and/or binary encoding of the file (see: http://www.asciitable.com/). What this means is that I could do a global replace on any of the hard returns as the hexeditor is agnostic to formatting and shows you everything in the file, nothing is hidden. So in this case I opened the file in octadecimal and replace all carriage returns (oct: 014) and line feeds (oct: 012) with spaces (oct: 040). And it is super efficient. The other neat thing is that you can look for all sorts of patterns etc. in the data so makes string searches really really easy.

So if you ever have trouble reading in a raw data file or have some complicated sting variables, you may be able to use these free hex editors to help things along.

Best,
PDP - primary data primate

4 comments:

cnukausJuly 9, 2012 at 6:05 PM
if you are talking Windows notepad (probably not), then there are so many free hex editing tools to use for removing LF/CR
DimitriyJuly 30, 2012 at 9:37 PM
With Stata, the combo of hexdump and filefilter usually does the trick.
ChrisJuly 22, 2013 at 6:59 AM
Thanks. That's the answer I was looking for.
Swag znaczenieAugust 17, 2013 at 10:38 AM
I just came here to sya SWAG!

The Data Monkey

Tuesday, May 15, 2012

Cleaning data: Removing unwanted characters from large files

4 comments:

Followers

Blog Archive

Contributors