Thursday, April 16, 2009

Parsing US Census Data

The goal is to get State, State Code, Place FIPS code, City Name, Lat, and Lon from places2k.txt.
http://www.census.gov/geo/www/gazetteer/gazette.html

Here is the layout of place2k.txt

Columns 1-2: United States Postal Service State Abbreviation
* Columns 3-4: State Federal Information Processing Standard (FIPS) code
* Columns 5-9: Place FIPS Code
* Columns 10-73: Name
...
* Columns 144-153: Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
* Columns 154-164: Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively

luan@luan-desktop:~/app/software/census_data$ head -5 places2k.txt
AL0100124Abbeville city 2987 1353 40301945 120383 15.560669 0.046480 31.566367 -85.251300
AL0100460Adamsville city 4965 2042 50779330 14126 19.606010 0.005454 33.590411 -86.949166
AL0100484Addison town 723 339 9101325 0 3.514041 0.000000 34.200042 -87.177851
AL0100676Akron town 521 239 1436797 0 0.554750 0.000000 32.876425 -87.740978
AL0100820Alabaster city 22619 8594 53023800 141711 20.472605 0.054715 33.231162 -86.823829

luan@luan-desktop:~/app/software/census_data$ cut -c -2,3-4,5-9,10-73,144-153,154-164 --output-delimiter="|" places2k.txt | head -5
AL|01|00124|Abbeville city | 31.566367| -85.251300
AL|01|00460|Adamsville city | 33.590411| -86.949166
AL|01|00484|Addison town | 34.200042| -87.177851
AL|01|00676|Akron town | 32.876425| -87.740978
AL|01|00820|Alabaster city | 33.231162| -86.823829

You can write C or Java to parse the above data, but it would take more time to write code for it.
If cut command does not serve your need, you should look at awk programming language before jump into C or Java.

No comments:

Post a Comment