Data journalists: Audit a csv file without ever opening it

Xray Specs

photobunny (flickr)

I’ve been spending some quality time with csvkit, a utility library assembled by the indefatigable Chris Groskopf.

Whenever I get a new dataset, I do a quick data audit to see what’s included and what kind of shape it’s in. I learned to do this work in Excel and Access, but I’m trying to bust out of that proprietary penitentiary called Microsoft Office. Life is so much more fun on the outside.

I’ve created a shell script that uses csvkit commands to peek inside a csv file without ever opening it.

The script sends the results of the data audit to a text file with three headings:

  • Column names: This is huge! I don’t have to boot Bill Gates to get this info, it is right there for me before I’ve ever opened the file.
  • The first ten rows of the first five columns: It’s a little arbitrary, but it will give you a feel for what the data looks like. Are first, middle and last names crammed into one column or broken up? What about city and state?
  • Column stats: A utility called csvstat generates a summary of each column, including number of unique values, if there are any nulls, and row counts. If there are numbers in the column, you’ll see the smallest and largest numbers along with mean and median. Amazing.

What you need to do

Follow Groskopf’s instructions for installing csvkit (time commitment: roughly 30 seconds, if you are comfortable at the command line — if you are not, I have just the web tutorial for you!)

Now create the shell script. I call mine audit.sh:

#!/bin/bash
 
usage () { echo "${0##*/} inputfile outputfile"; exit 1; }
 
(($#==2)) || usage
 
INPUTFILE="$1"
OUTPUTFILE="$2"
 
cat <$OUTPUTFILE
$(date "+Generated on %m/%d/%y at %H:%M:%S")
 
DATA AUDIT: $1
 
------------
COLUMN NAMES
------------
 
$(csvcut -n $INPUTFILE)
 
---------------------------------------
FIRST TEN ROWS OF FIRST FIVE COLUMNS
---------------------------------------
 
$(csvcut -c 1,2,3,4,5 $INPUTFILE | head -n 10)
 
------------
COLUMN STATS
------------
 
$(csvcut $INPUTFILE | csvstat )
 
---END AUDIT
EOF
 
echo "Audited!"
 
Don't forget to make the script executable (I'm new to this stuff and I <em>always</em> forget):
 
$ chmod +x audit.sh
 
<em>Now run the script!</em> Pick a csv file and type:
 
$ ./audit.sh filename.csv DataAudit.txt

You should end up with a file called DataAudit.txt that looks something like this. You can name the output file whatever you want, just replace DataAudit.txt when calling the script.

Got a better way to do this? I’d love to hear about it!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">