Recent posts

Protect your data before you mess with it: Another shell script for data journalists

Pastel padlocks

Meanest Indian (Flickr)

Padlocks in pastel on an Indian street = pretty protection

Data journalism is all about workflow. I have been writing shell scripts to manage my workflow (and as an excuse to learn some programming). First, I wrote a script to create a standard folder structure for all story projects. Next, I wrote a script to help with the initial data audit of a CSV file. Now I’ve got a script that helps with securing the original copies of my data.

This is a big deal. I never work with original data files. As soon as I receive the data, I copy it and protect the original files so they cannot be overwritten.

Why? Because the second I open that data, I’m going to start messing with it — and messing with data sometimes means messing it up. It is so easy to do. It is why I also keep a methodical data journal, which I will talk about a little later.

I made a script that protects the data for me (and helps with the journaling). I call it pushdata.sh and it is a beautiful thing.

So what does it do? It is pretty simple. I have a directory, called DataInbox, where all new data ends up. In that directory is another, called NewData, and it is sort of the launch pad for this script I’ve written. Any new data I want to protect and start working with goes in that directory. It does not matter if it is a single file or a bunch of nested folders filled with files.

Here is what happens when I run the script on whatever is in NewData (I’m using a CSV file called banklist.csv, a list I keep of failed banks in the United States).

I want to end up with a directory called FailedBanks that contains the protected data and the data it is safe for me to mess with. So I go to the command line and get inside the DataInbox directory, where I run this command:

bash pushdata.sh FailedBanks

And we’re rolling… Whatever was in NewData will disappear. The directory I just created, FailedBanks, now lives in a directory I call DataFarm — this is where all of the data I’m working with (or have worked with) lives. You’ll need to create a directory called DataFarm for this script to work.

The new directory contains two subdirectories. One is called Data and the other is called ProtectedOrig. I do my work with whatever is in the Data directory. The ProtectedOrig directory is read only, as is every subdirectory and file inside of it. You can mess that data up, but you can’t save that mess in place of the protected original.

Inside the Data directory, you’ll also find a file called FailedBanks_DataJournal.txt, which you should have open whenever the data itself is open and you are messing with it. The script generates a file creation date at the top, a note about how the new directory was created, and three headings to guide my data journaling, where I record my data cleaning and manipulation steps).

Here’s the script:

 #!/bin/bash 
 
if [[ -z "${1}" ]]; then
die "FolderName Required"
fi
 
newDirName="DirectoryName"
newBaseDir="/Users/YOU/DataFarm/$1/"
/bin/mkdir -p $newBaseDir/{ProtectedOrig,Data}
 
echo -n "$(date "+Generated on %m/%d/%y at %H:%M:%S")
 
The $1 folder structure was created using the pushdata.sh script.
 
-----------------
Data Introduction
-----------------
 
------------------------
Data Audit/Manipulation
------------------------
 
------------
Data Queries
------------ 
 
$newBaseDir/Data/$1_DataJournal.txt
 
ditto NewData/ NewDataCopy
mv NewData/ $newBaseDir/ProtectedOrig/NewData
mv NewDataCopy/ $newBaseDir/Data/NewDataCopy
mv $newBaseDir/Data/NewDataCopy/* $newBaseDir/Data/
rm -r $newBaseDir/Data/NewDataCopy
mv $newBaseDir/ProtectedOrig/NewData/* $newBaseDir/ProtectedOrig/
rm -r $newBaseDir/ProtectedOrig/NewData
chflags -R uchg $newBaseDir/ProtectedOrig/
mkdir NewData

The script I actually run does two additional things: it sends a backup of the new directory to an Amazon S3 bucket for cloud backup, and creates a new Basecamp message for my colleagues to let them know that new data has arrived and to open up a conversation about what should be done with it. I’ll post about both of these actions separately.

I will remind you that I am a beginner, which is why this script has a different style from the others, though it accomplishes similar tasks. I am glad to hear from the more experienced out there about style and usage and all of that good stuff.

Related development:
My copy of “UNIX in a nutshell” arrived today! Onward!

Data journalists: Audit a csv file without ever opening it

Xray Specs

photobunny (flickr)

I’ve been spending some quality time with csvkit, a utility library assembled by the indefatigable Chris Groskopf.

Whenever I get a new dataset, I do a quick data audit to see what’s included and what kind of shape it’s in. I learned to do this work in Excel and Access, but I’m trying to bust out of that proprietary penitentiary called Microsoft Office. Life is so much more fun on the outside.

I’ve created a shell script that uses csvkit commands to peek inside a csv file without ever opening it.

The script sends the results of the data audit to a text file with three headings:

  • Column names: This is huge! I don’t have to boot Bill Gates to get this info, it is right there for me before I’ve ever opened the file.
  • The first ten rows of the first five columns: It’s a little arbitrary, but it will give you a feel for what the data looks like. Are first, middle and last names crammed into one column or broken up? What about city and state?
  • Column stats: A utility called csvstat generates a summary of each column, including number of unique values, if there are any nulls, and row counts. If there are numbers in the column, you’ll see the smallest and largest numbers along with mean and median. Amazing.

What you need to do

Follow Groskopf’s instructions for installing csvkit (time commitment: roughly 30 seconds, if you are comfortable at the command line — if you are not, I have just the web tutorial for you!)

Now create the shell script. I call mine audit.sh:

#!/bin/bash
 
usage () { echo "${0##*/} inputfile outputfile"; exit 1; }
 
(($#==2)) || usage
 
INPUTFILE="$1"
OUTPUTFILE="$2"
 
cat <$OUTPUTFILE
$(date "+Generated on %m/%d/%y at %H:%M:%S")
 
DATA AUDIT: $1
 
------------
COLUMN NAMES
------------
 
$(csvcut -n $INPUTFILE)
 
---------------------------------------
FIRST TEN ROWS OF FIRST FIVE COLUMNS
---------------------------------------
 
$(csvcut -c 1,2,3,4,5 $INPUTFILE | head -n 10)
 
------------
COLUMN STATS
------------
 
$(csvcut $INPUTFILE | csvstat )
 
---END AUDIT
EOF
 
echo "Audited!"
 
Don't forget to make the script executable (I'm new to this stuff and I <em>always</em> forget):
 
$ chmod +x audit.sh
 
<em>Now run the script!</em> Pick a csv file and type:
 
$ ./audit.sh filename.csv DataAudit.txt

You should end up with a file called DataAudit.txt that looks something like this. You can name the output file whatever you want, just replace DataAudit.txt when calling the script.

Got a better way to do this? I’d love to hear about it!

Fellow reporters: Keep your story files organized with this handy shell script

This week I created a new system for organizing my story files. The problem? It’s nine folders large. If I had to create nine folders for every story I started, this system would die an early death. But it’s a good system. So I went to work trying to automate it.

Because I am a novice command-line wizard I was able to construct a very long command that created the nine folders, but there were a few things I couldn’t hack together, so I turned to the elder wizards. First, let me explain the system.

 

It’s all pretty self-explanatory. The last two are probably the exception. PubMaterial is for final copy, links to the article or post online, and screenshots of the published work. RefMaterial is for anything that is not reporting notes or a data file (reports, articles pulled from the internet, scanned documents).

Story tree

There’s one other thing that you don’t see up there. I work with lots of data. Often I have the data before I have a story. Often I create stories around a spreadsheet or database file I’ve acquired. The first thing I do when I get a new data file or collection of files is create a text document called DataJournal that is always organized into the same four headings (so why not automate that, too?)

I wanted to be able to create this document when I created the folders. I had no idea how to do that, so I turned to Stack Overflow. In fewer than 30 minutes I had three solutions, each one a bit more powerful than the last. You can read the thread here, or you can just read on.

The shell script that changed everything, or something

So here’s how this idea evolved through the good people who offered their help over at Stack Overflow: Now I can create the folder tree right where I want it, and with the DataJournal file included, with a single short command. What’s more, I can name the folder in the command, which keeps me from creating and being stuck with a ton of folders called “NewStory” because I’m too lazy to rename them.

Here’s how it’s done…

1) Open a blank text file and paste this into it:

#!/bin/bash
 
if [[ -z "${1}" ]]; then
die "FolderName Required"
fi
 
/bin/mkdir -p ~/Desktop/$1/{Copy,Data,Notes,PubMaterial,RefMaterial,Media/{Audio,Images,Video}}
 
echo -n "---Data Folder Setup
 
---Data Introduction
 
---Data Audit/Manipulation
 
---Data Queries" &gt; ~/Desktop/$1/Data/DataJournal.txt

2) Save save the file in your home folder (or wherever you please) as “create-story.sh”

3) Find your way to the command-line and make the file executable:

chmod +x create-story.sh

4) Now make the magic happen:

bash create-story.sh StoryName

That’s where you name your story folder, right there where it says “StoryName.”

This gives you my DataJournal file, which you may not want. Just stick with everything before the “echo” command if you don’t need the file. Or mess around a bit and put your own custom text file in one of the folders.

5) Set the path to whatever works for you. I actually send mine to a parent “Stories” folder on Dropbox, so I can get at this stuff anywhere. Just remember to change the path in both places it appears in the shell script.

I’m still playing with this. What I’d like to be able to do now is add build on to the script and make it create a generic Google Doc draft file and put a symlink in the drafts folder (messing with GoogleCL right now to try and make this happen). Onward!

Command line wizardry

The artisan face

pennstatelive (Flickr)

You know those woodworkers who use the foot-powered lathes? Yeah, like that guy right there. That’s kind of how I’ve always thought of people who use the command line to make their computer do things.

It seemed a bit like artisan computing, and now I am that artisan.*

I’m not making beautiful table legs yet. Mostly I can just pump the lever with the right rhythm and make a face like I mean it.

That was enough, however, to get me past the dozen or so high hurdles I met getting started with Python.

I couldn’t have done it without the help of Addison Berry over at Lullabot. Her Comand Line Basics videos are fantastic. She’s a great teacher. It’s kind of like having Terri Gross teach you computer programming. Now that I read that it sounds weird. But seriously, she’s great. You should totally get started now.

* Having learned command line basics, I see the weaknesses in the analogy, but I’m sticking with it!