Skinny Notebook: Project Argo Edition

Project Argo

I couldn’t stand it any longer, I had to try the suite of blogger-friendly WordPress themes and plugins developed by National Public Radio’s Project Argo. Skinny Notebook was already a WordPress blog, so it was an easy transition. So much has been written about Argo already, and I really want to get back to playing. So here are a few links if you want to see about giving it a spin…

Protect your data before you mess with it: Another shell script for data journalists

Pastel padlocks

Meanest Indian (Flickr)

Padlocks in pastel on an Indian street = pretty protection

Data journalism is all about workflow. I have been writing shell scripts to manage my workflow (and as an excuse to learn some programming). First, I wrote a script to create a standard folder structure for all story projects. Next, I wrote a script to help with the initial data audit of a CSV file. Now I’ve got a script that helps with securing the original copies of my data.

This is a big deal. I never work with original data files. As soon as I receive the data, I copy it and protect the original files so they cannot be overwritten.

Why? Because the second I open that data, I’m going to start messing with it — and messing with data sometimes means messing it up. It is so easy to do. It is why I also keep a methodical data journal, which I will talk about a little later.

I made a script that protects the data for me (and helps with the journaling). I call it pushdata.sh and it is a beautiful thing.

So what does it do? It is pretty simple. I have a directory, called DataInbox, where all new data ends up. In that directory is another, called NewData, and it is sort of the launch pad for this script I’ve written. Any new data I want to protect and start working with goes in that directory. It does not matter if it is a single file or a bunch of nested folders filled with files.

Here is what happens when I run the script on whatever is in NewData (I’m using a CSV file called banklist.csv, a list I keep of failed banks in the United States).

I want to end up with a directory called FailedBanks that contains the protected data and the data it is safe for me to mess with. So I go to the command line and get inside the DataInbox directory, where I run this command:

bash pushdata.sh FailedBanks

And we’re rolling… Whatever was in NewData will disappear. The directory I just created, FailedBanks, now lives in a directory I call DataFarm — this is where all of the data I’m working with (or have worked with) lives. You’ll need to create a directory called DataFarm for this script to work.

The new directory contains two subdirectories. One is called Data and the other is called ProtectedOrig. I do my work with whatever is in the Data directory. The ProtectedOrig directory is read only, as is every subdirectory and file inside of it. You can mess that data up, but you can’t save that mess in place of the protected original.

Inside the Data directory, you’ll also find a file called FailedBanks_DataJournal.txt, which you should have open whenever the data itself is open and you are messing with it. The script generates a file creation date at the top, a note about how the new directory was created, and three headings to guide my data journaling, where I record my data cleaning and manipulation steps).

Here’s the script:

 #!/bin/bash 
 
if [[ -z "${1}" ]]; then
die "FolderName Required"
fi
 
newDirName="DirectoryName"
newBaseDir="/Users/YOU/DataFarm/$1/"
/bin/mkdir -p $newBaseDir/{ProtectedOrig,Data}
 
echo -n "$(date "+Generated on %m/%d/%y at %H:%M:%S")
 
The $1 folder structure was created using the pushdata.sh script.
 
-----------------
Data Introduction
-----------------
 
------------------------
Data Audit/Manipulation
------------------------
 
------------
Data Queries
------------ 
 
$newBaseDir/Data/$1_DataJournal.txt
 
ditto NewData/ NewDataCopy
mv NewData/ $newBaseDir/ProtectedOrig/NewData
mv NewDataCopy/ $newBaseDir/Data/NewDataCopy
mv $newBaseDir/Data/NewDataCopy/* $newBaseDir/Data/
rm -r $newBaseDir/Data/NewDataCopy
mv $newBaseDir/ProtectedOrig/NewData/* $newBaseDir/ProtectedOrig/
rm -r $newBaseDir/ProtectedOrig/NewData
chflags -R uchg $newBaseDir/ProtectedOrig/
mkdir NewData

The script I actually run does two additional things: it sends a backup of the new directory to an Amazon S3 bucket for cloud backup, and creates a new Basecamp message for my colleagues to let them know that new data has arrived and to open up a conversation about what should be done with it. I’ll post about both of these actions separately.

I will remind you that I am a beginner, which is why this script has a different style from the others, though it accomplishes similar tasks. I am glad to hear from the more experienced out there about style and usage and all of that good stuff.

Related development:
My copy of “UNIX in a nutshell” arrived today! Onward!

Data journalists: Audit a csv file without ever opening it

Xray Specs

photobunny (flickr)

I’ve been spending some quality time with csvkit, a utility library assembled by the indefatigable Chris Groskopf.

Whenever I get a new dataset, I do a quick data audit to see what’s included and what kind of shape it’s in. I learned to do this work in Excel and Access, but I’m trying to bust out of that proprietary penitentiary called Microsoft Office. Life is so much more fun on the outside.

I’ve created a shell script that uses csvkit commands to peek inside a csv file without ever opening it.

The script sends the results of the data audit to a text file with three headings:

  • Column names: This is huge! I don’t have to boot Bill Gates to get this info, it is right there for me before I’ve ever opened the file.
  • The first ten rows of the first five columns: It’s a little arbitrary, but it will give you a feel for what the data looks like. Are first, middle and last names crammed into one column or broken up? What about city and state?
  • Column stats: A utility called csvstat generates a summary of each column, including number of unique values, if there are any nulls, and row counts. If there are numbers in the column, you’ll see the smallest and largest numbers along with mean and median. Amazing.

What you need to do

Follow Groskopf’s instructions for installing csvkit (time commitment: roughly 30 seconds, if you are comfortable at the command line — if you are not, I have just the web tutorial for you!)

Now create the shell script. I call mine audit.sh:

#!/bin/bash
 
usage () { echo "${0##*/} inputfile outputfile"; exit 1; }
 
(($#==2)) || usage
 
INPUTFILE="$1"
OUTPUTFILE="$2"
 
cat <$OUTPUTFILE
$(date "+Generated on %m/%d/%y at %H:%M:%S")
 
DATA AUDIT: $1
 
------------
COLUMN NAMES
------------
 
$(csvcut -n $INPUTFILE)
 
---------------------------------------
FIRST TEN ROWS OF FIRST FIVE COLUMNS
---------------------------------------
 
$(csvcut -c 1,2,3,4,5 $INPUTFILE | head -n 10)
 
------------
COLUMN STATS
------------
 
$(csvcut $INPUTFILE | csvstat )
 
---END AUDIT
EOF
 
echo "Audited!"
 
Don't forget to make the script executable (I'm new to this stuff and I <em>always</em> forget):
 
$ chmod +x audit.sh
 
<em>Now run the script!</em> Pick a csv file and type:
 
$ ./audit.sh filename.csv DataAudit.txt

You should end up with a file called DataAudit.txt that looks something like this. You can name the output file whatever you want, just replace DataAudit.txt when calling the script.

Got a better way to do this? I’d love to hear about it!

Fellow reporters: Keep your story files organized with this handy shell script

This week I created a new system for organizing my story files. The problem? It’s nine folders large. If I had to create nine folders for every story I started, this system would die an early death. But it’s a good system. So I went to work trying to automate it.

Because I am a novice command-line wizard I was able to construct a very long command that created the nine folders, but there were a few things I couldn’t hack together, so I turned to the elder wizards. First, let me explain the system.

 

It’s all pretty self-explanatory. The last two are probably the exception. PubMaterial is for final copy, links to the article or post online, and screenshots of the published work. RefMaterial is for anything that is not reporting notes or a data file (reports, articles pulled from the internet, scanned documents).

Story tree

There’s one other thing that you don’t see up there. I work with lots of data. Often I have the data before I have a story. Often I create stories around a spreadsheet or database file I’ve acquired. The first thing I do when I get a new data file or collection of files is create a text document called DataJournal that is always organized into the same four headings (so why not automate that, too?)

I wanted to be able to create this document when I created the folders. I had no idea how to do that, so I turned to Stack Overflow. In fewer than 30 minutes I had three solutions, each one a bit more powerful than the last. You can read the thread here, or you can just read on.

The shell script that changed everything, or something

So here’s how this idea evolved through the good people who offered their help over at Stack Overflow: Now I can create the folder tree right where I want it, and with the DataJournal file included, with a single short command. What’s more, I can name the folder in the command, which keeps me from creating and being stuck with a ton of folders called “NewStory” because I’m too lazy to rename them.

Here’s how it’s done…

1) Open a blank text file and paste this into it:

#!/bin/bash
 
if [[ -z "${1}" ]]; then
die "FolderName Required"
fi
 
/bin/mkdir -p ~/Desktop/$1/{Copy,Data,Notes,PubMaterial,RefMaterial,Media/{Audio,Images,Video}}
 
echo -n "---Data Folder Setup
 
---Data Introduction
 
---Data Audit/Manipulation
 
---Data Queries" &gt; ~/Desktop/$1/Data/DataJournal.txt

2) Save save the file in your home folder (or wherever you please) as “create-story.sh”

3) Find your way to the command-line and make the file executable:

chmod +x create-story.sh

4) Now make the magic happen:

bash create-story.sh StoryName

That’s where you name your story folder, right there where it says “StoryName.”

This gives you my DataJournal file, which you may not want. Just stick with everything before the “echo” command if you don’t need the file. Or mess around a bit and put your own custom text file in one of the folders.

5) Set the path to whatever works for you. I actually send mine to a parent “Stories” folder on Dropbox, so I can get at this stuff anywhere. Just remember to change the path in both places it appears in the shell script.

I’m still playing with this. What I’d like to be able to do now is add build on to the script and make it create a generic Google Doc draft file and put a symlink in the drafts folder (messing with GoogleCL right now to try and make this happen). Onward!

Command line wizardry

The artisan face

pennstatelive (Flickr)

You know those woodworkers who use the foot-powered lathes? Yeah, like that guy right there. That’s kind of how I’ve always thought of people who use the command line to make their computer do things.

It seemed a bit like artisan computing, and now I am that artisan.*

I’m not making beautiful table legs yet. Mostly I can just pump the lever with the right rhythm and make a face like I mean it.

That was enough, however, to get me past the dozen or so high hurdles I met getting started with Python.

I couldn’t have done it without the help of Addison Berry over at Lullabot. Her Comand Line Basics videos are fantastic. She’s a great teacher. It’s kind of like having Terri Gross teach you computer programming. Now that I read that it sounds weird. But seriously, she’s great. You should totally get started now.

* Having learned command line basics, I see the weaknesses in the analogy, but I’m sticking with it!

I forgot to mention JavaScript and SQL

JavaScript noob

It’s not just Python I’m creeping into. I’m also working my way through The JavaScript Pocket Guide by Lenny Burdette. I learned a little bit about JavaScript working on this. Mostly I learned that I really needed to learn JavaScript.

One more thing: I’m getting comfortable with Structured Query Language too. My intro to SQL came at a six-day boot camp run by the National Institute for Computer-Assisted Reporting.

That’s the toolbox for the moment: Python, JavaScript, and SQL (with nuts and bolts stuff like HTML and CSS already in there). Wish me luck. Be nice. Here we go.

The emotional landscape of learning to think Python

Hello big bad world

A quick note about how I’m going about learning Python.

1) I joined a Google Group called PythonJournos. They got started months before I did, but I’m going through the messages in order and feeling a part of the group, though I am yet to post a single message. Lurk life!

2) To follow along with the group, I purchased the epic Learning Python (4th Edition) by Mark Lutz.

2) I dusted off a printed copy of Allen B. Downey’s Think Python: How to Think Like a Computer Scientist because, well, I need a lot of help thinking like a computer scientist. It’s a fantastic book.

4) Occasionally I will bring beer and snacks to a friend who thinks like a computer scientist really well, in exchange for his patient wisdom.

I’m feeling some momentum and it’s a great feeling. There is also that occasional feeling of momentum blocked—or what feels like momentum blocked (it’s really just learning, and learning = momentum, right?).

I turned to Think Python after three chapters of Learning Python and I’m so glad I did. I never gave much thought to the emotional experience of programming, though I feel its effects constantly, whether it’s my stomach squeezed tight like a fist (when something is going wrong) or my foot excitedly tap-tap-tapping (the elation of things going right).

Downey addresses the emotional landscape beautifully in Chapter 1:

Programming, and especially debugging, sometimes brings out strong emotions. If you are struggling with a difficult bug, you might feel angry, despondent or embarrassed.

There is evidence that people naturally respond to computers as if they were people. When they work well, we think of them as teammates, and when they are obstinate or rude, we respond to them the same way we respond to rude, obstinate people.

Preparing for these reactions might help you deal with them. One approach is to think of the computer as an employee with certain strengths, like speed and precision, and particular weaknesses, like lack of empathy and inability to grasp the big picture.

Your job is to be a good manager: find ways to take advantage of the strengths and mitigate the weaknesses. And find ways to use your emotions to engage with the problem, without letting your reactions interfere with your ability to work effectively.

The manager/employee analogy is a little unsatisfying (loaded as it is for anybody who has ever been on either side of this often toxic relationship in the real world), but the message is crystal clear and I’m grateful for his cue to consider this element of learning to program.

Onward!

UPDATE (May 2, 2012): It’s been more than a year since I wrote this post. I never went deep with the Python group, and my learning ended up being far more diverse than I had expected, and did not go in a straight line. I’ll write more about this eventually, but felt compelled to chime in on younger me here.