Basecamp

Recent posts

Protect your data before you mess with it: Another shell script for data journalists

Pastel padlocks

Meanest Indian (Flickr)

Padlocks in pastel on an Indian street = pretty protection

Data journalism is all about workflow. I have been writing shell scripts to manage my workflow (and as an excuse to learn some programming). First, I wrote a script to create a standard folder structure for all story projects. Next, I wrote a script to help with the initial data audit of a CSV file. Now I’ve got a script that helps with securing the original copies of my data.

This is a big deal. I never work with original data files. As soon as I receive the data, I copy it and protect the original files so they cannot be overwritten.

Why? Because the second I open that data, I’m going to start messing with it — and messing with data sometimes means messing it up. It is so easy to do. It is why I also keep a methodical data journal, which I will talk about a little later.

I made a script that protects the data for me (and helps with the journaling). I call it pushdata.sh and it is a beautiful thing.

So what does it do? It is pretty simple. I have a directory, called DataInbox, where all new data ends up. In that directory is another, called NewData, and it is sort of the launch pad for this script I’ve written. Any new data I want to protect and start working with goes in that directory. It does not matter if it is a single file or a bunch of nested folders filled with files.

Here is what happens when I run the script on whatever is in NewData (I’m using a CSV file called banklist.csv, a list I keep of failed banks in the United States).

I want to end up with a directory called FailedBanks that contains the protected data and the data it is safe for me to mess with. So I go to the command line and get inside the DataInbox directory, where I run this command:

bash pushdata.sh FailedBanks

And we’re rolling… Whatever was in NewData will disappear. The directory I just created, FailedBanks, now lives in a directory I call DataFarm — this is where all of the data I’m working with (or have worked with) lives. You’ll need to create a directory called DataFarm for this script to work.

The new directory contains two subdirectories. One is called Data and the other is called ProtectedOrig. I do my work with whatever is in the Data directory. The ProtectedOrig directory is read only, as is every subdirectory and file inside of it. You can mess that data up, but you can’t save that mess in place of the protected original.

Inside the Data directory, you’ll also find a file called FailedBanks_DataJournal.txt, which you should have open whenever the data itself is open and you are messing with it. The script generates a file creation date at the top, a note about how the new directory was created, and three headings to guide my data journaling, where I record my data cleaning and manipulation steps).

Here’s the script:

 #!/bin/bash 
 
if [[ -z "${1}" ]]; then
die "FolderName Required"
fi
 
newDirName="DirectoryName"
newBaseDir="/Users/YOU/DataFarm/$1/"
/bin/mkdir -p $newBaseDir/{ProtectedOrig,Data}
 
echo -n "$(date "+Generated on %m/%d/%y at %H:%M:%S")
 
The $1 folder structure was created using the pushdata.sh script.
 
-----------------
Data Introduction
-----------------
 
------------------------
Data Audit/Manipulation
------------------------
 
------------
Data Queries
------------ 
 
$newBaseDir/Data/$1_DataJournal.txt
 
ditto NewData/ NewDataCopy
mv NewData/ $newBaseDir/ProtectedOrig/NewData
mv NewDataCopy/ $newBaseDir/Data/NewDataCopy
mv $newBaseDir/Data/NewDataCopy/* $newBaseDir/Data/
rm -r $newBaseDir/Data/NewDataCopy
mv $newBaseDir/ProtectedOrig/NewData/* $newBaseDir/ProtectedOrig/
rm -r $newBaseDir/ProtectedOrig/NewData
chflags -R uchg $newBaseDir/ProtectedOrig/
mkdir NewData

The script I actually run does two additional things: it sends a backup of the new directory to an Amazon S3 bucket for cloud backup, and creates a new Basecamp message for my colleagues to let them know that new data has arrived and to open up a conversation about what should be done with it. I’ll post about both of these actions separately.

I will remind you that I am a beginner, which is why this script has a different style from the others, though it accomplishes similar tasks. I am glad to hear from the more experienced out there about style and usage and all of that good stuff.

Related development:
My copy of “UNIX in a nutshell” arrived today! Onward!