Saturday, February 11, 2017

Exploring Data From the Linux Command Line

A few days ago, we saw the first signs that perhaps the worst of an unusually cold and wet winter might be ending: a temperature over 60°F!

A neighbor commented was made that it had been a long time since the last one, and I was curious as to exactly how long. For reasons of my own, I keep data files on what's recorded at the nearest weather station with what I consider to be fairly reliable data. So it only took a couple of minutes exploratory hacking around at a shell prompt to get my answer. Here’s what I did, and the result I got.

grep ^161[0-2] 1606010000-1612311600 | awk '{print $1" "$5}' | grep -E 6.{3} | tail -n1
1611191600 61.0

It seems longer, but the last day of ≥ 60.0°F temperature was 2016-11-19, and as a side-effect we also get the last time of the last day: 1600 (4PM for those of you who don't use 24-hour time). We could get rid of that side-effect; they are usually a Bad Thing in code. But in this case the source is obvious (as will be shown below), and entirely beneficial. It extracts another piece of information from our data at zero computation cost. Exploratory code for the win.

Before I get into what our pipeline is doing, a note about the file. These are raw data - fields are separated only by whitespace. Lines begin with time and date encoded as YYMMDDTTTT. Hence the first field meaning of the result seen above, and the file name 1606010000-1612311600. It reflects the start-stop dates and times of the file. That can be a useful convention: in this case it immediately reveals that the data are incomplete. The station failed to record after 1600 on New Year's Eve.

Additionally, we can use the wordcount program in linecount mode to see that we are starting with a file containing 5541 lines (records, though there is a 3-line header, which I won't bother to filter out).
wc -l 1606010000-1612311600
5541 1606010000-1612311600

1- grep ^161[0-2] 1606010000-1612311600, in which grep (a pattern-matching tool) is supplying all lines (records) from our file that begin (specified via ^) with 161, if the next digit is 0-2. I was only interested in months 10-12 of 2016 (and 2016 data are all that is in this file), because I knew the last date of ≥ 60.0°F would be in there somewhere. We now have only records from our period of interest. If we ended here, our output would be 
1610010000  24.10    3.0  163.0   53.0   51.0   80.0   12.9  204.0    6.0    0.0
...
1612311600  48.70    2.0   99.0   35.0   35.0   87.0   13.3  173.0    8.0   47.0

I'm using the ellipses in place of 2608 lines of output. wc -l shows 2610. We've filtered out nearly half of our data. Now we pipe (the | character) those lines to awk.

2- awk '{print $1" "$5}', where we instruct awk (a pattern scanning and processing language, of which more later) to print only that first datetime field, a space, then field 5, which contains the temperature, of each line of input it receives. Now we're down to only the fields of interest within our period of interest.  Had we stopped here, our output would still be 2610 lines, but only 2 fields out of 11, formatted as
YYMMDDTTTT NN.N.

This 2nd stage of our filter removed about 2/3 of its incoming data. I'm just guesstimating by looking at line lengths here, but you can get accurate numbers using wc again, before and after this stage. Specify -b instead of -l to count bytes instead of lines. I'll skip the demonstration. Now we send that on to grep again, but specifying different options.

3- grep -E 6.{3} contains the -E (Extended) option, which enables the {} syntax so that we can specify how many instances of a character we want to match. The preceding dot can be read as 'any one character', so a multi-character string would not match.  The trailing '$' matches the end of line -- the opposite of the '^' we used the first time we used when we piped to grep. The net effect is that only content that matches a '6' followed by any 3 single characters, followed by end-of-line, will survive. Given our NN.N format for the field field, we filter out anything except 6N.N and wc -l now shows only 222 of those short lines left, of 2610. Having filtered out all but 1/7 or so of the data coming into this stage, we now we filter down to one line - our answer.

4- tail -n1, which returns only the last n lines, and specify n=1. Because the data are in increasing time/date order (as can be seen in the output of our first filter) this gives us our last datetime, and answers our question, with greater accuracy than we had thought to ask.

If we needed the date and nothing but the date, we could modify our usage of awk, which is a pattern scanning and processing language. GNU awk has some very interesting capabilities, such as floating point math, true multidimensional arrays, etc. This entire task could have been done in awk, but I wanted to show more of the shell tools, and pipelines, not just Cool Things We Can Do With GNU awk'. [1]

The Shell Will Probably Always Belong in Your Toolbox

I often use far more sophisticated tools when I want to take a long hard look at data. But, file formats vary, data may be missing, etc. As a rule of thumb, you can expect to spend half of the total time spent analyzing data just seeing what's there, and cleaning it up. For much of that work, the shell is a great tool, and it's actually very common to spend a bit of time using the command line to explore. In a broad view, command-line tools can help you determine,  quickly, whether a particular data source contains anything of interest at all, and if so, how much, how it's formatted, etc. And finally, the commands can be saved as part of a shell script, and used over an arbitrary number of similar data files. 

To a point, anyway. Shells are slow (particularly bash). Though of course there are tools to quantify that as well, and timing work on a subset of the data can give you an idea of when you are going to have to use something else. 'time' is available as a built-in if you are using the bash shell, and any Unix or Linux will also have a 'time' binary somewhere on your search path if the appropriate package is installed. On this machine it's /usr/bin/time, packaged as 'time'. Everything else, except the shell itself, is in the 'coreutils' package. Which says something about how useful these tools are. If you aren't using them, you quite literally are not using the core of the Linux/Unix tools. 

That is probably a mistake. There is a lot of data out there, stored as textual files of moderate size.

My Ulterior Motive for This

I wanted a post such that:
  1. I could advocate the command line, to people who seem to inappropriately default to spreadsheets, which are nothing more than another tool in the box. That box should contain several tools. Consider unstructured data. Or consider binary data formats, which are an intractable problem for both shells and spreadsheets.
  2. Had absolutely nothing to do with security work. Because people are going to be justifiably sensitive about exactly whose security data I might be using as an example. But everybody talks about the weather.
If anyone wants to play with the data, it's available at:
https://drive.google.com/open?id=0B0XLFi22OXDpR3h0UUQ1cmNWbkk
Note to self: find another home for this sort of thing. Google Drive can't even preview a text file.
Note to all: this is not a promise to keep it there for any significant period of time. If I need the space for other things (like client-related things), that file is very, very gone. I recently VVG'ed most of what was in /pub.

[1] I do have one idea for something I'll do with awk one of these days. Because who doesn't like univariate summary statistics combined with 4000 year old Babylonian math, and using NIST-certified results to validate (or invalidate, as the case may be) our code?

No comments:

Post a Comment