Showing posts with label lifehacking. Show all posts
Showing posts with label lifehacking. Show all posts

Saturday, February 11, 2017

Exploring Data From the Linux Command Line

A few days ago, we saw the first signs that perhaps the worst of an unusually cold and wet winter might be ending: a temperature over 60°F!

A neighbor commented was made that it had been a long time since the last one, and I was curious as to exactly how long. For reasons of my own, I keep data files on what's recorded at the nearest weather station with what I consider to be fairly reliable data. So it only took a couple of minutes exploratory hacking around at a shell prompt to get my answer. Here’s what I did, and the result I got.

grep ^161[0-2] 1606010000-1612311600 | awk '{print $1" "$5}' | grep -E 6.{3} | tail -n1
1611191600 61.0

It seems longer, but the last day of ≥ 60.0°F temperature was 2016-11-19, and as a side-effect we also get the last time of the last day: 1600 (4PM for those of you who don't use 24-hour time). We could get rid of that side-effect; they are usually a Bad Thing in code. But in this case the source is obvious (as will be shown below), and entirely beneficial. It extracts another piece of information from our data at zero computation cost. Exploratory code for the win.

Before I get into what our pipeline is doing, a note about the file. These are raw data - fields are separated only by whitespace. Lines begin with time and date encoded as YYMMDDTTTT. Hence the first field meaning of the result seen above, and the file name 1606010000-1612311600. It reflects the start-stop dates and times of the file. That can be a useful convention: in this case it immediately reveals that the data are incomplete. The station failed to record after 1600 on New Year's Eve.

Additionally, we can use the wordcount program in linecount mode to see that we are starting with a file containing 5541 lines (records, though there is a 3-line header, which I won't bother to filter out).
wc -l 1606010000-1612311600
5541 1606010000-1612311600

1- grep ^161[0-2] 1606010000-1612311600, in which grep (a pattern-matching tool) is supplying all lines (records) from our file that begin (specified via ^) with 161, if the next digit is 0-2. I was only interested in months 10-12 of 2016 (and 2016 data are all that is in this file), because I knew the last date of ≥ 60.0°F would be in there somewhere. We now have only records from our period of interest. If we ended here, our output would be 
1610010000  24.10    3.0  163.0   53.0   51.0   80.0   12.9  204.0    6.0    0.0
...
1612311600  48.70    2.0   99.0   35.0   35.0   87.0   13.3  173.0    8.0   47.0

I'm using the ellipses in place of 2608 lines of output. wc -l shows 2610. We've filtered out nearly half of our data. Now we pipe (the | character) those lines to awk.

2- awk '{print $1" "$5}', where we instruct awk (a pattern scanning and processing language, of which more later) to print only that first datetime field, a space, then field 5, which contains the temperature, of each line of input it receives. Now we're down to only the fields of interest within our period of interest.  Had we stopped here, our output would still be 2610 lines, but only 2 fields out of 11, formatted as
YYMMDDTTTT NN.N.

This 2nd stage of our filter removed about 2/3 of its incoming data. I'm just guesstimating by looking at line lengths here, but you can get accurate numbers using wc again, before and after this stage. Specify -b instead of -l to count bytes instead of lines. I'll skip the demonstration. Now we send that on to grep again, but specifying different options.

3- grep -E 6.{3} contains the -E (Extended) option, which enables the {} syntax so that we can specify how many instances of a character we want to match. The preceding dot can be read as 'any one character', so a multi-character string would not match.  The trailing '$' matches the end of line -- the opposite of the '^' we used the first time we used when we piped to grep. The net effect is that only content that matches a '6' followed by any 3 single characters, followed by end-of-line, will survive. Given our NN.N format for the field field, we filter out anything except 6N.N and wc -l now shows only 222 of those short lines left, of 2610. Having filtered out all but 1/7 or so of the data coming into this stage, we now we filter down to one line - our answer.

4- tail -n1, which returns only the last n lines, and specify n=1. Because the data are in increasing time/date order (as can be seen in the output of our first filter) this gives us our last datetime, and answers our question, with greater accuracy than we had thought to ask.

If we needed the date and nothing but the date, we could modify our usage of awk, which is a pattern scanning and processing language. GNU awk has some very interesting capabilities, such as floating point math, true multidimensional arrays, etc. This entire task could have been done in awk, but I wanted to show more of the shell tools, and pipelines, not just Cool Things We Can Do With GNU awk'. [1]

The Shell Will Probably Always Belong in Your Toolbox

I often use far more sophisticated tools when I want to take a long hard look at data. But, file formats vary, data may be missing, etc. As a rule of thumb, you can expect to spend half of the total time spent analyzing data just seeing what's there, and cleaning it up. For much of that work, the shell is a great tool, and it's actually very common to spend a bit of time using the command line to explore. In a broad view, command-line tools can help you determine,  quickly, whether a particular data source contains anything of interest at all, and if so, how much, how it's formatted, etc. And finally, the commands can be saved as part of a shell script, and used over an arbitrary number of similar data files. 

To a point, anyway. Shells are slow (particularly bash). Though of course there are tools to quantify that as well, and timing work on a subset of the data can give you an idea of when you are going to have to use something else. 'time' is available as a built-in if you are using the bash shell, and any Unix or Linux will also have a 'time' binary somewhere on your search path if the appropriate package is installed. On this machine it's /usr/bin/time, packaged as 'time'. Everything else, except the shell itself, is in the 'coreutils' package. Which says something about how useful these tools are. If you aren't using them, you quite literally are not using the core of the Linux/Unix tools. 

That is probably a mistake. There is a lot of data out there, stored as textual files of moderate size.

My Ulterior Motive for This

I wanted a post such that:
  1. I could advocate the command line, to people who seem to inappropriately default to spreadsheets, which are nothing more than another tool in the box. That box should contain several tools. Consider unstructured data. Or consider binary data formats, which are an intractable problem for both shells and spreadsheets.
  2. Had absolutely nothing to do with security work. Because people are going to be justifiably sensitive about exactly whose security data I might be using as an example. But everybody talks about the weather.
If anyone wants to play with the data, it's available at:
https://drive.google.com/open?id=0B0XLFi22OXDpR3h0UUQ1cmNWbkk
Note to self: find another home for this sort of thing. Google Drive can't even preview a text file.
Note to all: this is not a promise to keep it there for any significant period of time. If I need the space for other things (like client-related things), that file is very, very gone. I recently VVG'ed most of what was in /pub.

[1] I do have one idea for something I'll do with awk one of these days. Because who doesn't like univariate summary statistics combined with 4000 year old Babylonian math, and using NIST-certified results to validate (or invalidate, as the case may be) our code?

Tuesday, September 20, 2016

Greater Yellowlegs

This August, I didn't find nearly the number of species of birds that I did in 2015. This month is also a lot slower. Unlike some birders, for whom the list length is everything (insert obvious crude comparison here), I'm fine with that. Species counts are just another tool I use to try to understand what is going on, on my patch. Counts happen to be a powerful tool, if used well, but it's about understanding, not competition.

Here are a couple of Greater Yellowlegs (a sort of large sandpiper) that I don't see enough of. Work can be a pressure cooker environment, the recent news reports are usually depressing, etc. Being able to walk out the back gate, go down to the river and see a couple of neat birds and fall color, reflected in late-summer low river levels, is a welcome break.

Well. That either matters to you, or it doesn't. If not, I hope you have some other means of coping.

Greater Yellowlegs, Willamette River, Linn C, OR, 2016-09.04



Friday, August 26, 2016

Does work really expand to fill all available hours?

That might be a perception issue. In a second effort (this week) to free up more time, I just invested half an hour to run an optimization experiment. Amazingly successful and I'll probably save 4-5 hours per week, for a month or more. Huge win, to be sure. Counting both efforts, I get 6-7 hours back.

The thing is, I didn't start really looking for optimizations until I passed a pain threshold. I expect that is pretty typical behavior for us all, and that really sucks for me, on a couple of levels.

First off is professional. Always optimizing stuff is part of the gig.

Second is just personal embarrassment, because missing a forehead-slappingly easy test for bias, is, well, personally embarrassing.

That bit of folk wisdom, that work expands to fill all available hours? Like much folk wisdom, not buying it. This was just the most recent iteration of the problem. I think it's much more about pain thresholds, and when we finally realize we can't fit that next Desired Thing into the schedule. Only then do we scurry off and find fixes for the problem.

Perhaps this a LifeHacking thing. Hard to tell: trying to follow whatever fashion is currently playing out on the Internet is usually an expertise in futility.

But I plainly need to lower my pain threshold, and optimize sooner.

Monday, May 2, 2016

Taking a Break with a Bald Eagle

After an 0-dark-thirty start to Monday, I was was ready for a break by mid-morning. Grab a fresh cup of Productivity Fluid, and out onto the deck. That deck is on the second story, and faces a large Black Walnut, the slope down to the river bank, etc. It's a bit of a habit to grab binoculars and a camera on the way out. This morning, the idea was to do a quick bird count, for submission to eBird. Because it's the height of Spring migration, and weird things happen.

And indeed they did. About 40 feet away, in that Black Walnut, were Wood Ducks. Which actually have clawed feet, perch in trees, etc. Hence the name. Photos didn't work out too well. These birds are usually shy, with good reason: there are lot of hunters on the river during the season. Typically, I seem them from a couple of hundred feet away, as the are headed elsewhere. Bummer about the photos, though. Drakes are almost cartoonishly colorful. But the drake had a branch between us, and it was obvious that if I moved much, they were going to spook.

And ... they did.

A few minutes later, an immature Bald Eagle flew into the same tree, and pretty obviously was not worried about me at all. There was a lot of ray-catching and preening involved. Here's the bird pausing from a bit of luxurious back-preening to make sure the silly human isn't doing anything, well, silly.


And of course, I had to take the obligatory head photo. 


That bird hung out for at least two hours. Seemingly just enjoying the morning. I, unfortunately, had to get back to the salt mines. An early start already looks like it will extend into a late night.

How's your day going?

Wednesday, April 27, 2016

Green-hued Purple Finch

April is drawing to a close. It's a been a great (for small values of great) month on the birding front. I added half a dozen or so species to the April all-years list at my local patch. Which is currently wedged at 99, and seems likely to finish that way. So close, and yet so meh. I'm chalking that up to a somewhat early migration for much of the year so far. Which seems to be ending.

This blog is very much not the place for accounts of the truly rare. Odd, I can occasionally do. My patch is a bit different, in that many local birders see White-crowned Sparrow, while I see White-throated, etc.  Another difference lies in Purple Finch. Which, for whatever reason, seem more common here than what is typically seen along the Benton/Linn Co. (Oregon) border. Common enough that I get to see PUFI (4-letter banding code for Purple Finch),  see the cannonical USGS reference for the whole thing, in unusual plumage.

This group is a bit prone to weirdness. I have photos of House and Purple Finch in hues that might be best described as golden, rather than red/rose/purple, and I've seen references to that being a function of diet. But green is a bit off-the-wall, in my experience. Here is the only green-hued Purple Finch I've ever seen, and that was on 2016-04-01. April Fools Day. No way was I going to post that the day I saw her.



But perhaps not so outlandish as all that. A Web search found one reference, the Purple Finch entry in John J. Audubon’s Birds of America, which seems to indicate that this hue can be common, at least toward the eastern US. OTOH, that was a long time ago, far from my patch in Oregon's Willamette Valley, and Audubon was, well, a bit dubious in some respects.

What do modern field guides have to say? In alphanumeric order, looking for any reference to 'green' I found the following.

  • National Geographic Field Guide to the Birds of North America: no mention.
  • Peterson Field Guide to Western Birds, 2nd edition: no mention.
  • Peterson Field Guide to Western Birds, 3rd edition: no mention.
  • Sibley Guide to the Birds, 1st edition: "Pacific females are washed greenish above..."
  • Sibley Guide to the Birds, 2nd edition: "Pacific females are washed greenish above..."
  • Stokes Field Guide to the Birds of North America: no mention.

Does this, in any way, constitute a recommendation for a field guide? Well, no. Aberrant golden hues are, in my limited experience as a patch birder, far more common amongst House and Purple Finch. I've seen dozens of golden-hued birds of each species, and exactly one greenish finch. Yet golden birds get no mention at all.

Does that mean that I regard popular field guides as equally wrong? Well, no. Tremendous effort was expended by very talented people in creating these guides. A lot of financial risk was assumed by all parties -- including publishers. Personally, I doubt that the vagaries of plumage variations can ever be adequately described in a field guide. Not least because human languages cannot adequately describe color. Ask a fly fisher what 'dunn' refers to.

I confess that Sibley is my favorite, but this is not an example of why.












Sunday, March 20, 2016

Workstation Wallpaper, Courtesy of ESO

This is from The European Southern Observatory, specifically the VISTA Magellanic Cloud Survey view of the Tarantula Nebula. For the last several years, an edited version of it has been the wallpaper on my main workstation, which is always named feynman. I'm just a bit strange that way; the wallpaper on my phone is an image from the Hubble Ultra Deep Field.


Before that, wallpapers were slideshows, but that is an historic, and also somewhat biographical, note. But that is for another post, along with that workstation name, which rolls from machine to machine as technology improves.

I could leap from this to some ranty post about why various sorts of hamburger, skate boards, etc., are not awesome. That, too, should be in another post.

I find myself rather sad this morning. Last night I was looking for a bit of imagery related to the UA Mirror Lab. I've mentioned them over at G+. They are currently producing optics for another European effort, the Large Magellan Telescope. No link there, because I just found them trying to do some highly obnoxious Web tracking, involving HTML canvas. That might be another post, but one more suited to my security blog. Some days, the sadness just piles up.

On a brighter note, I found what I was looking for (and much more) by stumbling across, then doing a topic search, on a great blog. GMT4 Unveil, about casting the 4th GMT mirror. Here's the time-lapse video, on YouTube, but you should really read the blog post, and follow the link from there. Casting an 8.4 meter mirror -- a process which takes months, even at what is unquestionably the finest large-scale optical fabrication facility in the world, is a nontrivial process.

So, why so sad?

Because in the course of checking out that great Ketelsens blog, I found a couple of things. One was a mention of Bob Goff. That name rang a bell -- he was a friend of a friend, years ago. Now dead at an early age, as is the friend (Larry Forrest, founder of since-sold Glass Mountain Optics) who used to mention him. Larry died unexpectedly, also before his time. He and Sharleen, his wife of many years, Forrest are two of the finest people I have ever known. Notice my use of the present tense for Larry. That is intentional. In ways that seem important to me, Larry is very much alive; he's just impossible to contact. Which sucks, but there it is.

In addition, the Ketelsens Blog mentions yet another person who has passed far too early, Dave Harvey. They both worked at Steward Observatory, but Harvey was on the software development, as opposed to the optical fabrication, side of things. He apparently regarded himself as more of a photographer. This guy, I can only wish I had had some connection to. Yes, I would be even more sad right now, but I still feel that that was a connection I missed out on, to my loss. 

When I was a child, I messed around with telescopes. No surprise, right. I was mainly interested in the structure of our galaxy, probably because some of the more spectacular sights were accessible to my small telescopes -- the classic 6-inch f8 Newtonian reflector I owned at the time was very mainstream. And even then, what we were learning about something like the structure of the galaxy, or establishing the distance scale of the universe, set the standard for what might be regarded as awesome. An awesome hamburger? Yeah, right. Awesome was a reserved word for me before I even hit my teens, and this not subject to change. Period.

So here's this Dave Harvey guy, also a telescope and software guy, who is using small but drool-worthy gear to photograph to take some fantastic imagery. Such as the Rho Ophiuchus region (Go there. Really. He did it with a 5-inch astrograph.) of the Milky Way, the subject of another long-term wallpaper on feynman. He was also a general-purpose photographer who was deeply knowledgeable about his craft, and who put in a lot of effort to get things right. 

I loves me some birds. Always have. But there are a lot of bird blogs out there by people who do a lot of photography (with far better/heavier/bulkier gear than I am willing to carry) who seem more interested in being cool. I have no firm idea of how birding might be seen as cool, seemingly by the same people who might judge that a hamburger could be in any way awesome, but I digress. More importantly, they don't seem to get things like the time-honored wisdom of 90% of all images needing a border. Or they simply can't be bothered.

In the few images I have ever posted, anywhere, I have tried to follow that rule. But sometimes I got in a rush, or simply lazy, and didn't. I suspect that it would not even have occurred to Dave Harvey to do such a thing. That, my friends, is attention to detail, and dedication to your craft. So Dave Harvey is still around, too, if one appreciates the debt we owe to all of those who teach and inspire.

There is no deep wisdom, now becoming obvious, in all this. At some level we all realize that associating ourselves with smart, dedicated people, who are also willing and able to teach something of whatever it is that they do, is a useful guideline. But it was one of those weird moments, which we probably all have, when many seemingly-disparate things all became connected. 


So why even write this?


Two reasons.

  1. The off chance that it might be my subconscious getting impatient and shouty. Which does happen to me. For instance, I have had great results from "sleeping on" problems.
  2. Spreading the word about outstanding work by others is always A Good Thing.