Follow

In , and generally, raw is never ever in a usable form.

This is one of the many things persistently gets wrong: the / / omni-computer-geek opens up the file, stares at a bunch of or , and says, “Ah hah! If I the to reverse the on the , I can the sequence to the of the ! Oh, and make if you want, but that’s extra.”

Bonus points if the screen projects on said scientist’s face and reflects from the inevitable chunky-framed glasses. Scribbling equations backward on a transparent whiteboard may also be involved.

, as I have said many times before and no doubt will need to say many times again, are people. We’re pretty good with numbers, yes, as a rule. But what we’re good at doing with those numbers is not reading and understanding them. It’s using them as the raw materials for product which makes sense to the human brain. Words, pictures, and a MUCH SMALLER number of numbers is our goal. Also continued , which is about the kind of numbers everyone understands.

Before we process the numbers, we need to “” them. There are several intermediate steps between the really raw data and the cover story for next week’s issue of Nature. Preprocessing is where we turn the glowing symbols projected onto our faces into something that kinda-sorta makes sense. It’s still not really readable, but people looking at it, who know what they’re looking at, can tell what it represents.

Usually this is in the form of one or more : for a familiar example, think of an workbook with several large . (In reality, storing data in Excel is a terrible idea, but I’ll stick with that metaphor.) Nobody’s going to read and digest everything in the workbook. You can look at the headers and a few of the values and at least have an idea where to start. Preprocessing gets you to that point.

For most types of data, preprocessing is fairly standardized. You don’t have to write your own code: someone else has already done that work for you. Just pick a , run the raw data through it, glance at the output to make sure nothing went horribly wrong. Now you’re ready to write the code only you can write, to discover the Secrets of Life Itself. Now is the time for SCIENCE.

Or Nature. Or The Journal Of Obscure Subfield Ten People In The World Know Exists. Or a tech report. You know, whatever.

Careful readers will have noticed the word “fairly” above. In fact there are multiple to choose from, and multiple packages implementing those algorithms, and written at 3:00 AM by an exhausted who really just wanted to check the cultures one last time and grab the remaining half a chicken salad sandwich from the break room fridge and go home and crawl into bed for a few hours’ sleep before dragging ass back in tomorrow. Shower optional.

Other exhausted postdocs and their harassed , who get somewhat more sleep and a somewhat finer grade of chicken salad but are much more worried about upcoming funding application deadlines, may or may not bother to write down which package they use to preprocess their data. Or what specific parameters they tuned. Or if they even know how they’re supposed to use the damned thing: there’s a really good chance they just ran the data through on the default settings, got something that looked reasonable, and called it a day.

Amazingly, most of the time this doesn’t really matter. Data has a life of its own. The bigger the data set gets, and these days nearly all data are “big data,” the more likely it is that any reasonable method will produce similar results. Good thing too, otherwise science (and Science) would grind to a screeching, shuddering, smoking halt.

Sometimes it matters a lot. Careful scientists check, just in case. I try to be one of those, and when I’m not, my coworkers pick up the slack. Luckily for me, for most of my career I’ve found myself in the company of those who live up to that standard, and I can mostly convince myself I do the same. Another item on Hollywood’s long list of sins: science is not a solo enterprise. In fact it’s deeply social, which is one of several reason why the stereotype of scientists as loners is a load of crap. But I digress.

In case you’re wondering if this has a point, yes it does, and here it is: all the above is why my boss recently sent me a message saying, “Woah yeah ok so maybe you do need to process from raw after all. B/c idk wtf that is.”

Without any irony at all: I love my job.

@medigoth
I'm sure Sally would agree that making dinosaurs is worth the "extra".

@Marquestor Indeed, she insists on it. And you know, she can be very insistent.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.