There are two philosophies in toward handling questionable . The first is to check the of the data every time it's used. This takes a fair amount of time, and depending on the size of the data may also take a fair amount of time. It's a PITA to write, test, debug, and run.

The second is to say "I've already checked this data a bunch of times in the program, it's fine" and skip the integrity checks after the first time. In programming, this is particularly tempting: the data sets are huge, and writing checks is annoying. The whole thing feels like a waste of time when you're reasonably sure your code will never run on anything except this particular data set which you already see more of than your family and your pets and you just want to get the damned thing done.

About 95% of the time, I take the first approach. Every time I do it, I'm grumbling to myself. Just finish it, already! And I am uneasily aware that those who take the second approach get their work done faster than I do.

Yes. This is true.

They also get a lot of results—many of which don't look like garbage at all. Here comes the ritual chest-thumping ... in , and generally, those mistakes don't just lead to flawed publications, as bad as that is. Garbage results kill people.

I just received a lesson in why the first is a really good idea. Let's be careful out there.

@medigoth Data sufficiency, quality, values near detection limits, collection methods, and so on and so forth are real issues for environmental investigations as well. More attention needs to be paid to these concerns so we can have increased confidence in the statistical confidence that results.

Follow

@ChemBob I believe it! Really I think all scientific programming suffers from the "I don't want it good, I want it Tuesday" problem.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.