The rules of analysis

Opinions, I got 'em'

While mentoring some juniors, I started to think about the rules of thumb for analysing data that I've built up over the years. While I'm certainly not the world's greatest data scientist (or it's greatest bioinformatician, statistician, biomedical scientist, etc.), it seems worthwhile trying to capture them here. these are obviously shaped by my experience and domain in healthcare and biomedical data.

This is a living document and with be extended as I think of more.

Analysis is code

Serious data analysis should be be done with code, proper code. Not Excel, not JMP, not a plotting package, not macros. Why?

  • It's reproducible
  • It's documentable (i.e. the option is there)
  • It's versionable
  • It can be worked on by multiple people
  • It's debuggable
  • It's easier to revise
  • Doing it the long (hard) way is usually faster for non-trivial tasks

I still remember with horror the 20K-strong patient cohort that a colleague split into a myriad of sub-cohorts, based on a huge list of complex criteria, in a single massive Excel spreadsheet. It involved a massive number of sub-sheets, macros and scripting, linked in a thick mesh of cross-references. It took months to put together. Did it work? There was no way of knowing - it just didn't obviously not work. That's not good enough.

Reports are generated straight from analysis

Summaries and results should be auto-generated from analysis, rather than cutting and pasting into Word or Powerpoint. Why?

  • Faster, once it's set up, and faster to repeat
  • Reproducible
  • One document / workflow
  • Prevents mistakes and omissions when you update
  • Copy-and-pasting into Powerpoint tends to encourage a style with excess verbiage and showmanship, with discretely massaged results being shamefully tacked on the end, and questions discouraged. Look up "The Cognitive Style of Powerpoint"

This admittedly can take a while to set up the first time but in the long term, it pays off.

Don't invent your own data formats

An old rule for bioinformatics. Why?

  • You're probably not as good a format designer as you think
  • How will others know how to parse your data? Will they have to write a custom parser?
  • You'll never get around to formally documenting your format. See previous point.

Meta-formats, or formats build on top of other formats like storing data in JSON or YAML are acceptable. At least then there are tools to read and write and the base format will stop a lot of bad design decisions. If there's a published or popular format that's 90% right for you? Maybe you should use that.

See the long history of ambiguities, extensions and inconsistencies in NEXUS or even something as simple as the NEWICK tree format.

Storage data in humane common ways

Store data in CSV. If you can't do that, store in JSON or YAML. If you can't do that, SQLite. Why?

  • It's human-readable & thus debuggable
  • Except for SQLite) It's plain text, thus in a pinch you can look or edit it with a text editor
  • We can deal with different text encodings easily
  • It can be read and written by a huge variety of software
  • The behaviour of and the definition of those formats are well-defined and understood.

Obviously, there are datasets that don't fit into easily into this because of being too big or inherently binary or just weird. But most of the data we use can be readily handled by the above tools. It you have to go big, think about using something common like HDF5.

As a rule of thumb, most systems that insist on using XML are deeply pathological.

Make sure that debug and dev modifications are blindingly obvious

Make any debug code blindingly obvious, strongly labelled, easy to find and easy to switch off. Why?

  • You will forget to switch off or delete the debug code and so will accidentally ship analyses that only look at the first 100 records, use dummy datasets, short-cut algorithms or make unsuitable assumptions.

We once had a web-tool that ran observed epi data against a historical database. Every once in a while, it would crash for no obvious reason. The development version ran flawlessly. One day, I got frustrated and sat down to puzzle it out. I sprinkled diagnostic print statements through the code of the production version, repeatedly entering data into the tool until it crashed, narrowing down the problem. Eventually I found the culprit, a single line of code that referred to the debug version of the database, clearly placed there to work out some bug. In development, this worked fine because the only database the code saw the debug one. In production, most of the data was being pulled from the production db, of course, except in this one circumstance. And the two database versions were similar but not identical. This web-tool had been running for a year, delivering potentially incorrect results.

If there's more than one substantial processing or analysis step, and this analysis will be run more than once, use a pipeline or workflow

Why?

  • It'll save you time in the long term
  • It documents your work
  • It makes your work reproducible
  • It makes your work easier to revise

The actual choice of workflow software is a Religious Issue but I'm partial to Snakemake, and many people like Nextflow and Drake.

Not obviously wrong does not mean right

Treat your analysis like code and sprinkle it with asserts, validations and dummy checks (e.g. what's the name of the file we're reading, how records are we processing in this stage). Trust but verify. Actually, just verify. Why?

  • You'll accidentally load the wrong dataset, that hasn't been cleaned or that is from a previous analysis and get the wrong answers.
  • That function might work flawlessly with good data but with bad data it might silently hand you the wrong answer (true of all too much R and bioinformatic data).
  • For many problems, our ability to spot a bad answer is severely limited.
  • In any case, it's better to stop immediately upon an error occurring rather than having to check the end results and repeat the whole analysis. Fail early and fail often.

No closed shop, proprietary tools

Why?

  • You'll want to study them, see how they do their work
  • You'll want to share your work, and you can't do that with people who don't have a license for
  • It'll avoid lock-in - perhaps the biggest questions for any analysis platform are: what can't it do? And can I get my data out to use it somewhere else when that happens?

Perhaps you can't prove this is true, but is it useful?

Often, you can't treat the result of an analysis as being true in a rigorous, statistical sense. Why?

  • Data is biased, or cherry-picked
  • A la Brian Wansink, you've tried so many different approaches and slices of the data that one of them was bound to show something
  • The algorithms you're using have only the veneer of a statistical basis, and obscure assumptions
  • Biomedical data is particular is heterogenous, biased and incomplete. Who knows if it meets the necessary assumptions.

I'm not saying that statistics is unimportant but it is difficult and can be unreliable. You can easily lie with statistics and maybe it impossible to place a statistical measure on your results. No, if your results aren't necessarily true, what can you do? Ask yourself, "This is better than what? And how can I show that?" Look for ways of independently validating or reproducing your findings. Or, if your answers aren't guaranteed to be true, do they show that something is not true? Look for ways for answers to be useful, to be advisory on other approaches.

(Alice Wong rightly noted this was a bit of a drive-by statement - stats, p-hacking, multiplicity, etc. are vast subjects. But I can't do justice to them in this space, other than to underline their complexity.)

Maybe you should use less SQL

Why?

  • Complex subsetting and massaging of data can be difficult to express in SQL, because it's depauperate as a programming language, and often those operations are cumbersome and verbose.
  • Perhaps because of this, there's a tradition of enormous multi-line SQL statements to get work done. It is difficult to understand what these do or debug. You end up taking the code on faith.
  • The internal representation of domain objects is right in your face, when you should be thinking of their external representation.
  • Asserts, checks and defensive programming? What's that?
  • And there are dialects of SQL so forget porting your analysis.
  • If you're pulling data live from a database, how do you version that.
  • In summary, excessive use of SQL moves towards the code being a blackbox.

Obviously (gestures at 50 years of use), SQL is a tremendously useful thing. But it's not a great tool for communicating intent or writing good complex code. Fortunately, you could just use a small amount of SQL to do a broad extract of data, thendoing the heavy lifting in a real programming language. or use an abstraction layer (like SQAlchemy) so that you're manipulating objects not rows scattered across tables.