Tools for data

2015-02-20

tagged Ruby-on-Rails, Rails, data science, CSV, JSON, Excel, SQLite, XML, REDCap

How to store data, what to use

Prompted by a recent tweet asking what people used for storing and managing their data, I wrote down my own hard-won lessons on the topic. In rough order of preference and data complexity:

A hierarchical strategy

Use restructured text for documentation

Or markdown / asciidoc. The advantages of this being:

It's a plain text format, you can write or read it anywhere
It's a format that's more about structure, information and relationships than presentation, i.e. what you should be worrying about rather than look-and-feel
There are many tools for exporting it to other formats.
If necessary, it can be parsed for extracting information.

Use CSV for tables

Because:

Every spreadsheet and stats tool can use it
Every programming language can parse it without problems
Again, it's a plain text format
It doesn't mangle or transform values like Excel has a habit of silently doing

Use SQLite for relational data

If you have to store relational data and/or use a database (and I suggest you question this need very closely), use one that

is easily portable
not beholden to any company
uses fairly standard SQL
is easy to get data in and out of
is just sophisticated enough to store your data and no more

In the context of data science, the lack of permissions / access control in SQLite is in fact an advantage. You're usually just throwing together a database to analyse, not to mount on the web, so a permissions structure is surplus to requirements and can only cause problems.

Should you push beyond the limits of SQLite data-size-wise, retreat to other vanilla SQL databases: MySQL, Postgres.

Use YAML or XML or HDF5 for complex datasets

YAML for humans, XML for computers.

Why YAML:

It's highly readable (and nearly valid javascript)
It's easy to write by hand
There are parsers for many languages

I'm tempted to says that YAML should always be used instead of XML. However there are more complex datatypes that are a bit fiddly to manipulate in YAML. And there may be the need for interoperability with XML-friendly tools. If so, you should use XML, because:

It's super-standard
There are many tools in every language for handling it.

If you end up with very large complicated, heterogeneous data, use CDF or HDF5. Why?

At this point, readability is moot. May as well use something that is easy for computers.
It's a standard. That's a good enough reason.
You can store anything in it.
It was made for just this use case.

Warning

Just, for godsake, don't not write your own de novo format. It's not as easy if it looks and untold hours have been consumed by dodgy home-grown formats. Write any format you use on top of YAML or XML. Even then, it's very easy to cock it up.

For web databases

Use REDCap. Really. It does most everything you want to do, has sophisticated access control, a nice body of reporting tools, it's easy to design / revise databases, it's easy to import and export data, there's a large body of expertise, it's secure, there's a body of expertise and experience out there.

(CKAN and opendatakitr may be good fits for this niche but I have less firsthand knowledge.)

If you have to go beyond this - say for a database with heavy relational elements, or a lot of custom tools and datatypes, use Ruby-on-Rails. It is built for quickly doing this sort of CRUD database, making a lot of assumptions to speed development that will almost certainly hold for your data needs. It can't do everything, but it can almost certainly do what you need to do. Actually, use Hobo, which is like the next level RoR, allowing very rapid db development.

The one downside is the dizzying rate of change in the RoR ecosystem, where yesterdays way is Old-and-Busted and tomorrow's New Hotness doesn't quite work yet. But a little bit of careful and conservative development can get you around this.

Alternatives

JSON

Certainly if you're working in Javascript (WHY?) or interacting with web-services, your data may pass through JSON. But as a primary data format? No. And that's before we consider the lack of data manipulation and search tools.

Excel

No.

Just wait until you come across your very important piece of data that Excel silently "corrects", or try and untangle the difference between a number that is contained in a numeric cell and one that is stored in a text cell. Then you'll understand. Look here:

Digression: It's often quite hard to avoid Excel, it's so embedded in everyone's workflow. Of course if someone needs a spreadsheet, they're going to use Excel, and so you are forced to deal with it. Even if you personally don't use Excel, it can still cause problems. I once set up a database and a user complained vociferously about how I'd "corrupted the data". They'd downloaded the data, imported it into Excel, the program 'corrected' the data ...

On the other hand, Excel has little competition in the 'point and click' creation of graphs and the pivot-table feature is actually very useful. The only thing you can do is aim for least-corruptible formatting. Send and receive data as CSV. Export data in forms that Excel can't interpret and has to treat as plain un-normalisable text.

Open/LibreOffice

And now it's okay to poke yourself in the eye with a stick because the stick is ideologically correct?

Langauge-specific encodings

Python pickles, .RData, SPSS .spv and .sav files ... Ask if you always and forever will be using this program. Ask if you and everyone you work with will always have access to this program. Ask if this format and program that reads it will never change.

General rules

Where possible, use open, data and platform-agnostic formats. Avoid proprietary formats.
Where possible, use human readable formats
Use as simple a format (or tool) as you can get away with.
Do not invent your own format.
Seriously, do not invent your own format.
Go with the crowd and use common tools and formats, so as to leverage others experience.
Remember the object is not the formatting or storage of the data but the analysis of the data.