(Re-)building databases with csvsql

2015-03-24

tagged database, data-science, csv, cssql, csvkit

Tips and traps when going from a database dump to a database

The scenario

You have a bunch of related CSV files.

Maybe they're the result of a raw database dump. Maybe they've been generated in some other way: experimental results, various public data sets, whatever. But the important thing is that you need to make a database from them. Perhaps because you need to query them in some way that's best suited to an SQL query (references across tables, selecting some fields but not others, etc.) Perhaps you need to work with the information in a particular database but in the absence of a working db connection, you have to make do with a raw data dump. Perhaps the original database has been lost or is so severely dysfunctional that normal exttraction and reporting tools don't work.

Just for example.

In any event, you need to turn a bunch of CSV files into a working database.

The solution

csvsql is part of the csvkit package and it does what the name suggests: converts a csv file into a table in an SQL database:

% csvsql --db sqlite:///foo.db --insert bar.csv

The above line converts the contents of bar.csv into the table bar in the sqlite database foo.db. There are a multitude of options and features, but to pick a few:

Instead of inserting the data, just generate the relevant SQL.
Name the table to be created and specify a schema for it.
Work with assorted databases like Access, Firebird, MySQL, etc.

Multiple tables can be inserted with repeated calls:

% csvsql --db sqlite:///foo.db --insert bar.csv
% csvsql --db sqlite:///foo.db --insert baz.csv
% csvsql --db sqlite:///foo.db --insert quux.csv

Possible problems

When making a databse, csvsql has to construct a database schema, inferring a type (CHAR, INT, DATE, etc.) for every column from the original table structure. All in all, it does a creditable job of this and guesses reasonable types that that you can later work with. Sometimes, however, it messes up and infers incorrectly. More to the point, this happens silently, so the problems may remain hidden.

Text fields are generally not an issue. Where problems seem to result are with numeric data and dates. The first instance I ran across of this was with a record id column that was recorded as a number: 23456, 67890, etc. csvsql mostly and sensibly interpreted that as an integer. However, for a few tables it would occasionally cast it as a string. This problem only became obvious when I queried for certain records and was unable to find them, as I was looking for an integer value not a string. csvsql conversion was consistent for a table, but I was never able to discern what in the ids triggered the difference.

This problem could be overcome in code in a number of ways: converting all ids to integer, doing a pre-processing step in which the database column types were altered, etc. However, a greater problem arose when csvsql interpreted columns with dates in them. It correctly recognised them as dates and cast the resultant database column as a date type, but incorrectly parsed the dates. As in, the year might be correct but the month and day were wrong.

This was obviously a huge problem. Ultimately, it was probably a problem with date formats in the source tables, as it used three different datetime formats, one of which had double-spaces between the date and time. But the datetimes were formatted consistently, so csvsql messed up bad.

There's a few possible solutions for these issues. csvsql can accept a schema, so it would be possible to do an initial conversion, alter the schema as desired and use it for the proper import. However, this fixes the integer issues but not the dates. Perhaps, some preprocessing of dates into a suitable and robust format might work, but I instead opted for a brute force approach. csvsql includes an option where no inference on column type takes place and everything is treated as text:

% csvsql --no-inference --db sqlite:///foo.db --insert bar.csv

Now, all keys are text (everything is text) and all dates are exactly as they are in the source. This may create additional problems. If you have a column without any data, csvsql bugs out when using no-inference, although a little bit of hacking can get you around this. Also, in a database where every column is text, comparative queries like WHERE DepartmentSize >= 100 or DateOfBirth >= '2000-12-25 00:00:00.000 will be impossible, and ORDER will work on textual not numeric order. I settled for the all text conversion, and then writing a script to extract and transform data from the all-text database into a new database, so I could sculpt and change the schema.

References

csvkit