Parsing Excel

2011-03-01

by Paul Agapow

in programming

tagged ruby, excel, programmning

Some (very) quick notes on libraries for manipulating (mostly reading) Excel files in Ruby.

Parseexcel

This is the usual / traditional method for reading Excel
Doesn't work on XSLX (modern Excel)
There are a number of versions handing around the web, look on rubygems for latest version
Appears to not call external libraries, uses it's own OLE module
Uses a member on Excel cells called "type" which causes spurious deprecation warnings
The project seems to be moribund, it was last touched in 2007

Writeexcel

Counterpart to Parseexcel, and the usual / traditional method for writing
An active project, last modified November 2010

Spreadsheet

Will read and write Excel
Supposedly builds on Parseexcel
"Will" handle other formats, but currently only does Excel
A quiet project, last touched in 2009
Will handle unicode
There's an irritating error involving the calling of new! method for a class that doesn't have it. Updating your Ruby OLE library should fix this.

Windows OLE bridge

Possible to use this for getting at Excel, but useless if you aren't on Windows (and I'm not)

Roo

The most promising candidate, reads Excel (including XSLX), OpenOffice and Google Docs spreadsheets
Unfortunately achieves this by importing 6 or so other libraries that gem doesn't seem to automatically installed alongside it
Seems to have the best handling / guessing of cell type
Has the Spreadsheet "New!" problem (see above)
Has decent cell type inference
Documentation is a little sketchy (and a little German) but mostly adequate
Reading isn't quite as easy as in the other modules, which allow simple iteration over rows and cells. Instead requires setting the default worksheet, finding the index of the first and last row and columns and iterating over the indices.
Generally the best, most functional and fully featured

Notes on the code

The sample program to test each library has the following parameters (based on a current work task):

It should read a given spreadsheet with an unknown number of rows and return a hash for each row
The keys for the hash are given by an initial header row, the values are the actual cell contents
Column headers can be in any order and are transformed to a canonical form to allow for variations in case, spaces, etc.
Blank cells should be caught and returned as nil
Flanking whitespace is trimmed off
Cell contents should ideally be returned as the best type (e.g. the float 1.23 instead of the string "1.23")
The reader should allow sub-classing, so as to allow custom handling of particular columns

There are some small inconsistencies between the readers (as I changed my mind about the requirements) but they all clean up column headers in the same way (lowercase, trim, spaces to underscores) and all try to guess the cell type and convert it appropriately. There's specialised conversion for two columns. There's a a whole mess of code trying to infer date formats (some of which is plain guessing, e.g. is "1/2/2011" in January or February?) that for brevity is placed in a common file. Excel-formatted dates are stored as "serial values", which is a float that has January 1, 1900 as day 1. This can be converted to an actual Ruby date thus:

def serial_to_date (s)
   # round down and convert from string if need be
   # zero day is actually "Jan 0" and indexing starts from 1, so subtract 2
   s = s.to_i() - 2
   # add to zero day
   JAN_1_1900 = Date.new(1900, 1, 1)
   return JAN_1_1900 + s
end