Parsing Excel

Some (very) quick notes on libraries for manipulating (mostly reading) Excel files in Ruby.

Parseexcel

  • This is the usual / traditional method for reading Excel
  • Doesn't work on XSLX (modern Excel)
  • There are a number of versions handing around the web, look on rubygems for latest version
  • Appears to not call external libraries, uses it's own OLE module
  • Uses a member on Excel cells called "type" which causes spurious deprecation warnings
  • The project seems to be moribund, it was last touched in 2007

Writeexcel

  • Counterpart to Parseexcel, and the usual / traditional method for writing
  • An active project, last modified November 2010

Spreadsheet

  • Will read and write Excel
  • Supposedly builds on Parseexcel
  • "Will" handle other formats, but currently only does Excel
  • A quiet project, last touched in 2009
  • Will handle unicode
  • There's an irritating error involving the calling of new! method for a class that doesn't have it. Updating your Ruby OLE library should fix this.

Windows OLE bridge

Possible to use this for getting at Excel, but useless if you aren't on Windows (and I'm not)

Roo

  • The most promising candidate, reads Excel (including XSLX), OpenOffice and Google Docs spreadsheets
  • Unfortunately achieves this by importing 6 or so other libraries that gem doesn't seem to automatically installed alongside it
  • Seems to have the best handling / guessing of cell type
  • Has the Spreadsheet "New!" problem (see above)
  • Has decent cell type inference
  • Documentation is a little sketchy (and a little German) but mostly adequate
  • Reading isn't quite as easy as in the other modules, which allow simple iteration over rows and cells. Instead requires setting the default worksheet, finding the index of the first and last row and columns and iterating over the indices.
  • Generally the best, most functional and fully featured

Notes on the code

The sample program to test each library has the following parameters (based on a current work task):

  • It should read a given spreadsheet with an unknown number of rows and return a hash for each row
  • The keys for the hash are given by an initial header row, the values are the actual cell contents
  • Column headers can be in any order and are transformed to a canonical form to allow for variations in case, spaces, etc.
  • Blank cells should be caught and returned as nil
  • Flanking whitespace is trimmed off
  • Cell contents should ideally be returned as the best type (e.g. the float 1.23 instead of the string "1.23")
  • The reader should allow sub-classing, so as to allow custom handling of particular columns

There are some small inconsistencies between the readers (as I changed my mind about the requirements) but they all clean up column headers in the same way (lowercase, trim, spaces to underscores) and all try to guess the cell type and convert it appropriately. There's specialised conversion for two columns. There's a a whole mess of code trying to infer date formats (some of which is plain guessing, e.g. is "1/2/2011" in January or February?) that for brevity is placed in a common file. Excel-formatted dates are stored as "serial values", which is a float that has January 1, 1900 as day 1. This can be converted to an actual Ruby date thus:

def serial_to_date (s)
   # round down and convert from string if need be
   # zero day is actually "Jan 0" and indexing starts from 1, so subtract 2
   s = s.to_i() - 2
   # add to zero day
   JAN_1_1900 = Date.new(1900, 1, 1)
   return JAN_1_1900 + s
end