If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html.
I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test).
At this point, it looks like roo is a good option again. It supports XLSX, allows (some) iteration by just using times with cell access. I admit, it's not pretty though.
Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over.
Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. However, as @PanagiotisKanavos mentioned in the comments, this has a few major drawbacks:
Excel must be installed
A new Excel instance is started for each document
Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation.
Again, the options don't look much better, but there isn't much else out there, I'm afraid. it's hard to parse a file format that is a black box. And those few who managed to break it didn't do it that visibly. Google Docs is closed source, and LibreOffice is thousands of lines of harry C++.
The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed.
I agree the syntax is not great nor ruby-like. But that can be easily achieved with something like:
class Spreadsheet
def initialize(file_path)
@xls = Roo::Spreadsheet.open(file_path)
end
def each_sheet
@xls.sheets.each do |sheet|
@xls.default_sheet = sheet
yield sheet
end
end
def each_row
0.upto(@xls.last_row) do |index|
yield @xls.row(index)
end
end
def each_column
0.upto(@xls.last_column) do |index|
yield @xls.column(index)
end
end
end
I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. However, one area that both suffer is the lack of examples on actually implementing anything useful. Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. I hope the code below can serve as a helpful example and show just how effective these tools can be.
require 'find'
require 'rubyXL'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xlsx$\b/ # check if file is xlsx format
workbook = RubyXL::Parser.parse(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
data = worksheet.extract_data.to_s # extract data of a given worksheet - must be converted to a string in order to match a regex
if data =~ /regex/
puts file
count += 1
end
end
end
end
puts "#{count} files were found"
require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
count = 0
Find.find('/Users/Anconia/crawler/') do |file| # begin iteration of each file of a specified directory
if file =~ /\b.xls$\b/ # check if a given file is xls format
workbook = Spreadsheet.open(file).worksheets # creates an object containing all worksheets of an excel workbook
workbook.each do |worksheet| # begin iteration over each worksheet
worksheet.each do |row| # begin iteration over each row of a worksheet
if row.to_s =~ /regex/ # rows must be converted to strings in order to match the regex
puts file
count += 1
end
end
end
end
end
puts "#{count} files were found"
I couldn't find a satisfactory xlsx parser. RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code.
So, I wrote simple_xlsx_reader. You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for.
The RemoteTable library uses roo internally. It makes it easy to read spreadsheets of different formats (XLS, XLSX, CSV, etc. possibly remote, possibly stored inside a zip, gz, etc.):
require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
puts row.inspect
end
I'm using creek which uses nokogiri. It is fast. Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is a hash of row and column name to cell content:
{"A1"=>"a cell", "B1"=>"another cell"}
The hash makes no guarantee that the keys will be in the original column order.
https://github.com/pythonicrubyist/creek
dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/thirtyseven/dullard
simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array:
["a cell", "another cell"]
https://github.com/woahdae/simple_xlsx_reader
Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.
In conclusion, creek seems like a good choice. Other posts recommend simple_xlsx_parser as it has similar performance.
Removed dullard as recommended as it's outdated and people are getting errors/having problems with it.
Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. That's fine if your spreadsheet is small.
xls = Spreadsheet.open(file_path)
For anyone working with very large files, a better way is to stream-read the contents of the file. The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015).
Spreadsheet.open(file_path).worksheets.first.rows do |row|
# do something with the array of CSV data
end
I recently needed to parse some Excel files with Ruby. The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it.
Here is a table of different Ruby libraries and what they support:
If you care about performance, here is how the xlsx libraries compare:
I have sample code to read xlsx files with each supported library here
Here are some examples for reading xlsx files with some different libraries:
rubyXL
require 'rubyXL'
workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"
worksheets.each do |worksheet|
puts "Reading: #{worksheet.sheet_name}"
num_rows = 0
worksheet.each do |row|
row_cells = row.cells.map{ |cell| cell.value }
num_rows += 1
end
puts "Read #{num_rows} rows"
end