Programing

Ruby로 XLS 및 XLSX (MS Excel) 파일 구문 분석?

lottogame 2020. 10. 28. 07:39
반응형

Ruby로 XLS 및 XLSX (MS Excel) 파일 구문 분석?


XLS 및 XLSX 파일을 구문 분석 할 수있는 gem이 있습니까? Spreadsheet와 ParseExcel을 찾았지만 둘 다 XLSX 형식을 이해하지 못합니다.


방금 roo를 찾았 습니다. 작업을 수행 할 수 있습니다. 기본 스프레드 시트를 읽고 내 요구 사항에 적합합니다.


최근에 Ruby로 일부 Excel 파일을 구문 분석해야했습니다. 풍부한 라이브러리와 옵션이 혼란스러워서 블로그 게시물을 작성 했습니다.

다음은 다양한 Ruby 라이브러리와 지원되는 항목의 표입니다.

여기에 이미지 설명 입력

성능에 관심이 있다면 xlsx라이브러리 비교 방법은 다음 과 같습니다.여기에 이미지 설명 입력

여기에 지원되는 각 라이브러리로 xlsx 파일을 읽는 샘플 코드가 있습니다.

다음은 xlsx다른 라이브러리가있는 파일 을 읽는 몇 가지 예입니다 .

루비 XL

require 'rubyXL'

workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.sheet_name}"
  num_rows = 0
  worksheet.each do |row|
    row_cells = row.cells.map{ |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

require 'roo'

workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet}"
  num_rows = 0
  workbook.sheet(worksheet).each_row_streaming do |row|
    row_cells = row.map { |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows" 
end

작은 만

require 'creek'

workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.values
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

simple_xlsx_reader

require 'simple_xlsx_reader'

workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

다음은 라이브러리를 xls사용하여 레거시 파일 을 읽는 예입니다 spreadsheet.

스프레드 시트

require 'spreadsheet'

# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

보석 Excel 용 위대한 작품을 (.XLS 및 .XLSX)과 적극적으로 개발되고있다.

구문이 위대하거나 루비와 같지 않다는 데 동의합니다. 그러나 다음과 같이 쉽게 달성 할 수 있습니다.

class Spreadsheet
  def initialize(file_path)
    @xls = Roo::Spreadsheet.open(file_path)
  end

  def each_sheet
    @xls.sheets.each do |sheet|
      @xls.default_sheet = sheet
      yield sheet
    end
  end

  def each_row
    0.upto(@xls.last_row) do |index|
      yield @xls.row(index)
    end
  end

  def each_column
    0.upto(@xls.last_column) do |index|
      yield @xls.column(index)
    end
  end
end

노코 기리를 사용하는 개울을 사용하고 있습니다. 빠릅니다. Macbook Air의 21x11250 xlsx 테이블에서 8.3 초를 사용했습니다. 루비 1.9.3 이상에서 작동합니다. 각 행의 출력 형식은 셀 내용에 대한 행 및 열 이름의 해시입니다. { "A1"=> "a cell", "B1"=> "another cell"} 해시는 키가있을 것이라는 보장을하지 않습니다. 원래 열 순서. https://github.com/pythonicrubyist/creek

dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/thirtyseven/dullard

simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/woahdae/simple_xlsx_reader

Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.

In conclusion, creek seems like a good choice. Other posts recommend simple_xlsx_parser as it has similar performance.

Removed dullard as recommended as it's outdated and people are getting errors/having problems with it.


If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html. I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test).

At this point, it looks like roo is a good option again. It supports XLSX, allows (some) iteration by just using times with cell access. I admit, it's not pretty though.

Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over.

Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. However, as @PanagiotisKanavos mentioned in the comments, this has a few major drawbacks:

  • Excel must be installed
  • A new Excel instance is started for each document
  • Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation.

But if you choose to use it, you can choose not to display Excel, load your XLSX file, and access it through it. I'm not sure if it supports iteration, however, I don't think it would be too hard to build around the supplied methods, as it is the full Microsoft OLE API for Excel. Here's the documentation: http://support.microsoft.com/kb/222101 Here's the gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html

Again, the options don't look much better, but there isn't much else out there, I'm afraid. it's hard to parse a file format that is a black box. And those few who managed to break it didn't do it that visibly. Google Docs is closed source, and LibreOffice is thousands of lines of harry C++.


I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. However, one area that both suffer is the lack of examples on actually implementing anything useful. Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. I hope the code below can serve as a helpful example and show just how effective these tools can be.

require 'find'
require 'rubyXL'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xlsx$\b/                                  # check if file is xlsx format
    workbook = RubyXL::Parser.parse(file).worksheets       # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      data = worksheet.extract_data.to_s                   # extract data of a given worksheet - must be converted to a string in order to match a regex
      if data =~ /regex/
        puts file
        count += 1
      end      
    end
  end
end

puts "#{count} files were found"

require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xls$\b/                                   # check if a given file is xls format
    workbook =  Spreadsheet.open(file).worksheets          # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      worksheet.each do |row|                              # begin iteration over each row of a worksheet
        if row.to_s =~ /regex/                             # rows must be converted to strings in order to match the regex
          puts file
          count += 1
        end
      end
    end
  end
end

puts "#{count} files were found"

The rubyXL gem parses XLSX files beautifully.


I couldn't find a satisfactory xlsx parser. RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code.

So, I wrote simple_xlsx_reader. You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for.


Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. That's fine if your spreadsheet is small.

xls = Spreadsheet.open(file_path)

For anyone working with very large files, a better way is to stream-read the contents of the file. The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015).

Spreadsheet.open(file_path).worksheets.first.rows do |row|
  # do something with the array of CSV data
end

CITE: https://github.com/zdavatz/spreadsheet


RemoteTable 라이브러리 사용은 ROO 내부. 다양한 형식 (XLS, XLSX, CSV 등)의 스프레드 시트를 쉽게 읽을 수 있습니다. 원격 일 수도 있고 zip, gz 등에 저장 될 수도 있습니다.

require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
  puts row.inspect
end

산출:

{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.0", "cyl"=>"6.0", "trans"=>"Auto(S4)", "drv"=>"R", "bidx"=>"60.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"20.0", "ucty"=>"19.1342", "uhwy"=>"30.2", "ucmb"=>"22.9121", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1238.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"2MODE", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.2", "cyl"=>"6.0", "trans"=>"Manual(M6)", "drv"=>"R", "bidx"=>"65.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"19.0", "ucty"=>"18.7", "uhwy"=>"30.4", "ucmb"=>"22.6171", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1302.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ASTON MARTIN", "carline name"=>"ASTON MARTIN VANQUISH", "displ"=>"5.9", "cyl"=>"12.0", "trans"=>"Auto(S6)", "drv"=>"R", "bidx"=>"1.0", "cty"=>"12.0", "hwy"=>"19.0", "cmb"=>"14.0", "ucty"=>"13.55", "uhwy"=>"24.7", "ucmb"=>"17.015", "fl"=>"P", "G"=>"G", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1651.0", "eng dscr"=>"GUZZLER", "trans dscr"=>"CLKUP", "vpc"=>"4.0", "cls"=>"1.0"}

참고 URL : https://stackoverflow.com/questions/3321011/parsing-xls-and-xlsx-ms-excel-files-with-ruby

반응형