David Collins' updates, comments and news
Convert XLSX to CSV in PHP
June 1, 2012
For most basic programming requirements these days, we can usually find an available script that will do the trick. Every now and then a requirement pops up that takes a bit more than piecing together available open-source packages. After searching for available open-source packages to convert xlsx spreadsheets to csv format in a PHP Web application, I concluded the best approach would be to roll my own.
The requirement called for efficient use of memory resources -- something I didn't find in available packages. I needed to find the easiest and fastest way to convert xlsx to csv with PHP.
A little research and a few hours of testing various approaches yielded the component I needed. Before I integrated the translation script into the application, I set up a short demo script and posted it at Github as xlsx2csv.
As with most programming projects, I learned a few things. I knew Microsoft had moved toward more open-source architecture in its office productivity formats, but I'd never found an opportunity to dig into any of the formats.
What I discovered -- with the help of Google searches and today's rich online programming community -- were XML files that could readily be read as objects into arrays then put into csv files. Xlsx-to-csv conversion processes that attempt to load an entire XML file into memory can overload available memory in typical PHP configurations. Rather than increase memory and processing time-out limits, I looked for a process that would avoid loading so much of longer files into memory.
How it works
Files in the Microsoft xlsx format comprise a zipped package of xml files. Xlsx2csv uses PhpConcept's open-source pclzip libarary to unzip xlsx files into a temporary /bin directory.
Xlsx2csv then uses PHP's XMLReader libraries to read XML components of the xlsx format line by line. The iterative approach avoids resource consumption associated with loading large files into memory. XML nodes are then processed by PHP's SimpleXML library.
Xlsx2csv reads only two files from the unzipped xlsx package: /xl/sheet1.xml and sharedStrings.xml. SharedStrings contains strings referenced in sheet1.xml.
Translated files are stored in a /csv directory, then files unpacked to the /bin directory are deleted. As packaged for release, an index.php file provides for upload of xlsx files via a Web browswer and returns converted files to the user's browser for download.
For those who need a desktop application, OpenOffice Calc can handle the job. For server-side processing, where resource consumption isn't an issue, the PHPExcel libary offers a wide selection of configurable tools that can be intergrated into Web applications to process various flavors of Excel files.
Update: LibreOffice also supports xlsx formats, including exporting to other formats. LibreOffice looks and feels like OpenOffice because it's an OpenOffice fork, with support for additional file formats.
For handling large data sets, Python offers a more powerful tool set -- if a project's server environment supports Python -- and an xlsx2csv script available on GitHub. If project requirements don't specify PHP, there's an available APACHE class, and for Linux users a bash script.
Xlsx2csv is a server-side script. It's not a desktop file conversion application. If a user has a PHP-equiped server running on a desktop computer, the script can nonetheless be run on a localhost and can readily convert xlsx file to csv format.
Why I wrote xlsx2csv
The project that inspired development of an xlsx to csv conversion script involved regularly downloading a spreadsheet provided by a public agency, and comparing various versions of those spreadsheets to discover changes. The process involves three steps:
- Download the spreadsheets as cronjobs using cURL.
- Translate the scripts to csv for storage as flat files or entry into a database.
- Compare various csv files and report results
Xlsx2csv handles the second step. The next step in the production process is to develop a CSV comparison script. I already have such a script running in a beta-deployment of similar operation.
A database might seem a more likely approach for data analysis, but the production environment for this project better accomodates accumlation of large flat-file collections.
The requirement implies a need for a csvQL libary -- something I've not found so far for a PHP context. I don't anticipate developing a full query language at this point. I'll either build out from the csv-diffs script I'm already using, or more likely, use the resource-saving approaches I learned in this xlsx2csv project to start from scratch.