Extracting data from Excell
How to use Apache POI library in java

Excell files circulate in the veins of large companies. Ther are times when you want to extract some data and you have to use Java language for this purpose. Let’s see how to quickly extract data from .xls or .xlsx files using Apache POI library.
The library
Maven coordinates of the library(ies) can be found here: https://mvnrepository.com/artifact/org.apache.poi. Note that there are two types of excel spreadsheet formats and apache-poi project has two of its most important libraries to handle these data:
- HSSF (Horrible SpreadSheet Format) is Excell 97-2023 (this very tricky and complex format is handled by
apache-poi.jar
) - XSSF (XML Spreadsheet Format) is Excell 2007+ which is XML-based zipped set of files (Office Open XML format, OOXML which has nothing to do with Open Office) is handled by
apache-poi-ooxml.jar
.
Documentation
In order to extract data from Excell we can take a look into howto documentation of user api.
The librarby supports three types of APIs:
- user api which is high level and which I’m using; it is operating on a document which is read into memory and is relatievly easy to use
- event api which is more low-level and you are parsing XML yourself (for uscases with scarce memory resources)
- streaming usermodel api can be used to parse huge documents in a streaming fassion in case we think that the document might not fit into our heap; and we only register callbacks to handle specific events/tags.
Example xlsx
Let’s extract data about dogs: breed name, type and color of a fur, link to an image and life expectancy. The file was taken from kaggle.com and I assume that this file is freely distributed.
Simple application
We can take and use the latter to set up a very simple gradle project and use it as single dependency:
|
|
Model
Let’s assume we want to keep dog data in a Dog record:
|
|
We need to first define a mapper function that converts cell values of a row to proper Dog instance.
Before that, let’s firs take a look how the file looks like: values we’re interested in start from column ‘B’, so
we need to ‘feed’ Dog instance starting with column 1:
|
|
Note that we cannot be sure if the cell we get from a row is a numeric value or text value or perhaps is null. So here’s a helper function that will give us non-null String with cell contents (time to use some modern java feature like swith expression; I wrote a post (in Polish) about this nice language feature):
|
|
Now, let’s put it all together. Our main method has only single line:
|
|
… and the final part is actually opening the file, creating workbook, taking its first sheet and iterating over rows. Here’s the implementation on readListFromExcell
:
|
|
As you can see, I made this method generic - perhaps I would like to use it later :)
- it takes commandlie arguments and reads file name from first argument
- it then createx XSSF workbook from FileInputStream
- it takes 0th sheet and creates a stream of Rows from rows iterator
- it skips first row and uses mapper to convert following rows to Dog instance
- it returns (possibly empty) list of Rows
An interesting part here is that I needed to create a stream from iterator:
|
|
To cal this app using gradle, pass file name with --args=/path/to/file.xlsx
:
|
|
As a resutl, we’ll get all the dogs printed to stdout:
|
|
Summary
Iteration over cells, columns and rows is easy and this high level API is good enough to be able to do simple parsing.
- The code is available in GitHub repository; see poi-dogs App.java file.
- Header images generated using bing and rescaled and joined with:
|
|