Contents

Extracting data from Excell

How to use Apache POI library in java

Excell files circulate in the veins of large companies. Ther are times when you want to extract some data and you have to use Java language for this purpose. Let’s see how to quickly extract data from .xls or .xlsx files using Apache POI library.

The library

Maven coordinates of the library(ies) can be found here: https://mvnrepository.com/artifact/org.apache.poi. Note that there are two types of excel spreadsheet formats and apache-poi project has two of its most important libraries to handle these data:

  • HSSF (Horrible SpreadSheet Format) is Excell 97-2023 (this very tricky and complex format is handled by apache-poi.jar)
  • XSSF (XML Spreadsheet Format) is Excell 2007+ which is XML-based zipped set of files (Office Open XML format, OOXML which has nothing to do with Open Office) is handled by apache-poi-ooxml.jar.

Documentation

In order to extract data from Excell we can take a look into howto documentation of user api.

The librarby supports three types of APIs:

  • user api which is high level and which I’m using; it is operating on a document which is read into memory and is relatievly easy to use
  • event api which is more low-level and you are parsing XML yourself (for uscases with scarce memory resources)
  • streaming usermodel api can be used to parse huge documents in a streaming fassion in case we think that the document might not fit into our heap; and we only register callbacks to handle specific events/tags.

Example xlsx

Let’s extract data about dogs: breed name, type and color of a fur, link to an image and life expectancy. The file was taken from kaggle.com and I assume that this file is freely distributed.

Simple application

We can take and use the latter to set up a very simple gradle project and use it as single dependency:

1
2
// https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml
implementation group: 'org.apache.poi', name: 'poi-ooxml', version: '5.2.3'

Model

Let’s assume we want to keep dog data in a Dog record:

1
record Dog(String breed, String fur, String color, String gender, String image, int expectancy){}

We need to first define a mapper function that converts cell values of a row to proper Dog instance. Before that, let’s firs take a look how the file looks like: values we’re interested in start from column ‘B’, so we need to ‘feed’ Dog instance starting with column 1: /en/posts/extracing-excell/lo_dogs.png

1
2
3
4
Function<Row, Dog> mapper = (Row r) ->
            new Dog(getCellValue(r.getCell(1)), getCellValue(r.getCell(2)),
                    getCellValue(r.getCell(3)), getCellValue(r.getCell(4)),
                    getCellValue(r.getCell(5)), getCellValue(r.getCell(6)));

Note that we cannot be sure if the cell we get from a row is a numeric value or text value or perhaps is null. So here’s a helper function that will give us non-null String with cell contents (time to use some modern java feature like swith expression; I wrote a post (in Polish) about this nice language feature):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
private static String getCellValue(Cell cell) {
        if (cell == null) {
            return "";
        }

        return switch (cell.getCellType()) {
            case STRING -> cell.getStringCellValue();
            case NUMERIC -> String.valueOf(cell.getNumericCellValue());
            case BOOLEAN -> String.valueOf(cell.getBooleanCellValue());
            default -> "";
        };
    }

Now, let’s put it all together. Our main method has only single line:

1
2
3
public static void main(String[] args) {
      readListFromExcell(args, mapper).forEach(System.out::println);
}

… and the final part is actually opening the file, creating workbook, taking its first sheet and iterating over rows. Here’s the implementation on readListFromExcell:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
private static <T> List<T> readListFromExcell(String[] args, Function<Row, T> mapper) {
        return Arrays.stream(args).findFirst().map(filePath -> {
            List<T> result = List.of();
            try (Workbook workbook = new XSSFWorkbook(new FileInputStream(filePath))) {

                Sheet sheet = workbook.getSheetAt(0); // Assuming the data is in the first sheet
                result = createStreamFromIterator(sheet.rowIterator()).skip(1).map(mapper).toList();
            } catch (IOException e) {
                e.printStackTrace();
            }
            return result;
        }).orElse(List.of());

As you can see, I made this method generic - perhaps I would like to use it later :)

  • it takes commandlie arguments and reads file name from first argument
  • it then createx XSSF workbook from FileInputStream
  • it takes 0th sheet and creates a stream of Rows from rows iterator
  • it skips first row and uses mapper to convert following rows to Dog instance
  • it returns (possibly empty) list of Rows

An interesting part here is that I needed to create a stream from iterator:

1
2
3
4
private static <T> Stream<T> createStreamFromIterator(Iterator<T> iterator) {
    Spliterator<T> spliterator = Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED);
    return StreamSupport.stream(spliterator, false);
}

To cal this app using gradle, pass file name with --args=/path/to/file.xlsx:

1
gradle run --args='/home/karma/Pobrane/dogs.xlsx' 

As a resutl, we’ll get all the dogs printed to stdout:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Dog[breed=Bulldog Francés, fur=Corto, color=Gris, gender=Hembra, image=https://http2.mlstatic.com/D_NQ_NP_714611-MEC50253081381_062022-O.jpg, expectancy=10]
Dog[breed=french poodle, fur=Corto, color=Blanco, gender=Macho, image=https://http2.mlstatic.com/D_NQ_NP_693633-MEC50152204932_052022-O.jpg, expectancy=16]
Dog[breed=Bulldog Francés, fur=Corto, color=Gris, gender=Hembra, image=https://http2.mlstatic.com/D_NQ_NP_627613-MEC48299727614_112021-O.jpg, expectancy=10]
Dog[breed=French Minitoy, fur=Corto, color=Blanco, gender=Hembra, image=https://http2.mlstatic.com/D_NQ_NP_843219-MEC45671834670_042021-O.jpg, expectancy=18]
Dog[breed=Pomeranian, fur=Largo, color=Blanco, gender=Hembra, image=https://http2.mlstatic.com/D_NQ_NP_795240-MEC49623128427_042022-O.jpg, expectancy=16]
Dog[breed=American Bully, fur=Corto, color=Gris, gender=Macho, image=https://http2.mlstatic.com/D_NQ_NP_649758-MEC49877576159_052022-O.jpg, expectancy=20]
Dog[breed=West Highland White Terrier, fur=Largo, color=, gender=Macho, image=https://http2.mlstatic.com/D_NQ_NP_803485-MEC44164319487_112020-O.jpg, expectancy=15]
Dog[breed=Bichón Maltés, fur=Largo, color=Blanco, gender=Macho, image=https://http2.mlstatic.com/D_NQ_NP_809643-MEC49927313716_052022-O.jpg, expectancy=15]
Dog[breed=Bulldog Francés, fur=Corto, color=Gris, gender=Macho, image=https://http2.mlstatic.com/D_NQ_NP_783819-MEC49878146251_052022-O.webp, expectancy=10]
Dog[breed=Bulldog Francés, fur=Corto, color=Crema, gender=Hembra, image=https://http2.mlstatic.com/D_NQ_NP_858327-MEC48552688569_122021-O.jpg, expectancy=14]

Summary

Iteration over cells, columns and rows is easy and this high level API is good enough to be able to do simple parsing.

1
2
ls -At|head -n3 | while read fname; do echo convert $fname -scale 800x ${fname%%.jpg}-small.jpg; done
montage -background '#444' -tile 3x1 -geometry +1+1 $(ls *small.jpg) out_dogs.jpg