calamine 0.29.0

An Excel/OpenDocument Spreadsheets reader and deserializer in pure rust
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
# calamine

An Excel/OpenDocument Spreadsheets file reader/deserializer, in pure Rust.

[![GitHub CI Rust tests](https://github.com/tafia/calamine/workflows/Rust/badge.svg)](https://github.com/tafia/calamine/actions)
[![Build status](https://ci.appveyor.com/api/projects/status/njpnhq54h5hxsgel/branch/master?svg=true)](https://ci.appveyor.com/project/tafia/calamine/branch/master)

[Documentation](https://docs.rs/calamine/)

## Description

**calamine** is a pure Rust library to read and deserialize any spreadsheet file:

- excel like (`xls`, `xlsx`, `xlsm`, `xlsb`, `xla`, `xlam`)
- opendocument spreadsheets (`ods`)

As long as your files are *simple enough*, this library should just work.

## Examples

### Serde deserialization

It is as simple as:

```rust
use calamine::{open_workbook, Error, Xlsx, Reader, RangeDeserializerBuilder};

fn example() -> Result<(), Error> {
    let path = format!("{}/tests/temperature.xlsx", env!("CARGO_MANIFEST_DIR"));
    let mut workbook: Xlsx<_> = open_workbook(path)?;
    let range = workbook.worksheet_range("Sheet1")?;


    let mut iter = RangeDeserializerBuilder::new().from_range(&range)?;

    if let Some(result) = iter.next() {
        let (label, value): (String, f64) = result?;
        assert_eq!(label, "celsius");
        assert_eq!(value, 22.2222);
        Ok(())
    } else {
        Err(From::from("expected at least one record but got none"))
    }
}
```

Calamine provides helper functions to deal with invalid type values. For
instance, to deserialize a column which should contain floats but may also
contain invalid values (i.e. strings), you can use the
[`deserialize_as_f64_or_none`](https://docs.rs/calamine/latest/calamine/fn.deserialize_as_f64_or_none.html)
helper function with Serde's
[`deserialize_with`](https://serde.rs/field-attrs.html) field attribute:

```rust
use calamine::{deserialize_as_f64_or_none, open_workbook, RangeDeserializerBuilder, Reader, Xlsx};
use serde::Deserialize;

#[derive(Deserialize)]
struct Record {
    metric: String,
    #[serde(deserialize_with = "deserialize_as_f64_or_none")]
    value: Option<f64>,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let path = format!("{}/tests/excel.xlsx", env!("CARGO_MANIFEST_DIR"));
    let mut excel: Xlsx<_> = open_workbook(path)?;

    let range = excel
        .worksheet_range("Sheet1")
        .map_err(|_| calamine::Error::Msg("Cannot find Sheet1"))?;

    let iter_records =
        RangeDeserializerBuilder::with_headers(&["metric", "value"]).from_range(&range)?;

    for result in iter_records {
        let record: Record = result?;
        println!("metric={:?}, value={:?}", record.metric, record.value);
    }

    Ok(())
}
```

The
[`deserialize_as_f64_or_none`](https://docs.rs/calamine/latest/calamine/fn.deserialize_as_f64_or_none.html)
function discards all invalid values. If instead you would like to return them
as `String`s, you can use the similar
[`deserialize_as_f64_or_string`](https://docs.rs/calamine/latest/calamine/fn.deserialize_as_f64_or_string.html)
function.

### Reader: Simple

```rust
use calamine::{Reader, Xlsx, open_workbook};

let mut excel: Xlsx<_> = open_workbook("file.xlsx").unwrap();
if let Ok(r) = excel.worksheet_range("Sheet1") {
    for row in r.rows() {
        println!("row={:?}, row[0]={:?}", row, row[0]);
    }
}
```

### Reader: With header row

```rs
use calamine::{HeaderRow, Reader, Xlsx, open_workbook};

let mut excel: Xlsx<_> = open_workbook("file.xlsx").unwrap();

let sheet1 = excel
    .with_header_row(HeaderRow::Row(3))
    .worksheet_range("Sheet1")
    .unwrap();
```

Note that `xlsx` and `xlsb` files support lazy loading, so specifying a
header row takes effect immediately when reading a sheet range.
In contrast, for `xls` and `ods` files, all sheets are loaded at once when
opening the workbook with default settings.
As a result, setting the header row only applies afterward and does not
provide any performance benefits.

### Reader: More complex

Let's assume

- the file type (xls, xlsx ...) cannot be known at static time
- we need to get all data from the workbook
- we need to parse the vba
- we need to see the defined names
- and the formula!

```rust
use calamine::{Reader, open_workbook_auto, Xlsx, DataType};

// opens a new workbook
let path = ...; // we do not know the file type
let mut workbook = open_workbook_auto(path).expect("Cannot open file");

// Read whole worksheet data and provide some statistics
if let Some(Ok(range)) = workbook.worksheet_range("Sheet1") {
    let total_cells = range.get_size().0 * range.get_size().1;
    let non_empty_cells: usize = range.used_cells().count();
    println!("Found {} cells in 'Sheet1', including {} non empty cells",
             total_cells, non_empty_cells);
    // alternatively, we can manually filter rows
    assert_eq!(non_empty_cells, range.rows()
        .flat_map(|r| r.iter().filter(|&c| c != &DataType::Empty)).count());
}

// Check if the workbook has a vba project
if let Some(Ok(mut vba)) = workbook.vba_project() {
    let vba = vba.to_mut();
    let module1 = vba.get_module("Module 1").unwrap();
    println!("Module 1 code:");
    println!("{}", module1);
    for r in vba.get_references() {
        if r.is_missing() {
            println!("Reference {} is broken or not accessible", r.name);
        }
    }
}

// You can also get defined names definition (string representation only)
for name in workbook.defined_names() {
    println!("name: {}, formula: {}", name.0, name.1);
}

// Now get all formula!
let sheets = workbook.sheet_names().to_owned();
for s in sheets {
    println!("found {} formula in '{}'",
             workbook
                .worksheet_formula(&s)
                .expect("sheet not found")
                .expect("error while getting formula")
                .rows().flat_map(|r| r.iter().filter(|f| !f.is_empty()))
                .count(),
             s);
}
```

## Features

- `dates`: Add date related fn to `DataType`.
- `picture`: Extract picture data.

### Others

Browse the [examples](https://github.com/tafia/calamine/tree/master/examples) directory.

## Performance

As `calamine` is readonly, the comparisons will only involve reading an excel `xlsx` file and then iterating over the rows. Along with `calamine`, three other libraries were chosen, from three different languages:

- [`excelize`]https://github.com/qax-os/excelize written in `go`
- [`ClosedXML`]https://github.com/ClosedXML/ClosedXML written in `C#`
- [`openpyxl`]https://foss.heptapod.net/openpyxl/openpyxl written in `python`

The benchmarks were done using this [dataset](https://raw.githubusercontent.com/wiki/jqnatividad/qsv/files/NYC_311_SR_2010-2020-sample-1M.7z), a `186MB` `xlsx` file when the `csv` is converted. The plotting data was gotten from the [`sysinfo`](https://github.com/GuillaumeGomez/sysinfo) crate, at a sample interval of `200ms`. The program samples the reported values for the running process and records it.

The programs are all structured to follow the same constructs:

`calamine`:

```rust
use calamine::{open_workbook, Reader, Xlsx};

fn main() {
    // Open workbook
    let mut excel: Xlsx<_> =
        open_workbook("NYC_311_SR_2010-2020-sample-1M.xlsx").expect("failed to find file");

    // Get worksheet
    let sheet = excel
        .worksheet_range("NYC_311_SR_2010-2020-sample-1M")
        .unwrap()
        .unwrap();

    // iterate over rows
    for _row in sheet.rows() {}
}
```

`excelize`:

```go
package main

import (
        "fmt"
        "github.com/xuri/excelize/v2"
)

func main() {
        // Open workbook
        file, err := excelize.OpenFile(`NYC_311_SR_2010-2020-sample-1M.xlsx`)

        if err != nil {
                fmt.Println(err)
                return
        }

        defer func() {
                // Close the spreadsheet.
                if err := file.Close(); err != nil {
                        fmt.Println(err)
                }
        }()

        // Select worksheet
        rows, err := file.Rows("NYC_311_SR_2010-2020-sample-1M")
        if err != nil {
                fmt.Println(err)
                return
        }

        // Iterate over rows
        for rows.Next() {
        }
}
```

`ClosedXML`:

```csharp
using ClosedXML.Excel;

internal class Program
{
        private static void Main(string[] args)
        {
                // Open workbook
                using var workbook = new XLWorkbook("NYC_311_SR_2010-2020-sample-1M.xlsx");

                // Get Worksheet
                // "NYC_311_SR_2010-2020-sample-1M"
                var worksheet = workbook.Worksheet(1);

                // Iterate over rows
                foreach (var row in worksheet.Rows())
                {

                }
        }
}
```

`openpyxl`:

```python
from openpyxl import load_workbook

# Open workbook
wb = load_workbook(
    filename=r'NYC_311_SR_2010-2020-sample-1M.xlsx', read_only=True)

# Get worksheet
ws = wb['NYC_311_SR_2010-2020-sample-1M']

# Iterate over rows
for row in ws.rows:
    _ = row

# Close the workbook after reading
wb.close()
```

### Benchmarks

The benchmarking was done using [`hyperfine`](https://github.com/sharkdp/hyperfine) with `--warmup 3` on an `AMD RYZEN 9 5900X @ 4.0GHz` running `Windows 11`. Both `calamine` and `ClosedXML` were built in release mode.

```bash
0.22.1 calamine.exe
  Time (mean ± σ):     25.278 s ±  0.424 s    [User: 24.852 s, System: 0.470 s]
  Range (min … max):   24.980 s … 26.369 s    10 runs

v2.8.0 excelize.exe
  Time (mean ± σ):     44.254 s ±  0.574 s    [User: 46.071 s, System: 7.754 s]
  Range (min … max):   42.947 s … 44.911 s    10 runs

0.102.1 closedxml.exe
  Time (mean ± σ):     178.343 s ±  3.673 s    [User: 177.442 s, System: 2.612 s]
  Range (min … max):   173.232 s … 185.086 s    10 runs

3.0.10 openpyxl.py
  Time (mean ± σ):     238.554 s ±  1.062 s    [User: 238.016 s, System: 0.661 s]
  Range (min … max):   236.798 s … 240.167 s    10 runs
```

`calamine` is 1.75x faster than `excelize`, 7.05x faster than `ClosedXML`, and 9.43x faster than `openpyxl`.

The spreadsheet has a range of 1,000,001 rows and 41 columns, for a total of 41,000,041 cells in the range. Of those, 28,056,975 cells had values.

Going off of that number:

- `calamine` =>  1,122,279 cells per second
- `excelize` => 633,998 cells per second
- `ClosedXML` => 157,320 cells per second
- `openpyxl` => 117,612 cells per second

### Plots

#### Disk Read

![bytes_from_disk](https://github.com/RoloEdits/calamine/assets/12489689/fcca1147-d73f-4d1c-b273-e7e4c183ab29)

As stated, the filesize on disk is `186MB`:

- `calamine` => `186MB`
- `ClosedXML` => `208MB`.
- `openpyxl` =>  `192MB`.
- `excelize` => `1.5GB`.

When asking one of the maintainers of `excelize`, I got this [response](https://github.com/qax-os/excelize/issues/1695#issuecomment-1772239230):
> To avoid high memory usage for reading large files, this library allows user-specific UnzipXMLSizeLimit options when opening the workbook, to set the memory limit on the unzipping worksheet and shared string table in bytes, worksheet XML will be extracted to the system temporary directory when the file size is over this value, so you can see that data written in reading mode, and you can change the default for that to avoid this behavior.
>
> \- xuri

#### Disk Write

![bytes_to_disk](https://github.com/RoloEdits/calamine/assets/12489689/befa9893-7658-41a7-8cbd-b0ce5a7d9341)

As seen in the previous section, `excelize` is writing to disk to save memory. The others don't employ that kind of mechanism.

#### Memory

![mem_usage](https://github.com/RoloEdits/calamine/assets/12489689/c83fdf6b-1442-4e22-8eca-84cbc1db4a26)

![virt_mem_usage](https://github.com/RoloEdits/calamine/assets/12489689/840a96ed-33d7-44f7-8276-80bb7a02557f)
> [!NOTE]
> `ClosedXML` was reporting a constant `2.5TB` of virtual memory usage, so it was excluded from the chart.

The stepping and falling for `calamine` is from the grows of `Vec`s and the freeing of memory right after, with the memory usage dropping down again. The sudden jump at the end is when the sheet is being read into memory. The others, being garbage collected, have a more linear climb all the way through.

#### CPU

![cpu_usage](https://github.com/RoloEdits/calamine/assets/12489689/c3aa55a8-b008-48ee-ba04-c08bd91c1f6f)

Very noisy chart, but `excelize`'s spikes must be from the GC?

## Unsupported

Many (most) part of the specifications are not implemented, the focus has been put on reading cell **values** and **vba** code.

The main unsupported items are:

- no support for writing excel files, this is a read-only library
- no support for reading extra contents, such as formatting, excel parameter, encrypted components etc ...
- no support for reading VB for opendocuments

## Credits

Thanks to [xlsx-js](https://github.com/SheetJS/js-xlsx) developers!
This library is by far the simplest open source implementation I could find and helps making sense out of official documentation.

Thanks also to all the contributors!

## License

MIT