commonmeta 0.9.6

# commonmeta-rs

commonmeta-rs is a Rust library to implement Commonmeta, the common Metadata Model for Scholarly Metadata. Use commonmeta to convert scholarly metadata in a variety of formats, listed below. Commonmeta-rs is work in progress, the first release was on June 17, 2026. Implementations in other languages are also available ([Go](https://github.com/front-matter/commonmeta), [Python](https://github.com/front-matter/commonmeta-py), [Ruby](https://github.com/front-matter/commonmeta-ruby)).

## Supported Metadata Formats

Commonmeta-rs reads and/or writes these metadata formats:

| Format                                                                                   | Name         | Content Type                            | Read  | Write |
| ---------------------------------------------------------------------------------------- | ------------ | --------------------------------------- | ----- | ----- |
| Commonmeta                                                                               | commonmeta   | application/vnd.commonmeta+json         | yes   | yes   |
| [CrossRef XML](https://www.crossref.org/schema/documentation/unixref1.1/unixref1.1.html) | crossref_xml | application/vnd.crossref.unixref+xml    | yes   | yes   |
| [Crossref](https://api.crossref.org)                                                     | crossref     | application/vnd.crossref+json           | yes   | yes   |
| [DataCite](https://api.datacite.org/)                                                    | datacite     | application/vnd.datacite.datacite+json  | yes   | yes   |
| [DataCite XML](https://api.datacite.org/)                                                            | datacite_xml | application/vnd.datacite.datacite+xml | yes     | yes |
| [Schema.org (in JSON-LD)](http://schema.org/)                                            | schema_org   | application/vnd.schemaorg.ld+json       | yes   | yes   |
| [RDF XML](http://www.w3.org/TR/rdf-syntax-grammar/)                                      | rdf_xml      | application/rdf+xml                     | no    | later |
| [RDF Turtle](http://www.w3.org/TeamSubmission/turtle/)                                   | turtle       | text/turtle                             | no    | later |
| [CSL-JSON](https://citationstyles.org/)                                                  | csl          | application/vnd.citationstyles.csl+json | yes   | yes   |
| [Formatted text citation](https://citationstyles.org/)                                   | citation     | text/x-bibliography                     | n/a   | yes   |
| [Codemeta](https://codemeta.github.io/)                                                  | codemeta     | application/vnd.codemeta.ld+json        | yes   | later |
| [Citation File Format (CFF)](https://citation-file-format.github.io/)                    | cff          | application/vnd.cff+yaml                | yes   | later |
| [JATS](https://jats.nlm.nih.gov/)                                                        | jats         | application/vnd.jats+xml                | later | later |
| [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)                              | csv          | text/csv                                | no    | later |
| [BibTex](http://en.wikipedia.org/wiki/BibTeX)                                            | bibtex       | application/x-bibtex                    | yes   | yes   |
| [RIS](http://en.wikipedia.org/wiki/RIS_(file_format))                                    | ris          | application/x-research-info-systems     | yes   | yes   |
| [InvenioRDM](https://inveniordm.docs.cern.ch/reference/metadata/)                        | inveniordm   | application/vnd.inveniordm.v1+json      | yes   | yes   |
| [JSON Feed](https://www.jsonfeed.org/)                                                   | jsonfeed     | application/feed+json                   | yes   | later |
| [OpenAlex](https://www.openalex.org/)                                                    | openalex     | n/a                                     | yes   | no    |

_commonmeta_: the Commonmeta format is the native format for the library and used internally.
_later_: we plan to implement this format in a later release.

## Build & run

```sh
cargo build
cargo test
```

The `commonmeta` binary has these subcommands: `convert`, `encode`, `decode`, `import`, `list`, `match`, `migrate`, `push`, `put`, `settings`, and `validate`.

```sh
# Encode/decode a Crockford base32 identifier suffix given a DOI prefix
cargo run -- encode 10.5555
cargo run -- decode 10.5555/nwbyp-29t86

# Convert a single record between formats, fetching it by DOI
cargo run -- convert 10.5555/12345678 --from crossref --to csl

# Convert a local file and write the result to disk
cargo run -- convert record.json --from commonmeta --to csl --file out.json

# Render a formatted citation (CSL style + locale)
cargo run -- convert 10.5555/12345678 --from crossref --to citation --style apa --locale en-US

# Fetch a batch of records from an API and write them as a commonmeta JSON array
cargo run -- list --from crossref --number 100 --type journal-article --file out.json

# Read all records from a local VRAIX SQLite file and convert to another format
cargo run -- list crossref-2026-06-15.sqlite3 --number 0 --to commonmeta --file out.json.gz

# Parquet output (.parquet file extension, --to commonmeta only): records are split into batches of 100,000, written in parallel, and zstd-compressed
cargo run --release -- list crossref-2026-06-15.sqlite3 --number 0 --file out.parquet

# Import a single record by DOI into the local commonmeta database (source auto-detected)
cargo run -- import 10.7554/elife.01567

# Import all Crossref records for a ROR-identified institution (paginates through all results)
cargo run -- import --from crossref --ror 00pd74e08

# Import all records from a Crossref VRAIX daily dump
cargo run -- import --from crossref --date 2026-06-15

# See the Local database section below for the full import command reference
# including annual public data files (Crossref torrent, DataCite TAR).

# Register records with a live InvenioRDM instance (creates/updates and publishes
# real records — registration is currently only supported with --to inveniordm)
cargo run -- push --from crossref --number 10 --to inveniordm --host rogue-scholar.org --token TOKEN

# Same as push, but for a single record (DOI, URL, or file path)
cargo run -- put 10.5555/12345678 --from crossref --to inveniordm --host rogue-scholar.org --token TOKEN

# Match a free-text affiliation string to a ROR organization (uses local DB when available)
cargo run -- match "Leibniz Universität Hannover"
cargo run -- match "Leibniz Universität Hannover" --to inveniordm

# Look up a ROR organization (uses local DB when available)
cargo run -- convert https://ror.org/02nr0ka47
cargo run -- convert https://ror.org/02nr0ka47 --to inveniordm

# Work fully offline — fails fast if a network call would be required
cargo run -- convert record.json --from commonmeta --to csl --no-network
cargo run -- list crossref-2026-06-15.sqlite3 --no-network --file out.json
cargo run -- import crossref-2026-06-15.sqlite3 --no-network
cargo run -- match "Leibniz Universität Hannover" --no-network
```

Use `cargo run -- <subcommand> --help` for the full list of options for each subcommand.

### `--no-network` flag

`convert`, `list`, `import`, and `match` all accept a `--no-network` flag. When set, any
operation that would make an outbound HTTP request is rejected immediately with a clear error
message. Operations on local files always succeed regardless of this flag. `push` and `put`
always require network access and do not expose this flag.

## Local database

The `import` command populates a local commonmeta SQLite database with scholarly metadata records. All imports upsert — existing records are updated rather than replaced. The database is also used by `match` and `convert` for offline lookups.

```sh
# Import a single record by DOI (source auto-detected from the DOI prefix)
commonmeta import 10.7554/elife.01567
commonmeta import https://doi.org/10.7554/elife.01567

# Import all Crossref records for an institution (ROR ID, paginates automatically)
commonmeta import --from crossref --ror 00pd74e08

# Import all DataCite records for an author (ORCID, paginates automatically)
commonmeta import --from datacite --orcid 0000-0003-1419-2405

# Import a full daily dump (downloads from metadata.vraix.org)
commonmeta import --from crossref --date 2026-06-15
commonmeta import --from datacite --date 2026-06-15

# Import from a locally downloaded VRAIX dump (source auto-detected from filename)
commonmeta import crossref-2026-06-15.sqlite3

# Import the Crossref annual public data file (~223 GB)
# Option A: Academic Torrents (aria2c required, free)
commonmeta import --from crossref
commonmeta import --from crossref --sample   # first 5 files only (~40 MB)
# Option B: AWS S3 requester-pays bucket (aws CLI + credentials required, ~$18)
# Bucket: s3://api-snapshots-reqpays-crossref   see https://www.crossref.org/documentation/retrieve-metadata/bulk-downloads/
# TAR cached at ~/Library/Caches/commonmeta/crossref/crossref-annual-s3.tar
commonmeta import --from crossref --s3

# Import the DataCite annual public data file (108 M records, 33 GB compressed)
# First run: obtain a time-limited download URL by submitting your email at
#   https://datafiles.datacite.org/datafiles/public-2025
# The TAR archive is cached at ~/Library/Caches/commonmeta/datacite/public-2025.tar
# for subsequent re-imports without a new token.
commonmeta import "https://datafiles.datacite.org/datafiles/public-2025/download?token=<TOKEN>"
commonmeta import "https://datafiles.datacite.org/datafiles/public-2025/download?token=<TOKEN>" --sample
# Re-import or re-parse from cache (no token needed after the first download):
commonmeta import --from datacite
commonmeta import --from datacite --sample

# Import the ORCID Public Data File into the people table
# The file (~46 GB compressed, ~220 M person records) is published annually on figshare:
#   https://figshare.com/articles/dataset/ORCID_Public_Data_File_2025/30375589
# Download the *summaries* file only (ORCID_YYYY_N_summaries.tar.gz, ~46 GB),
# not the full bundle (~221 GB).
#
# Records land in the `people` table, which by default shares the main database.
# Use --people-db to keep people in a separate file:
#   commonmeta import --from orcid --people-db /data/people.sqlite3 "<SUMMARIES_URL>"
#
# NOTE: figshare's download CDN (ndownloader.figshare.com) blocks some server/datacenter IPs.
# If you get a 403 error, download on a local machine and transfer with rsync (resumable):
#   commonmeta import --from orcid --list-releases   # get URL
#   wget -O ORCID_2025_10_summaries.tar.gz "<SUMMARIES_URL>"
#   rsync -az --progress ORCID_2025_10_summaries.tar.gz user@server:/data/
#   commonmeta import --from orcid /data/ORCID_2025_10_summaries.tar.gz
#
# Step 1 — get the direct download URL.
# On a machine where api.figshare.com is accessible (e.g. your laptop):
commonmeta import --from orcid --list-releases
# Prints something like:
#   Year/batch : 2025_10
#   SUMMARIES  : https://figshare.com/ndownloader/files/XXXXXXXX
# Copy the SUMMARIES URL and use it in Step 2.
# Alternatively: open the figshare page linked above, locate ORCID_YYYY_N_summaries.tar.gz,
# and copy its download link directly from the browser.
#
# Step 2 — import (single sequential download → cache → SQLite):
# The TAR is cached at ~/Library/Caches/commonmeta/orcid/ORCID_2025_10_summaries.tar.gz
# so subsequent re-imports read from disk without re-downloading.
commonmeta import --from orcid "<SUMMARIES_URL>"
# On servers where api.figshare.com is accessible, auto-discover and import in one step:
commonmeta import --from orcid
# Test with the first 1,000 records (~200 KB downloaded, then connection closed):
commonmeta import --from orcid --sample "<SUMMARIES_URL>"
# Re-import from a locally downloaded TAR (no URL needed):
commonmeta import --from orcid /path/to/ORCID_2025_10_summaries.tar.gz

# Import the full VRAIX pidbox dump
commonmeta import --from pidbox

# Import latest ROR organization data
commonmeta import --from ror

# Import from a dragoman cache (flushes cache after successful import)
commonmeta import --from dragoman
commonmeta import --from dragoman --dragoman-db /path/to/cache.sqlite3
DRAGOMAN_DB=/path/to/cache.sqlite3 commonmeta import --from dragoman
```

The database path is resolved in this order:

1. `COMMONMETA_DB` environment variable
2. Platform default:

| Platform | Default path                                                   |
| -------- | -------------------------------------------------------------- |
| macOS    | `~/Library/Application Support/commonmeta/commonmeta.sqlite3`  |
| Linux    | `/var/lib/commonmeta/commonmeta.sqlite3`                       |

```sh
# Use a custom path via environment variable
COMMONMETA_DB=/data/commonmeta.sqlite3 commonmeta import --from crossref --date 2026-06-15
```

### Dragoman cache import on Debian (cron)

The `--from dragoman` import reads from the [dragoman](https://crates.io/crates/dragoman) cache
SQLite database (same VRAIX transport schema: `pid`, `source_id`, `raw_metadata`).  After a
successful import the cache rows are deleted and the file is VACUUMed, so subsequent cron runs
only process new records.

#### System user and file permissions

```sh
# Create a system user for the commonmeta import job
adduser --system --group --no-create-home commonmeta

# Create and own the commonmeta database directory
install -d -o commonmeta -g commonmeta -m 750 /var/lib/commonmeta

# Allow the commonmeta user to read and write the dragoman cache.
# Dragoman must also be able to write to the same file.
# Option A — shared group (recommended):
adduser dragoman commonmeta   # or whatever user dragoman runs as
chmod g+rw /var/lib/dragoman/cache.sqlite3
chown dragoman:commonmeta /var/lib/dragoman/cache.sqlite3

# Option B — POSIX ACL (if the filesystem supports it):
setfacl -m u:commonmeta:rw /var/lib/dragoman/cache.sqlite3
```

#### `/etc/cron.d/commonmeta`

```cron
# Run the dragoman cache import every 15 minutes.
# Logs go to syslog via logger; the cache is flushed automatically on success.
*/15 * * * * commonmeta /usr/local/bin/commonmeta import --from dragoman 2>&1 | logger -t commonmeta-import
```

#### Environment variables (`/etc/default/commonmeta`)

```sh
# Override the default database paths if needed.
COMMONMETA_DB=/var/lib/commonmeta/commonmeta.sqlite3
DRAGOMAN_DB=/var/lib/dragoman/cache.sqlite3
```

Load them in the cron job by prefixing with `env $(cat /etc/default/commonmeta | xargs)`, or
install a systemd timer unit that reads `EnvironmentFile=/etc/default/commonmeta` instead.

#### systemd timer alternative (`/etc/systemd/system/commonmeta-import.service` + `.timer`)

```ini
# commonmeta-import.service
[Unit]
Description=Import records from dragoman cache into commonmeta database
After=network.target

[Service]
Type=oneshot
User=commonmeta
EnvironmentFile=/etc/default/commonmeta
ExecStart=/usr/local/bin/commonmeta import --from dragoman
StandardOutput=journal
StandardError=journal
```

```ini
# commonmeta-import.timer
[Unit]
Description=Run commonmeta dragoman import every 15 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=15min
Persistent=true

[Install]
WantedBy=timers.target
```

```sh
systemctl daemon-reload
systemctl enable --now commonmeta-import.timer
systemctl status commonmeta-import.timer
journalctl -u commonmeta-import.service -f
```

## Migrate

The `migrate` command applies any pending schema migrations to the local database, optionally backfills the junction tables (`works_orcid`, `works_ror`, `works_references`) that enable fast reverse lookups by author ORCID, institution ROR, and cited DOI, and can rebuild the FTS5 full-text search indexes.

Migrations are idempotent — safe to run repeatedly. No existing records are modified.

```sh
# Apply any pending schema migrations (safe to run any time)
commonmeta migrate

# Populate all three junction tables in a single streaming pass (most efficient)
commonmeta migrate --backfill

# Populate individual junction tables
commonmeta migrate --orcid       # works_orcid  — author ORCID index
commonmeta migrate --ror         # works_ror    — institution ROR index
commonmeta migrate --references  # works_references — cited DOI index

# Combine multiple tables without a full --backfill
commonmeta migrate --orcid --ror

# Restrict a backfill to records from one provider
commonmeta migrate --backfill --crossref
commonmeta migrate --references --datacite

# Rebuild FTS5 full-text search indexes
commonmeta migrate --rebuild-fts

# Migrate a database at a custom path
commonmeta migrate --file /var/lib/commonmeta/commonmeta.sqlite3
```

### Junction tables

| Table | Key column | Populated by | Used for |
| ----- | ---------- | ------------ | -------- |
| `works_orcid` | `orcid` | `--orcid` / `--backfill` | Fast author lookup: all works by an ORCID iD |
| `works_ror` | `ror` | `--ror` / `--backfill` | Fast institution lookup: all works for a ROR ID |
| `works_references` | `ref_id` | `--references` / `--backfill` | Reverse citation lookup: all works that cite a given DOI |

### FTS5 full-text search indexes

Three SQLite FTS5 virtual tables enable full-text search with Unicode-aware tokenization and diacritic folding (searching "muller" matches "Müller"):

| Table | Indexed columns | Content table |
| ----- | --------------- | ------------- |
| `works_fts` | `title`, `subjects` | `works` |
| `organizations_fts` | `name`, `names_flat` | `organizations` |
| `people_fts` | `name`, `keywords`, `other_names` | `people` |

All three use `content=<table>` so only the inverted index is stored separately — the full text remains in the base table. Expected index size for `works_fts` on a 200 M-row corpus: **5–10 GB**.

#### Deploying FTS5 indexes on an existing database (schema upgrade v4 → v5)

The FTS5 rebuild reads every row in `works` to build the inverted index. On a 200 M-row database this takes 10–45 minutes depending on I/O and is not done automatically on DB open to avoid blocking routine commands.

```sh
# Step 1: apply the schema migration (creates the empty virtual table — fast)
commonmeta migrate

# Step 2: populate the index (reads all rows — slow on large databases)
commonmeta migrate --rebuild-fts
```

Both steps print elapsed time per operation to stderr. Re-running either is safe.

#### FTS5 on a fresh bulk import

When importing the DataCite or Crossref annual data file, `works_fts` is rebuilt automatically at the end of the import loop — no extra step is needed:

```sh
commonmeta import --from datacite   # FTS rebuild runs automatically at the end
commonmeta import --from crossref --s3
```

#### Keeping FTS indexes current

FTS5 content tables are not updated incrementally — they must be rebuilt when the base table changes. The rebuild is triggered automatically only after full bulk imports. After daily incremental imports or individual-record imports, run `--rebuild-fts` periodically (e.g. nightly) to keep search results current:

```sh
commonmeta migrate --rebuild-fts
```

### Resumable backfills

Each backfill flag tracks a `rowid` cursor in the `settings` table. If the run is interrupted (Ctrl-C, machine restart), re-running the same command resumes from where it left off — no records are re-scanned. The cursor is deleted on completion, so a second full run starts fresh.

On a 300 M-record database expect each full-backfill pass to take several hours. Monitor progress on stderr; the command prints a running count and final elapsed time.

## Validate

The `validate` command checks records in the local database against the [commonmeta v1.0 JSON schema](https://commonmeta.org/commonmeta_v1.0.json) and reports any violations. Each failing record shows the JSON Pointer to the offending field and a short description of the constraint that was violated.

Errors are persisted in a `validation_errors` table inside the database so that `--recheck` can quickly re-run only the records that failed last time.

```sh
# Validate all records
commonmeta validate

# Validate only DataCite records
commonmeta validate --from datacite

# Validate only DataCite datasets
commonmeta validate --from datacite --type Dataset

# Validate the first 1 000 records
commonmeta validate --number 1000

# Repair invalid records in-place (re-applies schema normalization)
commonmeta validate --fix

# Re-validate only records that failed in the previous run
commonmeta validate --recheck

# Repair only previously-failing records
commonmeta validate --recheck --fix

# Write errors as JSONL to a file instead of stderr
commonmeta validate --report errors.jsonl

# Validate a different database
commonmeta validate /path/to/other.sqlite3

# Enrich affiliation identifiers using the local ROR database
commonmeta validate --fill --organizations /path/to/ror.sqlite3

# Enrich only Crossref records, cap at 10 000
commonmeta validate --fill --from crossref --number 10000
```

### Options

| Option | Description |
| ------ | ----------- |
| `--from` / `-f` | Filter by provider (`crossref`, `datacite`, `openalex`). |
| `--type` | Filter by work type, e.g. `Dataset`, `JournalArticle`. |
| `--number` / `-n` | Maximum number of records to check (default: all). |
| `--fix` | Attempt to repair invalid records in-place. Applies `prepare()` normalization: removes non-ROR organization ids, clears invalid URIs, deduplicates geo-locations, normalizes EISSN → ISSN, etc. Repaired records are removed from `validation_errors`; records that cannot be repaired remain. |
| `--recheck` | Only re-validate records listed in the `validation_errors` table from the previous run. Combine with `--fix` for an efficient repair loop. |
| `--report` | Write errors as JSONL (one `{"id": "…", "errors": […]}` object per record) to the given file instead of printing to stderr. |
| `--fill` | Enrich affiliation and organization identifiers. See [Fill](#fill) below. |
| `--organizations` | Path to the ROR organizations SQLite database used by `--fill` (default: platform `ror.sqlite3`, env: `ROR_DB`). |

### Repair loop

A typical workflow for cleaning up an imported database:

```sh
# 1. Full first pass — saves all failures to validation_errors
commonmeta validate --from datacite --fix

# 2. Subsequent passes — only re-checks and re-repairs the remaining failures
commonmeta validate --recheck --fix
```

The command exits with a non-zero status if any records remain invalid after the run.

### Fill

`--fill` enriches affiliation and organization identifiers in the works database using the local [ROR](https://ror.org) organizations database (imported with `commonmeta import ror`). It runs independently of schema validation and never sets the `valid` flag.

For each record, every contributor affiliation and organization-type contributor is inspected:

| Condition | Action |
| --------- | ------ |
| `id` is a Crossref Funder ID or ISNI | Replaced with the matching ROR URL; name set to the ROR display name; `asserted_by` set to `"Commonmeta"`. |
| `id` is a ROR URL and `name` is empty | Name filled from the ROR database. |
| `id` is already a ROR URL with a name | Left unchanged. |

The organizations database path defaults to the platform `ror.sqlite3` location (macOS: `~/Library/Application Support/commonmeta/ror.sqlite3`). Override with `--organizations` or the `ROR_DB` environment variable.

## Documentation

Documentation (work in progress) for using the library is available at the [commonmeta-rs Documentation](https://rust.commonmeta.org/) website.

## Settings

The `settings` command reads the `settings` table of the local commonmeta SQLite database. Settings rows record installed vocabulary versions and bulk-import dates written by `commonmeta import`.

```sh
# Show all key/value settings
commonmeta settings

# Show settings for a specific database
commonmeta settings --file /data/commonmeta.sqlite3

# Show settings alongside the ORCID people database
commonmeta settings --people-db /data/people.sqlite3

# Show record counts for all main tables
commonmeta settings --stats

# Show record counts for a specific database
commonmeta settings --stats --file /data/commonmeta.sqlite3
```

### Settings options

| Option | Description |
| ------ | ----------- |
| `--file` | Path to the works SQLite database (overrides `COMMONMETA_DB` and the platform default). |
| `--people-db` | Path to the people SQLite database. Shows ORCID Public Data File version alongside the works settings. |
| `--stats` | Show record counts for all main tables instead of settings values. |

### Stats output

`--stats` reports the row count for each of the following tables, or `(table not found)` if the table has not yet been created:

| Table | Contents |
| ----- | -------- |
| `works` | Scholarly metadata records |
| `organizations` | ROR organization vocabulary |
| `people` | ORCID person records |
| `prefixes` | DOI prefix registry |
| `works_ror` | Work → ROR junction (fast institution lookup) |
| `works_orcid` | Work → ORCID junction (fast author lookup) |
| `works_references` | Work → cited DOI junction (reverse citation lookup) |

Row counts use `MAX(rowid)` rather than `COUNT(*)` for instant results even on tables with hundreds of millions of rows. The value equals the true row count when no rows have been deleted from the table.

## Meta

Please note that this project is released with a [Contributor Code of Conduct](https://github.com/front-matter/commonmeta-rs/blob/main/CODE_OF_CONDUCT.md). By participating in this project you agree to abide by its terms.

License: [MIT](https://github.com/front-matter/commonmeta-rs/blob/main/LICENSE)