commonmeta-rs

commonmeta-rs is a Rust library to implement Commonmeta, the common Metadata Model for Scholarly Metadata. Use commonmeta to convert scholarly metadata in a variety of formats, listed below. Commonmeta-rs is work in progress, the first release was on June 17, 2026. Implementations in other languages are also available (Go, Python, Ruby).

Supported Metadata Formats

Commonmeta-rs reads and/or writes these metadata formats:

Format	Name	Content Type	Read	Write
Commonmeta	commonmeta	application/vnd.commonmeta+json	yes	yes
CrossRef XML	crossref_xml	application/vnd.crossref.unixref+xml	yes	yes
Crossref	crossref	application/vnd.crossref+json	yes	yes
DataCite	datacite	application/vnd.datacite.datacite+json	yes	yes
DataCite XML	datacite_xml	application/vnd.datacite.datacite+xml	yes	yes
Schema.org (in JSON-LD)	schema_org	application/vnd.schemaorg.ld+json	yes	yes
RDF XML	rdf_xml	application/rdf+xml	no	later
RDF Turtle	turtle	text/turtle	no	later
CSL-JSON	csl	application/vnd.citationstyles.csl+json	yes	yes
Formatted text citation	citation	text/x-bibliography	n/a	yes
Codemeta	codemeta	application/vnd.codemeta.ld+json	yes	later
Citation File Format (CFF)	cff	application/vnd.cff+yaml	yes	later
JATS	jats	application/vnd.jats+xml	later	later
CSV	csv	text/csv	no	later
BibTex	bibtex	application/x-bibtex	yes	yes
RIS	ris	application/x-research-info-systems	yes	yes
InvenioRDM	inveniordm	application/vnd.inveniordm.v1+json	yes	yes
JSON Feed	jsonfeed	application/feed+json	yes	later
OpenAlex	openalex	n/a	yes	no

commonmeta: the Commonmeta format is the native format for the library and used internally. later: we plan to implement this format in a later release.

Build & run

cargo build
cargo test

The commonmeta binary has these subcommands: convert, encode, decode, import, list, match, migrate, push, put, settings, and validate.

# Encode/decode a Crockford base32 identifier suffix given a DOI prefix
cargo run -- encode 10.5555
cargo run -- decode 10.5555/nwbyp-29t86

# Convert a single record between formats, fetching it by DOI
cargo run -- convert 10.5555/12345678 --from crossref --to csl

# Convert a local file and write the result to disk
cargo run -- convert record.json --from commonmeta --to csl --file out.json

# Render a formatted citation (CSL style + locale)
cargo run -- convert 10.5555/12345678 --from crossref --to citation --style apa --locale en-US

# Fetch a batch of records from an API and write them as a commonmeta JSON array
cargo run -- list --from crossref --number 100 --type journal-article --file out.json

# Read all records from a local VRAIX SQLite file and convert to another format
cargo run -- list crossref-2026-06-15.sqlite3 --number 0 --to commonmeta --file out.json.gz

# Parquet output (.parquet file extension, --to commonmeta only): records are split into batches of 100,000, written in parallel, and zstd-compressed
cargo run --release -- list crossref-2026-06-15.sqlite3 --number 0 --file out.parquet

# Import a single record by DOI into the local commonmeta database (source auto-detected)
cargo run -- import 10.7554/elife.01567

# Import all Crossref records for a ROR-identified institution (paginates through all results)
cargo run -- import --from crossref --ror 00pd74e08

# Import all records from a Crossref VRAIX daily dump
cargo run -- import --from crossref --date 2026-06-15

# See the Local database section below for the full import command reference
# including annual public data files (Crossref torrent, DataCite TAR).

# Register records with a live InvenioRDM instance (creates/updates and publishes
# real records — registration is currently only supported with --to inveniordm)
cargo run -- push --from crossref --number 10 --to inveniordm --host rogue-scholar.org --token TOKEN

# Same as push, but for a single record (DOI, URL, or file path)
cargo run -- put 10.5555/12345678 --from crossref --to inveniordm --host rogue-scholar.org --token TOKEN

# Match a free-text affiliation string to a ROR organization (uses local DB when available)
cargo run -- match "Leibniz Universität Hannover"
cargo run -- match "Leibniz Universität Hannover" --to inveniordm

# Look up a ROR organization (uses local DB when available)
cargo run -- convert https://ror.org/02nr0ka47
cargo run -- convert https://ror.org/02nr0ka47 --to inveniordm

# Work fully offline — fails fast if a network call would be required
cargo run -- convert record.json --from commonmeta --to csl --no-network
cargo run -- list crossref-2026-06-15.sqlite3 --no-network --file out.json
cargo run -- import crossref-2026-06-15.sqlite3 --no-network
cargo run -- match "Leibniz Universität Hannover" --no-network

Use cargo run -- <subcommand> --help for the full list of options for each subcommand.

`--no-network` flag

convert, list, import, and match all accept a --no-network flag. When set, any operation that would make an outbound HTTP request is rejected immediately with a clear error message. Operations on local files always succeed regardless of this flag. push and put always require network access and do not expose this flag.

Local database

The import command populates a local commonmeta SQLite database with scholarly metadata records. All imports upsert — existing records are updated rather than replaced. The database is also used by match and convert for offline lookups.

# Import a single record by DOI (source auto-detected from the DOI prefix)
commonmeta import 10.7554/elife.01567
commonmeta import https://doi.org/10.7554/elife.01567

# Import all Crossref records for an institution (ROR ID, paginates automatically)
commonmeta import --from crossref --ror 00pd74e08

# Import all DataCite records for an author (ORCID, paginates automatically)
commonmeta import --from datacite --orcid 0000-0003-1419-2405

# Import a full daily dump (downloads from metadata.vraix.org)
commonmeta import --from crossref --date 2026-06-15
commonmeta import --from datacite --date 2026-06-15

# Import from a locally downloaded VRAIX dump (source auto-detected from filename)
commonmeta import crossref-2026-06-15.sqlite3

# Import the Crossref annual public data file (~223 GB)
# Option A: Academic Torrents (aria2c required, free)
commonmeta import --from crossref
commonmeta import --from crossref --sample   # first 5 files only (~40 MB)
# Option B: AWS S3 requester-pays bucket (aws CLI + credentials required, ~$18)
# Bucket: s3://api-snapshots-reqpays-crossref   see https://www.crossref.org/documentation/retrieve-metadata/bulk-downloads/
# TAR cached at ~/Library/Caches/commonmeta/crossref/crossref-annual-s3.tar
commonmeta import --from crossref --s3

# Import the DataCite annual public data file (108 M records, 33 GB compressed)
# First run: obtain a time-limited download URL by submitting your email at
#   https://datafiles.datacite.org/datafiles/public-2025
# The TAR archive is cached at ~/Library/Caches/commonmeta/datacite/public-2025.tar
# for subsequent re-imports without a new token.
commonmeta import "https://datafiles.datacite.org/datafiles/public-2025/download?token=<TOKEN>"
commonmeta import "https://datafiles.datacite.org/datafiles/public-2025/download?token=<TOKEN>" --sample
# Re-import or re-parse from cache (no token needed after the first download):
commonmeta import --from datacite
commonmeta import --from datacite --sample

# Import the ORCID Public Data File into the people table
# The file (~46 GB compressed, ~220 M person records) is published annually on figshare:
#   https://figshare.com/articles/dataset/ORCID_Public_Data_File_2025/30375589
# Download the *summaries* file only (ORCID_YYYY_N_summaries.tar.gz, ~46 GB),
# not the full bundle (~221 GB).
#
# Records land in the `people` table, which by default shares the main database.
# Use --people-db to keep people in a separate file:
#   commonmeta import --from orcid --people-db /data/people.sqlite3 "<SUMMARIES_URL>"
#
# Step 1 — get the direct download URL.
# On a machine where api.figshare.com is accessible (e.g. your laptop):
commonmeta import --from orcid --list-releases
# Prints something like:
#   Year/batch : 2025_10
#   SUMMARIES  : https://figshare.com/ndownloader/files/XXXXXXXX
# Copy the SUMMARIES URL and use it in Step 2.
# Alternatively: open the figshare page linked above, locate ORCID_YYYY_N_summaries.tar.gz,
# and copy its download link directly from the browser.
#
# Step 2 — import (single sequential download → cache → SQLite):
# The TAR is cached at ~/Library/Caches/commonmeta/orcid/ORCID_2025_10_summaries.tar.gz
# so subsequent re-imports read from disk without re-downloading.
commonmeta import --from orcid "<SUMMARIES_URL>"
# On servers where api.figshare.com is accessible, auto-discover and import in one step:
commonmeta import --from orcid
# Test with the first 1,000 records (~200 KB downloaded, then connection closed):
commonmeta import --from orcid --sample "<SUMMARIES_URL>"
# Re-import from a locally downloaded TAR (no URL needed):
commonmeta import --from orcid /path/to/ORCID_2025_10_summaries.tar.gz

# Import the full VRAIX pidbox dump
commonmeta import --from pidbox

# Import latest ROR organization data
commonmeta import --from ror

# Import from a dragoman cache (flushes cache after successful import)
commonmeta import --from dragoman
commonmeta import --from dragoman --dragoman-db /path/to/cache.sqlite3
DRAGOMAN_DB=/path/to/cache.sqlite3 commonmeta import --from dragoman

The database path is resolved in this order:

COMMONMETA_DB environment variable
Platform default:

Platform	Default path
macOS	`~/Library/Application Support/commonmeta/commonmeta.sqlite3`
Linux	`/var/lib/commonmeta/commonmeta.sqlite3`

# Use a custom path via environment variable
COMMONMETA_DB=/data/commonmeta.sqlite3 commonmeta import --from crossref --date 2026-06-15

Dragoman cache import on Debian (cron)

The --from dragoman import reads from the dragoman cache SQLite database (same VRAIX transport schema: pid, source_id, raw_metadata). After a successful import the cache rows are deleted and the file is VACUUMed, so subsequent cron runs only process new records.

System user and file permissions

# Create a system user for the commonmeta import job
adduser --system --group --no-create-home commonmeta

# Create and own the commonmeta database directory
install -d -o commonmeta -g commonmeta -m 750 /var/lib/commonmeta

# Allow the commonmeta user to read and write the dragoman cache.
# Dragoman must also be able to write to the same file.
# Option A — shared group (recommended):
adduser dragoman commonmeta   # or whatever user dragoman runs as
chmod g+rw /var/lib/dragoman/cache.sqlite3
chown dragoman:commonmeta /var/lib/dragoman/cache.sqlite3

# Option B — POSIX ACL (if the filesystem supports it):
setfacl -m u:commonmeta:rw /var/lib/dragoman/cache.sqlite3

`/etc/cron.d/commonmeta`

# Run the dragoman cache import every 15 minutes.
# Logs go to syslog via logger; the cache is flushed automatically on success.
*/15 * * * * commonmeta /usr/local/bin/commonmeta import --from dragoman 2>&1 | logger -t commonmeta-import

Environment variables (`/etc/default/commonmeta`)

# Override the default database paths if needed.
COMMONMETA_DB=/var/lib/commonmeta/commonmeta.sqlite3
DRAGOMAN_DB=/var/lib/dragoman/cache.sqlite3

Load them in the cron job by prefixing with env $(cat /etc/default/commonmeta | xargs), or install a systemd timer unit that reads EnvironmentFile=/etc/default/commonmeta instead.

systemd timer alternative (`/etc/systemd/system/commonmeta-import.service` + `.timer`)

# commonmeta-import.service
[Unit]
Description=Import records from dragoman cache into commonmeta database
After=network.target

[Service]
Type=oneshot
User=commonmeta
EnvironmentFile=/etc/default/commonmeta
ExecStart=/usr/local/bin/commonmeta import --from dragoman
StandardOutput=journal
StandardError=journal

# commonmeta-import.timer
[Unit]
Description=Run commonmeta dragoman import every 15 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=15min
Persistent=true

[Install]
WantedBy=timers.target

systemctl daemon-reload
systemctl enable --now commonmeta-import.timer
systemctl status commonmeta-import.timer
journalctl -u commonmeta-import.service -f

Migrate

The migrate command applies any pending schema migrations to the local database, optionally backfills the junction tables (works_orcid, works_ror, works_references) that enable fast reverse lookups by author ORCID, institution ROR, and cited DOI, and can rebuild the FTS5 full-text search indexes.

Migrations are idempotent — safe to run repeatedly. No existing records are modified.

# Apply any pending schema migrations (safe to run any time)
commonmeta migrate

# Populate all three junction tables in a single streaming pass (most efficient)
commonmeta migrate --backfill

# Populate individual junction tables
commonmeta migrate --orcid       # works_orcid  — author ORCID index
commonmeta migrate --ror         # works_ror    — institution ROR index
commonmeta migrate --references  # works_references — cited DOI index

# Combine multiple tables without a full --backfill
commonmeta migrate --orcid --ror

# Restrict a backfill to records from one provider
commonmeta migrate --backfill --crossref
commonmeta migrate --references --datacite

# Rebuild FTS5 full-text search indexes
commonmeta migrate --rebuild-fts

# Migrate a database at a custom path
commonmeta migrate --file /var/lib/commonmeta/commonmeta.sqlite3

Junction tables

Table	Key column	Populated by	Used for
`works_orcid`	`orcid`	`--orcid` / `--backfill`	Fast author lookup: all works by an ORCID iD
`works_ror`	`ror`	`--ror` / `--backfill`	Fast institution lookup: all works for a ROR ID
`works_references`	`ref_id`	`--references` / `--backfill`	Reverse citation lookup: all works that cite a given DOI

FTS5 full-text search indexes

Three SQLite FTS5 virtual tables enable full-text search with Unicode-aware tokenization and diacritic folding (searching "muller" matches "Müller"):

Table	Indexed columns	Content table
`works_fts`	`title`, `subjects`	`works`
`organizations_fts`	`name`, `names_flat`	`organizations`
`people_fts`	`name`, `keywords`, `other_names`	`people`

All three use content=<table> so only the inverted index is stored separately — the full text remains in the base table. Expected index size for works_fts on a 200 M-row corpus: 5–10 GB.

Deploying FTS5 indexes on an existing database (schema upgrade v4 → v5)

The FTS5 rebuild reads every row in works to build the inverted index. On a 200 M-row database this takes 10–45 minutes depending on I/O and is not done automatically on DB open to avoid blocking routine commands.

# Step 1: apply the schema migration (creates the empty virtual table — fast)
commonmeta migrate

# Step 2: populate the index (reads all rows — slow on large databases)
commonmeta migrate --rebuild-fts

Both steps print elapsed time per operation to stderr. Re-running either is safe.

FTS5 on a fresh bulk import

When importing the DataCite or Crossref annual data file, works_fts is rebuilt automatically at the end of the import loop — no extra step is needed:

commonmeta import --from datacite   # FTS rebuild runs automatically at the end
commonmeta import --from crossref --s3

Keeping FTS indexes current

FTS5 content tables are not updated incrementally — they must be rebuilt when the base table changes. The rebuild is triggered automatically only after full bulk imports. After daily incremental imports or individual-record imports, run --rebuild-fts periodically (e.g. nightly) to keep search results current:

commonmeta migrate --rebuild-fts

Resumable backfills

Each backfill flag tracks a rowid cursor in the settings table. If the run is interrupted (Ctrl-C, machine restart), re-running the same command resumes from where it left off — no records are re-scanned. The cursor is deleted on completion, so a second full run starts fresh.

On a 300 M-record database expect each full-backfill pass to take several hours. Monitor progress on stderr; the command prints a running count and final elapsed time.

Validate

The validate command checks records in the local database against the commonmeta v1.0 JSON schema and reports any violations. Each failing record shows the JSON Pointer to the offending field and a short description of the constraint that was violated.

Errors are persisted in a validation_errors table inside the database so that --recheck can quickly re-run only the records that failed last time.

# Validate all records
commonmeta validate

# Validate only DataCite records
commonmeta validate --from datacite

# Validate only DataCite datasets
commonmeta validate --from datacite --type Dataset

# Validate the first 1 000 records
commonmeta validate --number 1000

# Repair invalid records in-place (re-applies schema normalization)
commonmeta validate --fix

# Re-validate only records that failed in the previous run
commonmeta validate --recheck

# Repair only previously-failing records
commonmeta validate --recheck --fix

# Write errors as JSONL to a file instead of stderr
commonmeta validate --report errors.jsonl

# Validate a different database
commonmeta validate /path/to/other.sqlite3

# Enrich affiliation identifiers using the local ROR database
commonmeta validate --fill --organizations /path/to/ror.sqlite3

# Enrich only Crossref records, cap at 10 000
commonmeta validate --fill --from crossref --number 10000

Options

Option	Description
`--from` / `-f`	Filter by provider (`crossref`, `datacite`, `openalex`).
`--type`	Filter by work type, e.g. `Dataset`, `JournalArticle`.
`--number` / `-n`	Maximum number of records to check (default: all).
`--fix`	Attempt to repair invalid records in-place. Applies `prepare()` normalization: removes non-ROR organization ids, clears invalid URIs, deduplicates geo-locations, normalizes EISSN → ISSN, etc. Repaired records are removed from `validation_errors`; records that cannot be repaired remain.
`--recheck`	Only re-validate records listed in the `validation_errors` table from the previous run. Combine with `--fix` for an efficient repair loop.
`--report`	Write errors as JSONL (one `{"id": "…", "errors": […]}` object per record) to the given file instead of printing to stderr.
`--fill`	Enrich affiliation and organization identifiers. See Fill below.
`--organizations`	Path to the ROR organizations SQLite database used by `--fill` (default: platform `ror.sqlite3`, env: `ROR_DB`).

Repair loop

A typical workflow for cleaning up an imported database:

# 1. Full first pass — saves all failures to validation_errors
commonmeta validate --from datacite --fix

# 2. Subsequent passes — only re-checks and re-repairs the remaining failures
commonmeta validate --recheck --fix

The command exits with a non-zero status if any records remain invalid after the run.

Fill

--fill enriches affiliation and organization identifiers in the works database using the local ROR organizations database (imported with commonmeta import ror). It runs independently of schema validation and never sets the valid flag.

For each record, every contributor affiliation and organization-type contributor is inspected:

Condition	Action
`id` is a Crossref Funder ID or ISNI	Replaced with the matching ROR URL; name set to the ROR display name; `asserted_by` set to `"Commonmeta"`.
`id` is a ROR URL and `name` is empty	Name filled from the ROR database.
`id` is already a ROR URL with a name	Left unchanged.

The organizations database path defaults to the platform ror.sqlite3 location (macOS: ~/Library/Application Support/commonmeta/ror.sqlite3). Override with --organizations or the ROR_DB environment variable.

Documentation

Documentation (work in progress) for using the library is available at the commonmeta-rs Documentation website.

Settings

The settings command reads the settings table of the local commonmeta SQLite database. Settings rows record installed vocabulary versions and bulk-import dates written by commonmeta import.

# Show all key/value settings
commonmeta settings

# Show settings for a specific database
commonmeta settings --file /data/commonmeta.sqlite3

# Show settings alongside the ORCID people database
commonmeta settings --people-db /data/people.sqlite3

# Show record counts for all main tables
commonmeta settings --stats

# Show record counts for a specific database
commonmeta settings --stats --file /data/commonmeta.sqlite3

Settings options

Option	Description
`--file`	Path to the works SQLite database (overrides `COMMONMETA_DB` and the platform default).
`--people-db`	Path to the people SQLite database. Shows ORCID Public Data File version alongside the works settings.
`--stats`	Show record counts for all main tables instead of settings values.

Stats output

--stats reports the row count for each of the following tables, or (table not found) if the table has not yet been created:

Table	Contents
`works`	Scholarly metadata records
`organizations`	ROR organization vocabulary
`people`	ORCID person records
`prefixes`	DOI prefix registry
`works_ror`	Work → ROR junction (fast institution lookup)
`works_orcid`	Work → ORCID junction (fast author lookup)
`works_references`	Work → cited DOI junction (reverse citation lookup)

Row counts use MAX(rowid) rather than COUNT(*) for instant results even on tables with hundreds of millions of rows. The value equals the true row count when no rows have been deleted from the table.

commonmeta 0.9.3