commonmeta 0.8.26

Library for conversions to/from the Commonmeta scholarly metadata format
Documentation

commonmeta-rs

commonmeta-rs is a Rust library to implement Commonmeta, the common Metadata Model for Scholarly Metadata. Use commonmeta to convert scholarly metadata in a variety of formats, listed below. Commonmeta-rs is work in progress, the first release was on June 17, 2026. Implementations in other languages are also available (Go, Python, Ruby).

Supported Metadata Formats

Commonmeta-rs reads and/or writes these metadata formats:

Format Name Content Type Read Write
Commonmeta commonmeta application/vnd.commonmeta+json yes yes
CrossRef XML crossref_xml application/vnd.crossref.unixref+xml yes yes
Crossref crossref application/vnd.crossref+json yes yes
DataCite datacite application/vnd.datacite.datacite+json yes yes
DataCite XML datacite_xml application/vnd.datacite.datacite+xml yes yes
Schema.org (in JSON-LD) schema_org application/vnd.schemaorg.ld+json yes yes
RDF XML rdf_xml application/rdf+xml no later
RDF Turtle turtle text/turtle no later
CSL-JSON csl application/vnd.citationstyles.csl+json yes yes
Formatted text citation citation text/x-bibliography n/a yes
Codemeta codemeta application/vnd.codemeta.ld+json yes later
Citation File Format (CFF) cff application/vnd.cff+yaml yes later
JATS jats application/vnd.jats+xml later later
CSV csv text/csv no later
BibTex bibtex application/x-bibtex yes yes
RIS ris application/x-research-info-systems yes yes
InvenioRDM inveniordm application/vnd.inveniordm.v1+json yes yes
JSON Feed jsonfeed application/feed+json yes later
OpenAlex openalex n/a yes no

commonmeta: the Commonmeta format is the native format for the library and used internally. later: we plan to implement this format in a later release.

Build & run

cargo build
cargo test

The commonmeta binary has these subcommands: convert, encode, decode, import, list, match, migrate, push, put, settings, and validate.

# Encode/decode a Crockford base32 identifier suffix given a DOI prefix
cargo run -- encode 10.5555
cargo run -- decode 10.5555/nwbyp-29t86

# Convert a single record between formats, fetching it by DOI
cargo run -- convert 10.5555/12345678 --from crossref --to csl

# Convert a local file and write the result to disk
cargo run -- convert record.json --from commonmeta --to csl --file out.json

# Render a formatted citation (CSL style + locale)
cargo run -- convert 10.5555/12345678 --from crossref --to citation --style apa --locale en-US

# Fetch a batch of records from an API and write them as a commonmeta JSON array
cargo run -- list --from crossref --number 100 --type journal-article --file out.json

# Read all records from a local VRAIX SQLite file and convert to another format
cargo run -- list crossref-2026-06-15.sqlite3 --number 0 --to commonmeta --file out.json.gz

# Parquet output (.parquet file extension, --to commonmeta only): records are split into batches of 100,000, written in parallel, and zstd-compressed
cargo run --release -- list crossref-2026-06-15.sqlite3 --number 0 --file out.parquet

# Import a single record by DOI into the local commonmeta database (source auto-detected)
cargo run -- import 10.7554/elife.01567

# Import all Crossref records for a ROR-identified institution (paginates through all results)
cargo run -- import --from crossref --ror 00pd74e08

# Import all records from a Crossref VRAIX daily dump
cargo run -- import --from crossref --date 2026-06-15

# See the Local database section below for the full import command reference
# including annual public data files (Crossref torrent, DataCite TAR).

# Register records with a live InvenioRDM instance (creates/updates and publishes
# real records — registration is currently only supported with --to inveniordm)
cargo run -- push --from crossref --number 10 --to inveniordm --host rogue-scholar.org --token TOKEN

# Same as push, but for a single record (DOI, URL, or file path)
cargo run -- put 10.5555/12345678 --from crossref --to inveniordm --host rogue-scholar.org --token TOKEN

# Match a free-text affiliation string to a ROR organization (uses local DB when available)
cargo run -- match "Leibniz Universität Hannover"
cargo run -- match "Leibniz Universität Hannover" --to inveniordm

# Look up a ROR organization (uses local DB when available)
cargo run -- convert https://ror.org/02nr0ka47
cargo run -- convert https://ror.org/02nr0ka47 --to inveniordm

# Work fully offline — fails fast if a network call would be required
cargo run -- convert record.json --from commonmeta --to csl --no-network
cargo run -- list crossref-2026-06-15.sqlite3 --no-network --file out.json
cargo run -- import crossref-2026-06-15.sqlite3 --no-network
cargo run -- match "Leibniz Universität Hannover" --no-network

Use cargo run -- <subcommand> --help for the full list of options for each subcommand.

--no-network flag

convert, list, import, and match all accept a --no-network flag. When set, any operation that would make an outbound HTTP request is rejected immediately with a clear error message. Operations on local files always succeed regardless of this flag. push and put always require network access and do not expose this flag.

Local database

The import command populates a local commonmeta SQLite database with scholarly metadata records. All imports upsert — existing records are updated rather than replaced. The database is also used by match and convert for offline lookups.

# Import a single record by DOI (source auto-detected from the DOI prefix)
commonmeta import 10.7554/elife.01567
commonmeta import https://doi.org/10.7554/elife.01567

# Import all Crossref records for an institution (ROR ID, paginates automatically)
commonmeta import --from crossref --ror 00pd74e08

# Import all DataCite records for an author (ORCID, paginates automatically)
commonmeta import --from datacite --orcid 0000-0003-1419-2405

# Import a full daily dump (downloads from metadata.vraix.org)
commonmeta import --from crossref --date 2026-06-15
commonmeta import --from datacite --date 2026-06-15

# Import from a locally downloaded VRAIX dump (source auto-detected from filename)
commonmeta import crossref-2026-06-15.sqlite3

# Import the Crossref annual public data file (~223 GB)
# Option A: Academic Torrents (aria2c required, free)
commonmeta import --from crossref
commonmeta import --from crossref --sample   # first 5 files only (~40 MB)
# Option B: AWS S3 requester-pays bucket (aws CLI + credentials required, ~$18)
# Bucket: s3://api-snapshots-reqpays-crossref   see https://www.crossref.org/documentation/retrieve-metadata/bulk-downloads/
# TAR cached at ~/Library/Caches/commonmeta/crossref/crossref-annual-s3.tar
commonmeta import --from crossref --s3

# Import the DataCite annual public data file (108 M records, 33 GB compressed)
# First run: obtain a time-limited download URL by submitting your email at
#   https://datafiles.datacite.org/datafiles/public-2025
# The TAR archive is cached at ~/Library/Caches/commonmeta/datacite/public-2025.tar
# for subsequent re-imports without a new token.
commonmeta import "https://datafiles.datacite.org/datafiles/public-2025/download?token=<TOKEN>"
commonmeta import "https://datafiles.datacite.org/datafiles/public-2025/download?token=<TOKEN>" --sample
# Re-import or re-parse from cache (no token needed after the first download):
commonmeta import --from datacite
commonmeta import --from datacite --sample

# Import the full VRAIX pidbox dump
commonmeta import --from pidbox

# Import latest ROR organization data
commonmeta import --from ror

# Import from a dragoman cache (flushes cache after successful import)
commonmeta import --from dragoman
commonmeta import --from dragoman --dragoman-db /path/to/cache.sqlite3
DRAGOMAN_DB=/path/to/cache.sqlite3 commonmeta import --from dragoman

The database path is resolved in this order:

  1. COMMONMETA_DB environment variable
  2. Platform default:
Platform Default path
macOS ~/Library/Application Support/commonmeta/commonmeta.sqlite3
Linux /var/lib/commonmeta/commonmeta.sqlite3
# Use a custom path via environment variable
COMMONMETA_DB=/data/commonmeta.sqlite3 commonmeta import --from crossref --date 2026-06-15

Dragoman cache import on Debian (cron)

The --from dragoman import reads from the dragoman cache SQLite database (same VRAIX transport schema: pid, source_id, raw_metadata). After a successful import the cache rows are deleted and the file is VACUUMed, so subsequent cron runs only process new records.

System user and file permissions

# Create a system user for the commonmeta import job
adduser --system --group --no-create-home commonmeta

# Create and own the commonmeta database directory
install -d -o commonmeta -g commonmeta -m 750 /var/lib/commonmeta

# Allow the commonmeta user to read and write the dragoman cache.
# Dragoman must also be able to write to the same file.
# Option A — shared group (recommended):
adduser dragoman commonmeta   # or whatever user dragoman runs as
chmod g+rw /var/lib/dragoman/cache.sqlite3
chown dragoman:commonmeta /var/lib/dragoman/cache.sqlite3

# Option B — POSIX ACL (if the filesystem supports it):
setfacl -m u:commonmeta:rw /var/lib/dragoman/cache.sqlite3

/etc/cron.d/commonmeta

# Run the dragoman cache import every 15 minutes.
# Logs go to syslog via logger; the cache is flushed automatically on success.
*/15 * * * * commonmeta /usr/local/bin/commonmeta import --from dragoman 2>&1 | logger -t commonmeta-import

Environment variables (/etc/default/commonmeta)

# Override the default database paths if needed.
COMMONMETA_DB=/var/lib/commonmeta/commonmeta.sqlite3
DRAGOMAN_DB=/var/lib/dragoman/cache.sqlite3

Load them in the cron job by prefixing with env $(cat /etc/default/commonmeta | xargs), or install a systemd timer unit that reads EnvironmentFile=/etc/default/commonmeta instead.

systemd timer alternative (/etc/systemd/system/commonmeta-import.service + .timer)

# commonmeta-import.service
[Unit]
Description=Import records from dragoman cache into commonmeta database
After=network.target

[Service]
Type=oneshot
User=commonmeta
EnvironmentFile=/etc/default/commonmeta
ExecStart=/usr/local/bin/commonmeta import --from dragoman
StandardOutput=journal
StandardError=journal
# commonmeta-import.timer
[Unit]
Description=Run commonmeta dragoman import every 15 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=15min
Persistent=true

[Install]
WantedBy=timers.target
systemctl daemon-reload
systemctl enable --now commonmeta-import.timer
systemctl status commonmeta-import.timer
journalctl -u commonmeta-import.service -f

Migrate

The migrate command applies any pending schema migrations to the local database and optionally backfills the junction tables (works_orcid, works_ror, works_references) that enable fast reverse lookups by author ORCID, institution ROR, and cited DOI.

Migrations are idempotent — safe to run repeatedly. No existing records are modified.

# Apply any pending schema migrations (safe to run any time)
commonmeta migrate

# Populate all three junction tables in a single streaming pass (most efficient)
commonmeta migrate --backfill

# Populate individual junction tables
commonmeta migrate --orcid       # works_orcid  — author ORCID index
commonmeta migrate --ror         # works_ror    — institution ROR index
commonmeta migrate --references  # works_references — cited DOI index

# Combine multiple tables without a full --backfill
commonmeta migrate --orcid --ror

# Restrict a backfill to records from one provider
commonmeta migrate --backfill --crossref
commonmeta migrate --references --datacite

# Migrate a database at a custom path
commonmeta migrate --file /var/lib/commonmeta/commonmeta.sqlite3

Junction tables

Table Key column Populated by Used for
works_orcid orcid --orcid / --backfill Fast author lookup: all works by an ORCID iD
works_ror ror --ror / --backfill Fast institution lookup: all works for a ROR ID
works_references ref_id --references / --backfill Reverse citation lookup: all works that cite a given DOI

Resumable backfills

Each backfill flag tracks a rowid cursor in the settings table. If the run is interrupted (Ctrl-C, machine restart), re-running the same command resumes from where it left off — no records are re-scanned. The cursor is deleted on completion, so a second full run starts fresh.

On a 300 M-record database expect each full-backfill pass to take several hours. Monitor progress on stderr; the command prints a running count and final elapsed time.

Validate

The validate command checks records in the local database against the commonmeta v1.0 JSON schema and reports any violations. Each failing record shows the JSON Pointer to the offending field and a short description of the constraint that was violated.

Errors are persisted in a validation_errors table inside the database so that --recheck can quickly re-run only the records that failed last time.

# Validate all records
commonmeta validate

# Validate only DataCite records
commonmeta validate --from datacite

# Validate only DataCite datasets
commonmeta validate --from datacite --type Dataset

# Validate the first 1 000 records
commonmeta validate --number 1000

# Repair invalid records in-place (re-applies schema normalization)
commonmeta validate --fix

# Re-validate only records that failed in the previous run
commonmeta validate --recheck

# Repair only previously-failing records
commonmeta validate --recheck --fix

# Write errors as JSONL to a file instead of stderr
commonmeta validate --report errors.jsonl

# Validate a different database
commonmeta validate /path/to/other.sqlite3

# Enrich affiliation identifiers using the local ROR database
commonmeta validate --fill --organizations /path/to/ror.sqlite3

# Enrich only Crossref records, cap at 10 000
commonmeta validate --fill --from crossref --number 10000

Options

Option Description
--from / -f Filter by provider (crossref, datacite, openalex).
--type Filter by work type, e.g. Dataset, JournalArticle.
--number / -n Maximum number of records to check (default: all).
--fix Attempt to repair invalid records in-place. Applies prepare() normalization: removes non-ROR organization ids, clears invalid URIs, deduplicates geo-locations, normalizes EISSN → ISSN, etc. Repaired records are removed from validation_errors; records that cannot be repaired remain.
--recheck Only re-validate records listed in the validation_errors table from the previous run. Combine with --fix for an efficient repair loop.
--report Write errors as JSONL (one {"id": "…", "errors": […]} object per record) to the given file instead of printing to stderr.
--fill Enrich affiliation and organization identifiers. See Fill below.
--organizations Path to the ROR organizations SQLite database used by --fill (default: platform ror.sqlite3, env: ROR_DB).

Repair loop

A typical workflow for cleaning up an imported database:

# 1. Full first pass — saves all failures to validation_errors
commonmeta validate --from datacite --fix

# 2. Subsequent passes — only re-checks and re-repairs the remaining failures
commonmeta validate --recheck --fix

The command exits with a non-zero status if any records remain invalid after the run.

Fill

--fill enriches affiliation and organization identifiers in the works database using the local ROR organizations database (imported with commonmeta import ror). It runs independently of schema validation and never sets the valid flag.

For each record, every contributor affiliation and organization-type contributor is inspected:

Condition Action
id is a Crossref Funder ID or ISNI Replaced with the matching ROR URL; name set to the ROR display name; asserted_by set to "Commonmeta".
id is a ROR URL and name is empty Name filled from the ROR database.
id is already a ROR URL with a name Left unchanged.

The organizations database path defaults to the platform ror.sqlite3 location (macOS: ~/Library/Application Support/commonmeta/ror.sqlite3). Override with --organizations or the ROR_DB environment variable.

Documentation

Documentation (work in progress) for using the library is available at the commonmeta-rs Documentation website.

Settings

The settings command reads the settings table of the local commonmeta SQLite database. Settings rows record installed vocabulary versions and bulk-import dates written by commonmeta import.

# Show all key/value settings
commonmeta settings

# Show settings for a specific database
commonmeta settings --file /data/commonmeta.sqlite3

# Show settings alongside the ORCID people database
commonmeta settings --people-db /data/people.sqlite3

# Show record counts for all main tables
commonmeta settings --stats

# Show record counts for a specific database
commonmeta settings --stats --file /data/commonmeta.sqlite3

Settings options

Option Description
--file Path to the works SQLite database (overrides COMMONMETA_DB and the platform default).
--people-db Path to the people SQLite database. Shows ORCID Public Data File version alongside the works settings.
--stats Show record counts for all main tables instead of settings values.

Stats output

--stats reports the row count for each of the following tables, or (table not found) if the table has not yet been created:

Table Contents
works Scholarly metadata records
organizations ROR organization vocabulary
people ORCID person records
prefixes DOI prefix registry
works_ror Work → ROR junction (fast institution lookup)
works_orcid Work → ORCID junction (fast author lookup)
works_references Work → cited DOI junction (reverse citation lookup)

Row counts use MAX(rowid) rather than COUNT(*) for instant results even on tables with hundreds of millions of rows. The value equals the true row count when no rows have been deleted from the table.

Meta

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

License: MIT