commonmeta-rs
commonmeta-rs is a Rust library to implement Commonmeta, the common Metadata Model for Scholarly Metadata. Use commonmeta to convert scholarly metadata in a variety of formats, listed below. Commonmeta-rs is work in progress, the first release was on June 17, 2026. Implementations in other languages are also available (Go, Python, Ruby).
Supported Metadata Formats
Commonmeta-rs reads and/or writes these metadata formats:
| Format | Name | Content Type | Read | Write |
|---|---|---|---|---|
| Commonmeta | commonmeta | application/vnd.commonmeta+json | yes | yes |
| CrossRef XML | crossref_xml | application/vnd.crossref.unixref+xml | yes | yes |
| Crossref | crossref | application/vnd.crossref+json | yes | yes |
| DataCite | datacite | application/vnd.datacite.datacite+json | yes | yes |
| DataCite XML | datacite_xml | application/vnd.datacite.datacite+xml | yes | yes |
| Schema.org (in JSON-LD) | schema_org | application/vnd.schemaorg.ld+json | yes | yes |
| RDF XML | rdf_xml | application/rdf+xml | no | later |
| RDF Turtle | turtle | text/turtle | no | later |
| CSL-JSON | csl | application/vnd.citationstyles.csl+json | yes | yes |
| Formatted text citation | citation | text/x-bibliography | n/a | yes |
| Codemeta | codemeta | application/vnd.codemeta.ld+json | yes | later |
| Citation File Format (CFF) | cff | application/vnd.cff+yaml | yes | later |
| JATS | jats | application/vnd.jats+xml | later | later |
| CSV | csv | text/csv | no | later |
| BibTex | bibtex | application/x-bibtex | yes | yes |
| RIS | ris | application/x-research-info-systems | yes | yes |
| InvenioRDM | inveniordm | application/vnd.inveniordm.v1+json | yes | yes |
| JSON Feed | jsonfeed | application/feed+json | yes | later |
| OpenAlex | openalex | n/a | yes | no |
commonmeta: the Commonmeta format is the native format for the library and used internally. later: we plan to implement this format in a later release.
Build & run
The commonmeta binary has these subcommands: convert, encode, decode, import, list, match, migrate, push, put, settings, and validate.
# Encode/decode a Crockford base32 identifier suffix given a DOI prefix
# Convert a single record between formats, fetching it by DOI
# Convert a local file and write the result to disk
# Render a formatted citation (CSL style + locale)
# Fetch a batch of records from an API and write them as a commonmeta JSON array
# Read all records from a local VRAIX SQLite file and convert to another format
# Parquet output (.parquet file extension, --to commonmeta only): records are split into batches of 100,000, written in parallel, and zstd-compressed
# Import a single record by DOI into the local commonmeta database (source auto-detected)
# Import all Crossref records for a ROR-identified institution (paginates through all results)
# Import all records from a Crossref VRAIX daily dump
# See the Local database section below for the full import command reference
# including annual public data files (Crossref torrent, DataCite TAR).
# Register records with a live InvenioRDM instance (creates/updates and publishes
# real records — registration is currently only supported with --to inveniordm)
# Same as push, but for a single record (DOI, URL, or file path)
# Match a free-text affiliation string to a ROR organization (uses local DB when available)
# Look up a ROR organization (uses local DB when available)
# Work fully offline — fails fast if a network call would be required
Use cargo run -- <subcommand> --help for the full list of options for each subcommand.
--no-network flag
convert, list, import, and match all accept a --no-network flag. When set, any
operation that would make an outbound HTTP request is rejected immediately with a clear error
message. Operations on local files always succeed regardless of this flag. push and put
always require network access and do not expose this flag.
Local database
The import command populates a local commonmeta SQLite database with scholarly metadata records. All imports upsert — existing records are updated rather than replaced. The database is also used by match and convert for offline lookups.
# Import a single record by DOI (source auto-detected from the DOI prefix)
# Import all Crossref records for an institution (ROR ID, paginates automatically)
# Import all DataCite records for an author (ORCID, paginates automatically)
# Import a full daily dump (downloads from metadata.vraix.org)
# Import from a locally downloaded VRAIX dump (source auto-detected from filename)
# Import the Crossref annual public data file (~223 GB)
# Option A: Academic Torrents (aria2c required, free)
# Option B: AWS S3 requester-pays bucket (aws CLI + credentials required, ~$18)
# Bucket: s3://api-snapshots-reqpays-crossref see https://www.crossref.org/documentation/retrieve-metadata/bulk-downloads/
# TAR cached at ~/Library/Caches/commonmeta/crossref/crossref-annual-s3.tar
# Import the DataCite annual public data file (108 M records, 33 GB compressed)
# First run: obtain a time-limited download URL by submitting your email at
# https://datafiles.datacite.org/datafiles/public-2025
# The TAR archive is cached at ~/Library/Caches/commonmeta/datacite/public-2025.tar
# for subsequent re-imports without a new token.
# Re-import or re-parse from cache (no token needed after the first download):
# Import the ORCID Public Data File into the people table
# The file (~46 GB compressed, ~220 M person records) is published annually on figshare:
# https://figshare.com/articles/dataset/ORCID_Public_Data_File_2025/30375589
# Download the *summaries* file only (ORCID_YYYY_N_summaries.tar.gz, ~46 GB),
# not the full bundle (~221 GB).
#
# Records land in the `people` table, which by default shares the main database.
# Use --people-db to keep people in a separate file:
# commonmeta import --from orcid --people-db /data/people.sqlite3 "<SUMMARIES_URL>"
#
# Step 1 — get the direct download URL.
# On a machine where api.figshare.com is accessible (e.g. your laptop):
# Prints something like:
# Year/batch : 2025_10
# SUMMARIES : https://figshare.com/ndownloader/files/XXXXXXXX
# Copy the SUMMARIES URL and use it in Step 2.
# Alternatively: open the figshare page linked above, locate ORCID_YYYY_N_summaries.tar.gz,
# and copy its download link directly from the browser.
#
# Step 2 — import (single sequential download → cache → SQLite):
# The TAR is cached at ~/Library/Caches/commonmeta/orcid/ORCID_2025_10_summaries.tar.gz
# so subsequent re-imports read from disk without re-downloading.
# On servers where api.figshare.com is accessible, auto-discover and import in one step:
# Test with the first 1,000 records (~200 KB downloaded, then connection closed):
# Re-import from a locally downloaded TAR (no URL needed):
# Import the full VRAIX pidbox dump
# Import latest ROR organization data
# Import from a dragoman cache (flushes cache after successful import)
DRAGOMAN_DB=/path/to/cache.sqlite3
The database path is resolved in this order:
COMMONMETA_DBenvironment variable- Platform default:
| Platform | Default path |
|---|---|
| macOS | ~/Library/Application Support/commonmeta/commonmeta.sqlite3 |
| Linux | /var/lib/commonmeta/commonmeta.sqlite3 |
# Use a custom path via environment variable
COMMONMETA_DB=/data/commonmeta.sqlite3
Dragoman cache import on Debian (cron)
The --from dragoman import reads from the dragoman cache
SQLite database (same VRAIX transport schema: pid, source_id, raw_metadata). After a
successful import the cache rows are deleted and the file is VACUUMed, so subsequent cron runs
only process new records.
System user and file permissions
# Create a system user for the commonmeta import job
# Create and own the commonmeta database directory
# Allow the commonmeta user to read and write the dragoman cache.
# Dragoman must also be able to write to the same file.
# Option A — shared group (recommended):
# Option B — POSIX ACL (if the filesystem supports it):
/etc/cron.d/commonmeta
# Run the dragoman cache import every 15 minutes.
# Logs go to syslog via logger; the cache is flushed automatically on success.
*/15 * * * * commonmeta /usr/local/bin/commonmeta import --from dragoman 2>&1 | logger -t commonmeta-import
Environment variables (/etc/default/commonmeta)
# Override the default database paths if needed.
COMMONMETA_DB=/var/lib/commonmeta/commonmeta.sqlite3
DRAGOMAN_DB=/var/lib/dragoman/cache.sqlite3
Load them in the cron job by prefixing with env $(cat /etc/default/commonmeta | xargs), or
install a systemd timer unit that reads EnvironmentFile=/etc/default/commonmeta instead.
systemd timer alternative (/etc/systemd/system/commonmeta-import.service + .timer)
# commonmeta-import.service
[Unit]
Description=Import records from dragoman cache into commonmeta database
After=network.target
[Service]
Type=oneshot
User=commonmeta
EnvironmentFile=/etc/default/commonmeta
ExecStart=/usr/local/bin/commonmeta import --from dragoman
StandardOutput=journal
StandardError=journal
# commonmeta-import.timer
[Unit]
Description=Run commonmeta dragoman import every 15 minutes
[Timer]
OnBootSec=2min
OnUnitActiveSec=15min
Persistent=true
[Install]
WantedBy=timers.target
Migrate
The migrate command applies any pending schema migrations to the local database, optionally backfills the junction tables (works_orcid, works_ror, works_references) that enable fast reverse lookups by author ORCID, institution ROR, and cited DOI, and can rebuild the FTS5 full-text search indexes.
Migrations are idempotent — safe to run repeatedly. No existing records are modified.
# Apply any pending schema migrations (safe to run any time)
# Populate all three junction tables in a single streaming pass (most efficient)
# Populate individual junction tables
# Combine multiple tables without a full --backfill
# Restrict a backfill to records from one provider
# Rebuild FTS5 full-text search indexes
# Migrate a database at a custom path
Junction tables
| Table | Key column | Populated by | Used for |
|---|---|---|---|
works_orcid |
orcid |
--orcid / --backfill |
Fast author lookup: all works by an ORCID iD |
works_ror |
ror |
--ror / --backfill |
Fast institution lookup: all works for a ROR ID |
works_references |
ref_id |
--references / --backfill |
Reverse citation lookup: all works that cite a given DOI |
FTS5 full-text search indexes
Three SQLite FTS5 virtual tables enable full-text search with Unicode-aware tokenization and diacritic folding (searching "muller" matches "Müller"):
| Table | Indexed columns | Content table |
|---|---|---|
works_fts |
title, subjects |
works |
organizations_fts |
name, names_flat |
organizations |
people_fts |
name, keywords, other_names |
people |
All three use content=<table> so only the inverted index is stored separately — the full text remains in the base table. Expected index size for works_fts on a 200 M-row corpus: 5–10 GB.
Deploying FTS5 indexes on an existing database (schema upgrade v4 → v5)
The FTS5 rebuild reads every row in works to build the inverted index. On a 200 M-row database this takes 10–45 minutes depending on I/O and is not done automatically on DB open to avoid blocking routine commands.
# Step 1: apply the schema migration (creates the empty virtual table — fast)
# Step 2: populate the index (reads all rows — slow on large databases)
Both steps print elapsed time per operation to stderr. Re-running either is safe.
FTS5 on a fresh bulk import
When importing the DataCite or Crossref annual data file, works_fts is rebuilt automatically at the end of the import loop — no extra step is needed:
Keeping FTS indexes current
FTS5 content tables are not updated incrementally — they must be rebuilt when the base table changes. The rebuild is triggered automatically only after full bulk imports. After daily incremental imports or individual-record imports, run --rebuild-fts periodically (e.g. nightly) to keep search results current:
Resumable backfills
Each backfill flag tracks a rowid cursor in the settings table. If the run is interrupted (Ctrl-C, machine restart), re-running the same command resumes from where it left off — no records are re-scanned. The cursor is deleted on completion, so a second full run starts fresh.
On a 300 M-record database expect each full-backfill pass to take several hours. Monitor progress on stderr; the command prints a running count and final elapsed time.
Validate
The validate command checks records in the local database against the commonmeta v1.0 JSON schema and reports any violations. Each failing record shows the JSON Pointer to the offending field and a short description of the constraint that was violated.
Errors are persisted in a validation_errors table inside the database so that --recheck can quickly re-run only the records that failed last time.
# Validate all records
# Validate only DataCite records
# Validate only DataCite datasets
# Validate the first 1 000 records
# Repair invalid records in-place (re-applies schema normalization)
# Re-validate only records that failed in the previous run
# Repair only previously-failing records
# Write errors as JSONL to a file instead of stderr
# Validate a different database
# Enrich affiliation identifiers using the local ROR database
# Enrich only Crossref records, cap at 10 000
Options
| Option | Description |
|---|---|
--from / -f |
Filter by provider (crossref, datacite, openalex). |
--type |
Filter by work type, e.g. Dataset, JournalArticle. |
--number / -n |
Maximum number of records to check (default: all). |
--fix |
Attempt to repair invalid records in-place. Applies prepare() normalization: removes non-ROR organization ids, clears invalid URIs, deduplicates geo-locations, normalizes EISSN → ISSN, etc. Repaired records are removed from validation_errors; records that cannot be repaired remain. |
--recheck |
Only re-validate records listed in the validation_errors table from the previous run. Combine with --fix for an efficient repair loop. |
--report |
Write errors as JSONL (one {"id": "…", "errors": […]} object per record) to the given file instead of printing to stderr. |
--fill |
Enrich affiliation and organization identifiers. See Fill below. |
--organizations |
Path to the ROR organizations SQLite database used by --fill (default: platform ror.sqlite3, env: ROR_DB). |
Repair loop
A typical workflow for cleaning up an imported database:
# 1. Full first pass — saves all failures to validation_errors
# 2. Subsequent passes — only re-checks and re-repairs the remaining failures
The command exits with a non-zero status if any records remain invalid after the run.
Fill
--fill enriches affiliation and organization identifiers in the works database using the local ROR organizations database (imported with commonmeta import ror). It runs independently of schema validation and never sets the valid flag.
For each record, every contributor affiliation and organization-type contributor is inspected:
| Condition | Action |
|---|---|
id is a Crossref Funder ID or ISNI |
Replaced with the matching ROR URL; name set to the ROR display name; asserted_by set to "Commonmeta". |
id is a ROR URL and name is empty |
Name filled from the ROR database. |
id is already a ROR URL with a name |
Left unchanged. |
The organizations database path defaults to the platform ror.sqlite3 location (macOS: ~/Library/Application Support/commonmeta/ror.sqlite3). Override with --organizations or the ROR_DB environment variable.
Documentation
Documentation (work in progress) for using the library is available at the commonmeta-rs Documentation website.
Settings
The settings command reads the settings table of the local commonmeta SQLite database. Settings rows record installed vocabulary versions and bulk-import dates written by commonmeta import.
# Show all key/value settings
# Show settings for a specific database
# Show settings alongside the ORCID people database
# Show record counts for all main tables
# Show record counts for a specific database
Settings options
| Option | Description |
|---|---|
--file |
Path to the works SQLite database (overrides COMMONMETA_DB and the platform default). |
--people-db |
Path to the people SQLite database. Shows ORCID Public Data File version alongside the works settings. |
--stats |
Show record counts for all main tables instead of settings values. |
Stats output
--stats reports the row count for each of the following tables, or (table not found) if the table has not yet been created:
| Table | Contents |
|---|---|
works |
Scholarly metadata records |
organizations |
ROR organization vocabulary |
people |
ORCID person records |
prefixes |
DOI prefix registry |
works_ror |
Work → ROR junction (fast institution lookup) |
works_orcid |
Work → ORCID junction (fast author lookup) |
works_references |
Work → cited DOI junction (reverse citation lookup) |
Row counts use MAX(rowid) rather than COUNT(*) for instant results even on tables with hundreds of millions of rows. The value equals the true row count when no rows have been deleted from the table.
Meta
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
License: MIT