Skip to main content

Crate commonmeta

Crate commonmeta 

Source
Expand description

commonmeta — a Rust port of front-matter/commonmeta.

Convert scholarly metadata between formats. The native model is Data; format modules read into it and write out of it.

Re-exports§

pub use data::Citation;
pub use data::Data;
pub use error::Error;
pub use error::Result;
pub use schema_utils::SCHEMA_JSON;

Modules§

author_utils
cmd
constants
Controlled vocabularies and cross-format type/role translation tables.
crockford
Generate, encode and decode random base32 identifiers. This encoder/decoder:
crossref
data
Core Commonmeta data model.
date_utils
Date and datetime utilities.
doi_utils
Utilities for working with DOIs
error
geonames
GeoNames populated-places reference data.
io_utils
progress
pubmed
ror_countries
schema_utils
JSON Schema and XSD validation utilities.
spdx
SPDX license vocabulary lookup.
utils
vocabularies
Embedded controlled vocabulary data files.

Structs§

AffiliationMatch
A single match result from the ROR affiliation API.
FillReport
PersonAffiliation
PushResult
The outcome of pushing a single record to InvenioRDM.
Ror
RorRelease
Metadata about a ROR data release published on Zenodo.
ValidationError
A single record that failed commonmeta v1.0 schema validation.
ValidationReport
Summary returned by [validate_sqlite].

Enums§

JunctionTable
Populate works_references for all existing rows in works that have no entry yet. Reads metadata blobs in streaming batches, extracts resolved DOI reference IDs, and inserts them with INSERT OR IGNORE (safe to re-run).

Constants§

VERSION

Functions§

backfill_junction_tables
Backfill one or more junction tables (works_orcid, works_ror, works_references) for every row in works. providers restricts to specific provider values (e.g. ["Crossref"]); empty = all providers. Reads blobs in 50 k-row streaming batches; uses INSERT OR IGNORE so it is safe to re-run or interrupt and resume. Returns (works_scanned, rows_inserted).
backfill_works_references
Convenience wrapper: backfill only works_references.
convert
Read from one format and write to another in a single call.
convert_citation
Like convert, but passes CSL style and locale through to the citation writer.
count_sqlite_works
Return the total number of rows in the works table of a commonmeta SQLite database — useful for reporting the cumulative count after an upsert.
crossref_fetch_page_with_cursor
Fetch one page of Crossref works using cursor-based pagination.
download_ror_all
Convenience: fetch the latest release metadata then immediately download and parse the dump. Returns (RorRelease, Vec<Ror>, from_cache).
download_ror_release
Download and parse the zip archive described by release. The zip is cached locally for 30 days so repeat installs of the same version skip the network round-trip. Returns (records, from_cache).
enrich_citations
Populate data.citations from the works_references junction table, merging with any citations already present (e.g. from DataCite/OpenAlex). No-op when db_path does not exist or the lookup fails.
enrich_ror_locations
Enrich missing geonames_details fields for each location in a Ror record using the locally installed GeoNames SQLite database. Only fills empty fields.
fetch_all_crossref_by_orcid
Fetch all works by ORCID from Crossref using cursor-based pagination.
fetch_all_datacite_by_orcid
Fetch all works by ORCID from DataCite, iterating pages until exhausted.
fetch_crossref_by_orcid
Fetch works by ORCID from Crossref, sorted by date descending. page is 1-based; Crossref offset is computed as (page-1) * limit.
fetch_crossref_by_ror
Fetch works by ROR from Crossref, sorted by date descending. page is 1-based; Crossref offset is computed as (page-1) * limit.
fetch_datacite_by_orcid
Fetch works by ORCID from DataCite, sorted by date descending. page is 1-based and maps directly to DataCite’s page[number] parameter.
fetch_datacite_by_ror
Fetch works by ROR from DataCite, sorted by date descending. page is 1-based and maps directly to DataCite’s page[number] parameter.
fetch_geonames_sqlite
Look up a GeoNames place by its integer id from the local SQLite database.
fetch_installed_geonames_date
Return the GeoNames install date stored in the local database’s settings table, or None when the database does not exist or no date has been recorded yet.
fetch_installed_orcid_public_data_version
Read the installed ORCID Public Data File version from the settings table. Read the installed ORCID Public Data File version from the settings table. Returns None when no version has been recorded yet.
fetch_installed_ror_version
Return the ROR version string stored in the local database’s settings table, or None when the database does not exist or no version has been recorded yet.
fetch_installed_vraix_date
Return the vraix_date (pidbox install date, YYYY-MM-DD) stored in the local works database’s settings table, or None when the database does not exist or no date has been recorded yet.
fetch_latest_orcid_release
Fetch the latest ORCID Public Data File release metadata from figshare. Fetch the latest ORCID Public Data File release from figshare.
fetch_latest_ror_release
Fetch metadata for the latest ROR data release from Zenodo (InvenioRDM) without downloading the full archive. Returns the version tag, release date, Zenodo record ID, zip filename, and direct download URL.
fetch_orcid
Fetch a person from the ORCID public API and return their record as Data. Accepts a bare ORCID iD (0000-0003-1419-2405) or a full ORCID URL.
fetch_orcid_affiliations
Fetch employment and education records from the ORCID public API, returning them as a combined list sorted by start date. Supersedes fetch_orcid_employments when both affiliation types are needed.
fetch_orcid_affiliations_sqlite
Read affiliations stored in the affiliations column of the people SQLite table. Returns an empty vec when the record is absent or the column is empty.
fetch_orcid_employments
Fetch employment records from the ORCID public API for the given ORCID URL. Returns affiliations sorted by start date. When db_path is provided, non-ROR organization identifiers (GRID, ISNI, FundRef, Wikidata) are resolved to ROR IDs via the local organizations SQLite table.
fetch_orcid_person_json
Fetch a person from the ORCID public API and return the raw ORCID 3.0 person JSON conforming to orcid_schema_v3.0.json.
fetch_orcid_person_json_sqlite
Look up a person from a local people SQLite table and return the raw ORCID 3.0 person JSON conforming to orcid_schema_v3.0.json.
fetch_orcid_sqlite
Look up a person from a local people SQLite table and return their record as Data. Accepts a bare ORCID iD or a full ORCID URL. Handles both XML blobs (bulk import) and JSON blobs (single-record API import).
fetch_orcid_with_json
Fetch a person from the ORCID public API and return both the parsed Data and the raw ORCID 3.0 person JSON in a single HTTP request.
fetch_orcid_work_dois
Fetch the DOIs of all works listed on an ORCID profile, returned as normalised https://doi.org/… URLs in response order.
fetch_reference_works
Fetch the referenced works of data that have a DOI.
fetch_ror
Fetch a ROR organization by its ROR URL or other organization identifier from the ROR API. Returns the record converted to the commonmeta Data model.
fetch_ror_raw
Fetch the raw Ror struct from the ROR v2 API, bypassing the lossy Data conversion.
fetch_ror_raw_sqlite
Return the raw Ror struct for a given ROR URL from the local SQLite database, bypassing the lossy Data conversion.
fetch_ror_sqlite
Look up a ROR organization by its full URL (e.g. https://ror.org/012xzy7a9) from a local SQLite database written by write_ror_sqlite. Returns the record converted to the commonmeta Data model, or an error when not found.
fetch_vraix_dump
Fetch commonmeta records from a VRAIX daily dump for from (“crossref” or “datacite”) and date (YYYY-MM-DD).
fill_sqlite
Fill missing or convertible affiliation/organization identifiers in the works database.
flush_dragoman_cache
Delete all rows from the VRAIX-schema transport table in the dragoman cache at path and VACUUM to reclaim disk space. Call this after a successful stream_pidbox_to_sqlite import to prevent re-importing the same records on the next run. Returns the number of rows deleted.
get_all_sqlite_settings
Return all rows from the settings table, sorted by key.
get_sqlite_setting
Read a value from the settings table. Returns None when the key is absent.
import_orcid_person
Fetch a single person record from the ORCID public API, upsert the person into people_db, and fetch their works from Crossref and DataCite and upsert them into works_db (may be the same path as people_db). Accepts a bare ORCID iD or a full ORCID URL. Returns the number of works written.
import_orcid_public_data
Download and import the ORCID Public Data File summaries into the people table at output_path. Skips the download if the current version is already installed; resumes partial downloads automatically.
import_prefixes
Bulk-resolve all distinct DOI prefixes in the works database against the DOI RA API and populate the prefixes table.
install_geonames_sqlite
Download the GeoNames cities500 dump, admin1 codes, and country info; parse them; and write the records to the geonames, geonames_admin1, and geonames_countries tables in the SQLite database at path. Caches all three files for 30 days; other tables in the database are untouched. Returns (record_count, from_cache).
match_ror_affiliation
Match a free-text affiliation string against ROR organizations using the ROR v2 affiliation endpoint.
match_ror_affiliation_sqlite
Match a free-text affiliation string against a local ROR SQLite database written by write_ror_sqlite. Uses Turso’s Tantivy-backed FTS index for full-text search across all organization name variants. Returns results in relevance order with chosen set on the top result.
prepare_commonmeta
Prepare a Data record for commonmeta v1.0 JSON serialization: normalises IDs, strips schema-private reference fields, clears invalid ROR/ORCID ids, etc.
push_inveniordm
Create-or-update, then publish, a list of records in InvenioRDM.
put_inveniordm
Create-or-update, then publish, a single record in InvenioRDM.
read
Read a single record from from format, without writing it back out.
read_parquet
Read a list of commonmeta records back from the Parquet schema written by write_parquet. Lossless: each record is restored from its json column, the complete original serialization.
read_ror_sqlite
Read a page of ROR organizations from the local SQLite database as Data records. limit caps records returned; offset is the zero-based row offset. country_code filters by ISO 3166-1 alpha-2 code; query applies FTS.
read_ror_sqlite_raw
read_sqlite_by_arxiv
read_sqlite_by_citation
Fetch all works that cite doi (i.e. have it in their reference list), ordered by date_published descending.
read_sqlite_by_dois
Fetch all works whose DOI matches any entry in dois in a single SQL query. DOIs are normalised before lookup; records not found are silently omitted.
read_sqlite_by_id
Look up a single record by its id (DOI URL) in a commonmeta SQLite database. Returns None when the record is not present.
read_sqlite_by_openalex
read_sqlite_by_orcid
Fetch all works with a contributor whose ORCID matches orcid_url, ordered by date_published descending.
read_sqlite_by_pmcid
read_sqlite_by_pmid
read_sqlite_by_ror
Fetch all works with a contributor affiliated with ror_url, ordered by date_published descending.
read_sqlite_commonmeta
Read records from a commonmeta SQLite database written by write_sqlite.
read_vraix_sqlite
Read commonmeta records from a VRAIX daily dump SQLite file already on disk at sqlite_path, e.g. an already-downloaded crossref-2026-06-14.sqlite3.
rebuild_organizations_fts
Drop and rebuild the organizations_fts FTS5 virtual table.
rebuild_people_fts
Drop and rebuild the people_fts FTS5 virtual table.
rebuild_works_fts
Drop and rebuild the works_fts FTS5 virtual table from the content in works.
run_cli
Run any commonmeta CLI subcommand from a list of arguments.
run_migrations
Apply any pending schema migrations to an existing database, printing per-step progress and timing to stderr. Returns (steps_applied, version).
sample_ror_sqlite
Return a random sample of ROR organizations from the local SQLite database.
sample_ror_sqlite_raw
set_sqlite_setting
Write a key/value pair into the settings table of a commonmeta SQLite database.
stream_cache_orcid_to_people_sqlite
Read ORCID person rows written by dragoman into cache.sqlite3 and upsert them into the people table at people_path.
stream_pidbox_to_sqlite
Stream the pidbox dump (a mixed-source VRAIX SQLite file containing crossref, datacite, and ROR rows) directly to a commonmeta SQLite database. Each row is routed to the appropriate parser by its source_id; ROR rows are skipped. When update is false the output file is recreated; when true rows are upserted by id. Returns the number of records written.
stream_pmc_ids_to_sqlite
Stream a gzip-compressed PMC-ids CSV file into the commonmeta SQLite database at output_path, upserting rows that have a DOI. Pass limit = 0 to process all rows. Returns the number of records written.
stream_vraix_to_sqlite
Stream a VRAIX daily dump at input_path directly to a commonmeta SQLite database at output_path in batches of 10 000 rows, converting with from-specific parser and writing each batch in a single transaction. limit caps total records written; pass 0 for all rows. When update is false the output file is deleted and recreated (default). When update is true the existing file is kept and rows are upserted by their id primary key — new rows are inserted, existing rows are replaced. Returns the number of records written. No Vec<Data> is held for the whole file — peak memory is proportional to one batch, not the whole dump.
stream_zst_pidbox_to_sqlite
Like stream_pidbox_to_sqlite but reads directly from the zstd-compressed pidbox file without decompressing it to disk first. Requires the database to be well-organised (VACUUM’d or sequential bulk inserts) so that pages appear in DFS pre-order.
upsert_sqlite
Like write_sqlite but opens an existing database instead of recreating it. Rows whose id already exists are replaced; new rows are inserted.
validate_sqlite
Validate records in a commonmeta SQLite database against the v1.0 JSON schema.
write
Write an already-loaded record to to format.
write_archive
Render list to to format, split into entries of at most batch_size records each — suitable for packing into an archive via io_utils::write_zip_archive/io_utils::write_tar_gz_archive. base_name (e.g. "out.json") names the single entry directly when there’s only one batch, or gets a numbered suffix ("out-00000.json", "out-00001.json", …) when there are several.
write_archive_citation
Like write_archive, but passes CSL style/locale through to the citation writer when to == "citation".
write_list
Render a list of records to to format as a single buffer: a JSON array for object-shaped formats (commonmeta, csl, datacite, inveniordm, schemaorg, ror), or newline-joined output for line/document-shaped formats (e.g. bibtex, ris, crossref_xml).
write_list_citation
Like write_list, but passes CSL style/locale through to the citation writer when to == "citation" (ignored for every other format, same as convert_citation/write_citation).
write_orcid_commonmeta
Convert ORCID 3.0 person JSON + resolved affiliations + works to a commonmeta array validated against the commonmeta v1.0 schema. works may be empty.
write_orcid_inveniordm_yaml
Serialize a person to InvenioRDM names YAML (list form). person_json is the ORCID 3.0 /person response; affiliations from fetch_orcid_employments.
write_orcid_json
Serialize an ORCID 3.0 person JSON value (from fetch_orcid_person_json or fetch_orcid_person_json_sqlite) to bytes.
write_parquet
Write a list of commonmeta records as a single Parquet file. Alongside a flattened tabular projection of each record’s fields (for filtering in tools like DuckDB without parsing JSON), every row also carries a json column with the record’s complete serialization, so read_parquet round-trips losslessly.
write_ror_commonmeta
Serialize a ROR organization Data as a v1.0-compliant commonmeta JSON array.
write_ror_json
Write a ROR-derived record as raw ROR-shaped JSON (as opposed to write("ror", data), which produces InvenioRDM vocabulary YAML).
write_ror_sqlite
Write a list of ROR records to a SQLite3 database at path with an organizations table. Existing file is deleted first. JSON array columns (types, locations, names, external_ids) are queryable via SQLite’s json_each(). The metadata column stores the full ROR JSON as a zstd-compressed BLOB for lossless round-trips.
write_ror_v2_json
Serialize a Ror record as ROR v2-compatible JSON, converting empty-string lang and preferred fields to JSON null to match the canonical API output.
write_sqlite
Write list as a SQLite3 database with a works table whose columns mirror the commonmeta v1.0 schema. Simple string fields are stored as TEXT; complex fields are stored as compact JSON TEXT. Any existing file at path is deleted first.
write_vraix_table_parquet
Write a VRAIX dump’s transport table (e.g. pid_records) to a single Parquet file’s bytes, using its raw columns (pid, source_id, raw_metadata, …) as-is — not converted to commonmeta Data the way read_vraix_sqlite is. For analytics over the dump itself (e.g. via DataFusion/Polars/DuckDB), not for ingesting it as commonmeta records. batch_size controls how many rows land in each internal Parquet row group (see [formats::commonmeta::write_parquet_all]’s analogous ROW_GROUP_SIZE for why this matters for large dumps).
write_with_style
Like [write], but forwards style and locale to the citation writer. For non-"citation" formats both parameters are ignored.