Skip to main content

Crate commonmeta

Crate commonmeta 

Source
Expand description

commonmeta — a Rust port of front-matter/commonmeta.

Convert scholarly metadata between formats. The native model is Data; format modules read into it and write out of it.

Re-exports§

pub use data::Data;
pub use error::Error;
pub use error::Result;

Modules§

author_utils
constants
Controlled vocabularies and cross-format type/role translation tables.
crockford
Generate, encode and decode random base32 identifiers. This encoder/decoder:
crossref
data
Core Commonmeta data model.
doi_utils
Utilities for working with DOIs
error
file_utils
progress
schema_utils
JSON Schema and XSD validation utilities.
spdx
SPDX license vocabulary lookup.
utils
vocabularies
Embedded controlled vocabulary data files.

Structs§

AffiliationMatch
A single match result from the ROR affiliation API.
PushResult
The outcome of pushing a single record to InvenioRDM.
RorRelease
Metadata about a ROR data release published on Zenodo.

Constants§

VERSION

Functions§

convert
Read from one format and write to another in a single call.
convert_citation
Like convert, but passes CSL style and locale through to the citation writer.
count_sqlite_works
Return the total number of rows in the works table of a commonmeta SQLite database — useful for reporting the cumulative count after an upsert.
crossref_fetch_page_with_cursor
Fetch one page of Crossref works using cursor-based pagination.
download_ror_all
Convenience: fetch the latest release metadata then immediately download and parse the dump. Returns (RorRelease, Vec<Ror>, from_cache).
download_ror_release
Download and parse the zip archive described by release. The zip is cached locally for 30 days so repeat installs of the same version skip the network round-trip. Returns (records, from_cache).
fetch_installed_ror_version
Return the ROR version string stored in the local database’s settings table, or None when the database does not exist or no version has been recorded yet.
fetch_installed_vraix_date
Return the vraix_date (pidbox install date, YYYY-MM-DD) stored in the local works database’s settings table, or None when the database does not exist or no date has been recorded yet.
fetch_latest_ror_release
Fetch metadata for the latest ROR data release from Zenodo (InvenioRDM) without downloading the full archive. Returns the version tag, release date, Zenodo record ID, zip filename, and direct download URL.
fetch_ror
Fetch a ROR organization by its ROR URL or other organization identifier from the ROR API. Returns the record converted to the commonmeta Data model.
fetch_ror_sqlite
Look up a ROR organization by its full URL (e.g. https://ror.org/012xzy7a9) from a local SQLite database written by write_ror_sqlite. Returns the record converted to the commonmeta Data model, or an error when not found.
fetch_vraix_dump
Fetch commonmeta records from a VRAIX daily dump for from (“crossref” or “datacite”) and date (YYYY-MM-DD).
match_ror_affiliation
Match a free-text affiliation string against ROR organizations using the ROR v2 affiliation endpoint.
match_ror_affiliation_sqlite
Match a free-text affiliation string against a local ROR SQLite database written by write_ror_sqlite. Uses Turso’s Tantivy-backed FTS index for full-text search across all organization name variants. Returns results in relevance order with chosen set on the top result.
push_inveniordm
Create-or-update, then publish, a list of records in InvenioRDM.
put_inveniordm
Create-or-update, then publish, a single record in InvenioRDM.
read
Read a single record from from format, without writing it back out.
read_parquet
Read a list of commonmeta records back from the Parquet schema written by write_parquet. Lossless: each record is restored from its json column, the complete original serialization.
read_sqlite_by_id
Look up a single record by its id (DOI URL) in a commonmeta SQLite database. Returns None when the record is not present.
read_sqlite_commonmeta
Read records from a commonmeta SQLite database written by write_sqlite.
read_vraix_sqlite
Read commonmeta records from a VRAIX daily dump SQLite file already on disk at sqlite_path, e.g. an already-downloaded crossref-2026-06-14.sqlite3.
stream_pidbox_to_sqlite
Stream the pidbox dump (a mixed-source VRAIX SQLite file containing crossref, datacite, and ROR rows) directly to a commonmeta SQLite database. Each row is routed to the appropriate parser by its source_id; ROR rows are skipped. When update is false the output file is recreated; when true rows are upserted by id. Returns the number of records written.
stream_vraix_to_sqlite
Stream a VRAIX daily dump at input_path directly to a commonmeta SQLite database at output_path in batches of 10 000 rows, converting with from-specific parser and writing each batch in a single transaction. limit caps total records written; pass 0 for all rows. When update is false the output file is deleted and recreated (default). When update is true the existing file is kept and rows are upserted by their id primary key — new rows are inserted, existing rows are replaced. Returns the number of records written. No Vec<Data> is held for the whole file — peak memory is proportional to one batch, not the whole dump.
stream_zst_pidbox_to_sqlite
Like stream_pidbox_to_sqlite but reads directly from the zstd-compressed pidbox file without decompressing it to disk first. Requires the database to be well-organised (VACUUM’d or sequential bulk inserts) so that pages appear in DFS pre-order.
upsert_sqlite
Like write_sqlite but opens an existing database instead of recreating it. Rows whose id already exists are replaced; new rows are inserted.
write
Write an already-loaded record to to format.
write_archive
Render list to to format, split into entries of at most batch_size records each — suitable for packing into an archive via file_utils::write_zip_archive/file_utils::write_tar_gz_archive. base_name (e.g. "out.json") names the single entry directly when there’s only one batch, or gets a numbered suffix ("out-00000.json", "out-00001.json", …) when there are several.
write_archive_citation
Like write_archive, but passes CSL style/locale through to the citation writer when to == "citation".
write_list
Render a list of records to to format as a single buffer: a JSON array for object-shaped formats (commonmeta, csl, datacite, inveniordm, schemaorg, ror), or newline-joined output for line/document-shaped formats (e.g. bibtex, ris, crossref_xml).
write_list_citation
Like write_list, but passes CSL style/locale through to the citation writer when to == "citation" (ignored for every other format, same as convert_citation/write_citation).
write_parquet
Write a list of commonmeta records as a single Parquet file. Alongside a flattened tabular projection of each record’s fields (for filtering in tools like DuckDB without parsing JSON), every row also carries a json column with the record’s complete serialization, so read_parquet round-trips losslessly.
write_ror_json
Write a ROR-derived record as raw ROR-shaped JSON (as opposed to write("ror", data), which produces InvenioRDM vocabulary YAML).
write_ror_sqlite
Write a list of ROR records to a SQLite3 database at path with an organizations table. Existing file is deleted first. JSON array columns (types, locations, names, external_ids) are queryable via SQLite’s json_each(). The metadata column stores the full ROR JSON as a zstd-compressed BLOB for lossless round-trips.
write_sqlite
Write list as a SQLite3 database with a works table whose columns mirror the commonmeta v1.0 schema. Simple string fields are stored as TEXT; complex fields are stored as compact JSON TEXT. Any existing file at path is deleted first.
write_vraix_table_parquet
Write a VRAIX dump’s transport table (e.g. pid_records) to a single Parquet file’s bytes, using its raw columns (pid, source_id, raw_metadata, …) as-is — not converted to commonmeta Data the way read_vraix_sqlite is. For analytics over the dump itself (e.g. via DataFusion/Polars/DuckDB), not for ingesting it as commonmeta records. batch_size controls how many rows land in each internal Parquet row group (see [formats::commonmeta::write_parquet_all]’s analogous ROW_GROUP_SIZE for why this matters for large dumps).
write_with_style
Like [write], but forwards style and locale to the citation writer. For non-"citation" formats both parameters are ignored.