Skip to main content

Module text

Module text 

Source

Structs§

ExtractionMeta
Metadata written alongside each DataLab extraction cache entry as meta.json.
WorkTextResult
Result of extracting text from a work’s PDF.
ZoteroItemInfo
Brief Zotero library info for a work matched by DOI.

Enums§

PdfSource
Where the PDF was obtained from.
ProcessingMode
WorkTextError
Errors from the work_text pipeline.

Functions§

datalab_cache_dir_path
Return the local cache directory path for cache_id if determinable.
datalab_cached_item_keys
Return the keys of all locally cached DataLab extractions.
datalab_cached_json
Return the cached JSON for cache_id if it exists, otherwise None.
datalab_cached_markdown
Return the cached markdown for cache_id if it exists, otherwise None.
do_extract
Extract text from PDF bytes, routing through DataLab if datalab is Some.
download_extraction_from_zotero
Download papers_extract_{item_key}.zip from Zotero (identified by att_key) and restore it to the local cache directory.
extract_text_bytes
Extract text from PDF bytes using pdf-extract.
find_work_in_zotero
Check if a work exists in the Zotero library, matched by DOI.
poll_zotero_for_work
Poll Zotero for a work by DOI. Waits 5s initially, then polls every 2s for up to ~2 min.
read_extraction_meta
Read the meta.json for cache_id from the local DataLab cache, if present.
try_zotero
Try to find and download a PDF from Zotero (local storage first, then remote API).
upload_extraction_to_zotero
Upload the local DataLab cache for item_key to Zotero as papers_extract_{item_key}.zip attached to that same item.
work_text
Download and extract the full text of a scholarly work.