Skip to main content

Module gdocs

Module gdocs 

Source
Expand description

Google Docs capture module.

Supports API-based capture of Google Docs documents via the export URL pattern: https://docs.google.com/document/d/{DOCUMENT_ID}/export?format={FORMAT}

§Supported Export Formats

  • html — HTML document (images as base64 data URIs)
  • txt — Plain text
  • md — Markdown (native Google Docs export)
  • pdf — PDF document
  • docx — Microsoft Word document
  • epub — EPUB ebook format

§Example

use web_capture::gdocs;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let url = "https://docs.google.com/document/d/abc123/edit";
    if gdocs::is_google_docs_url(url) {
        let result = gdocs::fetch_google_doc(url, "html", None).await?;
        println!("Content length: {}", result.content.len());
    }
    Ok(())
}

Structs§

CapturedDocument
Parsed Google Docs model/document capture.
ExtractedImage
An image extracted from base64 data URIs in HTML.
GDocsArchiveResult
Result of fetching a Google Doc as an archive.
GDocsExportPreprocessResult
Result of running the Google Docs export HTML pre-processor.
GDocsRenderedOutput
Rendered document output.
GDocsRenderedResult
Rendered Google Docs content from either Docs API or editor model data.
GDocsResult
Result of fetching a Google Docs document.
ListMeta
RemoteImage
Remote image reference extracted from browser-model capture.
TableBlock
Captured table.
TableCell
Captured table cell.
TableRow
Captured table row.

Enums§

CapturedBlock
Captured block.
ContentNode
Captured inline content node.
GDocsCaptureMethod
Google Docs capture backend selected from the CLI --capture flag.

Functions§

build_docs_api_url
Build a Google Docs REST API URL.
build_edit_url
Build a Google Docs editor URL.
build_export_url
Build a Google Docs export URL.
create_archive_zip
Create a ZIP archive from a GDocsArchiveResult.
extract_base64_images
Extract base64 data URI images from HTML content.
extract_bearer_token
Extract a Bearer token from an Authorization header value.
extract_document_id
Extract the document ID from a Google Docs URL.
fetch_google_doc
Fetch a Google Docs document via the export URL.
fetch_google_doc_as_archive
Fetch a Google Docs document as a ZIP archive.
fetch_google_doc_as_markdown
Fetch a Google Docs document and convert to Markdown.
fetch_google_doc_from_docs_api
Fetch and render a Google Docs document via the authenticated REST API.
fetch_google_doc_from_model
Fetch and render the model data embedded in the Google Docs /edit route.
is_google_docs_url
Check if a URL is a Google Docs document URL.
localize_rendered_remote_images_for_archive
Build a self-contained archive result from browser-model rendered output.
normalize_google_docs_export_markdown
Normalize Markdown emitted from Google Docs public-export HTML converters.
parse_model_chunks
Parse captured DOCS_modelChunk values.
parse_model_chunks_with_export_html
Parse captured DOCS_modelChunk values and optionally merge semantic hints from Google Docs export HTML.
preprocess_google_docs_export_html
Pre-process Google Docs export HTML so the generic html2md pipeline preserves inline formatting, heading numbering, and link targets.
render_captured_document
Render a parsed Google Docs capture as Markdown, HTML, or text.
render_docs_api_document
Render a Google Docs REST API document value.
select_capture_method
Select a Google Docs capture backend from the CLI --capture value.