faucet-source-gcs

Google Cloud Storage source connector for the faucet-stream ecosystem.

Lists objects in a bucket (with optional prefix) or accepts an explicit list of object keys, then fetches and parses each object as one of three formats — JSON Lines, JSON Array, or raw text — yielding records as serde_json::Value. Mirrors the existing faucet-source-s3 crate structurally and shares its semantics.

Built on the official google-cloud-storage SDK (1.12).

Config

pub struct GcsSourceConfig {
    pub bucket: String,
    pub prefix: Option<String>,
    pub object_keys: Option<Vec<String>>,
    pub credentials: GcsCredentials,
    pub file_format: GcsFileFormat,
    pub max_objects: Option<usize>,
    pub concurrency: usize,        // default 10
    pub batch_size: usize,         // default DEFAULT_BATCH_SIZE; 0 = one page per object
    pub storage_host: Option<String>,
}

Field	Description
`bucket`	GCS bucket name (no `gs://` prefix, no path).
`prefix`	Object-name prefix filter for listing. Ignored when `object_keys` is set.
`object_keys`	Explicit list of object names. When set, listing is skipped.
`credentials`	See `GcsCredentials` below. Defaults to `application_default`.
`file_format`	`json_lines` (default), `json_array`, or `raw_text`.
`max_objects`	Hard cap on the number of objects scanned.
`concurrency`	Maximum concurrent object reads.
`batch_size`	Records per emitted `StreamPage`. See Streaming.
`storage_host`	Endpoint override (integration tests only — production users leave unset).

YAML example:

source:
  type: gcs
  bucket: my-bucket
  prefix: events/2026/
  auth:
    type: service_account_json_file
    config:
      path: /run/secrets/gcp.json
  file_format: json_lines
  concurrency: 20
  batch_size: 5000

Authentication

See faucet-common-gcs for the full GcsCredentials reference. v1 supports:

application_default (ADC — recommended on GCE/GKE).
service_account_json_file (path to a key file).
service_account_json_inline (key as an inline string, env-injectable via ${env:GCP_SA_JSON} in CLI configs).

HMAC-key auth and signed-URL generation are out of scope for v1.

File formats

Format	Behaviour
`json_lines`	One JSON record per line. Blank lines skipped. Streams line-by-line.
`json_array`	The entire object is a JSON array of records. Buffered fully per object.
`raw_text`	The whole object becomes one record `{"key": <name>, "content": <utf-8>}`.

JSONL parse errors carry the object key + 1-based line number. JSON-array parse errors include the object key. Non-UTF-8 bodies surface as FaucetError::Source with a "not valid UTF-8" hint.

Streaming and batching

Source::stream_pages decodes JsonLines line-by-line so client-side memory stays bounded at O(batch_size) regardless of file size. RawText emits one record per object ({"key": ..., "content": ...}): the whole file is inherently its record, but it is streamed straight into one String via the same decoding reader JsonLines uses — no separate raw + decompressed copies for compressed objects. JsonArray buffers each object fully (the closing ] is required to parse the structure) and then chunks; very large arrays hold the full object in memory once.

batch_size = 0 is the no-batching sentinel: every page contains one complete object, with no within-object chunking and no cross-object accumulation. Useful for small lookup tables.

For non-zero batch_size, lines from multiple objects can share a page (cross-object flattening). The S3 source documents the same caveat — this is intentional.

Memory ceiling — RawText / JsonArray. Both hold one whole decoded object in memory at a time (inherent: RawText's record is the whole file, and a JSON array isn't valid until its closing ]). Objects are fetched concurrently, so peak memory is bounded by roughly concurrency × (largest object's decoded size), not by batch_size. For large RawText / JsonArray objects, lower concurrency to cap peak memory, or re-emit the data as JsonLines upstream so it streams at O(batch_size).

Errors

Failure	`FaucetError` variant	Message shape
Bad / missing credentials	`Auth`	`"GCS auth: ..."`
List API error	`Source`	`"GCS list error for bucket '{bucket}': {e}"`
Get object API error	`Source`	`"GCS get error for bucket '{bucket}' key '{key}': {e}"`
Body read error	`Source`	`"GCS read body error for key '{key}': {e}"`
Read / decode / non-UTF-8 body (`RawText` / `JsonArray`)	`Source`	`"GCS read/decode error for key '{key}' (not valid UTF-8?): {e}"`
JSON parse error	`Source`	`"GCS JSON parse error in '{key}' at line N: ..."`

Running the tests

cargo test -p faucet-source-gcs              # unit tests (no network)
cargo test -p faucet-source-gcs --test integration -- --ignored

Integration tests are marked #[ignore] because they require a real GCS-compatible gRPC backend. The google-cloud-storage SDK uses gRPC for control-plane operations (listing, metadata), and fake-gcs-server only speaks the REST API — so cargo test against the emulator fails with h2 protocol error / GoAway. Run the --ignored suite against a real GCS bucket or a gRPC-capable emulator when validating changes.

Compression

Behind the crate-local compression Cargo feature. Adds a compression config field with values none, gzip, zstd, or auto (the default — detects .gz / .zst from the file path / object key).

YAML example:

kind: gcs
config:
  # ... existing fields ...
  compression: auto  # or 'gzip' | 'zstd' | 'none'

The codec resolves per object key, so a single source can read a mix of compressed and uncompressed objects in one run.

Out of scope (v1)

HMAC-key auth.
Signed URL generation.
Mid-scan resumable bookmarks (matches faucet-source-s3 behaviour).
KMS CMEK encryption configuration.
Server-streaming gRPC reads.

License

Dual-licensed under MIT and Apache-2.0, per the workspace license field.

faucet-source-gcs 1.0.1