rust-data-processing 0.3.0

Schema-first ingestion (CSV, JSON, Parquet, Excel) into an in-memory DataSet, plus Polars-backed pipelines, SQL, profiling, validation, and map/reduce-style processing.
# Using `ffi_manifest.json` from Java (JAR + native `rdp_jvm_sys`)

The file **`bindings/jvm-sys/ffi_manifest.json`** is the **source of truth** for which `extern "C"` symbols exist in the **`rdp_jvm_sys`** shared library. The same bytes are bundled in the **`rust-data-processing-jvm`** JAR so applications can **discover symbols and ABI version at runtime** without parsing C headers.

| Location | Role |
| --- | --- |
| **`bindings/jvm-sys/ffi_manifest.json`** | Canonical manifest (Rust build / reviews) |
| **`rust-data-processing-jvm` JAR** | Classpath resource **`RdpNativeJson.FFI_MANIFEST_RESOURCE`** (`/io/github/scorpio_datalake/rust_data_processing/ffi_manifest.json`) |
| **`RdpNativeJson`** | High-level calls: `invokeAbiVersion`, `invokeParityExport` (JSON `RdpJsonSlice` protocol) |

**CI** enforces that the bundled copy matches `bindings/jvm-sys/ffi_manifest.json` (`python scripts/check_jvm_ffi_manifest.py`).

---

## 1. Maven dependency

Use the same **`groupId`** / **`artifactId`** / **`version`** as the published module (or `0.1.0-SNAPSHOT` when building locally after `mvn install`):

```xml
<dependency>
  <groupId>io.github.scorpio-datalake.rust-data-processing</groupId>
  <artifactId>rust-data-processing-jvm</artifactId>
  <version>0.1.0-SNAPSHOT</version>
</dependency>
```

You still need the **native** library (`librdp_jvm_sys.so`, `rdp_jvm_sys.dll`, or `librdp_jvm_sys.dylib`) built from **`bindings/jvm-sys`** (see **`bindings/java/rust-data-processing-jvm/README.md`**). The JAR does **not** embed that binary.

---

## 2. JVM flags and native library path

Panama **FFM** downcalls require native access:

```text
--enable-native-access=ALL-UNNAMED
```

Set the library path (absolute) via environment variable:

```bash
export RDP_JVM_SYS=/absolute/path/to/librdp_jvm_sys.so
```

or Java system property:

```text
-Drdp.jvm.sys.library=C:\absolute\path\to\rdp_jvm_sys.dll
```

The examples module uses the same resolution as tests (`ExamplesNativeLibrary`).

---

## 3. Read the manifest from the JAR

Use a class from **`rust-data-processing-jvm`** so the resource loads from that JAR:

```java
import io.github.scorpio_datalake.rust_data_processing.ffi.RdpNativeJson;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import org.json.JSONObject;

try (InputStream in = RdpNativeJson.class.getResourceAsStream(RdpNativeJson.FFI_MANIFEST_RESOURCE)) {
  if (in == null) {
    throw new IllegalStateException("ffi_manifest.json not on classpath");
  }
  JSONObject manifest = new JSONObject(new String(in.readAllBytes(), StandardCharsets.UTF_8));
  int abi = manifest.getInt("abi_version_constant");
  var symbols = manifest.getJSONArray("exported_symbols");
  // iterate symbols.getString(i) …
}
```

**Runnable repo example:** `LoadFfiManifestExample` in **`bindings/java/rust-data-processing-jvm-examples/`** (prints all symbols and probes ABI when `RDP_JVM_SYS` is set).

---

## 4. Call an exported symbol (parity JSON exports)

Every name in **`exported_symbols`** except **`rdp_json_slice_free`** (free helper) is intended to be resolved with **`SymbolLookup.libraryLookup`**. Parity exports (`rdp_parity_*`) follow the same calling convention as **`RdpNativeJson.invokeParityExport`**: `void (*)(RdpJsonSlice* out)`; JSON is written into the slice; callers must invoke **`rdp_json_slice_free`** on the slice (already done inside **`invokeParityExport`**).

Minimal usage:

```java
import io.github.scorpio_datalake.rust_data_processing.ffi.RdpNativeJson;
import java.lang.foreign.Arena;
import java.lang.foreign.Linker;
import java.lang.foreign.SymbolLookup;
import java.nio.file.Path;
import org.json.JSONObject;

Linker linker = Linker.nativeLinker();
Path lib = Path.of(System.getenv("RDP_JVM_SYS"));
try (Arena arena = Arena.ofConfined()) {
  SymbolLookup lookup = SymbolLookup.libraryLookup(lib, arena);
  JSONObject root = RdpNativeJson.invokeParityExport(linker, lookup, arena, "rdp_parity_bindings_mirror");
  // root: keys ok, interchange, notes — same envelope as python-wrapper parity tests
}
```

**Runnable repo examples:** `RunPytestMirrorExample` — pass any **`rdp_parity_*`** name from the manifest as the sole CLI argument. **`ParityScenariosWalkthrough`** (under `rust-data-processing-jvm-examples`) runs several exports in one run and prints short **`interchange`** summaries (see that module’s `README.md`).

To validate JSON shape the same way as unit tests, use **`PytestMirrorAssertions.validateMirrorExport(exportName, root)`** for `*_mirror` exports.

---

## 5. ABI version only

```java
int abi = RdpNativeJson.invokeAbiVersion(linker, lookup);
```

Compare with **`abi_version_constant`** from the manifest; they must match for a compatible **`rdp_jvm_sys`** build.

---

## 6. Classpath-only `java` / `java` from a fat layout

From **`rust-data-processing-jvm-examples`** after `mvn -DskipTests package` (with the main module already `mvn install`’d):

```bash
export RDP_JVM_SYS=/path/to/librdp_jvm_sys.so
export JAVA_TOOL_OPTIONS='--enable-native-access=ALL-UNNAMED'
java -cp "target/rust-data-processing-jvm-examples-0.1.0-SNAPSHOT.jar:../rust-data-processing-jvm/target/rust-data-processing-jvm-0.1.0-SNAPSHOT.jar" \
  io.github.scorpio_datalake.rust_data_processing.examples.LoadFfiManifestExample
java -cp "target/rust-data-processing-jvm-examples-0.1.0-SNAPSHOT.jar:../rust-data-processing-jvm/target/rust-data-processing-jvm-0.1.0-SNAPSHOT.jar" \
  io.github.scorpio_datalake.rust_data_processing.examples.RunPytestMirrorExample rdp_parity_bindings_mirror
```

On Windows, use `;` instead of `:` in `-cp` and absolute paths.

---

## 7. Large results: prefer Rust-side ETL and files

Many `rdp_parity_*` exports return **`interchange.dataset`** as **JSON** (`schema` + `rows`). That is the default interchange for **tests, contracts, and small tables**.

For **production** and **large** `DataSet` / Polars outputs, **do not** rely on shipping the full table through the JVM as JSON. Instead:

1. Run ingest, transforms, SQL, and validation **in Rust** (or Python calling Rust).
2. **Write** results to **Parquet**, **CSV**, or a **database** (or object storage).
3. Use the JVM only for **orchestration**, **small JSON responses**, or reading **paths** to files written by Rust — then let **Spark (`local[*]`)** or other readers consume those files.

The same idea applies to **every** parity export that materializes a full **`dataset`** in JSON. See **[`EXAMPLES.md` § Rust-first ETL vs JVM consumption](EXAMPLES.md#rust-first-etl-vs-jvm-consumption)** for the full list and rationale. Arrow-based interchange remains a future milestone (**[`ARROW_FFI_JVM.md`](ARROW_FFI_JVM.md)**).

---

## 8. What the manifest does *not* tell you

- **JSON schema** per export — infer from **`python-wrapper/tests`** and **`PytestMirrorAssertions`**, or inspect **`bindings/jvm-sys`** / Rust parity sources.
- **Future non-parity APIs** — when new `extern "C"` entry points ship, they must be added to **`ffi_manifest.json`** and regenerated in the JAR; **`FfiExportedSymbolsContractTest`** catches drift for symbols listed in the manifest.

For the high-level Phase 3 policy (semver, Panama), see **ADR [005](../adr/005-jvm-panama-production-policy.md)** and **[FFI_API_SLICE.md](FFI_API_SLICE.md)**.

---

## 9. Production path ingest and pipeline JSON (non-parity)

These symbols are listed in **`exported_symbols`** and covered by **`FfiExportedSymbolsContractTest`** + **`DocsExampleNativeIntegrationTest`** / **`JvmNativeContractScenarios`**. They return the same `{ ok, interchange, notes }` envelope as parity exports.

| Symbol | Role |
| --- | --- |
| `rdp_ingest_csv_path` | Path + schema JSON + options JSON → `ingest_path_csv` |
| `rdp_ingest_json_path` | JSON / NDJSON path ingest |
| `rdp_ingest_parquet_path` | Parquet path ingest |
| `rdp_ingest_xml_path` | XML path ingest (`format: xml` in options or extension) |
| `rdp_excel_ingest_path_sheet` | Excel sheet ingest (schema inferred in Rust; no schema on the wire) |
| `rdp_ingest_ordered_paths_json` | Multi-path payload: `paths`, `schema` / `schema_ref`, `options`, `response.mode` (`dataset` \| `parquet_temp` \| `arrow_ipc_temp`) |
| `rdp_run_pipeline_json` | Declarative pipeline: sources → optional `transform.sql` on `df` → sinks (`parquet_file`, `xml_file`, …) |
| `rdp_export_parquet_temp` | Small Rust-built sample Parquet in OS temp dir (handoff) |
| `rdp_export_arrow_ipc_temp` | Temp Arrow IPC file handoff |
| `rdp_export_polars_parquet_temp` | Temp Parquet via Polars writer |

**Fixture JSON** lives under `tests/fixtures/<bundle>/` (`schemas/`, `pipelines/`, `payloads/`). Java: **`io.github.scorpio_datalake.rust_data_processing.fixture.PipelineJsonFixtures`**; resolve templates before calling native code. Tour: **[EXAMPLES.md](EXAMPLES.md)**; runnable sources: **`docs/java/examples/*.java`**.

**Build / test locally:**

```powershell
pwsh -File scripts/build_all.ps1
# or: python scripts/python_scripts/build_all.py
```

Requires `rdp_jvm_sys` built with **`--features full`** (Excel + linked core), `RDP_JVM_SYS` pointing at the release `cdylib`, and `--enable-native-access=ALL-UNNAMED`.