difi-rs 2.0.0-rc7

Did I Find It? — Evaluate linkage completeness and purity for astronomical surveys
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
# difi

Did I Find It?

[![Rust CI](https://github.com/moeyensj/difi/actions/workflows/rust.yml/badge.svg)](https://github.com/moeyensj/difi/actions/workflows/rust.yml)
[![Coverage Status](https://coveralls.io/repos/github/moeyensj/difi/badge.svg?branch=main)](https://coveralls.io/github/moeyensj/difi?branch=main)
[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![DOI](https://zenodo.org/badge/152989392.svg)](https://zenodo.org/badge/latestdoi/152989392)
[![Built with Claude Code](https://img.shields.io/badge/Built%20with-Claude%20Code-D97757?logo=anthropic&logoColor=white&style=flat-square)](https://claude.ai)  
<a href="https://b612.ai/"><img src="https://img.shields.io/badge/Asteroid%20Institute-b612foundation.org-1a1a2e?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld0JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9IndoaXRlIiBzdHJva2Utd2lkdGg9IjIiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIgc3Ryb2tlLWxpbmVqb2luPSJyb3VuZCI+PGNpcmNsZSBjeD0iMTIiIGN5PSIxMiIgcj0iMTAiLz48bGluZSB4MT0iMiIgeTE9IjEyIiB4Mj0iMjIiIHkyPSIxMiIvPjxwYXRoIGQ9Ik0xMiAyYTE1LjMgMTUuMyAwIDAgMSA0IDEwIDE1LjMgMTUuMyAwIDAgMS00IDEwIDE1LjMgMTUuMyAwIDAgMS00LTEwIDE1LjMgMTUuMyAwIDAgMSA0LTEweiIvPjwvc3ZnPg==&logoColor=white&style=flat-square" alt="Asteroid Institute"></a>
<a href="https://dirac.astro.washington.edu/"><img src="https://img.shields.io/badge/DIRAC%20Institute-dirac.astro.washington.edu-1a1a2e?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld0JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9IndoaXRlIiBzdHJva2Utd2lkdGg9IjIiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIgc3Ryb2tlLWxpbmVqb2luPSJyb3VuZCI+PGNpcmNsZSBjeD0iMTIiIGN5PSIxMiIgcj0iMTAiLz48bGluZSB4MT0iMiIgeTE9IjEyIiB4Mj0iMjIiIHkyPSIxMiIvPjxwYXRoIGQ9Ik0xMiAyYTE1LjMgMTUuMyAwIDAgMSA0IDEwIDE1LjMgMTUuMyAwIDAgMS00IDEwIDE1LjMgMTUuMyAwIDAgMS00LTEwIDE1LjMgMTUuMyAwIDAgMSA0LTEweiIvPjwvc3ZnPg==&logoColor=white&style=flat-square" alt="DIRAC Institute"></a>

---

difi evaluates the completeness and purity of astronomical object linkage results
from software such as [THOR](https://github.com/moeyensj/thor),
[HelioLinC](https://github.com/lsst-dm/heliolinc2), or
[MOPS](https://github.com/lsst/mops_daymops). Given observations with known
object associations and a set of predicted linkages, difi determines which
objects were successfully discovered and how clean each linkage is.

This is difi v2 — a ground-up rewrite in Rust with Python bindings. It uses
[Rayon](https://github.com/rayon-rs/rayon) for parallelism and
[Apache Arrow](https://arrow.apache.org/) for zero-copy data interchange.

## Installation

### Python (recommended)

Requires Python 3.10+ and a Rust toolchain (1.85+).

```bash
pip install maturin
git clone https://github.com/moeyensj/difi.git && cd difi
maturin develop --release
```

Verify:

```python
>>> import difi
>>> difi.__version__  # matches the published rc
```

### Rust only

```bash
cargo build --release
```

### CLI

The `difi` binary ships with the crate behind a `cli` Cargo feature (so
library consumers don't pull clap/toml/anyhow transitively):

```bash
cargo install difi-rs --features cli
# or from a checkout
cargo build --release --features cli   # binary: ./target/release/difi
```

## Input format

difi requires two input tables:

### Observations

Each row is a single detection with an optional ground-truth object label.

| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| `id` | string | no | Unique observation identifier |
| `time` | struct{days: i64, nanos: i64} | no | Epoch as MJD days + nanoseconds |
| `ra` | f64 | no | Right ascension (degrees, 0-360) |
| `dec` | f64 | no | Declination (degrees, -90 to 90) |
| `ra_sigma` | f64 | yes | RA uncertainty (degrees) |
| `dec_sigma` | f64 | yes | Dec uncertainty (degrees) |
| `observatory_code` | string | no | Observatory/telescope identifier |
| `object_id` | string | yes | Ground-truth object label (null if unknown) |
| `night` | i64 | no | Local observing night identifier |

### Linkage members

Each row maps a predicted linkage to one of its constituent observations.

| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| `linkage_id` | string | no | Linkage identifier |
| `obs_id` | string | no | Observation identifier (foreign key to observations.id) |

Both tables are read from Parquet files or passed as quivr/PyArrow objects.

## Usage

### Python

Each function accepts either a file path (str/Path) or a quivr Table object.

```python
from difi import analyze_observations, analyze_linkages

# Step 1: CIFI — determine which objects are findable
# Pass file paths...
cifi_result = analyze_observations("observations.parquet")

# ...or quivr Tables
from difi import Observations
observations = Observations.from_parquet("observations.parquet")
cifi_result = analyze_observations(observations)

# Step 2: DIFI — classify linkages and compute completeness
difi_result = analyze_linkages(
    "observations.parquet",
    "linkage_members.parquet",
    min_obs=6,
    contamination_percentage=20.0,
)
print(f"Completeness: {difi_result['completeness']:.1f}%")
print(f"Pure: {difi_result['num_pure']}, Mixed: {difi_result['num_mixed']}")
# num_ignored_linkages counts linkages excluded because they had no
# observations inside the analysis partition (non-zero signals a mismatch
# between your --linkages file and the observation set).
print(f"Ignored: {difi_result['num_ignored_linkages']}")
```

#### Findability metrics

The `metric` parameter controls what observation pattern makes an object "findable":

```python
from difi import analyze_observations

# Singleton metric (default): object needs >= min_obs detections
# across >= min_nights distinct nights
result = analyze_observations(
    "observations.parquet",
    metric="singletons",
    min_obs=6,
    min_nights=3,
)

# Tracklet metric: object needs intra-night tracklets (multiple
# detections within max_obs_separation hours showing angular motion)
# on >= min_nights distinct nights
result = analyze_observations(
    "observations.parquet",
    metric="tracklets",
    min_nights=3,
)
```

In Rust, metrics are structs with full configuration:

```rust
use difi::metrics::singleton::SingletonMetric;
use difi::metrics::tracklet::TrackletMetric;

// Singleton: 6 obs across 3 nights, at least 2 obs/night when exactly 3 nights
let singleton = SingletonMetric {
    min_obs: 6,
    min_nights: 3,
    min_nightly_obs_in_min_nights: 2,
};

// Tracklet: 2+ obs per tracklet within 1.5 hours, 1" angular separation,
// tracklets on 3+ nights
let tracklet = TrackletMetric {
    tracklet_min_obs: 2,
    max_obs_separation: 1.5 / 24.0,  // days
    min_linkage_nights: 3,
    min_obs_angular_separation: 1.0,  // arcseconds
};
```

### CLI

A thin wrapper over the library, for shell pipelines and reproducible runs.
Subcommand names mirror the Python verbs; short aliases keep shell use ergonomic.

```bash
# CIFI: findability from observations  (alias: `difi cifi`)
difi analyze-observations \
    -i observations.parquet \
    -o out/ \
    --metric singletons --min-obs 6 --min-nights 3

# CIFI + DIFI: classify linkages end-to-end  (alias: `difi analyze`)
difi analyze-linkages \
    -i observations.parquet \
    -l linkage_members.parquet \
    -o out/ \
    --contamination-percentage 20
```

Outputs in `<output-dir>/`:

| File | Written by | Contents |
|---|---|---|
| `all_objects.parquet` | both | One row per (object, partition) — CIFI findability flag plus DIFI linkage stats merged in |
| `findable_observations.parquet` | both | One row per findable (object, partition) with discovery night |
| `partition_summaries.parquet` | both | One row per partition with observation / findable / found / completeness counts |
| `all_linkages.parquet` | `analyze-linkages` | One row per classified (linkage, partition) with pure/contaminated/mixed flags |
| `ignored_linkages.parquet` | `analyze-linkages` (only when non-empty) | Linkages excluded from classification, with reason + partition |
| `run_manifest.json` | both | argv, input SHA-256 prefixes, host, per-scenario timings, `warnings` counts, optional `reused_cifi` provenance |

#### Partitioned CIFI

```bash
# Sliding 30-night windows
difi cifi -i observations.parquet -o out/ \
    --partition-mode sliding --partition-window 30

# Tracklets with non-overlapping 15-night blocks
difi cifi -i observations.parquet -o out/ \
    --metric tracklets --min-linkage-nights 3 \
    --partition-mode blocks --partition-window 15
```

Partition flags also apply to `analyze-linkages` — the CLI loops DIFI over
each partition's summary and writes a combined `all_linkages.parquet` keyed by
`partition_id`. Linkages whose observations fall entirely outside a given
partition are excluded from `all_linkages.parquet` and reported in a separate
`ignored_linkages.parquet`; the manifest's `warnings` section surfaces counts
so a run with an unexpectedly high `orphan_linkages` value (linkages that
never intersect any partition) flags a likely mismatched `--linkages` file.

#### Reusing a CIFI snapshot

CIFI is the expensive phase on survey-scale inputs. Run it once, reuse across
multiple linkage sets:

```bash
# Produce a reusable CIFI snapshot
difi cifi -i observations.parquet -o cifi_snapshot/

# Classify two independent linkage sets against the same CIFI work
difi analyze-linkages -i observations.parquet -l thor_linkages.parquet \
    --cifi-output-dir cifi_snapshot/ -o difi_thor/
difi analyze-linkages -i observations.parquet -l precovery_linkages.parquet \
    --cifi-output-dir cifi_snapshot/ -o difi_precovery/
```

`--cifi-output-dir` is mutually exclusive with partition flags (the snapshot
encodes its own partitions). A SHA-256 prefix of the observations file is
stored in each manifest; mismatches between the snapshot and the current
observations fail fast with a clear error.

#### Batch scenarios

Declare scenarios in TOML for findability sweeps (LSST baselines, etc.):

```toml
# lsst_findability.toml
[defaults]
observations = "/path/to/observations.parquet"

[[scenario]]
name = "singleton_6obs_3nights"
metric = "singletons"
min_obs = 6
min_nights = 3

[[scenario]]
name = "tracklet_3pairs_15nights"
metric = "tracklets"
min_linkage_nights = 3
partition_mode = "sliding"
partition_window = 15
```

```bash
difi cifi --scenarios lsst_findability.toml -o results/
# results/<scenario>/all_objects.parquet, partition_summaries.parquet, ...
# results/run_manifest.json summarizes every scenario
```

Per-scenario `observations = "..."` overrides `[defaults]`.

#### Machine-readable progress

`--progress-json` emits one NDJSON event per line on stdout; human text still
goes to stderr.

```bash
difi --progress-json cifi -i observations.parquet -o out/ \
    | jq -c 'select(.event == "scenario_done")'
```

Errors always produce a human line on stderr; under `--progress-json` an
`{"event":"error", ...}` line is additionally written to stdout so machine
consumers see them too.

### Rust

```rust
use difi::cifi::analyze_observations;
use difi::difi::analyze_linkages;
use difi::io::{read_observations, read_linkage_members};
use difi::metrics::singleton::SingletonMetric;

// Load from Parquet
let (obs, mut interner, _) = read_observations(Path::new("observations.parquet"))?;
let lm = read_linkage_members(Path::new("linkage_members.parquet"), &mut interner)?;

// Step 1: CIFI — determine findability
let metric = SingletonMetric::default();
let (mut all_objects, findable, mut summaries) =
    analyze_observations(&obs, None, &metric)?;

// Step 2: DIFI — classify linkages. Returns (AllLinkages, IgnoredLinkages).
// Linkages whose observations all fall outside summaries[0]'s night range
// are redirected to `ignored` with reason NoObservationsInPartition, instead
// of producing phantom pure/contaminated/mixed rows.
let (all_linkages, ignored) = analyze_linkages(
    &obs, &lm, &mut all_objects, &mut summaries[0], 6, 20.0,
)?;
```

For multi-partition DIFI, loop over `summaries` and concatenate each partition's
`AllLinkages` / `IgnoredLinkages`. The `update_all_objects` call inside
`analyze_linkages` scopes its writes to the current partition's rows in
`AllObjects`, so multi-partition loops are safe.

### Cross-crate usage (e.g. from THOR)

difi defines `ObservationTable` and `LinkageMemberTable` traits. Implement
them for your own types to call difi directly without data conversion:

```rust
impl difi::types::ObservationTable for MyObservations {
    fn len(&self) -> usize { self.ids.len() }
    fn ids(&self) -> &[u64] { &self.ids }
    fn nights(&self) -> &[i64] { &self.nights }
    fn object_ids(&self) -> &[u64] { &self.object_ids }
    // ...
}

let (objects, findable, summaries) =
    difi::cifi::analyze_observations(&my_obs, None, &metric)?;
```

## Pipeline

difi operates in two phases:

1. **CIFI (Can I Find It?)** — Determines which objects are "findable" based on
   observation patterns. Supports two metrics:
   - **SingletonMetric**: >= `min_obs` observations across >= `min_nights` nights
   - **TrackletMetric**: intra-night tracklets with temporal and angular separation constraints

2. **DIFI (Did I Find It?)** — Classifies each linkage as exactly one of:
   - **Pure** — all observations belong to one object
   - **Pure complete** — pure + contains all partition observations of that object
   - **Contaminated** — mostly one object, contamination <= threshold
   - **Mixed** — too contaminated to attribute to a single object

   **Completeness** = (found objects / findable objects) × 100%, where
   *found* counts objects for which at least one pure linkage contains
   ≥ `min_obs` observations **inside the partition**. Single-partition runs see
   no difference from the whole-linkage interpretation; multi-partition runs
   stay bounded by in-partition evidence instead of inflating when a
   cross-boundary linkage has enough total obs but few inside any one window.

### Per-partition vs. cross-partition (survey-wide) counts

`partition_summaries.parquet` carries per-partition rows with the same
*completeness* formula scoped to each window. For a sliding-window run, the
same real object typically appears in many windows, so naively summing
`findable` / `found` across partitions double-counts.

`run_manifest.json → scenarios[0]` carries both views:

| Field | Meaning |
|---|---|
| `findable_count` | **Sum** across partitions. Useful for workload sizing |
| `found_count` | Sum across partitions (DIFI runs only) |
| `unique_findable_count` | **Distinct objects** findable in ≥ 1 partition |
| `unique_found_count` | Distinct objects with ≥ 1 found-pure linkage across all partitions (DIFI runs only) |
| `unique_completeness` | `unique_found_count / unique_findable_count × 100` — the number to quote for "how many asteroids did the linker recover?" (DIFI runs only) |

In single-partition runs the `unique_*` values equal the sum counterparts by
construction. In multi-partition sliding-window runs they diverge: for a 76-partition
survey run, `findable_count` can be ~200k (object, partition) pairs while
`unique_findable_count` is ~15k distinct objects.

## Performance

Benchmarked on the neomod_quads survey dataset (166M observations, 15,935 objects):

| Scale | Python (v2rc3) | Rust (v2rc4) | Speedup |
|-------|-----------|---------|---------|
| 55M obs (30 nights) | 23.0s | 0.42s | 55x |
| 111M obs (60 nights) | 67.3s | 0.85s | 80x |
| 166M obs (90 nights) | 132.9s | 1.24s | 107x |

Memory at 100M observations: ~3.2 GB (DIFI), ~5.6 GB (CIFI with tracklets).

## Development

```bash
# Build the library
cargo build

# Build the library + CLI binary
cargo build --features cli

# Verify lib-only build pulls no CLI deps
cargo build --no-default-features

# Full test suite (library + CLI integration tests)
cargo test --features cli

# Lint
cargo clippy --all-targets --features cli -- -D warnings

# Verify Cargo.toml and pyproject.toml versions agree (CI runs this on every
# PR; the publish workflow also checks tag vs both files before any upload)
./scripts/check_versions.sh

# Benchmarks
cargo bench

# Build Python package
maturin develop --release
```

## Acknowledgments

This work was supported by the [Asteroid Institute](https://asteroidinstitute.org/)
(a program of the [B612 Foundation](https://b612foundation.org/)) and the
[DIRAC Institute](https://dirac.astro.washington.edu/) at the University of Washington.

## License

BSD 3-Clause. See [LICENSE.md](LICENSE.md) for details.