difi
Did I Find It?
difi evaluates the completeness and purity of astronomical object linkage results from software such as THOR, HelioLinC, or MOPS. Given observations with known object associations and a set of predicted linkages, difi determines which objects were successfully discovered and how clean each linkage is.
This is difi v2 — a ground-up rewrite in Rust with Python bindings. It uses Rayon for parallelism and Apache Arrow for zero-copy data interchange.
Installation
Python (recommended)
Requires Python 3.10+ and a Rust toolchain (1.85+).
&&
Verify:
>>>
>>> # matches the published rc
Rust only
CLI
The difi binary ships with the crate behind a cli Cargo feature (so
library consumers don't pull clap/toml/anyhow transitively):
# or from a checkout
Input format
difi requires two input tables:
Observations
Each row is a single detection with an optional ground-truth object label.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
string | no | Unique observation identifier |
time |
struct{days: i64, nanos: i64} | no | Epoch as MJD days + nanoseconds |
ra |
f64 | no | Right ascension (degrees, 0-360) |
dec |
f64 | no | Declination (degrees, -90 to 90) |
ra_sigma |
f64 | yes | RA uncertainty (degrees) |
dec_sigma |
f64 | yes | Dec uncertainty (degrees) |
observatory_code |
string | no | Observatory/telescope identifier |
object_id |
string | yes | Ground-truth object label (null if unknown) |
night |
i64 | no | Local observing night identifier |
Linkage members
Each row maps a predicted linkage to one of its constituent observations.
| Column | Type | Nullable | Description |
|---|---|---|---|
linkage_id |
string | no | Linkage identifier |
obs_id |
string | no | Observation identifier (foreign key to observations.id) |
Both tables are read from Parquet files or passed as quivr/PyArrow objects.
Usage
Python
Each function accepts either a file path (str/Path) or a quivr Table object.
# Step 1: CIFI — determine which objects are findable
# Pass file paths...
=
# ...or quivr Tables
=
=
# Step 2: DIFI — classify linkages and compute completeness
=
# num_ignored_linkages counts linkages excluded because they had no
# observations inside the analysis partition (non-zero signals a mismatch
# between your --linkages file and the observation set).
Findability metrics
The metric parameter controls what observation pattern makes an object "findable":
# Singleton metric (default): object needs >= min_obs detections
# across >= min_nights distinct nights
=
# Tracklet metric: object needs intra-night tracklets (multiple
# detections within max_obs_separation hours showing angular motion)
# on >= min_nights distinct nights
=
In Rust, metrics are structs with full configuration:
use SingletonMetric;
use TrackletMetric;
// Singleton: 6 obs across 3 nights, at least 2 obs/night when exactly 3 nights
let singleton = SingletonMetric ;
// Tracklet: 2+ obs per tracklet within 1.5 hours, 1" angular separation,
// tracklets on 3+ nights
let tracklet = TrackletMetric ;
CLI
A thin wrapper over the library, for shell pipelines and reproducible runs. Subcommand names mirror the Python verbs; short aliases keep shell use ergonomic.
# CIFI: findability from observations (alias: `difi cifi`)
# CIFI + DIFI: classify linkages end-to-end (alias: `difi analyze`)
Outputs in <output-dir>/:
| File | Written by | Contents |
|---|---|---|
all_objects.parquet |
both | One row per (object, partition) — CIFI findability flag plus DIFI linkage stats merged in |
findable_observations.parquet |
both | One row per findable (object, partition) with discovery night |
partition_summaries.parquet |
both | One row per partition with observation / findable / found / completeness counts |
all_linkages.parquet |
analyze-linkages |
One row per classified (linkage, partition) with pure/contaminated/mixed flags |
ignored_linkages.parquet |
analyze-linkages (only when non-empty) |
Linkages excluded from classification, with reason + partition |
run_manifest.json |
both | argv, input SHA-256 prefixes, host, per-scenario timings, warnings counts, optional reused_cifi provenance |
Partitioned CIFI
# Sliding 30-night windows
# Tracklets with non-overlapping 15-night blocks
Partition flags also apply to analyze-linkages — the CLI loops DIFI over
each partition's summary and writes a combined all_linkages.parquet keyed by
partition_id. Linkages whose observations fall entirely outside a given
partition are excluded from all_linkages.parquet and reported in a separate
ignored_linkages.parquet; the manifest's warnings section surfaces counts
so a run with an unexpectedly high orphan_linkages value (linkages that
never intersect any partition) flags a likely mismatched --linkages file.
Reusing a CIFI snapshot
CIFI is the expensive phase on survey-scale inputs. Run it once, reuse across multiple linkage sets:
# Produce a reusable CIFI snapshot
# Classify two independent linkage sets against the same CIFI work
--cifi-output-dir is mutually exclusive with partition flags (the snapshot
encodes its own partitions). A SHA-256 prefix of the observations file is
stored in each manifest; mismatches between the snapshot and the current
observations fail fast with a clear error.
Batch scenarios
Declare scenarios in TOML for findability sweeps (LSST baselines, etc.):
# lsst_findability.toml
[]
= "/path/to/observations.parquet"
[[]]
= "singleton_6obs_3nights"
= "singletons"
= 6
= 3
[[]]
= "tracklet_3pairs_15nights"
= "tracklets"
= 3
= "sliding"
= 15
# results/<scenario>/all_objects.parquet, partition_summaries.parquet, ...
# results/run_manifest.json summarizes every scenario
Per-scenario observations = "..." overrides [defaults].
Machine-readable progress
--progress-json emits one NDJSON event per line on stdout; human text still
goes to stderr.
|
Errors always produce a human line on stderr; under --progress-json an
{"event":"error", ...} line is additionally written to stdout so machine
consumers see them too.
Rust
use analyze_observations;
use analyze_linkages;
use ;
use SingletonMetric;
// Load from Parquet
let = read_observations?;
let lm = read_linkage_members?;
// Step 1: CIFI — determine findability
let metric = default;
let =
analyze_observations?;
// Step 2: DIFI — classify linkages. Returns (AllLinkages, IgnoredLinkages).
// Linkages whose observations all fall outside summaries[0]'s night range
// are redirected to `ignored` with reason NoObservationsInPartition, instead
// of producing phantom pure/contaminated/mixed rows.
let = analyze_linkages?;
For multi-partition DIFI, loop over summaries and concatenate each partition's
AllLinkages / IgnoredLinkages. The update_all_objects call inside
analyze_linkages scopes its writes to the current partition's rows in
AllObjects, so multi-partition loops are safe.
Cross-crate usage (e.g. from THOR)
difi defines ObservationTable and LinkageMemberTable traits. Implement
them for your own types to call difi directly without data conversion:
let =
analyze_observations?;
Pipeline
difi operates in two phases:
-
CIFI (Can I Find It?) — Determines which objects are "findable" based on observation patterns. Supports two metrics:
- SingletonMetric: >=
min_obsobservations across >=min_nightsnights - TrackletMetric: intra-night tracklets with temporal and angular separation constraints
- SingletonMetric: >=
-
DIFI (Did I Find It?) — Classifies each linkage as exactly one of:
- Pure — all observations belong to one object
- Pure complete — pure + contains all partition observations of that object
- Contaminated — mostly one object, contamination <= threshold
- Mixed — too contaminated to attribute to a single object
Completeness = (found objects / findable objects) × 100%, where found counts objects for which at least one pure linkage contains ≥
min_obsobservations inside the partition. Single-partition runs see no difference from the whole-linkage interpretation; multi-partition runs stay bounded by in-partition evidence instead of inflating when a cross-boundary linkage has enough total obs but few inside any one window.
Per-partition vs. cross-partition (survey-wide) counts
partition_summaries.parquet carries per-partition rows with the same
completeness formula scoped to each window. For a sliding-window run, the
same real object typically appears in many windows, so naively summing
findable / found across partitions double-counts.
run_manifest.json → scenarios[0] carries both views:
| Field | Meaning |
|---|---|
findable_count |
Sum across partitions. Useful for workload sizing |
found_count |
Sum across partitions (DIFI runs only) |
unique_findable_count |
Distinct objects findable in ≥ 1 partition |
unique_found_count |
Distinct objects with ≥ 1 found-pure linkage across all partitions (DIFI runs only) |
unique_completeness |
unique_found_count / unique_findable_count × 100 — the number to quote for "how many asteroids did the linker recover?" (DIFI runs only) |
In single-partition runs the unique_* values equal the sum counterparts by
construction. In multi-partition sliding-window runs they diverge: for a 76-partition
survey run, findable_count can be ~200k (object, partition) pairs while
unique_findable_count is ~15k distinct objects.
Performance
Benchmarked on the neomod_quads survey dataset (166M observations, 15,935 objects):
| Scale | Python (v2rc3) | Rust (v2rc4) | Speedup |
|---|---|---|---|
| 55M obs (30 nights) | 23.0s | 0.42s | 55x |
| 111M obs (60 nights) | 67.3s | 0.85s | 80x |
| 166M obs (90 nights) | 132.9s | 1.24s | 107x |
Memory at 100M observations: ~3.2 GB (DIFI), ~5.6 GB (CIFI with tracklets).
Development
# Build the library
# Build the library + CLI binary
# Verify lib-only build pulls no CLI deps
# Full test suite (library + CLI integration tests)
# Lint
# Verify Cargo.toml and pyproject.toml versions agree (CI runs this on every
# PR; the publish workflow also checks tag vs both files before any upload)
# Benchmarks
# Build Python package
Acknowledgments
This work was supported by the Asteroid Institute (a program of the B612 Foundation) and the DIRAC Institute at the University of Washington.
License
BSD 3-Clause. See LICENSE.md for details.