difi

Did I Find It?

difi evaluates the completeness and purity of astronomical object linkage results from software such as THOR, HelioLinC, or MOPS. Given observations with known object associations and a set of predicted linkages, difi determines which objects were successfully discovered and how clean each linkage is.

This is difi v2 — a ground-up rewrite in Rust with Python bindings. It uses Rayon for parallelism and Apache Arrow for zero-copy data interchange.

Installation

Python (recommended)

Requires Python 3.10+ and a Rust toolchain (1.85+).

pip install maturin
git clone https://github.com/moeyensj/difi.git && cd difi
maturin develop --release

Verify:

>>> import difi
>>> difi.__version__  # matches the published rc

Rust only

cargo build --release

CLI

The difi binary ships with the crate behind a cli Cargo feature (so library consumers don't pull clap/toml/anyhow transitively):

cargo install difi-rs --features cli
# or from a checkout
cargo build --release --features cli   # binary: ./target/release/difi

Input format

difi requires two input tables:

Observations

Each row is a single detection with an optional ground-truth object label.

Column	Type	Nullable	Description
`id`	string	no	Unique observation identifier
`time`	struct{days: i64, nanos: i64}	no	Epoch as MJD days + nanoseconds
`ra`	f64	no	Right ascension (degrees, 0-360)
`dec`	f64	no	Declination (degrees, -90 to 90)
`ra_sigma`	f64	yes	RA uncertainty (degrees)
`dec_sigma`	f64	yes	Dec uncertainty (degrees)
`observatory_code`	string	no	Observatory/telescope identifier
`object_id`	string	yes	Ground-truth object label (null if unknown)
`night`	i64	no	Local observing night identifier

Linkage members

Each row maps a predicted linkage to one of its constituent observations.

Column	Type	Nullable	Description
`linkage_id`	string	no	Linkage identifier
`obs_id`	string	no	Observation identifier (foreign key to observations.id)

Both tables are read from Parquet files or passed as quivr/PyArrow objects.

Usage

Python

Each function accepts either a file path (str/Path) or a quivr Table object.

from difi import analyze_observations, analyze_linkages

# Step 1: CIFI — determine which objects are findable
# Pass file paths...
cifi_result = analyze_observations("observations.parquet")

# ...or quivr Tables
from difi import Observations
observations = Observations.from_parquet("observations.parquet")
cifi_result = analyze_observations(observations)

# Step 2: DIFI — classify linkages and compute completeness
difi_result = analyze_linkages(
    "observations.parquet",
    "linkage_members.parquet",
    min_obs=6,
    contamination_percentage=20.0,
)
print(f"Completeness: {difi_result['completeness']:.1f}%")
print(f"Pure: {difi_result['num_pure']}, Mixed: {difi_result['num_mixed']}")
# num_ignored_linkages counts linkages excluded because they had no
# observations inside the analysis partition (non-zero signals a mismatch
# between your --linkages file and the observation set).
print(f"Ignored: {difi_result['num_ignored_linkages']}")

Findability metrics

The metric parameter controls what observation pattern makes an object "findable":

from difi import analyze_observations

# Singleton metric (default): object needs >= min_obs detections
# across >= min_nights distinct nights
result = analyze_observations(
    "observations.parquet",
    metric="singletons",
    min_obs=6,
    min_nights=3,
)

# Tracklet metric: object needs intra-night tracklets (multiple
# detections within max_obs_separation hours showing angular motion)
# on >= min_nights distinct nights
result = analyze_observations(
    "observations.parquet",
    metric="tracklets",
    min_nights=3,
)

In Rust, metrics are structs with full configuration:

use difi::metrics::singleton::SingletonMetric;
use difi::metrics::tracklet::TrackletMetric;

// Singleton: 6 obs across 3 nights, at least 2 obs/night when exactly 3 nights
let singleton = SingletonMetric {
    min_obs: 6,
    min_nights: 3,
    min_nightly_obs_in_min_nights: 2,
};

// Tracklet: 2+ obs per tracklet within 1.5 hours, 1" angular separation,
// tracklets on 3+ nights
let tracklet = TrackletMetric {
    tracklet_min_obs: 2,
    max_obs_separation: 1.5 / 24.0,  // days
    min_linkage_nights: 3,
    min_obs_angular_separation: 1.0,  // arcseconds
};

CLI

A thin wrapper over the library, for shell pipelines and reproducible runs. Subcommand names mirror the Python verbs; short aliases keep shell use ergonomic.

# CIFI: findability from observations  (alias: `difi cifi`)
difi analyze-observations \
    -i observations.parquet \
    -o out/ \
    --metric singletons --min-obs 6 --min-nights 3

# CIFI + DIFI: classify linkages end-to-end  (alias: `difi analyze`)
difi analyze-linkages \
    -i observations.parquet \
    -l linkage_members.parquet \
    -o out/ \
    --contamination-percentage 20

Outputs in <output-dir>/:

File	Written by	Contents
`all_objects.parquet`	both	One row per (object, partition) — CIFI findability flag plus DIFI linkage stats merged in
`findable_observations.parquet`	both	One row per findable (object, partition) with discovery night
`partition_summaries.parquet`	both	One row per partition with observation / findable / found / completeness counts
`all_linkages.parquet`	`analyze-linkages`	One row per classified (linkage, partition) with pure/contaminated/mixed flags
`ignored_linkages.parquet`	`analyze-linkages` (only when non-empty)	Linkages excluded from classification, with reason + partition
`run_manifest.json`	both	argv, input SHA-256 prefixes, host, per-scenario timings, `warnings` counts, optional `reused_cifi` provenance

Partitioned CIFI

# Sliding 30-night windows
difi cifi -i observations.parquet -o out/ \
    --partition-mode sliding --partition-window 30

# Tracklets with non-overlapping 15-night blocks
difi cifi -i observations.parquet -o out/ \
    --metric tracklets --min-linkage-nights 3 \
    --partition-mode blocks --partition-window 15

Partition flags also apply to analyze-linkages — the CLI loops DIFI over each partition's summary and writes a combined all_linkages.parquet keyed by partition_id. Linkages whose observations fall entirely outside a given partition are excluded from all_linkages.parquet and reported in a separate ignored_linkages.parquet; the manifest's warnings section surfaces counts so a run with an unexpectedly high orphan_linkages value (linkages that never intersect any partition) flags a likely mismatched --linkages file.

Reusing a CIFI snapshot

CIFI is the expensive phase on survey-scale inputs. Run it once, reuse across multiple linkage sets:

# Produce a reusable CIFI snapshot
difi cifi -i observations.parquet -o cifi_snapshot/

# Classify two independent linkage sets against the same CIFI work
difi analyze-linkages -i observations.parquet -l thor_linkages.parquet \
    --cifi-output-dir cifi_snapshot/ -o difi_thor/
difi analyze-linkages -i observations.parquet -l precovery_linkages.parquet \
    --cifi-output-dir cifi_snapshot/ -o difi_precovery/

--cifi-output-dir is mutually exclusive with partition flags (the snapshot encodes its own partitions). A SHA-256 prefix of the observations file is stored in each manifest; mismatches between the snapshot and the current observations fail fast with a clear error.

Batch scenarios

Declare scenarios in TOML for findability sweeps (LSST baselines, etc.):

# lsst_findability.toml
[defaults]
observations = "/path/to/observations.parquet"

[[scenario]]
name = "singleton_6obs_3nights"
metric = "singletons"
min_obs = 6
min_nights = 3

[[scenario]]
name = "tracklet_3pairs_15nights"
metric = "tracklets"
min_linkage_nights = 3
partition_mode = "sliding"
partition_window = 15

difi cifi --scenarios lsst_findability.toml -o results/
# results/<scenario>/all_objects.parquet, partition_summaries.parquet, ...
# results/run_manifest.json summarizes every scenario

Per-scenario observations = "..." overrides [defaults].

Machine-readable progress

--progress-json emits one NDJSON event per line on stdout; human text still goes to stderr.

difi --progress-json cifi -i observations.parquet -o out/ \
    | jq -c 'select(.event == "scenario_done")'

Errors always produce a human line on stderr; under --progress-json an {"event":"error", ...} line is additionally written to stdout so machine consumers see them too.

Rust

use difi::cifi::analyze_observations;
use difi::difi::analyze_linkages;
use difi::io::{read_observations, read_linkage_members};
use difi::metrics::singleton::SingletonMetric;

// Load from Parquet
let (obs, mut interner, _) = read_observations(Path::new("observations.parquet"))?;
let lm = read_linkage_members(Path::new("linkage_members.parquet"), &mut interner)?;

// Step 1: CIFI — determine findability
let metric = SingletonMetric::default();
let (mut all_objects, findable, mut summaries) =
    analyze_observations(&obs, None, &metric)?;

// Step 2: DIFI — classify linkages. Returns (AllLinkages, IgnoredLinkages).
// Linkages whose observations all fall outside summaries[0]'s night range
// are redirected to `ignored` with reason NoObservationsInPartition, instead
// of producing phantom pure/contaminated/mixed rows.
let (all_linkages, ignored) = analyze_linkages(
    &obs, &lm, &mut all_objects, &mut summaries[0], 6, 20.0,
)?;

For multi-partition DIFI, loop over summaries and concatenate each partition's AllLinkages / IgnoredLinkages. The update_all_objects call inside analyze_linkages scopes its writes to the current partition's rows in AllObjects, so multi-partition loops are safe.

Cross-crate usage (e.g. from THOR)

difi defines ObservationTable and LinkageMemberTable traits. Implement them for your own types to call difi directly without data conversion:

impl difi::types::ObservationTable for MyObservations {
    fn len(&self) -> usize { self.ids.len() }
    fn ids(&self) -> &[u64] { &self.ids }
    fn nights(&self) -> &[i64] { &self.nights }
    fn object_ids(&self) -> &[u64] { &self.object_ids }
    // ...
}

let (objects, findable, summaries) =
    difi::cifi::analyze_observations(&my_obs, None, &metric)?;

Pipeline

difi operates in two phases:

CIFI (Can I Find It?) — Determines which objects are "findable" based on observation patterns. Supports two metrics:
- SingletonMetric: >= min_obs observations across >= min_nights nights
- TrackletMetric: intra-night tracklets with temporal and angular separation constraints
DIFI (Did I Find It?) — Classifies each linkage as exactly one of:
- Pure — all observations belong to one object
- Pure complete — pure + contains all partition observations of that object
- Contaminated — mostly one object, contamination <= threshold
- Mixed — too contaminated to attribute to a single object
Completeness = (found objects / findable objects) × 100%, where found counts objects for which at least one pure linkage contains ≥ min_obs observations inside the partition. Single-partition runs see no difference from the whole-linkage interpretation; multi-partition runs stay bounded by in-partition evidence instead of inflating when a cross-boundary linkage has enough total obs but few inside any one window.

Per-partition vs. cross-partition (survey-wide) counts

partition_summaries.parquet carries per-partition rows with the same completeness formula scoped to each window. For a sliding-window run, the same real object typically appears in many windows, so naively summing findable / found across partitions double-counts.

run_manifest.json → scenarios[0] carries both views:

Field	Meaning
`findable_count`	Sum across partitions. Useful for workload sizing
`found_count`	Sum across partitions (DIFI runs only)
`unique_findable_count`	Distinct objects findable in ≥ 1 partition
`unique_found_count`	Distinct objects with ≥ 1 found-pure linkage across all partitions (DIFI runs only)
`unique_completeness`	`unique_found_count / unique_findable_count × 100` — the number to quote for "how many asteroids did the linker recover?" (DIFI runs only)

In single-partition runs the unique_* values equal the sum counterparts by construction. In multi-partition sliding-window runs they diverge: for a 76-partition survey run, findable_count can be ~200k (object, partition) pairs while unique_findable_count is ~15k distinct objects.

Performance

Benchmarked on the neomod_quads survey dataset (166M observations, 15,935 objects):

Scale	Python (v2rc3)	Rust (v2rc4)	Speedup
55M obs (30 nights)	23.0s	0.42s	55x
111M obs (60 nights)	67.3s	0.85s	80x
166M obs (90 nights)	132.9s	1.24s	107x

Memory at 100M observations: ~3.2 GB (DIFI), ~5.6 GB (CIFI with tracklets).

Development

# Build the library
cargo build

# Build the library + CLI binary
cargo build --features cli

# Verify lib-only build pulls no CLI deps
cargo build --no-default-features

# Full test suite (library + CLI integration tests)
cargo test --features cli

# Lint
cargo clippy --all-targets --features cli -- -D warnings

# Verify Cargo.toml and pyproject.toml versions agree (CI runs this on every
# PR; the publish workflow also checks tag vs both files before any upload)
./scripts/check_versions.sh

# Benchmarks
cargo bench

# Build Python package
maturin develop --release

Acknowledgments

This work was supported by the Asteroid Institute (a program of the B612 Foundation) and the DIRAC Institute at the University of Washington.

License

BSD 3-Clause. See LICENSE.md for details.

difi-rs 2.0.0-rc7