Module rustitude_core::dataset
source · Expand description
This module contains all the resources needed to load and examine datasets.
A Dataset is, in essence, a list of Events, each of which contain all the pertinent
information about a single set of initial- and final-state particles, as well as an index
and weight within the Dataset.
This crate currently supports loading Datasets from ROOT and Parquet files (see
Dataset::from_root and Dataset::from_parquet. These methods require the following
“branches” or “columns” to be present in the file:
| Branch Name | Data Type | Notes |
|---|---|---|
Weight | Float32 | |
E_Beam | Float32 | |
Px_Beam | Float32 | |
Py_Beam | Float32 | |
Pz_Beam | Float32 | |
E_FinalState | [Float32] | [recoil, daughter #1, daughter #2, …] |
Px_FinalState | [Float32] | [recoil, daughter #1, daughter #2, …] |
Py_FinalState | [Float32] | [recoil, daughter #1, daughter #2, …] |
Pz_FinalState | [Float32] | [recoil, daughter #1, daughter #2, …] |
EPS | [Float32] | [$P_\gamma \cos(\Phi)$, $P_\gamma \sin(\Phi)$, $0.0$] for linear polarization with magnitude $P_\gamma$ and angle $\Phi$ |
The EPS branch is optional and files without such a branch can be loaded under the
following conditions. First, if we don’t care about polarization, and wish to set EPS =
[0.0, 0.0, 0.0], we can do so using the methods Dataset::from_root_unpolarized or
Dataset::from_parquet_unpolarized. If a data file contains events with only one
polarization, we can compute the EPS vector ourselves and use
Dataset::from_root_with_eps or Dataset::from_parquet_with_eps to load the same vector
for every event. Finally, to provide compatibility with the way polarization is sometimes
included in AmpTools files, we can note that the beam is often only moving along the
$z$-axis, so the $x$ and $y$ components are typically 0.0 anyway, so we can store
the $x$ and $y$ components of EPS in the beam’s four-momentum and use the methods
Dataset::from_root_eps_in_beam or Dataset::from_parquet_eps_in_beam to extract it.
There are also several methods used to split up Datasets based on their component
values. The [Dataset::select] method takes mutable access to a dataset along with a query
function which takes an Event and returns a bool. For each event, if the query
returns true, the event is removed from the original dataset and added to a new dataset
which is then returned by the select function. The [Dataset::reject] method does the
opposite. For example,
let ds_original = Dataset::from_root("path.root").unwrap();
let ds_a = ds_original.clone();
let ds_b = ds_original.clone();
let mass_gt_1_gev = |e: &Event| -> bool {
(e.daughter_p4s[0] + e.daughter_p4s[1]).m() > 1.0
};
let ds_a_selected = ds_a.select(mass_gt_1_gev);
let ds_b_rejected = ds_b.reject(mass_gt_1_gev);After this, ds_a and ds_b_rejected will contain events where the four-momentum of the
first two daughter particles combined has a mass less than $1.0$ GeV. On the other hand,
ds_a_selected and ds_b will have events where the opposite is true and the mass is
greater than $1.0$ GeV. The reason for this logic is two-fold. First, we might be
dealing with large datasets, so we don’t want to create copies of events if it can be
avoided. If copies are needed, they should be made explicitly with Dataset::clone.
Otherwise, we just extract the events from the dataset. The other reason is that the syntax
reads in a “correct” way. We expect let selected = data.select(condition); to put the
selected data into the selected dataset. We can then choose if we want to hold on to the
rejected data.
Since it is a common operation, there is also a method [Dataset::split] which will bin data
by a query which takes an Event and returns an Field value (rather than a bool).
This method also takes a range: (Field, Field) and a number of bins nbins: usize, and it
returns a (Vec<Dataset>, Dataset, Dataset). These fields correspond to the binned datasets,
the underflow bin, and the overflow bin respectively, so no data should ever be “lost” by this
operation. There is also a convenience method, Dataset::split_m, to split the dataset by
the mass of the summed four-momentum of any of the daughter particles, specified by their
index.