Random Forests for Change Point Detection
Change point detection aims to identify structural breaks in the probability distribution of a time series. Existing methods either assume a parametric model for within-segment distributions or a based on ranks or distances, and thus fail in scenarios with reasonably large dimensionality.
changeforest implements a classifier-based algorithm that consistently estimates
change points without any parametric assumptions even in high-dimensional scenarios.
It uses the out-of-bag probability predictions of a random forest to construct a
pseudo-log-likelihood that gets optimized using a computationally feasible two-step
method.
See [1] for details.
changeforest is available as rust crate, a Python package (on
PyPI and
conda-forge)
and as an R package (on conda-forge
, linux and MacOS only). See below for their respective user guides.
Python
Installation
To install from conda-forge (recommended), simply run
Example
The following example performs random forest based change point detection on the iris
dataset. This includes three classes setosa, versicolor and virginica with 50
observations each. We interpret this as a simulated time series with change points at
t = 50, 100.
:
...:
...:
...: =
...: =
...:
:
changeforest also implements methods change_in_mean and knn. While random_forest
and knn implement the TwoStepSearch optimizer as described in [1], for
change_in_mean the optimizer GridSearch is used. Both random_forest and knn
perform model selection via a pseudo-permutation test (see [1]). For change_in_mean
split candidates are kept whenever max_gain > control.minimal_gain_to_split.
The iris dataset
allows for rather simple classification due to large mean shifts between classes. As a
result, both simple methods change_in_mean and knn correctly identify die true change points.
: =
...:
:
: =
...:
:
changeforest returns a tree-like object with attributes start, stop, best_split, max_gain, p_value, is_significant, optimizer_result, model_selection_result, left, right and segments. These can be interesting to further investigate the output of the algorithm. Here we
plot the approximated gain curves of the first three segments:
:
...: =
...:
...:
...:
...:
...:
One can clearly observe that the approx. gain curves are piecewise linear, with maxima at the true underlying change points.
The changeforest algorithm can be tuned with hyperparameters. See here
for their descriptions and default values. In Python, the parameters can
be specified with the Control class
which can be passed to changeforest. The following will build random forests with
very few trees:
:
...:
:
R
To install from conda-forge, simply run
Example
The following example performs random forest based change point detection on the iris
dataset. This includes three classes setosa, versicolor and virginica with 50
observations each. We interpret this as a simulated time series with change points at
t = 50, 100.
>
>
>
> X <-
>
name best_split max_gain p_value is_significant
1 (0, 150] 50 96.173664 0.01 TRUE
2 ¦--(0, 50] 34 -5.262184 1.00 FALSE
3 °--(50, 150] 100 51.557473 0.01 TRUE
4 ¦--(50, 100] 80 -3.068934 1.00 FALSE
5 °--(100, 150] 134 -2.063508 1.00 FALSE
changeforest also implements methods change_in_mean and knn. While random_forest
and knn implement the TwoStepSearch optimizer as described in [1], for
change_in_mean the optimizer GridSearch is used. Both random_forest and knn
perform model selection via a pseudo-permutation test (see [1]). For change_in_mean
split candidates are kept whenever max_gain > control.minimal_gain_to_split.
The iris dataset allows for rather simple classification due to large mean shifts between classes. As a
result, both change_in_mean and knn also correctly identify die true change points.
> result <-
> result$
> result <-
> result$
changeforest returns a tree-like object with attributes start, stop, best_split, max_gain, p_value, is_significant, optimizer_result, model_selection_result, left, right and segments. These can be interesting to further investigate the output of the algorithm. Here we
plot the approximated gain curves of the first three segments:
>
> result <-
> data =
> +
+
+
+
+
One can clearly observe that the approx. gain curves are piecewise linear, with maxima at the true underlying change points.
The changeforest algorithm can be tuned with hyperparameters. See here
for their descriptions and default values. In R, the parameters can
be specified with the Control class
which can be passed to changeforest. The following
will build random forests with very few trees:
>
... TODO
References
[1] M. Londschien, S. Kovács and P. Bühlmann (2022), "Random Forests for Change Point Detection", working paper.