[][src]Crate group_similar

This crate enables grouping values based on string similarity via Jaro-Winkler distance and complete-linkage clustering.

Example: Identify likely repeated merchants based on merchant name

use group_similar::{Config, Named, Threshold, group_similar};

#[derive(Eq, PartialEq, std::hash::Hash, Debug)]
struct Merchant {
  id: usize,
  name: String
}

impl Named for Merchant {
    fn name(&self) -> &str {
        &self.name
    }
}

let merchants = vec![
    Merchant {
        id: 1,
        name: "McDonalds 105109".to_string()
    },
    Merchant {
        id: 2,
        name: "McDonalds 105110".to_string()
    },
    Merchant {
        id: 3,
        name: "Target ID1244".to_string()
    },
    Merchant {
        id: 4,
        name: "Target ID125".to_string()
    },
    Merchant {
        id: 5,
        name: "Amazon.com TID120159120".to_string()
    },
    Merchant {
        id: 6,
        name: "Target".to_string()
    },
    Merchant {
        id: 7,
        name: "Target.com".to_string()
    },
];

let config = Config::jaro_winkler(Threshold::default());
let results = group_similar(&merchants, &config);

assert_eq!(results.get(&merchants[0]), Some(&vec![&merchants[1]]));
assert_eq!(results.get(&merchants[2]), Some(&vec![&merchants[3], &merchants[5], &merchants[6]]));
assert_eq!(results.get(&merchants[4]), Some(&vec![]));

Structs

Config

Config manages grouping configuration based on three settings (managed internally); threshold, method, and compare.

Threshold

Threshold is a newtype wrapper describing how permissive comparisons are for a given comparison closure.

Traits

Named

Named describes data structures with a particular name to be grouped based on a &str value.

Functions

group_similar

Group records based on a particular configuration