tld_extract 0.1.0

Summary

tldextract-rs is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from Domain.

Hostname

Cargo.toml:

tld_extract = { git = "https://github.com/emo-cat/tldextract-rs" }

example code

use tld_extract::TLDExtract;

fn main() {
    let source = tld_extract::Source::Hardcode;
    let suffix = tld_extract::SuffixList::new(source, false, None);
    let mut extract = TLDExtract::new(suffix, true).unwrap();
    let e = extract.extract("  mirrors.tuna.tsinghua.edu.cn").unwrap();
    let s = serde_json::to_string_pretty(&e).unwrap();
    println!("{:}", s);
}

ExtractResult

{
  "subdomain": "mirrors.tuna",
  "domain": "tsinghua",
  "suffix": "edu.cn",
  "registered_domain": "tsinghua.edu.cn"
}

Implementation details

Why not split on "." and take the last element instead?

Splitting on "." and taking the last element only works for simple eTLDs like com, but not more complex ones like oseto.nagasaki.jp.

eTLD tries

tldextract-rs stores eTLDs in compressed tries.

Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.

Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 ✅
 ║  ╚═ edu ✅
 ║     ╚═ nsw 🚩 ✅
 ╚═ ac
    ╠═ com 🚩
    ╠═ edu 🚩
    ╚═ gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`