py-canon
The Python frontend for find-dup-defs:
Python source → a CPython ast.dump-shape canonical form plus a top-level definition scan.
Parses with the Ruff Python parser (modern syntax — PEP 695 / PEP 701). Two layers:
find_module_defsscans files for each module-level definition (function, class,UPPER_CASEconstant,typealias) →ModuleDef { kind, name, file, line, col, text }.- canonicalization of a definition's source text:
ast_canonical(a structural canonical matching CPython'sast.dumpshape, docstrings stripped — the input to byte-for-byte Ratcliff–Obershelp similarity), plusnormalize_functions/analyze_functionsfor the alpha-renamed and name-agnostic forms used to detect renamed copy-paste.
The canonicalization is validated byte-for-byte against a golden corpus produced by CPython's own
ast module (examples/verify_golden.rs).
use ;
let defs = find_module_defs; // reads the files, returns top-level defs
for d in &defs
Reusable on its own; pairs with difflib-fast for the
similarity/clustering step.
License
MIT.