wdict
Create dictionaries by scraping webpages.
Similar tools (some features inspired by them):
Take it for a spin
nix build .#
./result/bin/wdict --help
nix run .# -- --help
nix run github:pyqlsa/wdict -- --help
cargo install wdict
nix develop .#
cargo build
./target/debug/wdict --help
cargo build --release
./target/release/wdict --help
Usage
Create dictionaries by scraping webpages.
Usage: wdict [OPTIONS] <--url <URL>|--theme <THEME>>
Options:
-u, --url <URL>
URL to start crawling from
--theme <THEME>
Pre-canned theme URLs to start crawling from (for fun, demoing features, and sparking new ideas)
Possible values:
- star-wars: Star Wars themed URL <https://www.starwars.com/databank>
- tolkien: Tolkien themed URL <https://www.quicksilver899.com/Tolkien/Tolkien_Dictionary.html>
- witcher: Witcher themed URL <https://witcher.fandom.com/wiki/Elder_Speech>
- pokemon: Pokemon themed URL <https://www.smogon.com>
- bebop: Cowboy Bebop themed URL <https://cowboybebop.fandom.com/wiki/Cowboy_Bebop>
- greek: Greek Mythology themed URL <https://www.theoi.com>
- greco-roman: Greek and Roman Mythology themed URL <https://www.gutenberg.org/files/22381/22381-h/22381-h.htm>
- lovecraft: H.P. Lovecraft themed URL <https://www.hplovecraft.com>
-d, --depth <DEPTH>
Limit the depth of crawling urls
[default: 1]
-m, --min-word-length <MIN_WORD_LENGTH>
Only save words greater than or equal to this value
[default: 3]
-r, --req-per-sec <REQ_PER_SEC>
Number of requests to make per second
[default: 20]
-o, --output <OUTPUT>
File to write dictionary to (will be overwritten if it already exists)
[default: wdict.txt]
--output-urls
Write discovered urls to a file
--output-urls-file <OUTPUT_URLS_FILE>
File to write urls to, json formatted (will be overwritten if it already exists)
[default: urls.json]
--filters <FILTERS>...
Filter strategy for words; multiple can be specified (comma separated)
[default: none]
Possible values:
- deunicode: Transform unicode according to <https://github.com/kornelski/deunicode>
- decancer: Transform unicode according to <https://github.com/null8626/decancer>
- all-numbers: Ignore words that consist of all numbers
- any-numbers: Ignore words that contain any number
- no-numbers: Ignore words that contain no numbers
- only-numbers: Keep only words that exclusively contain numbers
- all-ascii: Ignore words that consist of all ascii characters
- any-ascii: Ignore words that contain any ascii character
- no-ascii: Ignore words that contain no ascii characters
- only-ascii: Keep only words that exclusively contain ascii characters
- none: Leave the word as-is
-j, --inclue-js
Include javascript from <script> tags and urls
-c, --inclue-css
Include CSS from <style> tags urls
--site-policy <SITE_POLICY>
Site policy for discovered urls
[default: same]
Possible values:
- same: Allow crawling urls, only if the domain exactly matches
- subdomain: Allow crawling urls if they are the same domain or subdomains
- sibling: Allow crawling urls if they are the same domain or a sibling
- all: Allow crawling all urls, regardless of domain
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Lib
This crate exposes a library, but for the time being, the interfaces should be considered unstable.
TODO
A list of ideas for future work:
- archive mode to crawl and save pages locally
- build dictionaries from local (archived) pages
- support different mime types
- better async?
License
Licensed under either of
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be
dual licensed as above, without any additional terms or conditions.