scrapelect
scrapelect is a web scraping language inspired by CSS that turns
a web page into structured JSON data. Select elements with CSS
selectors, apply filters to extract and modify the data you want from
a web page, and get the output in a structured, machine-readable,
interoperable format.
installation
Install the Rust toolchain. Using cargo,
run:
$ cargo install scrapelect
to install the scrapelect interpreter.
usage
Write a scrapelect program into a .scrp file. Documentation
for the language can be found in the scrapelect book.
A quick example, title.scrp, retrieves the title of a Wikipedia article:
title: .mw-page-title-main {
content: $element | text();
};
Run the scrp with the URL of the web page to scrape:
$ scrapelect title.scrp "https://en.wikipedia.org/wiki/Cat"
It will output:
documentation
- The
scrapelectbook contains documentation on language concepts and how to write ascrapelectprogram. - Additionally, documentation for scrapelect's built-in filters is located at docs.rs
- Developer-level documentation is also at docs.rs, but it is currently incomplete.
community
- GitHub issues
and discussions
are great places to report bugs, request features, and get help
using
scrapelect - Also, consider submitting a pull request to contribute to the code or documentation.
- See the contributing
chapter of the
scrapelectbook for more information on contributing toscrapelect.
license
scrapelect is available under the MIT or Apache 2 licenses, at your
option. Copies of these licenses are included at
LICENSE-MIT and
LICENSE-APACHE
at the root directory.
scrapelect: scrape + select, also -lect