Expand description
§Ready-to-use web-scraping, data processing and NLP pipelines
Rust-native state-of-the-art library for simplifying web scraping of news and public data. Port of a previous python application NewsLookout Package.
This package enables building a full-fledged multi-threaded web scraping solution that runs in batch mode with very meagre resources (e.g. single core CPU with less than 4GB RAM). It is primarily driven by configuration specified in a config file and intended to be invoked in batch mode.
This library is the main entry point for the package, it loads the config, initialises the workers and starts the scraping pipeline.
Here is an illustration of this multi-threaded pipeline:
§Architecture
This library sets up a web scraping pipeline and executes it as follows:
- Starts the web retriever modules in its own separate thread that run parallely to get the content from the respective websites
- Each page’s content is populated into a document struct and transmitted by the web retriever module threads to the data processing chain.
- Simultaneously the data processing modules are started in their own threads (which form the data processing chain). The retrieved documents are passed to these threads in serial order, based on the priority configured for each data processing module.
- Each data processing module processes the content and may add or modify the document it receives. It then passes it on to the next data processing thread in order of priority
- Popular LLM services are supported by the data processing pipelines such as - ChatGPT, Google Gemini and self-hosted LLMs using Ollama. The relevant API keys need to be configured as environment variables before using these plugins.
- At then end, the document is written to disk as a json file
- The retrieved URLs are saved to an SQLite database table to serve as a reference so these are not retrieved again in the next run.
- Adequate wait times are configured during web retrieval to avoid overloading the target website. All events and actions are logged to a central log file. Multiple instances are prevented by writing and checking for a PID file. Although, if desired multiple instances can be launched by running the application with separate config files.
§Get Started
Get started using this crate in just a few lines of code, for example:
use std::env;use newslookout::run_app;
fn main() {
if env::args().len() < 2 {
println!("Usage: newslookout_app
panic!("Provide config file as parameter in the command line, (need 2 parameters, got {})",
env::args().len()
);
}
let configfile = env::args().nth(1).unwrap();
println!("Loading configuration from file: {}", config_file);
let app_config: config::Config = newslookout::utils::read_config(config_file);
let docs_retrieved: Vec < newslookout::document::DocInfo > = newslookout::run_app(app_config);
// use this collection of retrieved documents information for any further custom processing
}
§Create your own custom plugins and run these in the Pipeline
Declare custom retriever plugin and add these to the pipeline to fetch data using your custom logic.
fn run_pipeline(config: &config::Config) -> Vecnewslookout::init_logging(config);
newslookout::init_pid_file(config);
log::info!("Starting the custom pipeline");
let mut retriever_plugins = newslookout::pipeline::load_retriever_plugins(config);
let mut data_proc_plugins = newslookout::pipeline::load_dataproc_plugins(config);
// add custom data retriever:
retriever_plugins.push(my_plugin);
let docs_retrieved = newslookout::pipeline::start_data_pipeline(
retriever_plugins,
data_proc_plugins,
config
);
log::info!("Data pipeline completed processing {} documents.", docs_retrieved.len());
// use docs_retrieved for any further custom processing.
newslookout::cleanup_pid_file(&config);
}
Similarly, you can also declare and use custom data processing plugins, e.g.:
data_proc_plugins.push(my_own_data_processing);Note that all data processing plugins are run in the serial order of priority as defined in the config file.
There are a few pre-built modules provided for a few websites. These can be readily extended for other websites as required.
Refer to the README file and the source code of these in the plugins folder and roll out your own plugins.
Modules§
Macros§
Constants§
Functions§
- cleanup_
pid_ file - Shuts down the application by performing any cleanup required.
- init_
logging - Initialise the application by configuring the log file, and setting the PID file to prevent duplicate instances form running simultaneously.
- init_
pid_ file - load_
and_ run_ pipeline - Runs the web scraping application plugins. Refers to the config object passed as the parameter. Initialises the logging, PID and multi-threaded web scraping modules as well as the data processing modules of the pipeline. All of these are configured and enabled via the config file.