newsfresh 0.1.0-beta.2

CLI and library for querying, filtering, and analyzing GDELT Global Knowledge Graph (GKG) v2.1 data — the world's largest open news event dataset
docs.rs failed to build newsfresh-0.1.0-beta.2
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

NewsFresh

A fast CLI tool for querying, filtering, and analyzing GDELT Global Knowledge Graph (GKG) v2.1 data — the world's largest open dataset of global news events, updated every 15 minutes. Includes an LLM-powered agent that turns GDELT into a personalized agentic news feed, delivering real-time email alerts tailored to your interests.

What is GDELT?

The GDELT Project monitors news media worldwide — print, broadcast, and web — in over 100 languages. Every 15 minutes it publishes structured records representing news articles, each annotated with people, organizations, themes, locations, sentiment, quotations, event counts, and more.

Each GKG record is a tab-delimited line with 27 fields. newsfresh handles the full pipeline: downloading, decompressing, parsing, filtering, searching, and outputting results in JSON or TeaLeaf format.

Features

Personalized Agentic News Feed

An LLM-powered monitoring agent runs 24/7, polling GDELT every 15 minutes and delivering email alerts for topics you care about. Define interest profiles in a simple JSON config and the agent handles the rest — search, deduplication, summarization, and notification.

  • Multi-provider LLM support — swap between Claude (Haiku, Sonnet) and OpenAI (GPT-4o, GPT-4o-mini) with a single config change
  • Interest profiles — define multiple search topics with country, person, theme, and tone filters
  • Smart deduplication — 24-hour rolling window prevents duplicate alerts
  • Email notifications — HTML summaries with who/what/where/sentiment via SendGrid
  • Live config reload — change profiles or switch LLM providers without restarting

See agent/README.md for setup and usage.

Full-Text Search & Analysis

  • Natural language search over GKG records using an in-memory Tantivy index with BM25 ranking
  • Search enrichment — FIPS country codes, ADM1 state/province codes, and theme taxonomy prefixes are expanded so searching "United States" or "California" matches raw codes like US or US06
  • Aggregate statistics — frequency tables and tone analysis over search results using Polars DataFrames (--stats)

Flexible Data Pipeline

  • Fetch, parse, filter, search, output — each stage works independently or composed via query and analyze commands
  • 12 composable filters — person, organization, theme, location, country, tone range, date range, source, image, and quotation filters, all AND-composed
  • 4 output formats — JSON, JSON-compact, TeaLeaf (~47% fewer tokens), TeaLeaf-compact (~60% fewer tokens)
  • Field projection — output only the fields you need with --fields
  • Streaming parser — handles large GKG files without loading everything into memory

Installation

Pre-built binaries

Download the latest release from GitHub Releases:

Linux / macOS:

# Linux (x64)
curl -L https://github.com/krishjag/newsfresh/releases/latest/download/newsfresh-linux-x64.tar.gz | tar xz
sudo mv newsfresh /usr/local/bin/

# Linux (arm64)
curl -L https://github.com/krishjag/newsfresh/releases/latest/download/newsfresh-linux-arm64.tar.gz | tar xz
sudo mv newsfresh /usr/local/bin/

# macOS (Apple Silicon)
curl -L https://github.com/krishjag/newsfresh/releases/latest/download/newsfresh-macos-arm64.tar.gz | tar xz
sudo mv newsfresh /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/krishjag/newsfresh/releases/latest/download/newsfresh-macos-x64.tar.gz | tar xz
sudo mv newsfresh /usr/local/bin/

Windows (PowerShell):

# Windows (x64)
Invoke-WebRequest -Uri https://github.com/krishjag/newsfresh/releases/latest/download/newsfresh-windows-x64.zip -OutFile newsfresh.zip
Expand-Archive newsfresh.zip -DestinationPath .
Move-Item newsfresh.exe "$env:USERPROFILE\.cargo\bin\"   # or any directory in your PATH
Remove-Item newsfresh.zip

# Windows (arm64)
Invoke-WebRequest -Uri https://github.com/krishjag/newsfresh/releases/latest/download/newsfresh-windows-arm64.zip -OutFile newsfresh.zip
Expand-Archive newsfresh.zip -DestinationPath .
Move-Item newsfresh.exe "$env:USERPROFILE\.cargo\bin\"   # or any directory in your PATH
Remove-Item newsfresh.zip

From crates.io

cargo install newsfresh

From source

git clone https://github.com/krishjag/newsfresh
cd newsfresh
cargo build --release
# Binary at target/release/newsfresh

Agent setup

The monitoring agent requires the newsfresh CLI and Python 3.10+:

# Install the CLI (any method above), then:
cd agent
pip install -r requirements.txt
cp config.json.example config.json  # edit with your profiles
python main.py --once               # test a single cycle

See agent/README.md for full setup instructions.

Commands

fetch — Download GKG data

# Latest 15-minute update (default)
newsfresh fetch

# Historical file by date
newsfresh fetch --date 20250217150000

# Non-English variant, custom output directory, keep zip
newsfresh fetch --translation -o ./my-data --keep-zip

parse — Parse a local GKG file

# Parse a CSV file with filters
newsfresh parse data/20250217150000.gkg.csv \
  --country US --person "Trump" --limit 10

# Parse directly from a zip file
newsfresh parse data/20250217150000.gkg.csv.zip -f json

# Output specific fields only
newsfresh parse data/gkg.csv \
  -f json --fields document_identifier,source_common_name,tone

query — Fetch + parse + filter in one step

# Fetch latest and filter by theme
newsfresh query --country US --theme "CLIMATE_CHANGE" --limit 5

# Historical data with tone filter
newsfresh query --date 20250201120000 --tone-min=-10 --tone-max=-2

# Persist downloaded files for reuse
newsfresh query --persist-data-file --country US --limit 20

# Output in compact TeaLeaf format
newsfresh query --country UK --has-quote -f tealeaf-compact

analyze — Natural language full-text search

# Search the latest GDELT data with natural language
newsfresh analyze --latest \
  --search "elections Congress US economy" --limit 20

# Search with additional structured filters
newsfresh analyze --latest \
  --search "climate carbon emissions policy" \
  --country US --limit 10

# From a local file
newsfresh analyze data/gkg.csv \
  --search "Ukraine Russia ceasefire negotiations" --limit 15

# Compact TeaLeaf output for LLM consumption (~47% fewer tokens)
newsfresh analyze --latest \
  --search "AI regulation technology" --limit 10 -f tealeaf-compact

# Aggregate statistics with Polars DataFrames
newsfresh analyze --latest \
  --search "US politics" --limit 50 --stats

# Top 5 per category
newsfresh analyze data/gkg.csv \
  --search "climate change" --stats --stats-top-n 5 --limit 100

The analyze command builds an in-memory Tantivy full-text index with BM25 ranking. Search enrichment layers expand raw codes into readable text for better recall:

  • FIPS country codes expanded to full names (240+ countries) — searching "United States" matches records with code US
  • ADM1 state/province codes expanded — searching "California" matches code US06
  • Theme codes canonicalized — TAX_FNCACT_PRESIDENT becomes searchable as "President"

--stats — Aggregate statistics

When --stats is passed, instead of outputting individual records, analyze computes frequency tables and statistics over the matched results using Polars DataFrames:

=== GDELT Analysis Stats (50 records) ===

--- Top Themes ---
   1. LEADER                        24  (2.0%)
   2. GENERAL GOVERNMENT            24  (2.0%)
   3. GOVERNMENT                    21  (1.8%)

--- Top Countries ---
   1. United States (US)            48  (31.2%)
   2. United Kingdom (UK)            9  (5.8%)

--- Tone ---
  Mean: -0.82  Std: 3.45  Range: [-8.12, 4.91]
  Most positive: [4.91] https://example.com/positive-article
  Most negative: [-8.12] https://example.com/negative-article

--- Top Persons ---
   1. donald trump                    9  (4.9%)
   2. elon musk                       3  (1.6%)

--- Top Organizations ---
   1. congress                       3  (1.7%)

--- Top Sources ---
   1. nytimes.com                   16  (32.0%)
Flag Description Default
--stats Show aggregate statistics instead of records off
--stats-top-n Number of entries per frequency table 10

schema — Print GKG type definitions

# TeaLeaf schema format
newsfresh schema

# JSON Schema format
newsfresh schema -f json-schema

Filter Options

All filters compose with AND logic. Available on parse, query, and analyze:

Flag Description Example
--person Person name (case-insensitive substring) --person "Trump"
--org Organization name --org "United Nations"
--theme GKG theme code --theme "TAX_POLICY"
--location Location name --location "Washington"
--country FIPS country code --country US
--tone-min / --tone-max Tone score range --tone-min -5 --tone-max 5
--date-from / --date-to Date range (YYYYMMDD) --date-from 20250201
--source Source name --source "bbc"
--has-image Only records with a sharing image --has-image
--has-quote Only records with quotations --has-quote

Output Formats

Format Flag Description
JSON (pretty) -f json Pretty-printed JSON array (default)
JSON (compact) -f json-compact Minified single-line JSON
TeaLeaf -f tealeaf Schema-driven format, ~47% fewer tokens than JSON
TeaLeaf (compact) -f tealeaf-compact Minified TeaLeaf, additional ~21% savings

Use --fields for field projection — output only the fields you need:

newsfresh query --limit 5 -f json --fields document_identifier,source_common_name,tone

GKG v2.1 Record Fields

Each record contains up to 27 fields:

Field Type Description
gkg_record_id String Unique record identifier
date Integer Publication date (YYYYMMDDHHMMSS)
source_collection_id Integer Source type (1=Web, 2=Citation, 3=Core, ...)
source_common_name String Human-readable source name
document_identifier String Article URL
v1_counts Array Event counts (protests, arrests, etc.)
v21_counts Array V2.1 counts with character offsets
v1_themes Array Theme codes (e.g., TAX_POLICY)
v2_enhanced_themes Array Themes with character offsets
v1_locations Array Geocoded locations (country, lat/lon)
v2_enhanced_locations Array V2 locations with ADM2 codes
v1_persons Array Person names mentioned
v2_enhanced_persons Array Persons with character offsets
v1_organizations Array Organization names
v2_enhanced_organizations Array Organizations with offsets
tone Object Sentiment (tone, polarity, pos/neg scores, word count)
v21_enhanced_dates Array Dates mentioned in the article
gcam Array GCAM content-analysis dimension scores
sharing_image String? Primary sharing image URL
related_images Array Related image URLs
social_image_embeds Array Social media image embeds
social_video_embeds Array Social media video embeds
quotations Array Direct quotes with attribution verbs
all_names Array All named entities
amounts Array Monetary/numerical amounts
translation_info Object? Translation source language and engine
extras_xml String? Extra XML content

Library Usage

newsfresh can also be used as a Rust library:

use newsfresh::parse::{self, GkgReader};
use newsfresh::filter::{CompositeFilter, RecordFilter};
use newsfresh::filter::predicates::*;
use newsfresh::search;

// Parse records from a file
let file = std::fs::File::open("data/gkg.csv")?;
let reader = std::io::BufReader::new(file);
let gkg_reader = GkgReader::new(reader);

let mut records = Vec::new();
for result in gkg_reader {
    let (line_num, line) = result?;
    if let Ok(record) = parse::parse_record(&line, line_num) {
        records.push(record);
    }
}

// Apply filters
let mut filters = CompositeFilter::new();
filters.add(Box::new(CountryFilter { code: "US".to_string() }));
filters.add(Box::new(PersonFilter { pattern: "Trump".to_string() }));

let filtered: Vec<_> = records.iter()
    .filter(|r| filters.matches(r))
    .collect();

// Full-text search with BM25 ranking
let mut engine = search::create_engine();
engine.build(&records)?;
let hits = engine.search("climate policy carbon", 20)?;

for hit in &hits {
    let record = &records[hit.record_index];
    println!("[{:.2}] {}", hit.score, record.document_identifier);
}

Module Overview

Module Description
cli CLI argument definitions (clap derive)
error NewsfreshError — HTTP, I/O, parse, ZIP, and JSON errors
fetch HTTP client, ZIP extraction, lastupdate.txt parsing
filter RecordFilter trait and 10 composable predicate filters
model Complete GKG v2.1 data model — GkgRecord with 27 fields, 14 sub-types
output OutputFormatter trait — JSON and TeaLeaf formatters, field projection, schema printing
parse Streaming GKG parser — GkgReader iterator, tab-delimited field parsing
search Tantivy full-text search — SearchEngine trait, BM25, FIPS/ADM1/theme enrichment
stats Polars DataFrame aggregation — theme/country/person/org/source frequency, tone statistics

Data Source

GDELT data is freely available at http://data.gdeltproject.org/gdeltv2/. The project publishes new data every 15 minutes. See the GDELT documentation for details on the GKG v2.1 format.

License

MIT