🔍 PolicyCheck
Web Attribution and Compliance Scanner
A fast, portable tool for checking web scraping compliance across robots.txt, RSL licenses, and TDM policies. Built with Rust for the OpenAttribution initiative.
What is PolicyCheck?
PolicyCheck helps you scrape responsibly by checking multiple compliance signals:
- ✅ Robots.txt - What paths you can crawl (REP/RFC 9309)
- 📜 RSL Licenses - Required licensing terms (Responsible Sourcing License)
- 🎯 Content Signals - AI usage preferences (Cloudflare's policy framework)
- 🤖 TDM Policies - Text & Data Mining permissions (coming soon)
- 🔒 Privacy Controls - DNT, GPC signals (coming soon)
- 📧 Security Contacts - Who to contact about scraping (coming soon)
Features
- 🤖 AI Bot Analysis - Check 26 known AI crawlers (GPTBot, ClaudeBot, CCBot, etc.)
- 🎯 Content Signals - Detect Cloudflare's AI policy signals (search, ai-input, ai-train)
- 📊 CSV Export - Major AI bots as columns for advertiser analysis
- 🚀 Fast - Built with Rust, battle-tested parser (34M+ robots.txt files)
- 📦 Portable - Single binary, no dependencies
- 🔍 Comprehensive - User agents, crawl delays, sitemaps, paths, licenses
- 📜 RSL License Detection - Automatically finds Responsible Sourcing Licenses
- 📈 Multiple Formats - Table, JSON, CSV, or compact text output
- 🌐 HTTP API - Run as a service for integration
- 📝 CSV Batch Processing - Analyze thousands of URLs concurrently
- ⚡ Concurrent - Parallel URL analysis
Web UI
Try PolicyCheck instantly at openattribution.org/policycheck
- 🌐 No installation required
- 📊 Interactive analysis with visual results
- 📥 Export to CSV for bulk analysis
- 🤖 See AI bot blocking status at a glance
Perfect for quick checks before integrating the API or CLI.
Quick Start
Installation
From crates.io (Recommended)
Requires Rust 1.75+:
From Source
For development or the latest unreleased features:
The binary will be at target/release/policycheck.
Basic Usage
# Analyze a single URL
# Check multiple URLs
# Analyze from CSV file (advertiser use case)
# Check for specific user agent
# Output as JSON
# Output as CSV with AI bot columns
# Save to file
AI Bot Analysis
PolicyCheck analyzes 26 known AI crawlers including GPTBot, ClaudeBot, CCBot, and more. Perfect for two key use cases:
Publisher Use Case: Protecting Content
Check which AI training bots can access your content:
Shows comprehensive breakdown of which bots are blocked vs allowed.
Advertiser Use Case: Evaluating Publisher Partnerships
Analyze multiple publishers to see which ones block AI search engines (affecting brand visibility):
Example output:
URL,Status,Path Allowed,RSL Licenses,TDM Reserved,GPTBot,ClaudeBot,Google-Extended,Meta-ExternalAgent,CCBot,Bytespider,OAI-SearchBot,PerplexityBot
https://www.nytimes.com,Success,Yes,0,N/A,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked
https://github.com,Success,Yes,0,N/A,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed
https://techcrunch.com,Success,Yes,0,N/A,Blocked,Blocked,Blocked,Allowed,Blocked,Blocked,Allowed,Allowed
Key insights:
- NYTimes: Blocks all AI bots (zero AI search visibility)
- GitHub: Allows all AI bots (maximum AI discoverability)
- TechCrunch: Selectively blocks training bots, allows some search bots
Perfect for advertisers evaluating whether publisher placements will appear in ChatGPT, Perplexity, Claude, etc.
RSL (Responsible Sourcing License) Support
PolicyCheck automatically detects RSL license directives from robots.txt files. RSL extends the Robots Exclusion Protocol to enable websites to declare governing license documents for automated crawlers.
How RSL Works
RSL introduces a License: directive that can be:
- Global: Outside any User-agent group (applies to all bots)
- Group-scoped: Inside a User-agent group (applies only to that bot)
Precedence rule: Group-scoped licenses override global licenses.
Example robots.txt with RSL
# Global license (applies to all bots unless overridden)
License: https://acme.com/global-license.xml
User-agent: *
Disallow: /private/
Allow: /public/
User-agent: GPTBot
Disallow: /
License: https://acme.com/gptbot-specific-license.xml
In this example:
- Most bots will see the global license
- GPTBot will see only the group-scoped license (global is ignored)
Real-world example: NYTimes blocks AI bots comprehensively:
# Shows: Blocked, with legal notice about prohibited uses
RSL in Output
PolicyCheck reports three license fields:
active_licenses: The licenses that actually apply (follows RSL precedence rules)global_licenses: Licenses defined outside user-agent groupsgroup_licenses: Licenses defined for the specific user agent
Compact output example:
================================================================================
URL: https://www.nytimes.com
Robots.txt: https://www.nytimes.com/robots.txt
Status: ✓ Success
User Agents:
• *
• GPTBot
• ClaudeBot
• (40+ more...)
Path Access (for GPTBot): ✗ Disallowed
AI Bot Analysis:
🚫 GPTBot: Blocked
🚫 ClaudeBot: Blocked
🚫 CCBot: Blocked
✓ Googlebot: Allowed (with restrictions)
Sitemaps:
• https://www.nytimes.com/sitemaps/new/news.xml.gz
• (15+ more sitemaps)
================================================================================
JSON output example:
For more information about RSL, see the RSL Standard.
Content Signals (Cloudflare AI Policy Framework)
PolicyCheck automatically detects Content Signals - Cloudflare's framework for expressing AI usage preferences in robots.txt. Adopted by over 3.8 million domains using Cloudflare's managed robots.txt.
What are Content Signals?
Content Signals allow websites to express preferences for how their content can be used after it's been accessed. Three signals are defined:
search- Traditional search indexing and results (not AI-generated summaries)ai-input- Inputting content into AI models (RAG, grounding, generative AI search)ai-train- Training or fine-tuning AI models
Format
User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /
Values can be yes (permitted) or no (not permitted). Omitting a signal means no preference is expressed.
Example Output
Compact format:
Content Signals:
✓ search: yes
✗ ai-train: no
✓ ai-input: yes
CSV format includes columns: CS-Search, CS-AI-Input, CS-AI-Train
JSON format:
Real-World Example
Cloudflare's blog permits all AI usage:
search=yes- Allowed in search indexesai-input=yes- Allowed for AI search/RAGai-train=yes- Allowed for model training
For more information, see Cloudflare's Content Signals announcement.
Output Formats
CSV Format (Best for Advertisers)
Perfect for bulk analysis with AI bot columns:
Creates a spreadsheet with major AI bots as columns - ideal for Excel/Google Sheets analysis:
URL,Status,Path Allowed,RSL Licenses,TDM Reserved,GPTBot,ClaudeBot,Google-Extended,Meta-ExternalAgent,CCBot,Bytespider,OAI-SearchBot,PerplexityBot
https://www.nytimes.com,Success,Yes,0,N/A,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked
https://github.com,Success,Yes,0,N/A,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed
Table Format (Default)
Perfect for quick checks:
Shows summary information in a clean ASCII table.
Compact Format
Detailed, human-readable output with full AI bot breakdown:
Shows all details including blocked/allowed AI bots, paths, sitemaps, and licenses.
JSON Format
For programmatic use:
Includes ai_bot_analysis array with per-bot status - perfect for integration with other tools.
Running as a Service
Production API
The PolicyCheck API is available at https://policycheck-d7wv0g.fly.dev
No authentication required for public use. Rate limits may apply.
Run Your Own Server
Start the HTTP API server locally:
Features:
- ✅ CORS enabled (all origins)
- ✅ JSON request/response
- ✅ Concurrent request handling
- ✅ 10s timeout per URL
API Endpoints
GET /health
Health check endpoint.
Response:
POST /analyze
Analyze robots.txt and RSL licenses for given URLs.
Request:
Success Response:
Error Response (with failures):
Example with curl
Using production API:
Using local server:
CSV Batch Processing
Create a CSV file with URLs to check:
url
https://acme.com
https://example.org
https://test.io
Or with identifiers for tracking:
source_id,url
acme,https://acme.com
example,https://example.org
test,https://test.io
Analyze all URLs:
PolicyCheck will automatically:
- Find the URL column (looks for headers containing "url", "link", "website", etc.)
- Default to the first column if no URL header is found
- Add
https://prefix if missing - Skip empty rows
- Process all URLs in parallel
Note: Only the URL column is used for analysis. Additional columns (like source_id) can be present for your own tracking but are ignored by PolicyCheck.
Integration Examples
Python
=
return
# Advertiser use case: check which publishers block AI bots
=
=
# Check specific AI bots
=
Node.js
const axios = require;
// Advertiser use case: analyze publisher AI visibility
const publishers = ;
const result = await ;
console.log;
result..;
Go
package main
import (
"bytes"
"encoding/json"
"net/http"
)
type AnalyzeRequest struct
func checkCompliance(urls []string, userAgent string) (*AnalyzeResponse, error)
Deployment
Docker
Using Pre-built Image (Recommended)
Pull and run the official image from GitHub Container Registry:
# Latest version
# Specific version
Multi-platform images available for linux/amd64 and linux/arm64.
Building from Source
FROM rust:1.92-slim as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/policycheck /usr/local/bin/policycheck
EXPOSE 3000
CMD ["policycheck", "serve", "--host", "0.0.0.0", "--port", "3000"]
Build and run:
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: policycheck
spec:
replicas: 3
selector:
matchLabels:
app: policycheck
template:
metadata:
labels:
app: policycheck
spec:
containers:
- name: policycheck
image: ghcr.io/openattribution-org/policycheck:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /health
port: 3000
---
apiVersion: v1
kind: Service
metadata:
name: policycheck-service
spec:
selector:
app: policycheck
ports:
- protocol: TCP
port: 80
targetPort: 3000
type: LoadBalancer
Podman Compose
For local development or production deployments using podman-compose:
# podman-compose.yml
services:
policycheck:
image: ghcr.io/openattribution-org/policycheck:latest
ports:
- "3000:3000"
healthcheck:
test:
interval: 30s
timeout: 10s
retries: 3
Run with:
Roadmap
✅ Completed
- Robots.txt parsing (REP/RFC 9309)
- RSL license detection
- User agent matching
- Crawl delay detection
- Sitemap discovery
- Path permission checking
- CSV batch processing
- HTTP API server
- Multiple output formats
🚧 In Progress
- TDM (Text & Data Mining) policy detection (
/.well-known/tdmrep.json) - Security contact discovery (
/.well-known/security.txt) - Privacy control detection (DNT, GPC)
📋 Planned
- AI plugin manifest detection (
/.well-known/ai-plugin.json) - OpenID configuration for gated content
- Caching layer for repeated checks
- GitHub Action for PR compliance checks
- Pre-commit hook for URL validation
Command Reference
policycheck analyze
Analyze robots.txt and RSL licenses from URLs.
Options:
-u, --url <URL>- URL to analyze (can be repeated)-c, --csv <PATH>- CSV file containing URLs-a, --user-agent <AGENT>- User agent to check (default: "*")-f, --format <FORMAT>- Output format: table, json, compact (default: table)-o, --output <PATH>- Save output to file
policycheck serve
Start HTTP API server.
Options:
-p, --port <PORT>- Port to listen on (default: 3000)--host <HOST>- Host to bind to (default: 127.0.0.1)
Performance
PolicyCheck is designed for speed:
- Concurrent analysis: Multiple URLs analyzed in parallel
- Optimized builds: Release builds use LTO and aggressive optimization
- Battle-tested parser: Based on
texting_robots, tested against 34M+ real-world files - Low memory footprint: Efficient parsing with minimal allocations
Typical performance:
- Single URL analysis: ~50-200ms (network dependent)
- 100 URLs analyzed concurrently: ~2-5 seconds
Limitations and Considerations
Datacenter IP Blocking
The Paradox: robots.txt exists for bots to check before crawling, but some sites block datacenter IPs, preventing policy checkers from accessing robots.txt.
Why this happens:
- Sites like Medium block cloud provider IP ranges to prevent scraping
- PolicyCheck runs from cloud infrastructure (Fly.io)
- Appears as "generic scraper" rather than "compliance checker"
How legitimate crawlers solve this:
- IP whitelisting - Googlebot, GPTBot, ClaudeBot use published IP ranges that sites whitelist
- Reverse DNS verification - Sites verify bot identity via DNS lookups
- User agent + IP combo - Both must match expected patterns
Impact on PolicyCheck:
- ✅ Works: Most sites (GitHub, Cloudflare, NYTimes, etc.)
- ❌ Blocked: Some sites that aggressively block datacenter IPs (e.g., Medium)
- 💡 Workaround: Test locally with
cargo runor use sites that don't block datacenter IPs
Why this matters: If compliance checkers are blocked, publishers can't verify their own policies are working correctly. This is a gap in the current web crawling ecosystem.
Security Considerations
- Input validation: URLs are validated before processing
- Size limits: robots.txt files limited to 500KB (Google's recommendation)
- Timeouts: HTTP requests timeout after 10 seconds
- No arbitrary code execution: Pure parsing, no eval or dynamic code
- CORS enabled: API server has CORS enabled by default
Standards Compliance
PolicyCheck implements the following standards:
- ✅ RFC 9309: Robots Exclusion Protocol (REP)
- ✅ RSL Standard: Responsible Sourcing License
- ✅ Content Signals: Cloudflare's AI Policy Framework (CC0 License)
- 🚧 W3C TDMRep: Text and Data Mining Reservation Protocol (planned)
- 🚧 RFC 9116: security.txt (planned)
- 🚧 RFC 8615: Well-Known URIs (planned)
Contributing
PolicyCheck is part of the OpenAttribution initiative. Contributions welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests if applicable
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see LICENSE for details.
Third-party software notices are in NOTICE.
Key Dependencies
- texting_robots (MIT OR Apache-2.0) - Robust robots.txt parsing by @Smerity
- reqwest (MIT OR Apache-2.0) - HTTP client
- clap (MIT OR Apache-2.0) - CLI argument parsing
- axum (MIT) - HTTP server framework
- See NOTICE for complete attribution list
OpenAttribution Initiative
PolicyCheck is built for the OpenAttribution initiative, which aims to make web attribution transparent, accessible, and machine-readable.
Mission: Enable responsible AI development through clear content licensing and attribution standards.
Support
- 🐛 Report issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📧 Contact: openattribution.org
- 🌐 Website: OpenAttribution.org
Acknowledgments
Built with ❤️ by the OpenAttribution community.
Special thanks to:
- @Smerity for texting_robots
- The Rust community for excellent tooling
- Everyone contributing to open web standards
Made with Rust 🦀 | Part of OpenAttribution 🔍 | MIT Licensed 📜