# 🔍 PolicyCheck
**Web Attribution and Compliance Scanner**
A fast, portable tool for checking web scraping compliance across robots.txt, RSL licenses, and TDM policies. Built with Rust for the [OpenAttribution](https://openattribution.org) initiative.
[](https://crates.io/crates/policycheck)
[](LICENSE)
[](https://www.rust-lang.org/)
[](https://openattribution.org)
## What is PolicyCheck?
PolicyCheck helps you **scrape responsibly** by checking multiple compliance signals:
- ✅ **Robots.txt** - What paths you can crawl (REP/RFC 9309)
- 📜 **RSL Licenses** - Required licensing terms (Responsible Sourcing License)
- 🎯 **Content Signals** - AI usage preferences (Cloudflare's policy framework)
- 🤖 **TDM Policies** - Text & Data Mining permissions (coming soon)
- 🔒 **Privacy Controls** - DNT, GPC signals (coming soon)
- 📧 **Security Contacts** - Who to contact about scraping (coming soon)
## Features
- 🤖 **AI Bot Analysis** - Check 26 known AI crawlers (GPTBot, ClaudeBot, CCBot, etc.)
- 🎯 **Content Signals** - Detect Cloudflare's AI policy signals (search, ai-input, ai-train)
- 📊 **CSV Export** - Major AI bots as columns for advertiser analysis
- 🚀 **Fast** - Built with Rust, battle-tested parser (34M+ robots.txt files)
- 📦 **Portable** - Single binary, no dependencies
- 🔍 **Comprehensive** - User agents, crawl delays, sitemaps, paths, licenses
- 📜 **RSL License Detection** - Automatically finds Responsible Sourcing Licenses
- 📈 **Multiple Formats** - Table, JSON, CSV, or compact text output
- 🌐 **HTTP API** - Run as a service for integration
- 📝 **CSV Batch Processing** - Analyze thousands of URLs concurrently
- ⚡ **Concurrent** - Parallel URL analysis
## Web UI
Try PolicyCheck instantly at **[openattribution.org/policycheck](https://openattribution.org/policycheck/)**
- 🌐 No installation required
- 📊 Interactive analysis with visual results
- 📥 Export to CSV for bulk analysis
- 🤖 See AI bot blocking status at a glance
Perfect for quick checks before integrating the API or CLI.
## Quick Start
### Installation
#### From crates.io (Recommended)
Requires Rust 1.75+:
```bash
cargo install policycheck
```
#### From Source
For development or the latest unreleased features:
```bash
git clone https://github.com/openattribution-org/policycheck.git
cd policycheck
cargo build --release
```
The binary will be at `target/release/policycheck`.
### Basic Usage
```bash
# Analyze a single URL
policycheck analyze --url https://www.nytimes.com
# Check multiple URLs
policycheck analyze \
--url https://www.nytimes.com \
--url https://github.com \
--url https://techcrunch.com
# Analyze from CSV file (advertiser use case)
policycheck analyze --csv publishers.csv --format csv --output results.csv
# Check for specific user agent
policycheck analyze --url https://www.nytimes.com --user-agent GPTBot
# Output as JSON
policycheck analyze --url https://www.nytimes.com --format json
# Output as CSV with AI bot columns
policycheck analyze --url https://www.nytimes.com --format csv
# Save to file
policycheck analyze --url https://www.nytimes.com --output results.json
```
## AI Bot Analysis
PolicyCheck analyzes **26 known AI crawlers** including GPTBot, ClaudeBot, CCBot, and more. Perfect for two key use cases:
### Publisher Use Case: Protecting Content
Check which AI training bots can access your content:
```bash
policycheck analyze --url https://www.nytimes.com --format compact
```
Shows comprehensive breakdown of which bots are blocked vs allowed.
### Advertiser Use Case: Evaluating Publisher Partnerships
Analyze multiple publishers to see which ones block AI search engines (affecting brand visibility):
```bash
policycheck analyze --csv publishers.csv --format csv --output analysis.csv
```
**Example output:**
```csv
URL,Status,Path Allowed,RSL Licenses,TDM Reserved,GPTBot,ClaudeBot,Google-Extended,Meta-ExternalAgent,CCBot,Bytespider,OAI-SearchBot,PerplexityBot
https://www.nytimes.com,Success,Yes,0,N/A,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked
https://github.com,Success,Yes,0,N/A,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed
https://techcrunch.com,Success,Yes,0,N/A,Blocked,Blocked,Blocked,Allowed,Blocked,Blocked,Allowed,Allowed
```
**Key insights:**
- **NYTimes**: Blocks all AI bots (zero AI search visibility)
- **GitHub**: Allows all AI bots (maximum AI discoverability)
- **TechCrunch**: Selectively blocks training bots, allows some search bots
Perfect for advertisers evaluating whether publisher placements will appear in ChatGPT, Perplexity, Claude, etc.
## RSL (Responsible Sourcing License) Support
PolicyCheck automatically detects **RSL license directives** from robots.txt files. RSL extends the Robots Exclusion Protocol to enable websites to declare governing license documents for automated crawlers.
### How RSL Works
RSL introduces a `License:` directive that can be:
- **Global**: Outside any User-agent group (applies to all bots)
- **Group-scoped**: Inside a User-agent group (applies only to that bot)
**Precedence rule**: Group-scoped licenses override global licenses.
### Example robots.txt with RSL
```
# Global license (applies to all bots unless overridden)
License: https://acme.com/global-license.xml
User-agent: *
Disallow: /private/
Allow: /public/
User-agent: GPTBot
Disallow: /
License: https://acme.com/gptbot-specific-license.xml
```
In this example:
- Most bots will see the global license
- GPTBot will see only the group-scoped license (global is ignored)
**Real-world example:** NYTimes blocks AI bots comprehensively:
```bash
policycheck analyze --url https://www.nytimes.com --user-agent GPTBot
# Shows: Blocked, with legal notice about prohibited uses
```
### RSL in Output
PolicyCheck reports three license fields:
- **`active_licenses`**: The licenses that actually apply (follows RSL precedence rules)
- **`global_licenses`**: Licenses defined outside user-agent groups
- **`group_licenses`**: Licenses defined for the specific user agent
**Compact output example:**
```
================================================================================
URL: https://www.nytimes.com
Robots.txt: https://www.nytimes.com/robots.txt
Status: ✓ Success
User Agents:
• *
• GPTBot
• ClaudeBot
• (40+ more...)
Path Access (for GPTBot): ✗ Disallowed
AI Bot Analysis:
🚫 GPTBot: Blocked
🚫 ClaudeBot: Blocked
🚫 CCBot: Blocked
✓ Googlebot: Allowed (with restrictions)
Sitemaps:
• https://www.nytimes.com/sitemaps/new/news.xml.gz
• (15+ more sitemaps)
================================================================================
```
**JSON output example:**
```json
{
"url": "https://github.com",
"robots_url": "https://github.com/robots.txt",
"status": "success",
"user_agents": ["*"],
"ai_bot_analysis": [
{"bot_name": "GPTBot", "company": "OpenAI", "category": "Training", "status": "allowed"},
{"bot_name": "ClaudeBot", "company": "Anthropic", "category": "Training", "status": "allowed"}
],
"global_licenses": [],
"group_licenses": [],
"active_licenses": [],
"crawl_delay": null,
"sitemaps": ["https://github.com/sitemap.xml"],
"is_path_allowed": true
}
```
For more information about RSL, see the [RSL Standard](https://rslstandard.org/rsl#_4-associating-rsl-licenses-with-digital-assets).
## Content Signals (Cloudflare AI Policy Framework)
PolicyCheck automatically detects **Content Signals** - Cloudflare's framework for expressing AI usage preferences in robots.txt. Adopted by over 3.8 million domains using Cloudflare's managed robots.txt.
### What are Content Signals?
Content Signals allow websites to express preferences for how their content can be used **after** it's been accessed. Three signals are defined:
- **`search`** - Traditional search indexing and results (not AI-generated summaries)
- **`ai-input`** - Inputting content into AI models (RAG, grounding, generative AI search)
- **`ai-train`** - Training or fine-tuning AI models
### Format
```
User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /
```
Values can be `yes` (permitted) or `no` (not permitted). Omitting a signal means no preference is expressed.
### Example Output
**Compact format:**
```
Content Signals:
✓ search: yes
✗ ai-train: no
✓ ai-input: yes
```
**CSV format** includes columns: `CS-Search`, `CS-AI-Input`, `CS-AI-Train`
**JSON format:**
```json
{
"content_signal_search": "yes",
"content_signal_ai_input": "yes",
"content_signal_ai_train": "no"
}
```
### Real-World Example
```bash
policycheck analyze --url https://blog.cloudflare.com --format compact
```
Cloudflare's blog permits all AI usage:
- `search=yes` - Allowed in search indexes
- `ai-input=yes` - Allowed for AI search/RAG
- `ai-train=yes` - Allowed for model training
For more information, see [Cloudflare's Content Signals announcement](https://blog.cloudflare.com/content-signals-policy).
## Output Formats
### CSV Format (Best for Advertisers)
**Perfect for bulk analysis with AI bot columns:**
```bash
policycheck analyze --csv publishers.csv --format csv --output analysis.csv
```
Creates a spreadsheet with major AI bots as columns - ideal for Excel/Google Sheets analysis:
```csv
URL,Status,Path Allowed,RSL Licenses,TDM Reserved,GPTBot,ClaudeBot,Google-Extended,Meta-ExternalAgent,CCBot,Bytespider,OAI-SearchBot,PerplexityBot
https://www.nytimes.com,Success,Yes,0,N/A,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked,Blocked
https://github.com,Success,Yes,0,N/A,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed,Allowed
```
### Table Format (Default)
Perfect for quick checks:
```bash
policycheck analyze --url https://github.com --format table
```
Shows summary information in a clean ASCII table.
### Compact Format
Detailed, human-readable output with full AI bot breakdown:
```bash
policycheck analyze --url https://www.nytimes.com --format compact
```
Shows all details including blocked/allowed AI bots, paths, sitemaps, and licenses.
### JSON Format
For programmatic use:
```bash
policycheck analyze --url https://www.nytimes.com --format json > results.json
```
Includes `ai_bot_analysis` array with per-bot status - perfect for integration with other tools.
## Running as a Service
### Production API
The PolicyCheck API is available at **https://policycheck-d7wv0g.fly.dev**
No authentication required for public use. Rate limits may apply.
### Run Your Own Server
Start the HTTP API server locally:
```bash
policycheck serve --port 3000 --host 0.0.0.0
```
**Features:**
- ✅ CORS enabled (all origins)
- ✅ JSON request/response
- ✅ Concurrent request handling
- ✅ 10s timeout per URL
### API Endpoints
#### `GET /health`
Health check endpoint.
**Response:**
```json
{
"status": "healthy",
"service": "policycheck",
"version": "0.1.0"
}
```
#### `POST /analyze`
Analyze robots.txt and RSL licenses for given URLs.
**Request:**
```json
{
"urls": ["https://www.nytimes.com", "https://github.com"],
"user_agent": "GPTBot"
}
```
**Success Response:**
```json
{
"total": 2,
"successful": 2,
"failed": 0,
"results": [
{
"url": "https://www.nytimes.com",
"robots_url": "https://www.nytimes.com/robots.txt",
"status": "success",
"user_agents": ["*", "GPTBot", "ClaudeBot", "..."],
"crawl_delay": null,
"sitemaps": ["https://www.nytimes.com/sitemaps/new/news.xml.gz"],
"allowed_paths": [],
"disallowed_paths": ["/"],
"is_path_allowed": false,
"global_licenses": [],
"group_licenses": [],
"active_licenses": [],
"ai_bot_analysis": [
{"bot_name": "GPTBot", "company": "OpenAI", "category": "Training", "status": "blocked"},
{"bot_name": "ClaudeBot", "company": "Anthropic", "category": "Training", "status": "blocked"}
],
"error": null
}
]
}
```
**Error Response (with failures):**
```json
{
"total": 2,
"successful": 1,
"failed": 1,
"results": [
{
"url": "https://invalid-domain-xyz.com",
"robots_url": "https://invalid-domain-xyz.com/robots.txt",
"status": "fetch_error",
"error": "Failed to fetch robots.txt",
"user_agents": [],
"ai_bot_analysis": []
},
{
"url": "https://github.com",
"status": "success",
"error": null
}
]
}
```
### Example with curl
**Using production API:**
```bash
curl -X POST https://policycheck-d7wv0g.fly.dev/analyze \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://www.nytimes.com"],
"user_agent": "GPTBot"
}'
```
**Using local server:**
```bash
curl -X POST http://localhost:3000/analyze \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://www.nytimes.com"],
"user_agent": "GPTBot"
}'
```
## CSV Batch Processing
Create a CSV file with URLs to check:
```csv
url
https://acme.com
https://example.org
https://test.io
```
Or with identifiers for tracking:
```csv
source_id,url
acme,https://acme.com
example,https://example.org
test,https://test.io
```
Analyze all URLs:
```bash
policycheck analyze --csv partners.csv --format compact > results.txt
```
PolicyCheck will automatically:
- Find the URL column (looks for headers containing "url", "link", "website", etc.)
- Default to the first column if no URL header is found
- Add `https://` prefix if missing
- Skip empty rows
- Process all URLs in parallel
**Note**: Only the URL column is used for analysis. Additional columns (like `source_id`) can be present for your own tracking but are ignored by PolicyCheck.
## Integration Examples
### Python
```python
import requests
def check_ai_bot_access(urls, user_agent="GPTBot"):
response = requests.post(
"http://localhost:3000/analyze",
json={"urls": urls, "user_agent": user_agent}
)
return response.json()
# Advertiser use case: check which publishers block AI bots
publishers = [
"https://www.nytimes.com",
"https://github.com",
"https://techcrunch.com"
]
result = check_ai_bot_access(publishers)
for site in result['results']:
print(f"\n{site['url']}")
print(f" GPTBot access: {'❌ Blocked' if not site['is_path_allowed'] else '✅ Allowed'}")
# Check specific AI bots
for bot in site['ai_bot_analysis']:
if bot['bot_name'] in ['GPTBot', 'OAI-SearchBot', 'PerplexityBot']:
status = '❌' if bot['status'] == 'blocked' else '✅'
print(f" {status} {bot['bot_name']}")
```
### Node.js
```javascript
const axios = require('axios');
async function checkAIBotAccess(urls, userAgent = 'GPTBot') {
const response = await axios.post('http://localhost:3000/analyze', {
urls,
user_agent: userAgent
});
return response.data;
}
// Advertiser use case: analyze publisher AI visibility
const publishers = [
'https://www.nytimes.com',
'https://github.com',
'https://techcrunch.com'
];
const result = await checkAIBotAccess(publishers);
console.log(`Analyzed ${result.total} publishers`);
result.results.forEach(site => {
const blocked = site.ai_bot_analysis.filter(b => b.status === 'blocked').length;
const allowed = site.ai_bot_analysis.filter(b => b.status === 'allowed').length;
console.log(`${site.url}: ${blocked} blocked, ${allowed} allowed`);
});
```
### Go
```go
package main
import (
"bytes"
"encoding/json"
"net/http"
)
type AnalyzeRequest struct {
URLs []string `json:"urls"`
UserAgent string `json:"user_agent"`
}
func checkCompliance(urls []string, userAgent string) (*AnalyzeResponse, error) {
reqBody := AnalyzeRequest{URLs: urls, UserAgent: userAgent}
jsonData, _ := json.Marshal(reqBody)
resp, err := http.Post(
"http://localhost:3000/analyze",
"application/json",
bytes.NewBuffer(jsonData),
)
if err != nil {
return nil, err
}
defer resp.Body.Close()
var result AnalyzeResponse
json.NewDecoder(resp.Body).Decode(&result)
return &result, nil
}
```
## Deployment
### Docker
#### Using Pre-built Image (Recommended)
Pull and run the official image from GitHub Container Registry:
```bash
# Latest version
docker pull ghcr.io/openattribution-org/policycheck:latest
docker run -p 3000:3000 ghcr.io/openattribution-org/policycheck:latest
# Specific version
docker pull ghcr.io/openattribution-org/policycheck:0.2.0
docker run -p 3000:3000 ghcr.io/openattribution-org/policycheck:0.2.0
```
Multi-platform images available for `linux/amd64` and `linux/arm64`.
#### Building from Source
```dockerfile
FROM rust:1.92-slim as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/policycheck /usr/local/bin/policycheck
EXPOSE 3000
CMD ["policycheck", "serve", "--host", "0.0.0.0", "--port", "3000"]
```
Build and run:
```bash
docker build -t policycheck .
docker run -p 3000:3000 policycheck
```
### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: policycheck
spec:
replicas: 3
selector:
matchLabels:
app: policycheck
template:
metadata:
labels:
app: policycheck
spec:
containers:
- name: policycheck
image: ghcr.io/openattribution-org/policycheck:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /health
port: 3000
---
apiVersion: v1
kind: Service
metadata:
name: policycheck-service
spec:
selector:
app: policycheck
ports:
- protocol: TCP
port: 80
targetPort: 3000
type: LoadBalancer
```
### Podman Compose
For local development or production deployments using podman-compose:
```yaml
# podman-compose.yml
services:
policycheck:
image: ghcr.io/openattribution-org/policycheck:latest
ports:
- "3000:3000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
```
Run with:
```bash
podman-compose up -d
```
## Roadmap
### ✅ Completed
- [x] Robots.txt parsing (REP/RFC 9309)
- [x] RSL license detection
- [x] User agent matching
- [x] Crawl delay detection
- [x] Sitemap discovery
- [x] Path permission checking
- [x] CSV batch processing
- [x] HTTP API server
- [x] Multiple output formats
### 🚧 In Progress
- [ ] TDM (Text & Data Mining) policy detection (`/.well-known/tdmrep.json`)
- [ ] Security contact discovery (`/.well-known/security.txt`)
- [ ] Privacy control detection (DNT, GPC)
### 📋 Planned
- [ ] AI plugin manifest detection (`/.well-known/ai-plugin.json`)
- [ ] OpenID configuration for gated content
- [ ] Caching layer for repeated checks
- [ ] GitHub Action for PR compliance checks
- [ ] Pre-commit hook for URL validation
## Command Reference
### `policycheck analyze`
Analyze robots.txt and RSL licenses from URLs.
**Options:**
- `-u, --url <URL>` - URL to analyze (can be repeated)
- `-c, --csv <PATH>` - CSV file containing URLs
- `-a, --user-agent <AGENT>` - User agent to check (default: "*")
- `-f, --format <FORMAT>` - Output format: table, json, compact (default: table)
- `-o, --output <PATH>` - Save output to file
### `policycheck serve`
Start HTTP API server.
**Options:**
- `-p, --port <PORT>` - Port to listen on (default: 3000)
- `--host <HOST>` - Host to bind to (default: 127.0.0.1)
## Performance
PolicyCheck is designed for speed:
- **Concurrent analysis**: Multiple URLs analyzed in parallel
- **Optimized builds**: Release builds use LTO and aggressive optimization
- **Battle-tested parser**: Based on `texting_robots`, tested against 34M+ real-world files
- **Low memory footprint**: Efficient parsing with minimal allocations
Typical performance:
- Single URL analysis: ~50-200ms (network dependent)
- 100 URLs analyzed concurrently: ~2-5 seconds
## Limitations and Considerations
### Datacenter IP Blocking
**The Paradox:** robots.txt exists for bots to check before crawling, but some sites block datacenter IPs, preventing policy checkers from accessing robots.txt.
**Why this happens:**
- Sites like Medium block cloud provider IP ranges to prevent scraping
- PolicyCheck runs from cloud infrastructure (Fly.io)
- Appears as "generic scraper" rather than "compliance checker"
**How legitimate crawlers solve this:**
- **IP whitelisting** - Googlebot, GPTBot, ClaudeBot use published IP ranges that sites whitelist
- **Reverse DNS verification** - Sites verify bot identity via DNS lookups
- **User agent + IP combo** - Both must match expected patterns
**Impact on PolicyCheck:**
- ✅ Works: Most sites (GitHub, Cloudflare, NYTimes, etc.)
- ❌ Blocked: Some sites that aggressively block datacenter IPs (e.g., Medium)
- 💡 Workaround: Test locally with `cargo run` or use sites that don't block datacenter IPs
**Why this matters:** If compliance checkers are blocked, publishers can't verify their own policies are working correctly. This is a gap in the current web crawling ecosystem.
## Security Considerations
- **Input validation**: URLs are validated before processing
- **Size limits**: robots.txt files limited to 500KB (Google's recommendation)
- **Timeouts**: HTTP requests timeout after 10 seconds
- **No arbitrary code execution**: Pure parsing, no eval or dynamic code
- **CORS enabled**: API server has CORS enabled by default
## Standards Compliance
PolicyCheck implements the following standards:
- ✅ **RFC 9309**: Robots Exclusion Protocol (REP)
- ✅ **RSL Standard**: Responsible Sourcing License
- ✅ **Content Signals**: Cloudflare's AI Policy Framework (CC0 License)
- 🚧 **W3C TDMRep**: Text and Data Mining Reservation Protocol (planned)
- 🚧 **RFC 9116**: security.txt (planned)
- 🚧 **RFC 8615**: Well-Known URIs (planned)
## Contributing
PolicyCheck is part of the [OpenAttribution](https://openattribution.org) initiative. Contributions welcome!
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests if applicable
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request
## License
This project is licensed under the MIT License - see [LICENSE](LICENSE) for details.
Third-party software notices are in [NOTICE](NOTICE).
### Key Dependencies
- **texting_robots** (MIT OR Apache-2.0) - Robust robots.txt parsing by [@Smerity](https://github.com/Smerity)
- **reqwest** (MIT OR Apache-2.0) - HTTP client
- **clap** (MIT OR Apache-2.0) - CLI argument parsing
- **axum** (MIT) - HTTP server framework
- See [NOTICE](NOTICE) for complete attribution list
## OpenAttribution Initiative
PolicyCheck is built for the [OpenAttribution](https://openattribution.org) initiative, which aims to make web attribution transparent, accessible, and machine-readable.
**Mission**: Enable responsible AI development through clear content licensing and attribution standards.
## Support
- 🐛 **Report issues**: [GitHub Issues](https://github.com/openattribution-org/policycheck/issues)
- 💬 **Discussions**: [GitHub Discussions](https://github.com/openattribution-org/policycheck/discussions)
- 📧 **Contact**: [openattribution.org](https://openattribution.org)
- 🌐 **Website**: [OpenAttribution.org](https://openattribution.org)
## Acknowledgments
Built with ❤️ by the OpenAttribution community.
Special thanks to:
- [@Smerity](https://github.com/Smerity) for texting_robots
- The Rust community for excellent tooling
- Everyone contributing to open web standards
---
**Made with Rust 🦀 | Part of OpenAttribution 🔍 | MIT Licensed 📜**