DataFold
A Rust-based distributed data platform with schema-based storage, AI-powered ingestion, and real-time data processing capabilities. DataFold provides a complete solution for distributed data management with automatic schema generation, field mapping, and extensible ingestion pipelines.
✨ Features
- 🤖 AI-Powered Data Ingestion - Automatic schema creation and field mapping using AI [Initial prototype]
- 🔄 Real-Time Processing - Event-driven architecture with automatic transform execution [working]
- 🌐 Distributed Architecture - P2P networking with automatic peer discovery [untested]
- 📊 Flexible Schema System - Dynamic schema management with validation [working]
- 🔐 Permission Management - Fine-grained access control and trust-based permissions [working]
- ⚡ High Performance - Rust-based core with optimized storage and query execution [maybe]
- ☁️ Serverless Ready - S3-backed storage for AWS Lambda and serverless deployments [working]
- 🔌 Extensible Ingestion - Plugin system for social media and external data sources [not yet begun]
🚀 Quick Start
Installation
Add DataFold to your Cargo.toml:
[]
= "0.1.0"
Or install the CLI tools:
This provides three binaries:
datafold_cli- Command-line interfacedatafold_http_server- HTTP server with web UIdatafold_node- P2P node server
Optional TypeScript Bindings
The crate ships without generating TypeScript artifacts by default so it can
compile cleanly in any environment. If you need the auto-generated bindings for
the web UI, enable the ts-bindings feature when building or testing:
The feature keeps the ts-rs dependency optional and writes the generated
definitions to the existing bindings/ directory just like the repository
version.
Basic Usage
use ;
use json;
async
Running the HTTP Server
# Start the HTTP server with web UI
Then visit http://localhost:9001 for the web interface.
📖 Core Concepts
Schemas
DataFold uses dynamic schemas that define data structure and operations:
use ;
// Load a schema
let schema_json = read_to_string?;
let schema: Schema = from_str?;
// Execute operations
let operation = Query;
let result = node.execute_operation.await?;
AI-Powered Ingestion
Automatically analyze and ingest data from any source:
use ;
// Configure with OpenRouter API
let config = IngestionConfig ;
let ingestion = new?;
// Process any JSON data
let result = ingestion.process_json_ingestion.await?;
Distributed Networking
Connect nodes in a P2P network:
use ;
let network_config = default;
let network = new.await?;
// Start networking
network.start.await?;
// Discover peers
let peers = network.discover_peers.await?;
🌐 Frontend Development
DataFold includes a comprehensive React frontend with a unified API client architecture that provides type-safe, standardized access to all backend operations.
Frontend API Clients
The frontend uses specialized API clients that eliminate boilerplate code and provide consistent error handling, caching, and authentication:
import { schemaClient, securityClient, systemClient } from '../api/clients';
// Schema operations with automatic caching
const response = await schemaClient.getSchemas();
if (response.success) {
const schemas = response.data; // Fully typed SchemaData[]
}
// System monitoring with intelligent caching
const status = await systemClient.getSystemStatus(); // 30-second cache
// Security operations with built-in validation
const verification = await securityClient.verifyMessage(signedMessage);
Key Features
- 🔒 Type Safety - Full TypeScript support with comprehensive interfaces
- ⚡ Intelligent Caching - Operation-specific caching (30s for status, 5m for schemas, 1h for keys)
- 🔄 Automatic Retries - Configurable retry logic with exponential backoff
- 🛡️ Error Handling - Standardized error types with user-friendly messages
- 🔐 Built-in Authentication - Automatic auth header management
- 📊 Request Deduplication - Prevents duplicate concurrent requests
- 🎯 Batch Operations - Efficient multi-request processing
Available Clients
- SchemaClient - Schema management and SCHEMA-002 compliance
- SecurityClient - Authentication, key management, cryptographic operations
- SystemClient - System operations, logging, database management
- TransformClient - Data transformation and queue management
- IngestionClient - AI-powered data ingestion (60s timeout for AI processing)
- MutationClient - Data mutation operations and query execution
Error Handling
import {
isNetworkError,
isAuthenticationError,
isSchemaStateError
} from '../api/core/errors';
try {
const response = await schemaClient.approveSchema('users');
} catch (error) {
if (isAuthenticationError(error)) {
redirectToLogin();
} else if (isSchemaStateError(error)) {
showMessage(`Schema "${error.schemaName}" is ${error.currentState}`);
} else {
showMessage(error.toUserMessage());
}
}
Frontend Development Setup
# Start the backend server
# In another terminal, start the React frontend
The frontend will be available at http://localhost:5173 with hot-reload.
Frontend Documentation
- Architecture Guide - Technical architecture and design patterns
- Developer Guide - Usage examples and best practices
- Migration Reference - Migration from direct fetch() usage
🔌 Extensible Ingestion
DataFold supports ingesting data from various sources with the new adapter-based architecture:
- Social Media APIs - Twitter, Facebook, Reddit, TikTok
- Real-time Streams - WebSockets, Server-Sent Events
- File Uploads - JSON, CSV, JSONL
- Webhooks - Real-time event processing
- Custom Adapters - Extensible plugin system
See SOCIAL_MEDIA_INGESTION_PROPOSAL.md for the complete ingestion architecture.
🛠️ Development Setup
Prerequisites
- Rust 1.70+ with Cargo
- Node.js 16+ (for web UI development)
Building from Source
# Clone the repository
# Install dependencies
# Build all components
# Run tests
Running the Web UI
For development with hot-reload:
# Start the Rust backend
# In another terminal, start the React frontend
The UI will be available at http://localhost:5173.
☁️ Serverless Deployment (S3 Storage)
DataFold can run in serverless environments like AWS Lambda using S3-backed storage:
use ;
async
Environment variable configuration:
See S3 Configuration Guide for complete setup instructions, AWS Lambda deployment, and cost optimization.
📊 Examples
Loading Sample Data
# Use the CLI to load a schema
# Query data
# Execute mutations
Python Integration
See datafold_api_examples/ for Python scripts demonstrating:
- Schema management
- Data querying
- Mutations and updates
- User management
🔧 Configuration
DataFold uses JSON configuration files. Default config:
Environment variables:
OPENROUTER_API_KEY- API key for AI-powered ingestionDATAFOLD_CONFIG- Path to configuration file
🔐 Public Key Persistence
DataFold stores registered Ed25519 public keys in the sled database. When the node starts it loads all saved keys, and new keys are persisted as soon as they are registered. This keeps authentication intact across restarts. See PBI SEC-8 documentation for implementation details.
DATAFOLD_LOG_LEVEL- Logging level (trace, debug, info, warn, error)
📚 Documentation
- API Documentation - Complete API reference
- CLI Guide - Command-line interface usage
- Ingestion Guide - AI-powered data ingestion
- S3 Storage Guide - Serverless deployment with S3
- Architecture - System design and patterns
🤝 Contributing
We welcome contributions! Please see our contributing guidelines:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Run
cargo test --workspace - Submit a pull request
📄 License
This project is licensed under either of:
at your option.
🌟 Community
- Issues - Report bugs and request features on GitHub Issues
- Discussions - Join discussions on GitHub Discussions
DataFold - Distributed data platform for the modern world 🚀