Offline Intelligence Library
High-performance LLM inference engine with advanced memory management and context orchestration capabilities. Built in Rust for maximum performance across Windows, macOS, and Linux platforms.
Architecture Overview
This project follows a dual-licensing model:
- Open Source Core (80%): Publicly available under Apache 2.0 license
- Proprietary Extensions (20%): Private plugins for advanced features
Core Components (Public)
- LLM Integration Engine
- Basic Memory Management
- Configuration System
- Metrics and Telemetry
- API Proxy Layer
- Administration Interface
Proprietary Extensions (Private/Future)
- Advanced Context Management (
context_engine) - Key-Value Cache System (
cache_management) - Enhanced Memory Components
- Advanced API Features
Platform Support
| Platform | Architecture | Status |
|---|---|---|
| Windows | x86_64 | ✅ Supported |
| macOS | x86_64, ARM64 | ✅ Supported |
| Linux | x86_64, ARM64 | ✅ Supported |
Language Bindings
The library provides native bindings for multiple languages:
Native Rust
Direct access to all core functionality:
use ;
let config = from_env?;
run_server.await?;
Python
Install via pip:
Usage:
=
C++
CMake integration:
auto config = ;
;
JavaScript/Node.js
NPM package:
Usage:
const = require;
const config = ;
;
Java
Maven dependency:
com.offlineintelligence
offline-intelligence-java
0.1.0
Usage:
;
;
Config config ;
Server.;
Building from Source
Prerequisites
- Rust 1.70+
- CMake 3.16+ (for C++ bindings)
- Python 3.8+ (for Python bindings)
- Node.js 16+ (for JavaScript bindings)
- Java 11+ (for Java bindings)
Build Process
Windows
build.bat
Linux/macOS
Build Output
The build process creates distribution packages in the dist/ directory:
rust/- Native Rust binariespython/- Python wheelscpp-lib/- C++ libraries and headersjavascript/- Node.js packagesjava/- Java JAR files
Configuration
The library uses environment variables for configuration:
# Core settings
LLAMA_BIN=/path/to/llama-server
MODEL_PATH=/path/to/model.gguf
API_HOST=127.0.0.1
API_PORT=8000
# Resource allocation
THREADS=auto
GPU_LAYERS=auto
CTX_SIZE=auto
BATCH_SIZE=auto
API Endpoints
Core Endpoints
POST /generate/stream- Stream generationGET /healthz- Health checkGET /readyz- Readiness checkGET /metrics- Prometheus metrics
Admin Endpoints
GET /admin/status- System statusPOST /admin/load- Load modelPOST /admin/stop- Stop backend
Memory Endpoints
GET /memory/stats/{session_id}- Memory statisticsPOST /memory/optimize- Optimize memoryPOST /memory/cleanup- Cleanup memory
Performance Characteristics
- Low Latency: Optimized for real-time inference
- Memory Efficient: Smart caching and garbage collection
- Multi-threaded: Automatic thread pool management
- GPU Accelerated: CUDA support for NVIDIA GPUs
Contributing
We welcome contributions to the open-source core components. Please see our Contributing Guide for details.
License
- Core library: Apache 2.0 License
- Proprietary extensions: Commercial licensing available
Support
For support, please open an issue on our GitHub repository or contact our team at support@offlineintelligence.com.
Roadmap
Short Term (0.2.0)
- Enhanced documentation
- Additional language bindings
- Performance optimizations
Medium Term (0.3.0)
- Plugin architecture for proprietary extensions
- Cloud deployment support
- Enhanced monitoring capabilities
Long Term (1.0.0)
- Full commercial plugin ecosystem
- Enterprise features
- Advanced orchestration capabilities