ares-server 0.7.5

A.R.E.S - Agentic Retrieval Enhanced Server: A production-grade agentic chatbot server with multi-provider LLM support, tool calling, RAG, and MCP integration
Documentation
# Known Issues

## OpenAI Integration

**Status**: ✅ Compiles against async-openai 0.31.1; needs live API verification

**Issue**: The provider was updated to the 0.31.1 API (tool enums, tool list conversion, and tool-call parsing). Compile errors are resolved; runtime/tool-calling correctness still needs validation with a real OpenAI endpoint.

**Impact**: 
- Builds now succeed with the `openai` feature
- Tool calling should work, but has not been exercised against the real API
- Further adjustments may be needed after end-to-end testing

**Workaround**:
- Prefer Ollama or LlamaCpp for local-first workflows
- If using OpenAI, run targeted E2E tests with a real API key

**Next Steps**:
1. Run live tests with a real OpenAI key to validate tool calling and streaming
2. Add mocked/OpenAI-contract tests if feasible
3. Update docs with any model-specific nuances

## GPU Backend Compilation

**Status**: ⚠️ Requires platform-specific SDKs

**Issue**: Building with GPU features requires installed SDKs:
- `llamacpp-cuda`: Requires CUDA Toolkit
- `llamacpp-metal`: macOS only, requires Xcode
- `llamacpp-vulkan`: Requires Vulkan SDK

**Impact**:
- `--all-features` builds will fail without SDKs installed
- Per-platform builds work fine

**Workaround**:
```bash
# Use specific features instead of --all-features
cargo build --features "ollama,llamacpp,local-db"

# Or enable GPU only if SDK is installed
cargo build --features "llamacpp-cuda"  # if CUDA is available
```

**Documentation**: See `docs/GGUF_USAGE.md` for GPU setup instructions

## Test Coverage

**Status**: ✅ Core features fully tested

**Details**:
- 277+ tests total (152 lib + 125 integration)
- Ollama: Full coverage with wiremock
- LlamaCpp: Needs integration tests with real GGUF models
- OpenAI: Tests disabled pending API fixes

**Recommendations**:
1. Add E2E tests with real Ollama instance in CI
2. Add LlamaCpp tests with tiny test model
3. Fix OpenAI and re-enable tests

## Windows-Specific

**Status**: ✅ Works with minor notes

**Notes**:
- PowerShell script provided: `scripts/dev-setup.ps1`
- CUDA requires Visual Studio Build Tools
- Long path support may be needed for some GGUF models

**Fix**:
```powershell
# Enable long paths in Windows
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
```

## Docker Compose

**Status**: ✅ Functional with notes

**Notes**:
- GPU passthrough requires `nvidia-docker2` on Linux
- Windows/Mac GPU support varies by Docker Desktop version
- Health checks may timeout on slow systems

**Workaround**:
```yaml
# Increase health check intervals in docker-compose.dev.yml
healthcheck:
  interval: 60s  # was 30s
  timeout: 20s   # was 10s
  start_period: 120s  # was 60s
```

## Memory Usage

**Status**: ⚠️ Large models require significant RAM

**Issue**: 
- 7B Q4 models: ~6-8GB RAM
- 13B Q4 models: ~10-12GB RAM
- 70B Q4 models: ~40-50GB RAM

**Workaround**:
1. Use smaller models (1B-3B) for development
2. Use more aggressive quantization (Q3_K_M, Q4_0)
3. Reduce context size: `LLAMACPP_N_CTX=2048`
4. Enable GPU offloading to move memory to VRAM

## MCP Integration

**Status**: ✅ Complete

**Implementation**: Full MCP (Model Context Protocol) server with tool support.

**Files**:
- `src/mcp/server.rs` - MCP server implementation
- 14+ tests for MCP functionality

**Features**:
- Tool registration and execution
- Protocol compliance
- Tested with comprehensive test suite

## Performance Notes

**Not Issues, Just Notes**:

1. **First Request Slowness**: Model loading can take 5-30 seconds on first request
2. **Context Building**: Large contexts (8K+) slow down generation significantly  
3. **Concurrent Requests**: CPU-only can handle 1-2 concurrent generations efficiently
4. **Streaming Latency**: First token can take 1-2 seconds to generate

---

## Reporting Issues

If you encounter issues not listed here:

1. Check `CONTRIBUTING.md` for development setup
2. Verify environment variables in `.env`
3. Run with debug logging: `RUST_LOG=debug cargo run`
4. Search existing GitHub issues
5. Open a new issue with:
   - OS and Rust version
   - Feature flags used
   - Full error message
   - Steps to reproduce

## Fixed in This Release

✅ Turso cloud dependency removed (local-first by default)  
✅ Qdrant cloud dependency removed (optional feature)  
✅ Ollama tool calling implemented and tested  
✅ LlamaCpp streaming working  
✅ 175+ tests passing for core features  
✅ CI/CD pipeline configured  
✅ Documentation complete  
✅ MCP server fully implemented  
✅ RAG pipeline with pure-Rust vector store  
✅ Rate limiting infrastructure  
✅ Improved CORS configuration  
✅ **Vector persistence bug fixed** - Vectors now properly saved to disk (commit 354a771)  
✅ **Race condition in parallel model loading fixed** - Added per-model initialization locks (commit 354a771)  
✅ **Fuzzy search query typo correction** - Query-level typo correction implemented (commit 1eda28b, closes #4)
✅ **Embedding cache implemented** - In-memory LRU cache for embeddings (commit c6c25dd)

## Open Issues

*No major open issues at this time.*

---

**Last Updated**: 2026-02-01  
**Version**: 0.5.0