Expand description
§Web-RWKV
This is an inference engine for the language model of RWKV implemented in pure WebGPU.
§Features
- No dependencies on CUDA/Python.
- Support Nvidia/AMD/Intel GPUs, including integrated GPUs.
- Vulkan/Dx12/OpenGL backends.
- Batched inference.
- Int8 and NF4 quantization.
- Very fast.
- LoRA merging at loading time.
- Support RWKV V4, V5 and V6.
§Notes
Note that web-rwkv
is only an inference engine. It only provides the following functionalities:
- A tokenizer.
- Model loading.
- State creation and updating.
- A
run
function that takes in prompt tokens and returns logits (predicted next token probabilities after callingsoftmax
).
It does not provide the following:
- OpenAI API or APIs of any kind.
- If you would like to deploy an API server, check AI00 RWKV Server which is a fully-functional OpenAI-compatible API server built upon
web-rwkv
. - You could also check the
web-rwkv-axum
project if you want some fancy inference pipelines, including Classifier-Free Guidance (CFG), Backus–Naur Form (BNF) guidance, and more.
- If you would like to deploy an API server, check AI00 RWKV Server which is a fully-functional OpenAI-compatible API server built upon
- Samplers, though in the examples a basic nucleus sampler is implemented, this is not included in the library itself.
- State caching or management system.
- Python (or any other languages) binding.
- Runtime. Without a runtime makes it easy to be integrated into any applications from servers, front-end apps (yes,
web-rwkv
can run in browser) to game engines.
§Crate Features
subgroup-ops
(enabled by default) — Enables subgroup operations in the kernels. Accelerates the inference on some device.tokio
(enabled by default) — Enables tokio’s multi-threaded runtime. Doesn’t work on web platforms.trace
— Enables performance tracing.
Re-exports§
pub use wgpu;