Your local AI workstation.

Discover, download, convert, serve, chat with, benchmark, and generate images from open-weight models — all in one desktop app. Powered by llama.cpp, Apple MLX, DFlash/DDTree speculative decoding, and five pluggable cache compression strategies.

macOS · Linux · Windows · signed in-app updates

ChaosEngineAI walkthrough

Everything you need, locally.

A single desktop control plane that replaces a stack of scripts, notebooks and CLIs.

Dashboard

Backend health, engine, loaded model, hardware, and live warm-pool telemetry.

Discover

Curated catalogs with capability tags — chat, coding, vision, reasoning, tools, multilingual.

My Models

Local library sorted by name, format, size, context or modified date — one-click launch.

Chat

Streaming multi-thread chat with document RAG attachments and image inputs for vision models.

Server

OpenAI-compatible HTTP server with bind address, request stats and a built-in test panel.

Benchmarks

Live tok/sec & TTFT, prompt sets, full report cards, saved history with side-by-side diffs.

Image Discover

Curated Stable Diffusion catalog — browse, filter by compatibility, and install with one click.

Image Studio

Prompt-based image generation with aspect ratio, quality presets, negative prompts, and live progress.

Image Gallery

Browse, filter, and reuse saved outputs — compare models and re-run with the same seed.

Conversion

Hugging Face → MLX with optional compression. Layer-by-layer live progress.

Warm pool

Recently-used models stay hot — subsequent loads are instant.

In-app updates

Signed, verified, cross-platform. Launch → prompt → relaunch.

DFlash / DDTree

Speculative decoding — 3-5x faster generation with zero quality loss. Tree-based candidate exploration for higher acceptance.

Agent Tools

Web search, calculator, code executor, and file reader — models call tools during conversations.

Fine-Tuning

LoRA adapter training for MLX models with configurable learning rate, rank, and epochs.

Plugins

Extensible plugin system — cache strategies, inference engines, tools, model sources, and post-processors.

Logs

Live tail of the backend stream — load events, server requests, errors — with level filters.

Settings

Directories, default launch preferences, cache strategy, fused attention, FP16 layers.

Telemetry

Backend health, engine in use, hardware, memory pressure — always one glance away.

Speculative decoding

3-5x faster generation, zero quality loss.

A small draft model proposes a block of tokens; the target model verifies them in a single forward pass. Accepted tokens are committed instantly. Two modes, one goal: faster inference.

  • DFlash — linear draft-and-verify with auto-resolved draft checkpoints from the z-lab collection
  • DDTree — tree-structured candidate exploration using a max-probability heap, verified with a tree attention mask in one pass
  • Supported families — Qwen3, Qwen3.5, Qwen3-Coder, LLaMA 3.1, gpt-oss, Kimi-K2.5
  • Graceful fallback — DDTree falls back to DFlash, DFlash falls back to standard generation
Speedup3-5x
Quality lossZero
DDTree budget0 – 64 nodes
BackendsMLX + dflash-mlx
CacheNative f16 (auto)
Pluggable compression

Cutting-edge KV cache compression.

ChaosEngineAI supports multiple cache compression backends via a pluggable strategy system. Install a backend, restart the app, and it appears in the cache selector — no config needed. Run bigger models in tighter memory budgets without giving up quality.

  • RotorQuant — 4D quaternion & 2D Givens rotation compression (3-4 bit)
  • TriAttention — transparent vLLM-integrated cache budget management
  • TurboQuant — PolarQuant with fused Metal kernels for Apple Silicon
  • ChaosEngine — PCA-based decorrelation + hybrid quantization (2-8 bit)
  • Native f16 — full-precision baseline, always available
Backends5 strategies
Bit depths1 – 8 bit
Compressionup to ~7×
Installpip install
PlatformsMLX · CUDA · Metal

Download & install.

Signed, cross-platform builds with in-app auto-updates from v0.4.20 onward.