Bonsai 8B: Full AI model in 1.15 GB — works on iPhone and doesn't need the cloud

April 7, 2026 jarvis

AI article illustration for ai-jarvis.eu

A complete language model with 8 billion parameters in a 1.15 GB file. You run it on an iPhone, Raspberry Pi, or an old laptop — without an internet connection, without cloud fees, without the need to share your data with a foreign server. This is not sci-fi. This is Bonsai 8B, the first commercially viable 1-bit language model from the startup PrismML, which quietly emerged from stealth mode on March 31, 2026, and shook the developer community.

What is a 1-bit model and why does it matter

Classic language models like Llama or Mistral store each of their weights as a floating-point number — typically in 16- or 32-bit format. Bonsai takes a radically different approach: each weight is represented only by its sign, i.e., a value of −1 or +1. In addition, a shared scaling factor is added for each group of weights. Nothing more.

The result? Where a standard 8B model in 16-bit format takes up approximately 16 GB of memory, Bonsai 8B makes do with just 1.15 GB. This is a reduction to one-fourteenth of the original size. The project founder Babak Hassibi, a professor of electrical engineering at Caltech, spent years building the mathematical theory that enables this compression without a destructive loss of the model's reasoning capabilities.

Numbers that don't feel like a marketing trick

PrismML has published specific benchmarks that can be verified. Bonsai 8B achieves an average score of 70.5 points across standard tests (MMLU Redux, MuSR, GSM8K, and others). For comparison: Llama 3 8B achieves 67.1 points and requires fourteen times more memory to run. Mistral 3B has a score of 71.0 — but it is a smaller model trained on a different dataset.

The metric that shows true efficiency is called intelligence density — intelligence per gigabyte. Bonsai 8B achieves a value of 1.06/GB. Qwen3 8B, one of the best open-source models today, achieves only 0.10/GB. A tenfold difference.

The text generation speed is surprisingly high:

M4 Pro Mac: 131 tokens per second
NVIDIA RTX 4090: 368 tokens per second
iPhone 17 Pro Max: 44 tokens per second

For comparison: a standard Llama 3 in 16-bit format generates approximately 17 tokens per second on an M4 Pro Mac. Bonsai is eight times faster. Energy consumption drops to approximately four to five times lower consumption compared to 16-bit models — on an iPhone, it's 0.068 mWh per token.

Who is it actually for

PrismML makes no secret that their primary goal is to free AI from the cloud. Bonsai was designed for scenarios where the cloud simply doesn't work or isn't desirable:

Industrial robotics — real-time decision-making without network connection latency
Healthcare — sensitive data remains on the device, out of reach of external servers
Corporate deployment — AI assistant directly on the internal network, no transfer of corporate data outside
Developers and enthusiasts — local experimentation on Mac, Raspberry Pi, or an old GPU

The models are available in three sizes: 1.7B, 4B, and 8B parameters. All run natively via MLX on Apple devices (Mac, iPhone, iPad) and via llama.cpp on NVIDIA GPUs. The entire family is released under the Apache 2.0 license — meaning commercially freely usable, free of charge.

Why this is important for Europe and the Czech Republic

The shift of AI to local devices has a specific regulatory dimension in the European context. GDPR and the upcoming implementing regulations for the EU AI Act place increasingly stringent requirements on where and how personal data is processed. Cloud-based language models are problematic from this perspective — you never know exactly where your data passed through and where it was processed.

Models like Bonsai eliminate this problem at its root: data doesn't go anywhere because the model runs directly on your device. For Czech companies in healthcare, law, or the financial sector, where data protection is crucial, this is an argument that has been practically missing from open-source models until now — either the model was too large for deployment on proprietary hardware, or too weak for actual use.

Czech is not yet a primarily supported language — Bonsai is trained mainly on English. PrismML has not provided specific information about support for other languages. For Czech deployment, it would be necessary to fine-tune the model or use an English interface. However, developer communities are monitoring the situation, and it is likely that finetuning for Czech will appear on HuggingFace within weeks.

One question remains: what do you lose?

Compression always brings compromises. Bonsai 8B significantly lags behind full-fledged models like GPT-4o or Claude 3.7 Sonnet in creative tasks and complex reasoning. Tests from HPCwire show that the model is strongest in classification tasks, document reading, and simple code generation. For complex multi-turn reasoning or writing long coherent texts, the 1-bit architecture is not yet sufficient.

PrismML openly states that Bonsai 8B is not a replacement for GPT-5 or Gemini Ultra. It is a specialized tool for specific scenarios where latency, privacy, and energy consumption matter — and in these scenarios, it currently has no competition in the open-source model category.

Whether the 1-bit architecture will establish itself as a new standard or remain a niche solution, practice will show. In any case, April 1, 2026, was no joke — Bonsai is a real thing you can download and run today.

Where can I download and try Bonsai 8B?

The models are freely available on HuggingFace under the Apache 2.0 license (prism-ml/Bonsai-8B-gguf and prism-ml/Bonsai-8B-mlx-1bit). On Apple devices (Mac, iPhone, iPad), they work via the MLX framework; on Windows and Linux with NVIDIA GPUs, via llama.cpp. Sample code and instructions can also be found on the PrismML-Eng/Bonsai-demo GitHub project.

Does Bonsai 8B work in Czech?

The model is primarily trained on English text, and Czech is not an officially supported language. The model shows basic understanding of Czech (it is trained on multilingual data), but for reliable Czech outputs, community finetuning will be needed, which will likely emerge on HuggingFace within weeks.

Is the 1-bit model safe for processing sensitive corporate data?

Yes, and that is one of its main advantages. Because the model runs exclusively locally on your device, no data leaves your infrastructure. For sectors such as healthcare, law, or finance, where strict GDPR and data protection rules apply, this is a key argument. The model does not require an internet connection and does not send any queries to external servers.