Run large language models at home, BitTorrent‑style
- Generate text with Llama 2 (70B), Falcon (40B+), BLOOM (176B) (or their derivatives) and fine‑tune them for your tasks — using a consumer-grade GPU or Google Colab.
- You load a small part of the model, then join a network of people serving the other parts. Single‑batch inference runs at up to 6 tokens/sec for Llama 2 (70B) and up to 4 tokens/sec for Falcon (180B) — enough for chatbots and interactive apps.
- Beyond classic LLM APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch and 🤗 Transformers.
Thanks for subscribing!
We will email you only if we have really exciting updates.
Top contributors right now:
Follow development in Discord or via email:
This project is a part of the BigScience research workshop.