Needle: Distilling Gemini Tool Calling into a 26M Parameter Micro-Model

Hero

If you have been building agentic workflows over the past year, you know the fundamental tension: tool calling requires intelligence, and intelligence traditionally requires massive models. We have grown accustomed to routing our function calls through massive APIs or settling for the latency of gigabyte-sized local weights.

Today, that paradigm shifted. Cactus Compute hit Hacker News with a "Show HN" that immediately caught our attention: Needle, a hyper-specialized, 26 million parameter model explicitly distilled from Google's Gemini 3.1 Flash Lite. It doesn't write poetry or generate Python scripts. It does exactly one thing: it parses user intent against tool schemas and outputs perfect JSON. And it does it at blinding speed.

#What Happened?

Cactus Compute has open-sourced Needle under the MIT license, including its weights on Hugging Face. At a mere 26M parameters, the model is astonishingly small. To put this in perspective, Needle is a fraction of the size of models previously considered "tiny," such as FunctionGemma-270M or Qwen-0.6B.

Despite its size, Needle is fiercely competent at its designated task. It handles single-shot tool calling across 15 distinct categories—ranging from smart home controls and messaging to navigation and timers. By distilling the latent capabilities of Gemini 3.1 Flash Lite into a hyper-focused architecture, the team has proven that you don't need billions of parameters to parse a schema and extract arguments.

#Why It Matters: Extreme Efficiency at the Edge

The most compelling aspect of Needle isn't just its size; it is what that size enables. When quantized to INT4, the entire model occupies roughly 14MB of memory.

Let those numbers sink in for a moment. This model doesn't require a dedicated GPU cluster; it barely requires a modern CPU. This unlocks sophisticated, local-first tool calling for environments where it was previously impossible:

Wearables: Smartwatches and AR glasses can now process voice commands into structured API calls locally, entirely bypassing cloud latency.
IoT Devices: Smart home hubs can handle intent routing on an ESP32 or a low-end ARM chip without round-tripping to a server.
Mobile Apps: Applications can embed the model natively, ensuring zero-latency UI interactions and preserving user privacy by keeping queries on-device.

Performance-wise, Needle is an absolute beast. On consumer hardware, it achieves 6,000 tokens per second for prefill and 1,200 tokens per second for decode. In the context of user interaction, this means the JSON payload is generated and ready to be executed literally faster than the human eye can register the loading state.

#Technical Implications: The "No-FFN" Architecture

As engineers, the architectural choices behind Needle are arguably the most fascinating part of the release. The Cactus Compute team introduced what they call the Simple Attention Network (SAN).

Standard transformer architectures are typically built using alternating layers of Multi-Head Attention and Feed-Forward Networks (FFNs, or MLPs). It is widely understood in deep learning circles that FFNs act as the "memory" of the model, storing world knowledge and facts, while Attention handles the dynamic routing of context.

The breakthrough insight with Needle is realizing that tool calling is not a reasoning or memory task; it is a retrieval and assembly task.

When you prompt a model with a list of available tool schemas and a user query, the model does not need to know the capital of France. It only needs to align the semantic spans of the user's request (e.g., "turn off the living room lights") with the required slots in the provided JSON schema.

Therefore, Needle completely strips out the FFN layers. It uses a 12-layer encoder and an 8-layer decoder consisting entirely of pure attention and gating mechanisms. By dropping the MLPs, they eliminated the bulk of the parameter weight, drastically reducing computational overhead without sacrificing the specific routing capabilities required for function calling.

#The Training Pipeline

Training a model this specific required a clever pipeline:

Pretraining: The model was trained from scratch on 200 billion tokens. Because of its microscopic size, this phase took only 27 hours on a cluster of 16 TPU v6e chips.
Post-Training (Distillation): The team generated 2 billion tokens of highly complex, synthetic function-calling data using Gemini 3.1 Flash Lite. This phase took a mere 45 minutes, effectively transferring Gemini's robust instruction-following and schema-parsing behavior into the SAN architecture.

#What's Next?

Needle is available right now, and the barrier to entry is virtually zero. You can clone the repository, install the dependencies, and start experimenting with your own local schemas in minutes.

If you want to test it locally, Cactus Compute has provided a streamlined setup:

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

This launches a local playground where you can inject custom tool schemas—perhaps internal microservice APIs or local system scripts—and watch the model instantly route commands to them. Because the model is so small, fine-tuning it on proprietary, domain-specific tools is incredibly cheap and fast.

#Conclusion

The release of Needle is a massive validation of the "micro-model" philosophy. While foundational frontier models will continue to grow in size to push the boundaries of general reasoning, the execution layer of software engineering is moving in the opposite direction.

By aggressively pruning architectures to fit specific operational patterns—like ripping out FFNs for purely context-driven routing tasks—we are entering an era of hyper-optimized, localized AI components. Needle proves that for the mechanical plumbing of agentic systems, distillation and architectural minimalism trump sheer parameter scale. At Ichiban Tools, we will absolutely be experimenting with embedding this into our local utility pipelines.