Research
Why small language models can be better
Iain Mackie
Large language models have become a big part of how we work now. Tools like ChatGPT, Claude, Grok, and Gemini handle everything from drafting emails to writing code. In image generation there is DALL-E and Midjourney, in audio there is Whisper and Eleven Labs, and in video there is Sora. They are coaching tools, brainstorming partners, coding buddies, mental health companions. Imagination really is the only limit.
But there is a catch. These models are eye-wateringly expensive to run.
OpenAI, the company behind ChatGPT (one of the fastest-growing products in history), was projected to operate at a $5 billion loss in 2024 alone. And OpenAI is best in class. The economics of large AI models are genuinely challenging, which makes it hard for most companies to justify the investment, especially in industries where data is sensitive and returns on AI are still being proven.
This is where small language models come in. And they are more interesting than their name suggests.
The cost gap is bigger than you think
A large language model can consume 100 to 1,000 times more power than a small one. Large models need entire clusters of specialised computing chips running in data centres to function. Small models can run on a single device, sometimes even a phone or a laptop.
Once a system is up and running, that difference in running costs means day-to-day AI operations can be 100 to 1,000 times cheaper. For any organisation thinking seriously about deploying AI at scale, that gap matters enormously.
AI adoption is already expensive before the running costs even kick in. Specialised talent is hard to find and retain. Gathering, cleaning, and preparing quality data takes time and money. If the ongoing costs are prohibitive on top of that, the business case simply does not stack up. Small language models change that calculation.
Small but mighty: what small language models are actually good at
A small language model will not beat a large one at everything. Large models are trained to be generalists, and at that they excel. All else being equal, bigger tends to mean better for broad, open-ended tasks.
But most real business problems are not broad. They are specific.
A hospital system does not need a model that can write poetry and debug Python. It needs one that can triage patient notes accurately and consistently. A legal team does not need a model that knows about everything. It needs one that understands contract language. A customer support team needs a model that knows their product, not the whole internet.
This is where small language models shine. When adapted for a specific domain, they often match or even outperform much larger models on that focused task, at a fraction of the cost.
There is also a smarter way to build them, one that closes the gap between small and large even further.
How small models learn from large ones
One of the most powerful ideas in this space is called knowledge distillation. The concept is surprisingly intuitive.
You start with a large, capable "teacher" model. You then train a smaller "student" model not just on raw data, but on the teacher's outputs and reasoning. The student learns to mimic how the teacher thinks, not just what answers it gives. Over time, it picks up the teacher's expertise in a way that standard training alone cannot replicate.
The result is a compact model that punches well above its weight. Large models paved the way. Now we use them to train the next generation of focused specialists.
This unlocks several real advantages beyond lower running costs:
Better privacy and security. When a model runs on your own infrastructure rather than sending data to an external server, sensitive information stays where it belongs. This matters a great deal in industries like healthcare, finance, and law, where data handling is tightly regulated.
Stronger performance on specific tasks. A small model built for a focused job, like reviewing legal contracts or screening insurance claims, will often outperform a large general model on that same job. Precision beats breadth when the task is well-defined.
Runs almost anywhere. Small models can be embedded into mobile apps, factory floor devices, or lightweight cloud setups. They do not require expensive infrastructure to work well.
Lower environmental impact. Less compute means less energy. For organisations with sustainability targets, this is a meaningful benefit, not just a nice-to-have.
A hospital system does not need a model that can write poetry and debug Python. It needs one that can triage patient notes accurately and consistently.
Three ways to make a small language model work for you
Knowing you want a small language model is one thing. Getting it to perform well for your specific needs is another. There are three main approaches, each with a different level of effort and payoff.
1. Prompt engineering
This is the lightest-touch option. Instead of changing the model itself, you change how you talk to it. Carefully written instructions, useful context, and well-placed examples in your prompts can dramatically improve what a model produces.
Think of it like briefing a contractor. A vague brief gets generic results. A clear brief, with examples of what good looks like, gets you something far more useful.
A helpful variation of this is few-shot prompting, where you give the model a small set of examples before asking your actual question. Show it two or three correctly formatted legal clauses or well-structured medical summaries, and it will follow those patterns much more reliably going forward.
Prompt engineering is fast, cheap, and requires no specialist setup. The downside is that it has a ceiling. You can guide a model's responses, but you cannot fundamentally change what it knows or how deeply it understands a topic.
2. Fine-tuning
Fine-tuning goes deeper. Rather than adjusting the instructions, you adjust the model itself by training it on domain-specific data. A fine-tuned model does not just follow prompts about legal documents; it has absorbed the language, structure, and logic of legal documents at a level that prompt engineering cannot reach.
The results are more consistent and reliable across a wide range of situations within that domain. The trade-off is real: fine-tuning requires quality training data, additional computing resources, and ongoing upkeep if the domain changes. It is a meaningful investment. For the right use case, it is well worth it.
3. Knowledge distillation
When you have access to a high-performing large model, distillation lets you transfer its expertise into a smaller, faster, more deployable version. This is the most involved approach of the three, but it produces the strongest results for organisations that need to scale AI operations efficiently.
These three approaches are not mutually exclusive. Most robust AI deployments combine them, starting with prompt engineering to quickly test and validate an idea, adding fine-tuning for depth and consistency, then using distillation when efficiency at scale becomes the priority.
Small language models you may already recognise
The SLM space has moved fast. Several well-known models already demonstrate what is possible:
Qwen 2.5 (Alibaba): A family of models ranging from compact to moderate in size, built for resource-constrained environments and competitive on a wide range of language tasks.
Llama 3.2 (Meta): The 1B and 3B versions run well on limited hardware and handle tasks like summarisation and instruction-following effectively.
Phi 3.5 Mini (Microsoft): A model that performs comparably to systems many times its size, largely thanks to careful training on high-quality curated data.
DistilBERT: A distilled version of the influential BERT model that keeps 97% of its language understanding ability using only 60% of the original's size.
DistilWhisper: A smaller, faster version of OpenAI's Whisper speech recognition model, built for efficient audio processing.
These models show that the performance gap between large and small has been closing quickly. In many focused applications, there is no longer much of a gap at all.
Where this is heading
Small language models are not a replacement for large ones. For genuinely open-ended tasks requiring broad knowledge and flexible reasoning, large models will remain the better choice.
But the more interesting future may lie in hybrid systems: a large model handling high-level reasoning and coordination, with a network of small specialist models doing the focused, high-volume work. Each part of the system doing what it does best.
The field is also moving fast on model compression. Researchers keep finding that large models contain more redundancy than expected, and that removing it carefully can produce models that are smaller, faster, and nearly as capable. Models with 1 to 2 billion parameters are now beginning to rival models with 7 to 13 billion on specific tasks.
The future of AI does not just lie in going bigger. It lies in being smarter about where and how you deploy it.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531
Frankle, J., & Carlin, M. (2018). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv:1803.03635
Ma, X., et al. (2023). LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv:2305.11627
Hu, S., et al. (2024). MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. arXiv:2404.06395
Kim, B., et al. (2024). Shortened LLaMA: A Simple Depth Pruning for Large Language Models. arXiv:2402.02834
Men, X., et al. (2024). ShortGPT: Layers in Large Language Models are More Redundant Than You Expect. arXiv:2403.03853
Muralidharan, S., et al. (2024). Compact Language Models via Pruning and Knowledge Distillation. arXiv:2407.14679
Xia, M., et al. (2024). Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. arXiv:2310.06694
Yang, B., et al. (2024). LaCo: Large Language Model Pruning via Layer Collapse. arXiv:2402.11187
Dery, L., et al. (2024). Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes. arXiv:2402.05406
Wang, Z., et al. (2024). Model Compression and Efficient Inference for Large Language Models: A Survey. arXiv:2402.09748
Liu, T., et al. (2024). A Survey on Lottery Ticket Hypothesis. arXiv:2403.04861

