From the Alexa Challenge to Malted AI

The Malted AI founders leveraged small language models (SLMs) and distillation to win the Amazon Alexa Challenge. These learnings are the cornerstone of Malted AI unique approach to solving domain-specific tasks in secure enterprise environments.

What is the 2021/22 Amazon Alexa TaskBot Challenge?

A 12-month competition against 125 top AI teams from around the globe.

Each team developed an AI agent, or TaskBot, to help users solve day-to-day tasks such as cooking a meal or fixing a bike.

TaskBots supported Amazon Alexa users and had to manage 100,000s+ interactions.

This challenge required production-grade AI, i.e., low latency, 100% uptime, and data security (self-hosting to ensure no personal data left Amazon’s cloud).

Malted AI won based on user ratings and judgment from domain experts.

What is a TaskBot?

Regular conversational agents, or “SocialBots”, simply aim to engage users in general conversations around politics, travel, or movies. A “TaskBot” is an AI system that works with a user to accomplish a specific goal¹. For example, walking a user through identifying what’s wrong with their bike and then supporting them in fixing it. We focused on domains that require precise explanations such as cooking, crafts, and Do it yourself (DIY).

Problem 1 – system

We mapped out a dream user experience. For example, imagine cooking at home with Gordon Ramsey to design and prepare a dinner party for four guests. The AI agent (Gordon Ramsey) would help you find the best recipes from 100,000s options, deal with any issues in real-time, and make it a factually grounded and engaging experience.

Thus, we needed a flexible and scalable system to construct these knowledge-grounded experiences at scale. We looked at various system options but assessed neither was suitable:

Intent-based: These are traditional tree-based conversational logic where you can design detailed user flows.² These flows give some sense of controlled user experience, low latency and can be self-hosted, which was required because we were dealing with personal/sensitive user information. However, these systems often do not sound natural, are brittle when dealing with edge cases, and are hard to scale to millions of tasks.

Large Language Models (LLMs): As of late 2022, OpenAI launched GPT3, which is a general AI model containing 175 billion parameters³. These systems can create natural and fluent text, are flexible when dealing with conversation edge cases, and are scalable to millions of tasks by varying the prompt⁴. However, we couldn’t create complex and knowledge-based conversations, use self-host given the model size, or respond to the user in a low-latency manner.

Network of Small Language Models (SLMs)

Our solution was to develop a system based on multiple Small Language Models (SLMs). These open-source AI models are between 100 million and 10 billion parameters (10-100x smaller than GPT3), allowing secure on-prem deployment on a single GPU.

Due to latency requirements, we predominately used T5⁵ and BERT⁶ large models, which are under 1 billion parameters. Each SLM could focus on a specific tasks:

Offline pipelines: We used SLMs to automatically extract millions of tasks from trusted sources, i.e., Gordon Ramsey’s cooking recipes.
Functionalities: We created specialised SLMs for different sub-components, i.e., searching for tasks, answering questions, or using other tools such as setting timers.
Neural Decision Parser: This was the brain of the system that managed the conversational state. Specifically, this SLM selected which subsystem to call given a user utterance.

Overall, this system met all our criteria, retaining the positive aspects of LLMs (natural, flexible, scalable) but with a more controlled experience, low latency, and affordable self-hosting.

Example: using an SLM to manage conversational state

After having a flexible system of SLMs we wondered – how do we actually manage these diverse daily conversations? We developed the Neural Decision Parser, a T5 Large SLM (700m parameters) that takes in conversation context and task state and generates Python code to call system sub-components. For example, if the user says “shepherd’s pie please”, the Neural Decision Parser will generate parameterised code, Search (”shepherd’s pie”), to call the task search system. This SLM “brain” allows for flexible handling of complex conversations.

Problem 2 – Data

We solved the system problem by leveraging a network of SLMs and rich task data. However, how will we train our SLMs for each sub-task? We needed to build representative and high-quality datasets for our machine-learning models.

“In machine learning, you need three things: data, data, data.”

Building traditional machine learning datasets often requires hundreds or thousands of hours of manual annotation. These may result in high-quality data but are not cost-effective or scalable. Furthermore, there are sensitivity issues around annotators viewing sensitive data such as personal conversations.

Distillation

Our solution was to use knowledge distillation⁷ to create scalable, high-quality synthetic data to train our SLMs. Distillation is where we use the output on a larger, more computationally intensive AI model (the “teacher”) to train a smaller AI model (the “student”) that is suitable for production with lower latency or 24/7 resource cost. Thus, unlike the manual annotation process, distillation creates high-quality training data for machine learning models in a scalable and low-cost manner.

Example: Distilling the Neural Decision Parser

As discussed earlier, the Neural Decision Parser is a key component that manages conversational flows. This model takes into account the context of the conversation, task state and needs to output the correct function call. This is a complex semantic parsing problem with a large input (conversational context) and output (parameterised sub-system calls) space. Furthermore, we need the model’s inference latency to be under 1 second and capable of being self-hosted on a small GPU. However, due to the size of the input/output space, it would not be practical to manually annotate a dataset capable of training an SLM for this task.

Therefore, we used distillation to create a large, high-quality synthetic dataset. We started with the 100,000s of raw task data and some initial seed annotations. We used a large model³ ( 175 million parameters) to synthesise conversations for different task states (our synthetic inputs). We then used another pass of prompting an LLM to create the synthetic outputs. We used domain experts in the loop to improve the chaining of the LLMs and refine the LLM prompts to ensure the generated data was of sufficient quality.

We used this large dataset to fine-tune a T5 Large SLM (0.7bn parameters) that achieved performance on par with GPT3. However, this SLM was 100x smaller, had an inference latency of 0.5 seconds versus 4-5 minutes for our data generation pipeline, and could be deployed on a small GPU for £700/month.

Continuously improving system

Another benefit of having a cost-effective and scalable way of creating training data was that we could regularly re-run the distillation pipelines. Specifically, the system could be updated based on new task data and user sessions. This led to a system that continuously improved to become better and more robust for users.

Results

The winning formula for the Amazon Alexa Challenge was a network of specialised SLMs and a scalable way to create high-quality training data. This is reflected by a strong user rating that saw a consistent uptrend as the system learned and improved. Our approach was especially true for complex, long conversations (over 3 minutes) that require a flexible system to succeed.

Founding Malted AI

Malted AI was inspired by overcoming the system and data challenges the team encountered during the Alexa Challenge. We showed that leveraging small language models and distillation can solve complex domain-specific problems. After winning the global competition, Iain Mackie, Carlos Gemmell, and Federico Rossetto founded what Malted AI is today. Alan Turning Fellow Jeff Dalton is Malted’s Chief Scientific Advisor and Paul Owoicho is a Machine Learning Engineer.

Malted AI partners with enterprises to build custom AI applications utilising distilled small language models, trained on their proprietary data in a secure environment. Our distillation technology creates high-quality training data for SLMs that would have required thousands of human hours to annotate manually. Thus, we support enterprises in building factually accurate and reliable AI solutions that are 10-100x smaller than current general LLMs. Ultimately, showing that smaller can be better.

References

Ipek, Anna Gottardi Osman, et al. “Alexa, Let’s Work Together: Introducing the First Alexa Prize TaskBot Challenge on Conversational Task Assistance.” arXiv preprint arXiv:2209.06321 (2022). ↩︎
Xie, Tian, et al. “Converse: A tree-based modular task-oriented dialogue system.” arXiv preprint arXiv:2203.12187 (2022). ↩︎
Brown, Tom B. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 (2020). ↩︎
Foosherian, Mina, et al. “Enhancing pipeline-based conversational agents with large language models.” arXiv preprint arXiv:2309.03748 (2023). ↩︎
Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of machine learning research 21.140 (2020): 1-67. ↩︎
Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019). ↩︎
Gou, Jianping, et al. “Knowledge distillation: A survey.” International Journal of Computer Vision 129.6 (2021): 1789-1819. ↩︎