Mistral 7B: Setting New Benchmarks Beyond Llama2 in the Open-Source Space

Large Language Models (LLMs) have recently taken center stage, thanks to standout performers like ChatGPT. When Meta introduced their Llama models, it sparked a renewed interest in open-source LLMs. The aim? To create affordable, open-source LLMs that are as good as top-tier models such as GPT-4, but without the hefty price tag or complexity.

This mix of affordability and efficiency not only opened up new avenues for researchers and developers but also set the stage for a new era of technological advancements in natural language processing.

Recently, generative AI startups have been on a roll with funding. Together raised $20 million, aiming to shape open-source AI. Anthropic also raised an impressive $450 million, and Cohere, partnering with Google Cloud, secured $270 million in June this year.

Introduction to Mistral 7B: Size & Availability

Mistral AI, based in Paris and co-founded by alums from Google’s DeepMind and Meta, announced its first large language model: Mistral 7B. This model can be easily downloaded by anyone from GitHub and even via a 13.4-gigabyte torrent.

This startup managed to secure record-breaking seed funding even before they had a product out. Mistral AI first mode with 7 billion parameter model surpasses the performance of Llama 2 13B in all tests and beats Llama 1 34B in many metrics.

Compared to other models like Llama 2, Mistral 7B provides similar or better capabilities but with less computational overhead. While foundational models like GPT-4 can achieve more, they come at a higher cost and aren't as user-friendly since they're mainly accessible through APIs.

When it comes to coding tasks, Mistral 7B gives CodeLlama 7B a run for its money. Plus, it's compact enough at 13.4 GB to run on standard machines.

Additionally, Mistral 7B Instruct, tuned specifically for instructional datasets on Hugging Face, has shown great performance. It outperforms other 7B models on MT-Bench and stands shoulder to shoulder with 13B chat models.

Hugging Face Mistral 7B Example

Performance Benchmarking

In a detailed performance analysis, Mistral 7B was measured against the Llama 2 family models. The results were clear: Mistral 7B substantially surpassed the Llama 2 13B across all benchmarks. In fact, it matched the performance of Llama 34B, especially standing out in code and reasoning benchmarks.

The benchmarks were organized into several categories, such as Commonsense Reasoning, World Knowledge, Reading Comprehension, Math, and Code, among others. A particularly noteworthy observation was Mistral 7B's cost-performance metric, termed “equivalent model sizes”. In areas like reasoning and comprehension, Mistral 7B demonstrated performance akin to a Llama 2 model three times its size, signifying potential savings in memory and an uptick in throughput. However, in knowledge benchmarks, Mistral 7B aligned closely with Llama 2 13B, which is likely attributed to its parameter limitations affecting knowledge compression.

What really makes Mistral 7B model better than most other Language Models?

Simplifying Attention Mechanisms

While the subtleties of attention mechanisms are technical, their foundational idea is relatively simple. Imagine reading a book and highlighting important sentences; this is analogous to how attention mechanisms “highlight” or give importance to specific data points in a sequence.

In the context of language models, these mechanisms enable the model to focus on the most relevant parts of the input data, ensuring the output is coherent and contextually accurate.

In standard transformers, attention scores are calculated with the formula:

Transformers Attention Formula

The formula for these scores involves a crucial step – the matrix multiplication of Q and K. The challenge here is that as the sequence length grows, both matrices expand accordingly, leading to a computationally intensive process. This scalability concern is one of the major reasons why standard transformers can be slow, especially when dealing with long sequences.

Attention mechanisms help models focus on specific parts of the input data. Typically, these mechanisms use ‘heads' to manage this attention. The more heads you have, the more specific the attention, but it also becomes more complex and slower. Dive deeper into of transformers and attention mechanisms here.

Multi-query attention (MQA) speeds things up by using one set of ‘key-value' heads but sometimes sacrifices quality. Now, you might wonder, why not combine the speed of MQA with the quality of multi-head attention? That's where Grouped-query attention (GQA) comes in.

Grouped-query Attention (GQA)

Grouped-query attention

GQA is a middle-ground solution. Instead of using just one or multiple ‘key-value' heads, it groups them. This way, GQA achieves a performance close to the detailed multi-head attention but with the speed of MQA. For models like Mistral, this means efficient performance without compromising too much on quality.

Sliding Window Attention (SWA)

The sliding window is another method use in processing attention sequences. This method uses a fixed-sized attention window around each token in the sequence. With multiple layers stacking this windowed attention, the top layers eventually gain a broader perspective, encompassing information from the entire input. This mechanism is analogous to the receptive fields seen in Convolutional Neural Networks (CNNs).

On the other hand, the “dilated sliding window attention” of the Longformer model, which is conceptually similar to the sliding window method, computes just a few diagonals of the $Q K T$ matrix. This change results in memory usage increasing linearly rather than quadratically, making it a more efficient method for longer sequences.

Mistral AI's Transparency vs. Safety Concerns in Decentralization

In their announcement, Mistral AI also emphasized transparency with the statement: “No tricks, no proprietary data.” But at the same time their only available model at the moment ‘Mistral-7B-v0.1' is a pretrained base model therefore it can generate a response to any query without moderation, which raises potential safety concerns. While models like GPT and Llama have mechanisms to discern when to respond, Mistral's fully decentralized nature could be exploited by bad actors.

However, the decentralization of Large Language Models has its merits. While some might misuse it, people can harness its power for societal good and making intelligence accessible to all.

Deployment Flexibility

One of the highlights is that Mistral 7B is available under the Apache 2.0 license. This means there aren't any real barriers to using it – whether you're using it for personal purposes, a huge corporation, or even a governmental entity. You just need the right system to run it, or you might have to invest in cloud resources.

While there are other licenses such as the simpler MIT License and the cooperative CC BY-SA-4.0, which mandates credit and similar licensing for derivatives, Apache 2.0 provides a robust foundation for large-scale endeavors.

Final Thoughts

The rise of open-source Large Language Models like Mistral 7B signifies a pivotal shift in the AI industry, making high-quality language models accessible to a wider audience. Mistral AI's innovative approaches, such as Grouped-query attention and Sliding Window Attention, promise efficient performance without compromising quality.

While the decentralized nature of Mistral poses certain challenges, its flexibility and open-source licensing underscore the potential for democratizing AI. As the landscape evolves, the focus will inevitably be on balancing the power of these models with ethical considerations and safety mechanisms.

Up next for Mistral? The 7B model was just the beginning. The team aims to launch even bigger models soon. If these new models match the 7B's performance, Mistral might quickly rise as a top player in the industry, all within their first year.

The post Mistral 7B: Setting New Benchmarks Beyond Llama2 in the Open-Source Space appeared first on Unite.AI.