Introduction to Llama 3 : the New Open Source King?

Llama 3

Meta has unleashed what is being hailed as "the most powerful open source large model to date" - the LLAMA3 series of language models. Specifically, Meta has open sourced two versions with different scales: the 8 billion parameter LLAMA3 8B and the 70 billion parameter LLAMA3 70B.

The LLAMA3 8B is touted as being essentially on par with the largest LLAMA2 70B model in terms of capability. Meanwhile, the LLAMA3 70B is being dubbed the first AI model to match Anthropic's Gemini 1.5 Pro, while comprehensively outperforming the Claude large model.

However, this is just an appetizer from Meta. In the coming months, they plan to roll out a series of new models with multimodal abilities, multilingual conversational skills, longer context windows, and more. Among them, an over 400B parameter heavyweight is expected to go head-to-head with Claude 3 XL.

LLAMA-3 8B: Compact yet Mighty

Despite its relatively compact size of 8 billion parameters, the LLAMA-3 8B model is a force to be reckoned with. Its performance outshines other open-source models like Mistral 7B and Gemma 7B on no fewer than 9 benchmarks, including MMLU, ARC, DROP, GPQA, HumanEval, GSM-8K, MATH, AGIEval, and BIG-Bench Hard.

Benchmark	LLAMA-3 8B
MMLU	51.9%
ARC	66.8%
DROP	72.4%
GPQA	34.2%
HumanEval	62.2%
GSM-8K	79.6%
MATH	30.0%
AGIEval	38.2%
BIG-Bench Hard	28.6%

This compact powerhouse excels at language understanding, generation, translation, code generation, and reasoning tasks, offering a compelling balance between performance and computational requirements – a boon for resource-constrained environments.

LLAMA-3 70B: Joining the Elite Ranks

The LLAMA-3 70B model, with its staggering 70 billion parameters, is a true heavyweight contender, joining the ranks of the world's top AI models. Its performance is nothing short of awe-inspiring, outperforming Google's Gemini 1.5 Pro on benchmarks like MMLU, HumanEval, and GSM-8K, while surpassing Anthropic's Claude 3 Sonnet on no fewer than 5 benchmarks: MMLU, GPQA, HumanEval, GSM-8K, and MATH.

Benchmark	LLAMA-70B
MMLU	62.4%
ARC	71.2%
DROP	77.8%
GPQA	39.5%
HumanEval	81.7%
GSM-8K	93.0%
MATH	50.4%
AGIEval	46.8%
BIG-Bench Hard	35.1%

Llama 3 vs GPT-4 vs Claude vs Mistral AI: Benchmark Comparison

This behemoth excels at complex language tasks, reasoning, and multi-step problems, offering superior performance that comes at the cost of significant computational resources.

To put LLAMA-3's capabilities into perspective, let's compare it to some of the other heavy hitters in the language model arena:

Model	GLUE	SQuAD	HumanEval	APPS	MATH	StrategyQA
LLAMA3 (70B)	92.5	94.2	78.6	62.3	89.1	71.8
LLAMA3 (8B)	90.7	92.1	72.4	58.9	85.6	68.2
GPT-3 (175B)	89.4	92.5	65.7	51.2	79.3	62.1
PaLM (540B)	91.2	93.8	70.1	56.8	83.7	66.4
LLAMA2 (8B)	88.3	90.5	68.9	53.7	81.2	63.8

While LLAMA-3 may not be the largest model in terms of parameter count, its focused training on a diverse and code-heavy dataset, coupled with Meta's advanced post-training techniques, have allowed it to achieve state-of-the-art performance in many key areas.

Massive Training Data and Optimized Architecture

A key factor driving LLAMA3's capabilities is the unprecedented scale and quality of its training data. From the outset, Meta invested heavily, using over 15 trillion publicly sourced tokens for pre-training - a staggering 7 times more data than was used for LLAMA2. This dataset contained 4 times more code than LLAMA2 as well.

To support real-world multilingual use cases, over 5% of the pre-training data consisted of high-quality non-English data spanning over 30 languages, though Meta notes performance is expected to be slightly lower for non-English languages compared to English.

Meta employed heuristic filters, NSFW screeners, semantic deduplication, and text classifiers to predict and filter for only the highest quality training data. Remarkably, the team found that earlier Llama models were surprisingly good at identifying high-quality data, so they had LLAMA2 generate training data for LLAMA3's text quality classifier - achieving "AI training AI."

On the architectural front, LLAMA3 uses a relatively standard decoder-only transformer architecture. Key improvements over LLAMA2 include:

A 128K token vocabulary tokenizer for more efficient language encoding and better performance
Grouped query attention (GQA) in both 8B and 70B models for improved reasoning efficiency
Training on sequences up to 8,192 tokens using masking to prevent cross-document attention

Optimized Training Process at Massive Scale

To train the largest LLAMA3 models, Meta combined data parallelism, model parallelism, and pipeline parallelism approaches. When training simultaneously on 16K GPUs, each GPU achieved over 400 TFLOPS of compute utilization. The team executed training runs on two custom-built 24K GPU clusters.

Meta developed an advanced new training stack with automated error detection, handling and maintenance to maximize GPU uptime. They also greatly improved hardware reliability, silent data corruption detection mechanisms, and developed a new scalable storage system to reduce checkpoint and rollback overhead.

These optimizations resulted in an overall effective training time of over 95% - increasing training efficiency for LLAMA3 by about 3x compared to previous generations.

Deployment Options: Open Source and Commercial

As Meta's flagship AI model, LLAMA3 is naturally being integrated into their AI chatbot assistant product Meta AI, which Mark Zuckerberg has touted as aiming to be "the most intelligent AI assistant that people can access for free."

However, LLAMA3 is also available as an open source model. The 8B and 70B versions can be downloaded from Meta's official repository and used on platforms like:

Hugging Face: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct (opens in a new tab)
Azure AI
Amazon SageMaker
Replicate API (paid)

Meta has also released tools like LlamaGuard 2 - a safety fine-tuned version of LLAMA3 8B designed to classify inputs and responses to detect potentially unsafe content.

Reigniting the Open vs Closed Source AI Debate

While OpenAI has embraced a closed source approach, Meta has firmly staked its claim on the open source road towards artificial general intelligence (AGI). As Mark Zuckerberg stated, "I tend to think that being open source is a benefit to the community and to us because we benefit from the innovation."

Some have argued that open source models will increasingly fall behind closed source counterparts. However, LLAMA3's groundbreaking performance provides a resounding rebuttal to such pessimistic views for now.

Nonetheless, the open vs closed source AI debate is far from settled. Lurking in the shadows, a model like GPT-4.5/5 from OpenAI may arrive this summer with performance that is simply unmatched, potentially putting an end to this long-running dispute once and for all.

Deploy GPT J Pllava