WizardLM 2: Microsoft's Groundbreaking Advancement in Large Language Models

Microsoft has recently unveiled WizardLM 2, a groundbreaking family of large language models that represent a significant leap forward in the field of artificial intelligence. These models, which include WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B, have demonstrated remarkable improvements in complex chat, multilingual understanding, reasoning, and agent capabilities, surpassing their predecessor, WizardLM, and other leading open-source models.

The Evolution of WizardLM

WizardLM 2 is the culmination of Microsoft's ongoing efforts to scale up large language model post-training. Over the past year, the company has been iterating on the training of the Wizard series, starting with their work on empowering large language models to follow complex instructions. They then accelerated the evolution to code and math reasoning scenarios. As a result, Evol-Instruct and Instruction&Process Supervised Reinforcement Learning (RLEIF) have become fundamental technologies for the GenAI community.

WizardLM 2 Models

The WizardLM 2 family consists of three cutting-edge models, each designed to cater to specific needs and performance requirements:

WizardLM-2 8x22B: Microsoft's most advanced model, it is the best open-source LLM in their internal evaluation for highly complex tasks.
WizardLM-2 70B: This model reaches top-tier reasoning capabilities and is the first choice in its size category.
WizardLM-2 7B: The fastest model, it achieves comparable performance with existing open-source leading models that are 10 times larger.

Method Overview

As human-generated data becomes increasingly exhausted, Microsoft believes that data carefully created by AI and models supervised by AI will be the sole path towards more powerful AI. To achieve this, they have built a fully AI-powered synthetic training system, which consists of several key components:

Data Pre-Processing

The data pre-processing pipeline includes the following steps:

Data Analysis: This step helps to understand the distribution of different attributes in the new source data.
Weighted Sampling: The distribution of the best training data is not always consistent with the natural distribution of human chat corpora. Therefore, the weights of various attributes in the training data are adjusted based on experimental experience.
Progressive Learning: Unlike the common practice of using all data for one-time training, Microsoft found that using different data partitions and progressively training stage-by-stage can achieve better results with less data.

Evol Lab

The Evol Lab is responsible for generating more diverse and complex [instruction, response] pairs. It consists of two main components:

Evol-Instruct: This method enables various agents to automatically generate high-quality instructions.
Evol-Answer: Guiding the model to generate and rewrite responses multiple times can improve its logic, correctness, and affinity.

AI Align AI (AAA)

AI Align AI (AAA) is a framework that collects WizardLMs and various state-of-the-art models to co-teach and improve each other. It consists of two main components:

Co-Teaching: The models engage in simulated chat, quality judging, improvement suggestions, and closing skill gaps to teach and improve each other.
Self-Teaching: WizardLM can generate new evolution training data for supervised learning and preference data for reinforcement learning via active learning from itself.

Learning

The learning process involves three main steps:

Supervised Learning: The models are trained using labeled data.
Stage-DPO: For more effective offline reinforcement learning, the preference data is split into different slices, and the model is progressively improved stage by stage.
RLEIF: This approach employs instruction quality reward models (IRM) combined with process supervision reward models (PRM) to achieve more precise correctness in online reinforcement learning.

WizardLM 2 Capabilities

To evaluate the performance of WizardLM 2, Microsoft conducted both human and automatic evaluations, comparing their models with diverse baselines. The results show that WizardLM 2 demonstrates highly competitive performance compared to leading proprietary works and consistently outperforms all existing state-of-the-art open-source models.

Human Preferences Evaluation

In a blind pairwise comparison, WizardLM 2 models were evaluated against baselines using a complex and challenging set of real-world instructions. The results showed that:

WizardLM-2 8x22B is just slightly behind GPT-4-1106-preview and significantly stronger than Command R Plus and GPT4-0314.
WizardLM-2 70B is better than GPT4-0613, Mistral-Large, and Qwen1.5-72B-Chat.
WizardLM-2 7B is comparable with Qwen1.5-32B-Chat and surpasses Qwen1.5-14B-Chat and Starling-LM-7B-beta.

MT-Bench

Microsoft also adopted the automatic MT-Bench evaluation framework based on GPT-4 to assess the performance of their models. The results are summarized in the following table:

Model	Performance
WizardLM-2 8x22B	Highly competitive with GPT-4-Turbo and Claude-3
WizardLM-2 70B	Top-performing model among leading baselines at 70B scale
WizardLM-2 7B	Top-performing model among leading baselines at 7B scale

These results demonstrate that WizardLM 2 models consistently outperform other open-source models in their respective size categories and are highly competitive with the most advanced proprietary models.

Usage

The model weights of WizardLM-2 8x22B and WizardLM-2 7B are shared on Hugging Face, and WizardLM-2 70B and the demo of all the models will be available in the coming days. To guarantee the generation quality, users should use the same system prompts strictly as provided by Microsoft.

WizardLM-2 adopts the prompt format from Vicuna and supports multi-turn conversation. The prompt should be as follows:

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: Hi
ASSISTANT: Hello.
USER: Who are you?
ASSISTANT: I am WizardLM.
...

Microsoft also provides a WizardLM-2 inference demo code on their GitHub repository.

Conclusion

WizardLM 2 represents a significant milestone in the development of large language models, showcasing Microsoft's commitment to advancing the field of artificial intelligence. By leveraging innovative training methodologies, such as Evol-Instruct, RLEIF, and AAA, Microsoft has created a family of models that consistently outperform existing open-source alternatives and rival the most advanced proprietary models.

As the AI community continues to explore the capabilities of WizardLM 2 and build upon its foundations, we can expect to see further advancements in natural language processing, reasoning, and agent interactions. With the release of these models and the accompanying research, Microsoft has set the stage for a new era of AI-driven innovation that will transform the way we interact with technology and each other.

Grok 1 5 Vision Dolphin 2.9 Llama3