We are excited to introduce the DeepSpeed- and Megatron-powered Megatron-Turing Natural Language Generation model (MT-NLG), the largest and the most powerful monolithic transformer language model trained to date, with 530 billion parameters. It is the result of a research collaboration between Microsoft and NVIDIA to further parallelize and optimize the training of very large AI models.

As the successor to Turing NLG 17B and Megatron-LM, MT-NLG has 3x the number of parameters compared to the existing largest model of this type and demonstrates unmatched accuracy in a broad set of natural language tasks such as:

  • Completion prediction
  • Reading comprehension
  • Commonsense reasoning
  • Natural language inferences
  • Word sense disambiguation

The 105-layer, transformer-based MT-NLG improved upon the prior state-of-the-art models in zero-, one-, and few-shot settings and set the new standard for large-scale language models in both model scale and quality.

Large-scale language models

Transformer-based language models in natural language processing (NLP) have driven rapid progress in recent years fueled by computation at scale, large datasets, and advanced algorithms and software to train these models.

Language models with large numbers of parameters, more data, and more training time acquire a richer, more nuanced understanding of language. As a result, they generalize well as effective zero– or few-shot learners, with high accuracy on many NLP tasks and datasets. Exciting downstream applications include summarization, automatic dialogue generation, translation, semantic search, and code autocompletion. It’s no surprise that the number of parameters in state-of-the-art NLP models have grown at an exponential rate (Figure 1).

Training such models, however, is challenging for two main reasons:

It is no longer possible to fit the parameters of these models in the memory of even the largest GPU.

The large number of compute operations required can result in unrealistically long training times if special attention is not paid to optimizing the algorithms, software, and hardware stack all together.

Training MT-NLG was made feasible by numerous innovations and breakthroughs along all AI axes. For example, working closely together, NVIDIA and Microsoft achieved an unprecedented training efficiency by converging a state-of-the-art GPU-accelerated training infrastructure with a cutting-edge distributed learning software stack. We built high-quality, natural language training corpora with hundreds of billions of tokens, and co-developed training recipes to improve optimization efficiency and stability.

In this post, we elaborate on each aspect of the training and describe our methods as well as results.

Large-scale training infrastructure

Powered by NVIDIA A100 Tensor Core GPUs and HDR InfiniBand networking, state-of-the-art supercomputing clusters such as the NVIDIA Selene and Microsoft Azure NDv4 have enough compute power to train models with trillions of parameters within a reasonable timeframe. However, achieving the full potential of these supercomputers requires parallelism across thousands of GPUs, efficient and scalable on both memory and compute.

In isolation, existing parallelism strategies such as data, pipeline, or tensor-slicing have trade-offs in memory and compute efficiency and cannot be used to train models at this scale.

Data parallelism achieves good compute efficiency, but it replicates model states and cannot leverage aggregate distributed memory.

Tensor-slicing requires significant communication between GPUs that limits compute efficiency beyond a single node where high-bandwidth NVLink is not available.

Pipeline parallelism can scale efficiently across nodes. However, to be compute-efficient, it requires large batch sizes, coarse grain parallelism, and perfect load balancing, which is not possible at scale.

Software design

Through a collaboration between NVIDIA Megatron-LM and Microsoft DeepSpeed, we created an efficient and scalable 3D parallel system capable of combining data, pipeline, and tensor-slicing based parallelism together to address these challenges.

By combining tensor-slicing and pipeline parallelism, we can operate them within the regime where they are most effective. More specifically, the system uses tensor-slicing from Megatron-LM to scale the model within a node and uses pipeline parallelism from DeepSpeed to scale the model across nodes.

For example, for the 530 billion model, each model replica spans 280 NVIDIA A100 GPUs, with 8-way tensor-slicing within a node and 35-way pipeline parallelism across nodes. We then use data parallelism from DeepSpeed to scale out further to thousands of GPUs.

Hardware system

Model training is done with mixed precision on the NVIDIA DGX SuperPOD-based Selene supercomputer powered by 560 DGX A100 servers networked with HDR InfiniBand in a full fat tree configuration. Each DGX A100 has eight NVIDIA A100 80GB Tensor Core GPUs, fully connected to each other by NVLink and NVSwitch. A similar reference architecture is used by Microsoft for Azure NDv4 cloud supercomputers.

System throughput

We considered the end-to-end throughput of our system for the 530 billion parameters model with batch size 1920 on 280, 350, and 420 DGX A100 servers on Selene. We observed iteration time of 60.1, 50.2, and 44.4 seconds, respectively. These correspond to 126, 121, and 113 teraFLOP/s per GPU, respectively.

Training dataset and model configuration

We used the architecture of the transformer decoder, which is a left-to-right generative transformer-based language model consisting of 530 billion parameters. The number of layers, hidden dimensions, and attention heads are 105, 20480, and 128, respectively.

We used an 8-way tensor and 35-way pipeline parallelism. The sequence length is 2048 and the global batch size is 1920. Over the first 12 billion training tokens, we gradually increased the batch size by 32, starting at 32, until we reach the final batch size of 1920. We used one billion tokens for the learning rate warmup in our training.

We largely built our training dataset based on prior work, The Pile. First, we selected the subset of datasets (the top 11 rows in Figure 2, below) from The Pile that we found to be of the highest relative quality. Then, following a similar approach as that used to generate Pile-CC, we downloaded and filtered two recent Common Crawl (CC) snapshots.

The steps we took for the CC data included text extraction from raw HTML files, scoring extracted documents using a classifier trained on high-quality data, and filtering documents according to their scores. To diversify the training, we also collected the RealNews and CC-Stories datasets.

Document deduplication is necessary in building training datasets because the same content can be present in multiple documents of different datasets. We used a fuzzy deduplication process at the document level using min-hash LSH to compute a sparse document graph and the connected components in it to identify duplicate documents.

We then used a priority order based on the quality of the datasets when selecting a representative document from the duplicate documents in each connected component. Finally, we used n-gram based filtering to remove downstream task data from the training datasets to avoid contamination.

We ended with a set of 15 datasets consisting of a total of 339 billion tokens. During training, we opted to blend the datasets into heterogeneous batches according to variable sampling weights given in Figure 2, with an emphasis on higher-quality datasets. We trained the model on 270 billion tokens.

WordPress Appliance - Powered by TurnKey Linux