NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

3 minutes read

Nvidia's Soaring Data Center Revenue Signals Strong Ai And Gpu

Lawrence Jengar
Aug 29, 2024 16:10

NVIDIA’s TensorRT Model Optimizer considerably boosts efficiency of Meta’s Llama 3.1 405B massive language mannequin on H200 GPUs.

Meta’s Llama 3.1 405B massive language mannequin (LLM) is reaching new ranges of efficiency due to NVIDIA’s TensorRT Model Optimizer, in line with the NVIDIA Technical Blog. The enhancements have resulted in as much as a 1.44x improve in throughput when operating on NVIDIA H200 GPUs.

Excellent Llama 3.1 405B Inference Throughput with TensorRT-LLM

TensorRT-LLM has already delivered exceptional inference throughput for Llama 3.1 405B because the mannequin’s launch. This was achieved by varied optimizations, together with in-flight batching, KV caching, and optimized consideration kernels. These strategies have accelerated inference efficiency whereas sustaining decrease precision compute.

TensorRT-LLM added assist for the official Llama FP8 quantization recipe, which calculates static and dynamic scaling components to protect most accuracy. Moreover, user-defined kernels equivalent to matrix multiplications from FBGEMM are optimized through plug-ins inserted into the community graph at compile time.

Boosting Performance As much as 1.44x with TensorRT Model Optimizer

NVIDIA’s customized FP8 post-training quantization (PTQ) recipe, accessible by the TensorRT Model Optimizer library, enhances Llama 3.1 405B throughput and reduces latency with out sacrificing accuracy. This recipe incorporates FP8 KV cache quantization and self-attention static quantization, lowering inference compute overhead.

Desk 1 demonstrates the utmost throughput efficiency, exhibiting vital enhancements throughout varied enter and output sequence lengths on an 8-GPU HGX H200 system. The system options eight NVIDIA H200 Tensor Core GPUs with 141 GB of HBM3e reminiscence every and 4 NVLink Switches, offering 900 GB/s of GPU-to-GPU bandwidth.

Most Throughput Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs
Enter \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
TensorRT Model Optimizer FP8	463.1	320.1	71.5
Official Llama FP8 Recipe	399.9	230.8	49.6
Speedup	1.16x	1.39x	1.44x

Desk 1. Most throughput efficiency of Llama 3.1 405B with NVIDIA inner measurements

Equally, Desk 2 presents the minimal latency efficiency utilizing the identical enter and output sequence lengths.

Batch Measurement = 1 Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs
Enter \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
TensorRT Model Optimizer FP8	49.6	44.2	27.2
Official Llama FP8 Recipe	37.4	33.1	22.8
Speedup	1.33x	1.33x	1.19x

Desk 2. Minimal latency efficiency of Llama 3.1 405B with NVIDIA inner measurements

These outcomes point out that H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer are delivering superior efficiency in each latency-optimized and throughput-optimized situations. The TensorRT Model Optimizer FP8 recipe additionally achieved comparable accuracy with the official Llama 3.1 FP8 recipe on the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Becoming Llama 3.1 405B on Simply Two H200 GPUs with INT4 AWQ

For builders with {hardware} useful resource constraints, the INT4 AWQ approach in TensorRT Model Optimizer compresses the mannequin, permitting Llama 3.1 405B to suit on simply two H200 GPUs. This methodology reduces the required reminiscence footprint considerably by compressing the weights all the way down to 4-bit integers whereas encoding activations utilizing FP16.

Tables 4 and 5 present the utmost throughput and minimal latency efficiency measurements, demonstrating that the INT4 AWQ methodology offers comparable accuracy scores to the Llama 3.1 official FP8 recipe from Meta.

Most Throughput Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs
Enter \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	60,000 \| 2,048
TensorRT Model Optimizer INT4 AWQ	75.6	28.7	16.2

Desk 4. Most throughput efficiency of Llama 3.1 405B with NVIDIA inner measurements

Batch Measurement = 1 Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs
Enter \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	60,000 \| 2,048
TensorRT Model Optimizer INT4 AWQ	21.6	18.7	12.8

Desk 5. Minimal latency efficiency of Llama 3.1 405B with NVIDIA inner measurements

NVIDIA’s developments in TensorRT Model Optimizer and TensorRT-LLM are paving the way in which for enhanced efficiency and effectivity in operating massive language fashions like Llama 3.1 405B. These enhancements supply builders extra flexibility and cost-efficiency, whether or not they have in depth {hardware} assets or extra constrained environments.

Picture supply: Shutterstock