0
Intel Is Ready For Meta Llama 3 GenAI Workloads: Optimized For Xeon & Core Ultra CPUs, Arc GPUs & Gaudi Accelerators

Intel Is Ready For Meta Llama 3 GenAI Workloads: Optimized For Xeon & Core Ultra CPUs, Arc GPUs & Gaudi Accelerators

Why this is important: As part of its mission to make AI ubiquitous, Intel invests in software and the AI ​​ecosystem to ensure its products are ready for the latest innovations in the dynamic AI space. . In the data center, Gaudi and Xeon processors with Advanced Matrix Extension (AMX) acceleration provide customers with options to meet dynamic and expansive needs.

Intel Core Ultra processors and Arc graphics products provide both a native development vehicle and deployment on millions of devices with support for comprehensive software frameworks and tools, including PyTorch and the Intel Extension for PyTorch used for native research and development. and the OpenVINO toolkit for model development and estimation. .

About Lama 3 running on Intel: Intel's initial testing and performance results for the Llama 3 8B and 70B models use open source software to provide the latest software optimizations, including PyTorch, DeepSpeed, the Optimum Habana library, and the Intel Extension for PyTorch.

  • Intel Gaudi 2 accelerators have improved performance on the Llama 2 models – 7B, 13B, and 70B parameters – and now there are preliminary performance measurements for the new Llama 3 model. With the Gaudi software maturing, Intel easily ran the new Llama 3 model and generated inference and fine-tuning results. The recently announced Lama 3 is also supported. Gaudi 3 Accelerator.
  • Intel Xeon processors demand end-to-end AI workloads, and Intel invests in optimizing LLM results to reduce latency. Xeon 6 processors With Performance Core (codenamed Granite Rapids) the Llama 3 8B shows a 2x improvement over 4th Gen Xeon processors in inference latency and the ability to run large language models like the Llama 3 70B, in less than 100ms per generated token.
  • Intel Core Ultra and Arc Graphics deliver impressive performance for the Llama 3. In early testing, Core Ultra processors already produce faster than normal human reading speeds. Further, the Arc A770 GPU has X.e Matrix Extensions (XMX) AI acceleration and 16GB of dedicated memory to deliver exceptional performance for LLM workloads.

Xeon Scalable Processors

Intel is continuously improving LLM estimation for Xeon platforms. For example, Llama 2 has been developed to deliver a 5x latency reduction in PyTorch and the Intel Extension for PyTorch compared to launch software. The optimization uses paged attention and tensor parallelism to maximize the use of available compute and memory bandwidth. Figure 1 shows the performance of Meta Llama 3 8B inference on an AWS m7i.metal-48x instance, based on a 4th Gen Xeon Scalable processor.

llama3-aws-performance-chart1

We benchmarked Meta Llama 3 on a Xeon 6 processor with Performance Core (formerly codenamed Granite Rapids) to share a performance preview. These preview numbers show that the Xeon 6 Llama 3 offers a 2x improvement over widely available 4th Gen Xeon processors at 8B inference latency, and the ability to run large language models like the Llama 3 70B, a two-token At less than 100ms. socket server.

Model TP related to health Input length Output length Throughput Delay* Between
Meta-Llama-3-8B-Guide 1 fp8 2k 4k 1549.27

Tokens/sec

7.747

MS

12
Meta-Llama-3-8B-Guide 1 bf16 1k 3k 469.11

Tokens/sec

8.527

MS

4
Meta-Llama-3-70B-Guide 8 fp8 2k 4k 4927.31

Tokens/sec

56.23

MS

277
Meta-Llama-3-70B-Guide 8 bf16 2k 2k 3574.81

Tokens/sec

60.425

MS

216

Client Platforms

In the early stages of diagnosis, Intel Core Ultra Processor Already produces faster than normal human reading speed. These results are driven by the built-in Arch GPU with 8 Xe-cores, included DP4a AI acceleration, and up to 120 GB/s system memory bandwidth. We're excited to continue investing in improving performance and power efficiency on Llama 3, especially as we move to our next-generation processors.

With launch day support for Core Ultra processors and Arch graphics products, the collaboration between Intel and Meta provides both a local development vehicle and deployment on millions of devices. Intel client hardware is accelerated by comprehensive software frameworks and tools, including PyTorch and the Intel Extension for PyTorch used for native research and development, and the OpenVINO Toolkit for model deployment and evaluation.

what's next: In the coming months, Meta expects to introduce new capabilities, additional model sizes, and improved performance. Intel will continue to improve the performance of its AI products to support this new LLM.

About the Author

Leave a Reply