NVIDIA GPUs Power Meta's Next-Gen Llama 3 Model, Optimized AI Across All Platforms Including RTX

NVIDIA GPUs Power Meta's Next-Gen Llama 3 Model, Optimized AI Across All Platforms Including RTX

NVIDIA has. announced That Meta's Llama 3 LLMs Built with NVIDIA GPUs and optimized to run on all platforms from servers to PCs.

Meta's Next-Gen Llama 3 AI LLMs are here and NVIDIA is the driving force behind them, with optimized support in the Cloud, Edge and RTX PCs

News for the newspaper: NVIDIA today announced optimizations across all of its platforms to accelerate MetaLlama 3, the latest generation of the Large Language Model (LLM). The open model combined with NVIDIA accelerated computing equips devs, researchers, and enterprises to responsibly innovate across a wide variety of applications.

Trained on NVIDIA AI

Metaengineers trained Llama 3 on a cluster of 24,576 Peking computers. H100 Tensor Core GPUsConnected to a Quantum-2 InfiniBand network. In collaboration with NVIDIA, Meta built its network, software, and model architectures for its flagship LLM.

To further advance the state of the art in creative AI, Meta recently outlined plans to scale its infrastructure. 350,000 H100 GPUs.

Putting Llama 3 to work

Versions of Llama 3, accelerated on NVIDIA GPUs, are available today for use in the cloud, data center, edge, and PC.

Image Source: Wccftech (AI-generated)

Businesses can fix Llama 3 using their own data. NVIDIA NeMoAn open source framework for LLMs that is part of the secure, supported NVIDIA AI Enterprise Platform. Custom models can be optimized for inference with NVIDIA. TensorRT-LLM and deployed with Triton Inference Server.

Bringing Llama 3 to Devices and PC

Llama 3 also runs on Jetson Orin for robotics and edge computing devices, which creates interactive agents in the Jetson AI Lab. What's more, benchmark the speed of workstations and PCs on RTX and GeForce RTX GPUs on Lallama 3. These allow system developers to target more than 100 million NVIDIA-accelerated systems worldwide.

Get the best performance with Llama 3.

LLM deployment best practices for chatbots include balancing low latency, good read speed, and using the best GPU to minimize overhead. Such a service requires providing tokens – equivalent to words in LLM – at about twice the user's reading speed, which is about 10 tokens/second.

Applying these metrics, a NVIDIA H200 Tensor Core GPU Initial tests using a version of Llama 3 with 70 billion parameters generated about 3,000 tokens/second — enough to serve about 300 users simultaneously. This means a single NVIDIA HGX server with eight H200 GPUs can serve 24,000 tokens/second, further optimizing costs by supporting more than 2,400 users at the same time.

For edge devices, a version of Llama 3 with eight billion parameters generates up to 40 tokens/second on the Jetson AGX Orin and up to 15 tokens/second on the Jetson Orin Nano.

Advancing community models

An active open source contributor, the NVIDIA community is committed to improving software that helps customers solve their toughest challenges. Open-source models also promote AI transparency and allow users to work on AI security and resilience at scale.

Share this story.



About the Author

Leave a Reply