Meta’s next generation model, Llama 3.1 405B is now available on Azure AI (2024)

In collaboration with Meta, Microsoft is announcing Llama 3.1 405B available today through Azure AI’s Models-as-a-Service as a serverless API endpoint. The latest fine-tuned versions of Llama 3.1 8B and Llama 3.1 70B are also now available on Azure AI Model Catalog. Developers can rapidly try, evaluate and provision these models in Azure AI Studio using popular LLM developer tools like Azure AI prompt flow, OpenAI, LangChain, LiteLLM, CLI with curl and Python web requests.

We are also announcing Llama 3.1 8B, Llama 3.1 70B, Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, Llama Guard 3 8B and Prompt Guard now available in Azure AI through managed compute deployments.

We are thrilled to be one of Meta’s launch partners in this innovative release for advanced synthetic data generation and distillation where 405B-Instruct is used as a teacher model and 8B-Instruct/70B-Instruct models serving as student models.Enterprises and developers can now streamline the development process while maintaining performance and cost efficiency, leveraging AI to build complex applications for a variety industry task-specific use case.

The Growing Need for Specialized AI Models

Large Language Models (LLMs) are known for their impressive few-shot learning and reasoning abilities. However, for applications that need tailored responses, the comprehensive capabilities of larger models can be excessive. This over-qualification leads to high computational demands and increased latency, making them less suitable for specific-use scenarios.

As such, customers can leverage powerful large models as a teacher model to train small student through distillation, resulting in tailored models ready for use in domain-specific use cases:

Customer Support: Automated systems need to provide accurate, relevant responses to diverse customer queries.

Healthcare: AI-driven diagnostics and patient interaction require precise, context-sensitive information.

Legal Services: Document drafting, and legal advice must be tailored to specific legal scenarios and client needs.

Education: Personalized tutoring systems that cater to individual learning paces and styles.

Finance: Tailored financial advice and portfolio management based on individual client profiles and market conditions.

Introducing Llama 3.1 405B on Azure AI

According to Meta, Llama 3.1 405B is expected to be the largest and most powerful open-source model available, built with delivering specific capabilities to developers:

Synthetic data generation and distillation

Asignificant hurdle in customizing smaller models is the substantial computational effort required to annotate vast datasets. Here, the Llama 3.1 405B Instruct synthetic data generation capability through distillation becomes invaluable.

Direct model usage

Using a combination of quantization, speculative decoding or other optimization techniques, Llama 3.1 405B will be a highly advanced model for both batch and online inference.

Domain specific model

Llama 3.1 405B can serve as a base model for specialized continual pre-training or fine-tuning in a specific industry domain.

Open models Accessibility and Enterprise-Grade Reliability

The potential of the Llama 3.1 405B is magnified by its availability under an open license, allowing unrestricted access for commercial and research purposes. This openness encourages widespread adoption and innovation, offering developers the freedom to experiment and tailor solutions to their specific needs without the overhead of licensing restrictions.

Exploring the Llama-3.1 models and benefits on Azure AI

The Llama 3.1 collection of LLM includes pretrained and instruction-tuned generative models in 8B, 70B, and 405B sizes, supporting long context lengths (128k) and optimized for inference with grouped query attention (GQA). These models are designed for multilingual (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai) dialogue use cases. According to Meta, these models outperform many open-source chat models on industry benchmarks.

The Llama 3.1 collection of models employs an optimized transformer architecture and uses supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for alignment with human preferences. The instruction-tuned text-only models are particularly effective for tool use, supporting zero-shot tool use and specific capabilities like search, image generation, code execution, and mathematical reasoning. For more information, see the Meta Blog post.

Why Azure AI for Meta Llama 3.1?

Developers using Llama- 3.1 models can work seamlessly with tools in Azure AI Studio, such as Azure AI Content Safety, Azure AI Search, and prompt flow to enhance ethical and effective AI practices.

From today, customers can access the following models through serverless APIs:

Meta- Llama-3.1-405B-Instruct

Meta-Llama-3.1-70B-Instruct

Meta-Llama-3.1-8B-Instruct

Through managed compute deployment, customers can provision the following models using their available quota:

Getting Started with-Llama-3.1 on Azure AI

To get started and deploy your first model, follow these clear steps: 

Familiarize Yourself: If you're new to Azure AI Studio, start by reviewing this documentation to understand the basics and set up your first project.

Access the Model Catalog: Open the model catalog in AI Studio.

Find the Model: Use the filter to select the Meta collection or click the “View models” button on the MaaS announcement card.

Select the Model: Open the Meta-Llama-3.1-405B-Instruct text model from the list.

Deploy the Model: Click on ‘Deploy’ and choose the Pay-as-you-go (PAYG) deployment option.

Subscribe and Access: Subscribe to the offer to gain access to the model (usage charges apply), then proceed to deploy it.

Explore the Playground: After deployment, you will automatically be redirected to the Playground. Here, you can explore the model's capabilities.

Customize Settings: Adjust the context or inference parameters to fine-tune the model's predictions to your needs.

Access Programmatically: Click on the “View code” button to obtain the API, keys, and a code snippet. This enables you to access and integrate the model programmatically. 
Generate Data/Distillation: Use the distillation recipe or data generation recipe to generate data and/or distill models using the deployed models.

Integrate with Tools: Use the provided API in Large Language Model (LLM) tools such as prompt flow, Semantic Kernel, LangChain, or any other tools that support REST API with key-based authentication for making inferences.

Looking Forward

Microsoft’s introduction of the Llama 3.1 405B models underscores our commitment to providing cutting-edge AI models that drive business transformation. By integrating this powerful model into your operations, customers can leverage its advanced capabilities for synthetic data generation and model distillation, producing domain task-specific models for tailored industry use cases.

FAQ 

Cost: What does it cost to use Llama 3.1 405B on Azure?

You are billed based on the number of prompt and completions tokens. You can review the pricing on the Llama 3.1 405B offer in the Azure Marketplace offer details tab when deploying the model. You can also find the pricing on the Azure Marketplace.

Regional availability: Are Llama 3.1 models' region specific on Azure?

Llama 3.1 405B,70B and 8B are available through MaaS as serverless API endpoints.

These endpoints can be created in Azure AI Studio projects or Azure Machine Learning workspaces. Cross-regional support for these endpoints is available for any region in the US.

Fine-tuning jobs for 8B Instruct and 70B Instruct are available in West US 3.

Please note that if you would like to use any of these 3 MaaS models in prompt flow within Azure AI Studio projects or Azure Machine Learning workspaces in other regions, you can use the API endpoint and key as a connection to prompt flow manually. Meaning which, you can use the AI endpoint from any Azure region once it’s been created in East US 2 (for 405B Instruct, 70B Instruct, 8B Instruct) and/or in Sweden Central (70B Instruct, 8B Instruct).

GPU capacity quota: Which models do I require GPU capacity quota in my Azure subscription?

Meta-Llama -3.1-405B-Instruct, Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-8B-Instruct are available through MaaS as serverless API endpoints. You don’t require GPU capacity quota in your Azure subscription to deploy these models.

However, if you would like to deploy any of: Meta-Llama-3.1-70B-Instruct, Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.1-70B, Meta-Llama-3.1-8B, Llama-Guard-3-8B and Prompt-Guard-86M, provided you have the relevant associated GPU capacity quota availability as part of a managed compute offering, you will be able to deploy these models.

Azure Marketplace: Llama3.1 405B is listed on the Azure Marketplace. Can I purchase and use Llama 3.1 405B directly from Azure Marketplace?

Azure Marketplace is our foundation for commercial transactions for models built on or built for Azure. The Azure Marketplace enables the purchasing and billing of Llama 3.1 405B. However, model discoverability occurs in both Azure Marketplace and the Azure AI model catalog. Meaning you can search and find Llama 3.1 405B in both the Azure Marketplace and Azure AI Model Catalog.

If you search for Llama 3.1 405B in Azure Marketplace, you can subscribe to the offer before being redirected to the Azure AI Model Catalog in Azure AI Studio where you can complete subscribing and can deploy the model.

If you search for Llama 3.1 405B in the Azure AI Model Catalog, you can subscribe and deploy the model from the Azure AI Model Catalog without starting from the Azure Marketplace. The Azure Marketplace still tracks the underlying commerce flow.

The above is true for Llama 3.1 70B and Llama 3.1 8B as MaaS models, where the commerce flow is supported by Azure Marketplace.

MACC: Given that Llama 3.1 405Bis billed through the Azure Marketplace, does it retire my Azure consumption commitment (aka MACC)?

Yes, Llama 3.1 405B is an “Azure benefit eligible” Marketplace offer, which indicates MACC eligibility. Learn more about MACC here: https://learn.microsoft.com/en-us/marketplace/azure-consumption-commitment-benefit

Data privacy: Is my inference data shared with Meta?

No, Microsoft does not share the content of any inference request or response data with Meta.

Microsoft acts as the data processor for prompts and outputs sent to and generated by a model deployed for pay-as-you-go inferencing (MaaS). Microsoft doesn't share these prompts and outputs with the model provider, and Microsoft doesn't use these prompts and outputs to train or improve Microsoft's, the model providers, or any third party's models. Read more on data, security and privacy for Models-as-a-Service.

Are there rate limits for the Meta models on Azure?

Meta models come with 400 K tokens per minute and 1 K requests per minute limit. Reach out to Azure customer support if this doesn’t suffice.

Can I use MaaS models in any Azure subscription types?

Customers can use MaaS models in all Azure subsection types with a valid payment method, except for the CSP (Cloud Solution Provider) program. Free or trial Azure subscriptions are not supported.

Can I fine-tune the Llama 3.1 405B model? What about other models?

Not yet for 405B Instruct – stay tuned!

Models available to fine-tune today:

Deployment as serverless API (MaaS): 8B Instruct and 70B Instruct.

Deployment as managed compute: 8B Instruct, 70B Instruct, 8B, 70B.