What is the latency and cost of A100 gpu

Question

Hi team,

I want to deploy fine-tuned model on Azure. If I use llama3.1-8b model, the input is 4k token and output is 512 token.

What is the latency if I deploy the model to A100 gpu ?
What is the economy cost for each A100 gpu ?
Do we support vllm to serve the model inference ?
Do we support batch inference for model serving ?

Answer

Do we support batch inference for model serving ?

The latency of deploying a fine-tuned model like LLaMA 3.1-8B on an A100 GPU can vary based on several factors, including the specific architecture of the model, the batch size, the framework used for deployment

I would suggest you, please check the blogpost for more details.

What is the economy cost for each A100 gpu ?

The cost of using an A100 GPU on Azure depends on the specific VM size and region.

You can Check Estimate costs before using Azure AI services and Monitor costs for models offered through the Azure Marketplace.

Also, see Cost and quotas for more details.

Do we support vllm to serve the model inference ?

Do you mean vision models? Azure support VLLM to serve the model inference.

See this for details.

Do we support batch inference for model serving ?

Yes, you can configure your model serving setup to handle batch inference.

Please see Deploy models for scoring in batch endpoints, Overview of Batch Inferencing

Do let me know if that helps or have any other queries.

Share via

What is the latency and cost of A100 gpu

1 answer

Your answer