Michael Dong Greetings!
Do we support batch inference for model serving ?
The latency of deploying a fine-tuned model like LLaMA 3.1-8B on an A100 GPU can vary based on several factors, including the specific architecture of the model, the batch size, the framework used for deployment
I would suggest you, please check the blogpost for more details.
What is the economy cost for each A100 gpu ?
The cost of using an A100 GPU on Azure depends on the specific VM size and region.
You can Check Estimate costs before using Azure AI services and Monitor costs for models offered through the Azure Marketplace.
Also, see Cost and quotas for more details.
Do we support vllm to serve the model inference ?
Do you mean vision models? Azure support VLLM to serve the model inference.
See this for details.
Do we support batch inference for model serving ?
Yes, you can configure your model serving setup to handle batch inference.
Please see Deploy models for scoring in batch endpoints, Overview of Batch Inferencing
Do let me know if that helps or have any other queries.