Use vision-enabled chat models

Article
02/21/2025

Vision-enabled chat models are large multimodal models (LMM) developed by OpenAI that can analyze images and provide textual responses to questions about them. They incorporate both natural language processing and visual understanding. The current vision-enabled models are o1, GPT-4o, GPT-4o-mini, and GPT-4 Turbo with Vision.

The vision-enabled models can answer general questions about what's present in the images you upload.

Tip

To use vision-enabled models, you call the Chat Completion API on a supported model that you have deployed. If you're not familiar with the Chat Completion API, see the Vision-enabled chat how-to guide.

Call the Chat Completion APIs

The following command shows the most basic way to use a vision-enabled chat model with code. If this is your first time using these models programmatically, we recommend starting with our Chat with images quickstart.

REST
Python

Send a POST request to https://{RESOURCE_NAME}.openai.azure.com/openai/deployments/{DEPLOYMENT_NAME}/chat/completions?api-version=2024-02-15-preview where

RESOURCE_NAME is the name of your Azure OpenAI resource
DEPLOYMENT_NAME is the name of your model deployment

Required headers:

Content-Type: application/json
api-key: {API_KEY}

Body: The following is a sample request body. The format is the same as the chat completions API for GPT-4, except that the message content can be an array containing text and images (either a valid HTTP or HTTPS URL to an image, or a base-64-encoded image).

Important

Remember to set a "max_tokens" value, or the return output will be cut off.

Important

When uploading images, there is a limit of 10 images per chat request.

{
    "messages": [ 
        {
            "role": "system", 
            "content": "You are a helpful assistant." 
        },
        {
            "role": "user", 
            "content": [
	            {
	                "type": "text",
	                "text": "Describe this picture:"
	            },
	            {
	                "type": "image_url",
	                "image_url": {
                        "url": "<image URL>"
                    }
                } 
           ] 
        }
    ],
    "max_tokens": 100, 
    "stream": false 
}

Define your Azure OpenAI resource endpoint and key.
Enter the name of your model deployment.

Create a client object using those values.

api_base = '<your_azure_openai_endpoint>' # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/
api_key="<your_azure_openai_key>"
deployment_name = '<your_deployment_name>'
api_version = '2024-02-15-preview' # this might change in the future

client = AzureOpenAI(
    api_key=api_key,  
    api_version=api_version,
    base_url=f"{api_base}openai/deployments/{deployment_name}",
)

Then call the client's create method. The following code shows a sample request body. The format is the same as the chat completions API for GPT-4, except that the message content can be an array containing text and images (either a valid HTTP or HTTPS URL to an image, or a base-64-encoded image).

Important

Remember to set a "max_tokens" value, or the return output will be cut off.

response = client.chat.completions.create(
    model=deployment_name,
    messages=[
        { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": [  
            { 
                "type": "text", 
                "text": "Describe this picture:" 
            },
            { 
                "type": "image_url",
                "image_url": {
                    "url": "<image URL>"
                }
            }
        ] } 
    ],
    max_tokens=2000 
)
print(response)

Tip

Use a local image

If you want to use a local image, you can use the following Python code to convert it to base64 so it can be passed to the API. Alternative file conversion tools are available online.

import base64
from mimetypes import guess_type

# Function to encode a local image into data URL 
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = 'application/octet-stream'  # Default MIME type if none is found

    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')

    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

# Example usage
image_path = '<path_to_image>'
data_url = local_image_to_data_url(image_path)
print("Data URL:", data_url)

When your base64 image data is ready, you can pass it to the API in the request body like this:

...
"type": "image_url",
"image_url": {
   "url": "data:image/jpeg;base64,<your_image_data>"
}
...

Output

The API response should look like the following.

{
    "id": "chatcmpl-8VAVx58veW9RCm5K1ttmxU6Cm4XDX",
    "object": "chat.completion",
    "created": 1702439277,
    "model": "gpt-4",
    "prompt_filter_results": [
        {
            "prompt_index": 0,
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "choices": [
        {
            "finish_reason":"stop",
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The picture shows an individual dressed in formal attire, which includes a black tuxedo with a black bow tie. There is an American flag on the left lapel of the individual's jacket. The background is predominantly blue with white text that reads \"THE KENNEDY PROFILE IN COURAGE AWARD\" and there are also visible elements of the flag of the United States placed behind the individual."
            },
            "content_filter_results": {
                "hate": {
                    "filtered": false,
                    "severity": "safe"
                },
                "self_harm": {
                    "filtered": false,
                    "severity": "safe"
                },
                "sexual": {
                    "filtered": false,
                    "severity": "safe"
                },
                "violence": {
                    "filtered": false,
                    "severity": "safe"
                }
            }
        }
    ],
    "usage": {
        "prompt_tokens": 1156,
        "completion_tokens": 80,
        "total_tokens": 1236
    }
}

Every response includes a "finish_reason" field. It has the following possible values:

stop: API returned complete model output.
length: Incomplete model output due to the max_tokens input parameter or model's token limit.
content_filter: Omitted content due to a flag from our content filters.

Detail parameter settings in image processing: Low, High, Auto

The detail parameter in the model offers three choices: low, high, or auto, to adjust the way the model interprets and processes images. The default setting is auto, where the model decides between low or high based on the size of the image input.

low setting: the model does not activate the "high res" mode, instead processes a lower resolution 512x512 version, resulting in quicker responses and reduced token consumption for scenarios where fine detail isn't crucial.
high setting: the model activates "high res" mode. Here, the model initially views the low-resolution image and then generates detailed 512x512 segments from the input image. Each segment uses double the token budget, allowing for a more detailed interpretation of the image.''

For details on how the image parameters impact tokens used and pricing please see - What is Azure OpenAI? Image Tokens

Output

The chat responses you receive from the model should now include enhanced information about the image, such as object labels and bounding boxes, and OCR results. The API response should look like the following.

{
    "id": "chatcmpl-8UyuhLfzwTj34zpevT3tWlVIgCpPg",
    "object": "chat.completion",
    "created": 1702394683,
    "model": "gpt-4",
    "choices":
    [
        {
            "finish_reason": {
                "type": "stop",
                "stop": "<|fim_suffix|>"
            },
            "index": 0,
            "message":
            {
                "role": "assistant",
                "content": "The image shows a close-up of an individual with dark hair and what appears to be a short haircut. The person has visible ears and a bit of their neckline. The background is a neutral light color, providing a contrast to the dark hair."
            }
        }
    ],
    "usage":
    {
        "prompt_tokens": 816,
        "completion_tokens": 49,
        "total_tokens": 865
    }
}

Every response includes a "finish_reason" field. It has the following possible values:

stop: API returned complete model output.
length: Incomplete model output due to the max_tokens input parameter or model's token limit.
content_filter: Omitted content due to a flag from our content filters.

GPT-4 Turbo model upgrade

The latest GA release of GPT-4 Turbo is:

gpt-4 Version: turbo-2024-04-09

This is the replacement for the following preview models:

gpt-4 Version: 1106-Preview
gpt-4 Version: 0125-Preview
gpt-4 Version: vision-preview

Differences between OpenAI and Azure OpenAI GPT-4 Turbo GA Models

OpenAI's version of the latest 0409 turbo model supports JSON mode and function calling for all inference requests.
Azure OpenAI's version of the latest turbo-2024-04-09 currently doesn't support the use of JSON mode and function calling when making inference requests with image (vision) input. Text based input requests (requests without image_url and inline images) do support JSON mode and function calling.

Differences from gpt-4 vision-preview

Azure AI specific Vision enhancements integration with GPT-4 Turbo with Vision isn't supported for gpt-4 Version: turbo-2024-04-09. This includes Optical Character Recognition (OCR), object grounding, video prompts, and improved handling of your data with images.

Important

Vision enhancements preview features including Optical Character Recognition (OCR), object grounding, video prompts will be retired and no longer available once gpt-4 Version: vision-preview is upgraded to turbo-2024-04-09. If you are currently relying on any of these preview features, this automatic model upgrade will be a breaking change.

GPT-4 Turbo provisioned managed availability

gpt-4 Version: turbo-2024-04-09 is available for both standard and provisioned deployments. Currently the provisioned version of this model doesn't support image/vision inference requests. Provisioned deployments of this model only accept text input. Standard model deployments accept both text and image/vision inference requests.

Deploying GPT-4 Turbo with Vision GA

To deploy the GA model from the Azure AI Foundry portal, select GPT-4 and then choose the turbo-2024-04-09 version from the dropdown menu. The default quota for the gpt-4-turbo-2024-04-09 model will be the same as current quota for GPT-4-Turbo. See the regional quota limits.

Share via

Use vision-enabled chat models

Call the Chat Completion APIs

Use a local image

Output

Detail parameter settings in image processing: Low, High, Auto

Output

GPT-4 Turbo model upgrade

Differences between OpenAI and Azure OpenAI GPT-4 Turbo GA Models

Differences from gpt-4 vision-preview

GPT-4 Turbo provisioned managed availability

Deploying GPT-4 Turbo with Vision GA

Next steps

Feedback

Additional resources