Develop applications with Semantic Kernel and Azure AI Foundry

In this article, you learn how to use Semantic Kernel with models deployed from the Azure AI model catalog in Azure AI Foundry portal.

Prerequisites

  • An Azure subscription.

  • An Azure AI project as explained at Create a project in Azure AI Foundry portal.

  • A model supporting the Azure AI model inference API deployed. In this example, we use a Mistral-Large deployment, but use any model of your preference. For using embeddings capabilities in LlamaIndex, you need an embedding model like cohere-embed-v3-multilingual.

  • Python 3.10 or later installed, including pip.

  • Semantic Kernel installed. You can do it with:

    pip install semantic-kernel
    
  • In this example, we are working with the Azure AI model inference API, hence we install the relevant Azure dependencies. You can do it with:

    pip install semantic-kernel[azure]
    

Configure the environment

To use LLMs deployed in Azure AI Foundry portal, you need the endpoint and credentials to connect to it. Follow these steps to get the information you need from the model you want to use:

  1. Go to the Azure AI Foundry portal.

  2. Open the project where the model is deployed, if it isn't already open.

  3. Go to Models + endpoints and select the model you deployed as indicated in the prerequisites.

  4. Copy the endpoint URL and the key.

    Screenshot of the option to copy endpoint URI and keys from an endpoint.

    Tip

    If your model was deployed with Microsoft Entra ID support, you don't need a key.

In this scenario, we placed both the endpoint URL and key in the following environment variables:

export AZURE_AI_INFERENCE_ENDPOINT="<your-model-endpoint-goes-here>"
export AZURE_AI_INFERENCE_API_KEY="<your-key-goes-here>"

Once configured, create a client to connect to the endpoint:

from semantic_kernel.connectors.ai.azure_ai_inference import AzureAIInferenceChatCompletion

chat_completion_service = AzureAIInferenceChatCompletion(ai_model_id="<deployment-name>")

Tip

The client automatically reads the environment variables AZURE_AI_INFERENCE_ENDPOINT and AZURE_AI_INFERENCE_API_KEY to connect to the model. However, you can also pass the endpoint and key directly to the client via the endpoint and api_key parameters on the constructor.

Alternatively, if your endpoint support Microsoft Entra ID, you can use the following code to create the client:

export AZURE_AI_INFERENCE_ENDPOINT="<your-model-endpoint-goes-here>"
from semantic_kernel.connectors.ai.azure_ai_inference import AzureAIInferenceChatCompletion

chat_completion_service = AzureAIInferenceChatCompletion(ai_model_id="<deployment-name>")

Note

When using Microsoft Entra ID, make sure that the endpoint was deployed with that authentication method and that you have the required permissions to invoke it.

Azure OpenAI models

If you're using an Azure OpenAI model, you can use the following code to create the client:

from azure.ai.inference.aio import ChatCompletionsClient
from azure.identity.aio import DefaultAzureCredential

from semantic_kernel.connectors.ai.azure_ai_inference import AzureAIInferenceChatCompletion

chat_completion_service = AzureAIInferenceChatCompletion(
    ai_model_id="<deployment-name>",
    client=ChatCompletionsClient(
        endpoint=f"{str(<your-azure-open-ai-endpoint>).strip('/')}/openai/deployments/{<deployment_name>}",
        credential=DefaultAzureCredential(),
        credential_scopes=["https://cognitiveservices.azure.com/.default"],
    ),
)

Inference parameters

You can configure how inference is performed by using the AzureAIInferenceChatPromptExecutionSettings class:

from semantic_kernel.connectors.ai.azure_ai_inference import AzureAIInferenceChatPromptExecutionSettings

execution_settings = AzureAIInferenceChatPromptExecutionSettings(
    max_tokens=100,
    temperature=0.5,
    top_p=0.9,
    # extra_parameters={...},    # model-specific parameters
)

Calling the service

Let's first call the chat completion service with a simple chat history:

Tip

Semantic Kernel is an asynchronous library, so you need to use the asyncio library to run the code.

import asyncio

async def main():
    ...

if __name__ == "__main__":
    asyncio.run(main())
from semantic_kernel.contents.chat_history import ChatHistory

chat_history = ChatHistory()
chat_history.add_user_message("Hello, how are you?")

response = await chat_completion_service.get_chat_message_content(
    chat_history=chat_history,
    settings=execution_settings,
)
print(response)

Alternatively, you can stream the response from the service:

chat_history = ChatHistory()
chat_history.add_user_message("Hello, how are you?")

response = chat_completion_service.get_streaming_chat_message_content(
    chat_history=chat_history,
    settings=execution_settings,
)

chunks = []
async for chunk in response:
    chunks.append(chunk)
    print(chunk, end="")

full_response = sum(chunks[1:], chunks[0])

Create a long-running conversation

You can create a long-running conversation by using a loop:

while True:
    response = await chat_completion_service.get_chat_message_content(
        chat_history=chat_history,
        settings=execution_settings,
    )
    print(response)
    chat_history.add_message(response)
    chat_history.add_user_message(user_input = input("User:> "))

If you're streaming the response, you can use the following code:

while True:
    response = chat_completion_service.get_streaming_chat_message_content(
        chat_history=chat_history,
        settings=execution_settings,
    )

    chunks = []
    async for chunk in response:
        chunks.append(chunk)
        print(chunk, end="")

    full_response = sum(chunks[1:], chunks[0])
    chat_history.add_message(full_response)
    chat_history.add_user_message(user_input = input("User:> "))

Use embeddings models

Configure your environment similarly to the previous steps, but use the AzureAIInferenceEmbeddings class:

from semantic_kernel.connectors.ai.azure_ai_inference import AzureAIInferenceTextEmbedding

embedding_generation_service = AzureAIInferenceTextEmbedding(ai_model_id="<deployment-name>")

The following code shows how to get embeddings from the service:

embeddings = await embedding_generation_service.generate_embeddings(
    texts=["My favorite color is blue.", "I love to eat pizza."],
)

for embedding in embeddings:
    print(embedding)