Jobs Getting Suspended in Azure Container Apps (KEDA Queue-based Triggered)

Question

Some of our Azure Container App Jobs are being suspended unexpectedly. The logs show "Suspending Scale Job: jobname". We expect that this issue is related to the scaling of the job executions. The jobs are triggered via KEDA based on messages in a Service Bus Queue. However, some jobs are suspended/stoppped despite the queue having messages, and no other job executions are started.

Additionally, the following messages appear intermittently:

0/3 nodes are available: 1 node(s) had untolerated taint {virtual-kubelet.io/provider: legion}, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

We need help understanding why jobs are suspended and why these node-related log messages appear.

Relevant Configuration Details:

The workload profile is specified as:

resource containerAppJob 'Microsoft.App/jobs@2024-08-02-preview' = {
  name: jobname
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    environmentId: managedEnvironment.id
    workloadProfileName: 'pt-D8-8-32'
    template: {
      containers: [
        {
          name: 'containername'
          image: 'name.azurecr.io/image:latest'
          imageType: 'ContainerImage'
          command: [
            'python'
          ]
          args: [
            '-m'
            'src.core_processing.process'
          ]
          resources: {
            cpu: json('8')
            memory: '32Gi'
          }
        }
      ]
    }
    configuration: {
      registries: [
        {
          server: 'name.azurecr.io'
          identity: 'system'
        }
      ]
      triggerType: 'Event'
      replicaTimeout: 3600
      replicaRetryLimit: 0
      eventTriggerConfig: {
        replicaCompletionCount: 1
        parallelism: 1
        scale: {
          minExecutions: 0
          maxExecutions: 2
          pollingInterval: 5
          rules: [
            {
              name: 'message-start-job'
              type: 'azure-servicebus'
              metadata: {
                activationMessageCount: '0'
                messageCount: '1'
                namespace: serviceBusNamespaceName
                queueName: serviceBusQueueProcess
              } 
              auth: []
              identity: 'system'
            }
          ]
        }
      }
    } 
  }
}

      {
        workloadProfileType: 'D8'
        name: 'pt-D8-8-32'
        enableFips: false
        minimumCount: 0
        maximumCount: 4
      }

What I Need Help With:

Why are jobs getting suspended despite messages in the queue?
Are the node-related errors causing job suspensions?
Is there any configuration adjustment needed to prevent suspensions and improve scaling?

Answer

@Arthur Kerst ,

Thanks for reaching out. Jobs in Azure Container Apps can get suspended for various reasons, especially when using KEDA for scaling based on Service Bus Queue messages. Here are some possible reasons and related log messages:

Insufficient Resources: If there aren't enough resources (e.g., CPU, memory), KEDA might suspend the jobs. The message 0/3 means there are no available nodes, either because they're at capacity or none are available.
Scaling Configuration: If your scaling settings aren't optimal, KEDA might scale down to 0 if it can't process messages or allocate resources.
Job Execution Limits: If jobs aren't set to handle multiple messages at once, they might not start new executions even if there are messages in the queue.
Node Pool Configuration: If your Kubernetes node pool lacks enough nodes, you might need to increase the number or adjust autoscaling settings to ensure enough resources.

To avoid suspensions and improve scaling, consider these tweaks:

Increase the minimum count of job replicas to always have instances ready to process messages.
Check the resource limits and requests for your jobs to ensure they're suitable for the workload.
Monitor your Kubernetes nodes' health and capacity to ensure they can handle the load.

Share via

Jobs Getting Suspended in Azure Container Apps (KEDA Queue-based Triggered)

1 answer

Your answer