Cluster creation errors on Azure HDInsight on AKS
Note
We will retire Azure HDInsight on AKS on January 31, 2025. Before January 31, 2025, you will need to migrate your workloads to Microsoft Fabric or an equivalent Azure product to avoid abrupt termination of your workloads. The remaining clusters on your subscription will be stopped and removed from the host.
Only basic support will be available until the retirement date.
Important
This feature is currently in preview. The Supplemental Terms of Use for Microsoft Azure Previews include more legal terms that apply to Azure features that are in beta, in preview, or otherwise not yet released into general availability. For information about this specific preview, see Azure HDInsight on AKS preview information. For questions or feature suggestions, please submit a request on AskHDInsight with the details and follow us for more updates on Azure HDInsight Community.
This article describes how to troubleshoot and resolve errors that could occur when you create Azure HDInsight on AKS clusters.
Sr. No | Error message | Cause | Resolution |
---|---|---|---|
1 | InternalServerError UnrecognizableError | This error could indicate an incorrect template used. Currently, database connectors are allowed only through ARM template. Hence the validation of configuration isn't possible on the template. | |
2 | InvalidClusterSpec - ServiceDependencyFailure - Invalid configuration | Max memory per node error. | Refer to the maximum memory configurations property value types. |
3 | WaitingClusterResourcesReadyTimeOut - Metastoreservice unready | This error could be due to the container name may only contain lowercase letters, numbers, and hyphens. Container name must begin with a letter or a number. | Each hyphen must be preceded by and follow by a nonhyphen character. The name must also be between 3 and 63 characters long. |
4 | InvalidClusterSpec -Invalid configuration - ClusterUpsertActivity | Error: Invalid configuration property hive.metastore.uri: may not be null . |
Refer to the Hive connector documentation. |
5 | InternalServerError - An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call . |
Retry the operation or open a support ticket to Azure HDInsight team. | |
6 | InternalServerError - ObjectDisposedException occurs in RP code. |
Retry the operation or open a support ticket to Azure HDInsight team. | |
7 | PreconditionFailed - Operation failure due to quota limits on user subscription. |
There's quota validation before cluster creation. But when several clusters are created under the same subscription at the same time, it's possible that the first cluster occupies the quota and the other fails because of quota shortage. | Confirm there's enough quota and retry cluster/cluster pool creation. |
8 | ReconcileApplicationSecurityGroupError - Internal AKS error |
Retry the operation or open a support ticket to Azure HDInsight team. | |
9 | ResourceGroupBeingDeleted |
During HDI on AKS resource creation or update, user is also deleting some resources in related resource groups. | Don't delete resources in HDI related resource groups when HDI on AKS resources are being created or updated. |
10 | UpsertNodePoolTimeOut - Async operation dependentArmResourceTask has timed out . |
AKS issue – could be due to high traffic in a particular region at the time of the operation. | Retry the operation after some time. If possible, use another region. |
11 | Authorization_IdentityNotFound - {"code":null,"message":"The identity of the calling application could not be established."} |
The 1-p service principle isn't on boarded to the tenant. | Execute the command to provision the 1-p service principle on the new tenant to onboard. |
12 | NotFound - ARM/AKS sdk error |
The user tries to update HDI on AKS cluster but the corresponding agent pool has been deleted. | The corresponding agent pool has been deleted. It's not recommended to operate AKS agent pool directly. |
13 | AuthorizationFailed - Scope invalid role assignment issue with managed RG and cluster msi |
Lack of permission to perform the operation. | Check if the service principle app ID mentioned in the error message owned by you. If yes, grant the permission according to the error message. If no, open a support ticket to Azure HDInsight team. |
14 | DeleteAksClusterFailed - {"code":"DeleteAksClusterFailed","message":"An Azure service request has failed. ErrorCode: 'DeleteAksClusterFailed', ErrorMessage: 'Delete HDI cluster namespace failed. Additional info: 'Can't access a disposed object.\\r\\nObject name: 'Microsoft.Azure.Common.Configuration.ManagedConfiguration was already disposed'.''."} |
RP switched to a new role instance unexpectedly. | retry the operation or open a support ticket to Azure HDInsight team. |
15 | EntityStoreOperationError - ARM/AKS sdk error |
A database operation failed on AKS side during cluster update. | Retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team. |
16 | InternalServerError - {"exception":"System.Threading.Tasks.TaskCanceledException","message":"The operation was canceled."} |
This error caused due to various issues. | retry the operation or open a support ticket to Azure HDInsight team. |
17 | InternalServerError - {"exception":"System.IO.IOException","message":"Unable to read data from the transport connection: A connection attempt failed because the connected party didn't properly respond after a period of time, or established connection failed because connected host has failed to respond."} |
This error caused due to various issues. | retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team. |
18 | InternalServerError - Null reference exception occurs in RP code . |
This error caused due to various issues. | Retry the operation or open a support ticket to Azure HDInsight team. |
19 | InternalServerError - {"code":"InternalServerError","message":"An internal error has occurred, exception: 'InvalidOperationException, Sequence contains no elements.'"} |
This error caused due to various issues. | retry the operation or open a support ticket to Azure HDInsight team. |
20 | InternalServerError - {"code":"InternalServerError","message":"An internal error has occurred, exception: 'ArgumentNullException, Value can't be null. (Parameter 'roleAssignmentGuid')'"} |
This error caused due to various issues. | retry the operation or open a support ticket to Azure HDInsight team. |
21 | OperationNotAllowed - {"code":"OperationNotAllowed","message":"An Azure service request has failed. ErrorCode: 'OperationNotAllowed', ErrorMessage: 'Service request failed.\\r\\nStatus: 409 (Conflict)\\r\\n\\r\\nContent:\\r\\n{\\ n \\"code\\": \\"OperationNotAllowed\\",\\ n \\"details\\": null,\\ n \\"message\\": \\"Operation isn't allowed: Another agent pool operation (Scaling) is in progress, wait for it to finish before starting a new operation. |
Another agent pool operation (Scaling) is in progress. This error caused due to RP Service Fabric reboot. | Wait for the previous operation to finish before starting a new operation. If the issue persists after retry, open a support ticket to Azure HDInsight team. |
22 | ReconcileVMSSAgentPoolFailed |
There's quota validation before cluster creation. But when several clusters are created under the same subscription at the same time, it's possible that the first cluster occupies the quota and the others fail because of quota shortage. | Confirm there's enough quota and retry cluster/cluster pool creation. |
23 | ReconcileVMSSAgentPoolFailed - Unable to establish outbound connection from agents |
AKS/VMSS side issue: VM has reported a failure. |
retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team. |
24 | InternalServerError - {"code":"InternalServerError","message":"An internal error has occurred, exception: 'SqlException'"} |
This error caused due to a transient SQL connection issue. | retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team. |
25 | NotLatestOperation - ARM/AKS SDK error |
The operation can't proceed. Either the operation has been preempted by another one, or the information needed by the operation failed to be saved (or hasn't been saved yet). | retry the operation after some time. If the issue persists, open a support ticket to Azure HDInsight team. |
26 | ReconcileVMSSAgentPoolFailed - Agent pool drain failed |
There was an issue with the scaling down operation. | Open a support ticket to Azure HDInsight team. |
27 | ResourceNotFound - ARM/AKS SDK error |
This error issue occurs when a required resource removed/deleted by the user. | Make sure the resource that is mentioned in the error message exists, then retry the operation. If the issue persists, open a support ticket to Azure HDInsight team. |
28 | InvalidClusterSpec - The cluster instance deployment failed with reason 'System.DependencyFailure' and message 'Metastoreservice instance _'xyz'_ has invalid request due to - [Hive metastore storage location access check timed out.] . |
The HMS initialization might time out due to SQL server or storage related issues. | Open a support ticket to Azure HDInsight team. |
29 | InvalidClusterSpec - The cluster instance deployment failed with reason 'System.DependencyFailure' and message 'Metastoreservice instance '_xyz_' has invalid request due to - [Keyvault secrets weren't configured properly. Failed to fetch secrets from keyvault.] . |
This error can occur due to keyvault being inaccessible or the secret key being not available. In some rare cases, this error might be due to slower initialization of pod identity infra on the cluster nodes. |
If you have Log Analytics enabled, check the logs of secretprovider-validate job to identify the reason.retry the operation after some time, if the issue persists, open a support ticket to Azure HDInsight team. |
30 | FlinkCluster unready - {"FlinkCluster": "Status can't be determined"} |
This error can occur due to various reasons such as image pull issue, or controller pods not ready, or an issue with MSI. | retry the operation after some time, if the issue persists, open a support ticket to Azure HDInsight team. |
31 | FlinkCluster unready - {"FlinkCluster": "StatefulSet instance 'flink-taskmanager' isn't ready due to - [Ready replicas don't match desired replica count]."} |
This error can occur due to various reasons such as image pull issue, or controller pods not ready, or an issue with MSI. | retry the operation after some time, if the issue persists, open a support ticket to Azure HDInsight team. |
32 | InvalidClusterSpec (class com.microsoft.azure.hdinsight.services.spark.exception.ClusterConfigException:[SparkClusterValidator#ConfigurationValidator#][ISSUE:(1)-Component config valid:[[{serviceName='yarn-service,componentName=hadoop-config-client}, {serviceName='yarn-service,componentName=hadoop-config}]],current:[[{serviceName='yarn-service,componentName=yarn-config}' . |
This error can occur if the service config consists of components that aren't allowed. | Validate the service config components and retry. If the issue persists, open a support ticket to Azure HDInsight team. |
33 | InvalidClusterSpec -1,"conditions":[{"type":"RequestIsValid","status":"UNKNOWN","reason":"UNKNOWN","message":"Unable to determine status of one or more dependencies . |
This error can occur due to HMS,SPARK,YARN services not being up, this error could be related to storage. | Open a support ticket to Azure HDInsight team. |
34 | WaitingClusterResourcesReadyTimeOut - Failed to reconcile from generation 1 to 1. |
Open a support ticket to Azure HDInsight team. | |
35 | WaitingClusterResourcesReadyTimeOut - {"YarnService":"StatefulSet instance 'resourcemanager' isn't ready due to - `` see service status for specific details and how to fix it. Failing services are: YarnService, SparkService"} |
This error can occur due to HMS,SPARK,YARN services not being up, this error could be related to storage. | Open a support ticket to Azure HDInsight team. |
36 | InvalidClusterSpec - [spec.configs[0].files[3].fileName: Invalid value: "yarn-env.sh": spec.configs[0].files[3].fileName in body should match '(^yarn-site\\.xml$)|(^capacity-scheduler\\.xml$)|(^core-site\\.xml$)|(^mapred-site\\.xml$)', spec.configs[0].files[3].values: Required value, spec.configs[1].files[2].fileName: Invalid value: "yarn-env.sh": spec.configs[1].files[2].fileName in body should match '(^yarn-site\\.xml$)|(^capacity-scheduler\\.xml$)|(^core-site\\.xml$)|(^mapred-site\\.xml$)', spec.configs[1].files[2].values: Required value] . |
This error can occur when unsupported files are passed in services configuration. | Validate the service config components and retry. If the issue persists, open a support ticket to Azure HDInsight team. |
37 | InvalidClusterSpec - ".AccessDeniedException: Operation failed: "Server failed to authenticate the request. InvalidAuthenticationInfo, "Server failed to authenticate the request.." |
Invalid authentication parameters – the storage location is inaccessible. | Correct authentication parameters and retry. If the issue persists, open a support ticket to Azure HDInsight team. |
38 | InvalidClusterSpec - “_xyz_.dfs.core.windows.net isn't accessible. Reason: HTTP Error -1; url=. AzureADAuthenticator.getTokenCall threw java.net.SocketTimeoutException :. AzureADAuthenticator.getTokenCall threw java.net.SocketTimeoutException : Read timed out.] . |
This error can occur when the pod identity resources take too long to start on the node when HMS pod is scheduled. | retry the operation, if the issue persists, open a support ticket to Azure HDInsight team. |