AzureMLCompute job failed 500: [REDACTED]: Some(true) Error while creating custom environment in azure ml

Question

Hello everyone,

I am trying to create a custom environment to train and deploy a catboost regression model in azure ml SDK. However when I submit the job, it's running for a while and then throws "AzureMLCompute job failed 500: [REDACTED]: Some(true)" error. When I check the logs for the job, I couldn't find anything to solve the problem. Actually there was nothing in the logs. Can you please help me identify and solve the problem ?

Here is my environment definition, and the job to create the env.

channels:
  - conda-forge
dependencies:
  - python=3.10
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=1.0.2
  - scipy=1.7.1
  - pandas~=1.5.3
  - catboost
  - pip:
      - inference-schema[numpy-support]~=1.5.0
      - packaging==23.2
      - cloudpickle==2.2.1
      - mlflow==2.8.0
      - mlflow-skinny==2.8.0
      - azureml-mlflow==1.51.0
      - psutil==5.8.0
      - pyyaml==6.0.1
      - tqdm>=4.59,<4.60
      - ipykernel~=6.0
      - azureml-inference-server-http
      - azureml-core
      - azureml-dataset-runtime[fuse]
      - azureml-fsspec
name: model-env

import os
#create a source folder for the script
train_src_dir = "./pipeline_src"
os.makedirs(train_src_dir, exist_ok=True)


from azure.ai.ml.entities import Environment
#create and register this custom environment in your workspace:
custom_env_name = "model-env"
custom_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for catboost reg",
    tags={"scikit-learn": "1.0.2"},
    conda_file=os.path.join(train_src_dir, "conda.yaml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}")

Accepted Answer

Hello Sena and Alex,

Thanks for sharing the solution, will escalate this issue to document team to see how to doc this fine. If Sena feels Alex's answer is helpful, please kindly accept it so that more people can see.

I will repo Sena's answer here for Sena's convenience to accept since the question poster can not accept her/his own answer as some limitation.

Thanks again for reporting the issue and posting the solution.

For private workspace, you only need to run these codes once. There is no need to run these codes everytime when creating an environment.


#set compute cluster for environment job

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
subscription_id = ""
resource_group = ""
workspace = ""
ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# Get workspace info
ws=ml_client.workspaces.get(name=workspace)

# Update to use cpu-cluster for image builds
ws.image_build_compute=""

# To switch back to using ACR to build (if ACR is not in the VNet):
# ws.image_build_compute = ''
ml_client.workspaces.begin_update(ws)

#set legacy mode of the workspace to False
Python
from azureml.core import Workspace
ws = Workspace.from_config()
ws.update(v1_legacy_mode=False)

Appreciated again.

Regards,

Yutong

-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

Answer

Hi I'm having the same issue but in my case I can't disable v1_legacy_mode. I think the issue based on Sena's fix is that the compute used by default to prepare the images is the serverless one which is outwit the private endpoint configuration.

I resorted to specifying the compute and building the environment explicitly but this appears to be a breaking change in behaviour. I'm still on the old version of the SDK but for anyone looking for an answer here is what worked for me:

my_environment = Environment('')
compute_name = ""
my_environment.build(ws, compute_name) # line I didn't need prior to this

If possible it would be great to make logs for this more explicit to ease with troubleshooting.

Answer

I was able to solve the problem. Anyone who encounters such an error, here is what you should do:

Note : For private workspace, you only need to run these codes once. There is no need to run these codes everytime when creating an environment.

#set compute cluster for environment job

from azure.ai.ml import MLClient

from azure.identity import DefaultAzureCredential


subscription_id = ""

resource_group = ""

workspace = ""


ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# Get workspace info

ws=ml_client.workspaces.get(name=workspace)

# Update to use cpu-cluster for image builds

ws.image_build_compute=""

# To switch back to using ACR to build (if ACR is not in the VNet):

# ws.image_build_compute = ''

ml_client.workspaces.begin_update(ws)

#set legacy mode of the workspace to False

from azureml.core import Workspace

ws = Workspace.from_config()

ws.update(v1_legacy_mode=False)

Answer

There can be a slightly different scenario:

v1_legacy_mode is already set to False
setting image_build_compute is not an option
there is private subnet available for the workspace

In this case, "serverlessComputeSettings" of the AML workspace resource should be updated using the following YAML config:

serverless_compute:
  custom_subnet: [SUBNET_ID]
  no_public_ip: true

And then:

az ml workspace update --name [WORKSPACE_NAME] --resource_group [RESOURCE_GROUP] --subscription [SUBSCRIPTION_ID] --file [THE_FILE_ABOVE]

Then the serverless compute should work.

Share via

AzureMLCompute job failed 500: [REDACTED]: Some(true) Error while creating custom environment in azure ml

3 additional answers

Your answer