Hello Mark Lui,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are having error: Failure starting repl. Try detaching and re-attaching the notebook.
Since you said you haven't changed any settings, and no user changes were made, automatic updates or changes in Azure Databricks infrastructure might have introduced this issue. This is not uncommon in managed services. I will suggest the Best Practice is to regularly test and validate configurations with every Databricks runtime update or when scaling workloads. Maintain documentation of cluster configurations to enable faster issue resolution.
The lists below is generic to guide on how you can resolve the error:
- If the notebook's state is corrupted, you might need detaching and reattaching of the notebook to the cluster often resolves such issues.
- Confirm that the cluster has adequate CPU, memory, and disk resources for running the workload. If possible, temporarily upgrade to a larger instance.
- Review the logs in the cluster UI under "Driver Logs" and "Event Logs" for errors or warnings and look for library dependency errors, memory allocation failures, or REPL-specific issues.
- Use the
%pip list
command to verify installed library versions and make sure there are no conflicts, particularly with critical libraries like Pandas, NumPy, or PySpark. Reinstall problematic libraries if necessary using%pip uninstall
and%pip install
. - If initialization scripts are configured, disable them temporarily to rule out conflicts. Then, Go to the cluster configuration under "Advanced Options > Init Scripts" and uncheck any enabled scripts.
- Make sure your cluster is using the latest Databricks runtime version, which may include bug fixes or compatibility updates.
- Create a new notebook and run simple commands like
1+1
orprint("hello")
to determine if the issue is specific to the original notebook. - Make sure there are no firewall or connectivity issues between your Databricks workspace and the underlying infrastructure. For instance, verify that the workspace can access Azure Storage if required.
- Temporarily create a minimal cluster configuration (e.g., a small, single-node cluster with no custom libraries) and test if the problem persists.
If none of the above resolves the issue, submit a support request through the Azure portal. Provide logs and steps to reproduce the issue for faster assistance.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.