Unable to read content of file despite able to see FileInfo. urlopen error [Errno 5] Input/output error

Question

I have 2 notebooks in different accounts, Staging & Production. Both use Managed Identity, linkedService -> System-assigned managed identity, and mounted drive. Both notebooks use the exact same code:

Both can see the FileInfo, name, size, etc...

FileInfo(path=file:/synfs/notebook/23/mount1/staging_path/ABC.zip, name=ABC.zip, size=1024),

The Staging environment can read the contents of the file, while Production gives error:

URLError:

Code FYR

mssparkutils.fs.mount("abfss://container_name@account_name.dfs.core.windows.net", "/mount1", {"linkedService": "workspace_storage_test"})

mssparkutils.fs.ls(path)

mssparkutils.fs.ls(f'file:{mssparkutils.fs.getMountPath("/mount1")}{staging_path}')


df0 = pd.read_csv(f'file:{mssparkutils.fs.getMountPath("/mount1")}{staging_path}ABC.zip', compression ='zip', sep='|', names = abc, dtype= xyz)
df1 = spark.createDataFrame(df0)
display(df1)

Answer

First, check the permissions, the Managed Identity used in the Production environment should have the necessary permissions to access the storage account and the specific file.

You need also to verify that the mount point (/mount1) is correctly mounted in the Production environment. You can use the following command to list the mounts and verify:


mssparkutils.fs.mounts()

If the mount is not present or incorrect, remount it:


mssparkutils.fs.unmount("/mount1")

mssparkutils.fs.mount("abfss://container_name@account_name.dfs.core.windows.net", "/mount1", {"linkedService": "workspace_storage_test"})

Verify that the file path is correct and accessible. You can list the contents of the directory to verify:


mssparkutils.fs.ls(f'file:{mssparkutils.fs.getMountPath("/mount1")}{staging_path}')

The Input/output error might indicate a network issue. Check if there are any network restrictions or firewall rules that might be blocking access to the storage account in the Production environment.

Implement retry logic in your code to handle transient errors:


import time

from urllib.error import URLError

retries = 3

for attempt in range(retries):

    try:

        df0 = pd.read_csv(f'file:{mssparkutils.fs.getMountPath("/mount1")}{staging_path}ABC.zip', compression='zip', sep='|', names=abc, dtype=xyz)

        break

    except URLError as e:

        if attempt < retries - 1:

            time.sleep(5)  # Wait for 5 seconds before retrying

            continue

        else:

            raise e

Add logging to capture more details about the error:


import logging

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)

try:

    df0 = pd.read_csv(f'file:{mssparkutils.fs.getMountPath("/mount1")}{staging_path}ABC.zip', compression='zip', sep='|', names=abc, dtype=xyz)

except URLError as e:

    logger.error(f"Failed to read file: {e}")

    raise e

Share via

Unable to read content of file despite able to see FileInfo. urlopen error [Errno 5] Input/output error

1 answer

Your answer