Udostępnij za pośrednictwem


Using Custom Python Libraries with U-SQL

The U-SQL/Python extensions for Azure Data Lake Analytics ships with the standard Python libraries and includes pandas and numpy. We've been getting a lot of questions about how to use custom libraries. This is very simple!

Introducing zipimport

PEP 273  (zipimport) gave Python's import statement the ability to import modules from ZIP files. Take a moment to review the zipimport documentation  before we proceed.

Here are the the basics:

  • Ccreate a module (a .py file, etc.)
  • ZIP up the module into a .zip file
  • Add the full path of the zip file to sys.path
  • Import the module

Build and test a simple zipped package

Before you try this with U-SQL, first master the mechanics of zipimport on your own box.

Create a file called mymodule.py with the following contents:

 # demo module
hello_world = "Hello World! This is code from a custom module"

This module defines a single variable called hello_world.

Create a zip file called modules.zip that contains the mymodule.py at the root .

  • In Windows you can create right-click on mymodule.py and select Send to compressed folder
    • This will create a file called mymodule.zip
  • Rename mymodule.zip to mycustommodules.zip
    • NOTE: This renaming step isn't strictly mandatory when using zipimport, but will help highlight how the process will work

Create a test.py Python file in the same folder as mycustommodules.zip.

 import sys
sys.path.insert(0, 'mycustommodules.zip')
import mymodule
print(mymodule.hello_world)

Your folder should contain:

  • test.py
  • mycustommodules.py

Now run the test.py program

 python test.py

The output should look like this:

 Hello World! This is code from a custom module

Deploying Custom Python Modules with U-SQL

First upload the mycustommodules.zip file to your ADLS store - in this case we will upload it to the root of the default ADLS account for the ADLA account we are using - so its path is "\mycustommodules.zip"

Now, run this U-SQL script

 REFERENCE ASSEMBLY [ExtPython];
DEPLOY RESOURCE @"/mycustommodules.zip";

// mymodule.py is inside the mycustommodules.zip file

DECLARE @myScript = @"
import sys
sys.path.insert(0, 'mycustommodules.zip')
import mymodule

def usqlml_main(df):
 del df['number']
 df['hello_world'] = str(mymodule.hello_world)
 return df
";

@rows = 
 SELECT * FROM (VALUES (1)) AS D(number);

@rows =
 REDUCE @rows ON number
 PRODUCE hello_world string
 USING new Extension.Python.Reducer(pyScript:@myScript);

OUTPUT @rows
 TO "/demo_python_custom_module.csv"
 USING Outputters.Csv(outputHeader: true);

It will produce a simple CSV file with "Hello World! This is code from a custom module" as a row.

Comments

  • Anonymous
    June 28, 2017
    This is very helpful!
  • Anonymous
    June 28, 2017
    Great article! A follow up question: If I wanted to use a custom library such as tensorflow that uses different versions of numpy than what is pre-installed with the U-sql python extension, what would be the best way to do this ?