Jaa


Building an Azure ML SSIS Task

In several previous blog posts (HERE and HERE), I’ve introduced and discussed the Azure Machine Learning service, its features, benefits and general capabilities. Since that time I have been toying with the idea of a building a custom SSIS Task to integrate Azure ML into SSIS.

My vision of the project is pretty simple and is roughly modeled after the built-in Execute SQL and Web Service Tasks. First, since the Azure ML batch execution service requires the input data to be hosted in an Azure Blob Storage account, the task would be capable of either uploading the required input data to an Azure Blob Storage account or simply using an URI to an existing blob. If the input source is not an existing blob, that data could come from a variable, a file connection or via direct input.

Next, the task would execute and monitor the batch prediction job via the end point provided. The task would also periodically log the status for the purpose of visibility. When the job completes, the task would allow the results to be downloaded to either a string variable or file connection.

To start, I re-purposed the Azure Storage Connection Manager found in SSIS Azure Blob Storage sample project on MSDN. The project itself consists of just a single task which I called ExecuteAzureMLBatch and its supporting classes. While a deep dive into the code is outside the scope of this post, those that are interested can download the project directly from my Github project: https://github.com/bluewatersql/AzureMLSSIS.

Building the ExecureAzureMLBatch Task

1. Download the code from https://github.com/bluewatersql/AzureMLSSIS

2. The solution consists of two projects: one for the connection manager and one for the task itself. The build events for both projects handle all the heavy lifting for using the components on your location machine. Ensure the path to GacUtil.exe as well as your SQL Server installation found in the Post-Build Events is correct for your machine. Note that you need to do this for both projects.

image

3. Build the project. The post-build events discussed in step 2 take care of registering the required assemblies in the GAC and copying the files to the required directories. If you get any squirrely errors, you may need to run Visual Studio as an Administrator to successful run GacUtil.

Using the ExecuteAzureMLBatch Task

After the solution has been successful built, your ready to use the task in an SSIS package. To use the task successfully, use the following steps:

  1. Add a new Windows Azure Storage connection manager to your package. This connection does not have a design-time UI for configuration so you will need to use the properties window to set the following properties for where experiment sample data will live:
    • UseDevelopmentStorage = false
    • StorageName = <STORAGE_ACCOUNT />
    • AccessKey = <STORAGE_ACCOUNT_ACCESS_KEY />
    • Path = <CONTAINER_NAME /
  2. Add the ExecuteAzureML task to the control flow design surface and configure.
    • Set the Azure ML Batch URL Endpoint (ex: https://ussouthcentral.services.azureml.net/workspaces/\<….>/services/<…>/jobs)
    • Set the Azure ML Key
    • The experiment input source can come from a variable, a file connection or by direct input.
    • The experiment destination can be none, variable or a file connection and determines where the results are loaded to. Note that when none is specified the results are not downloaded.
    • image
  3. Run your package

image

Once the rask runs, you can process the results as required. I hope you find this as interesting as I do and feel free to let me know if you have suggestions, feedback or would like to actively contribute.

Till next time!

Chris

Comments

  • Anonymous
    July 14, 2015
    Thanks for this Chris! I look forward to exploring it :)

  • Anonymous
    September 09, 2015
    Tire looks really interesting. Thank you for putting all this together. I tried this with SQL 2014 (vs2013) without luck.. Would you be able to share the exact platform your project is created on i.e. versions of SQL and VS and data tools you run your package on. Also does this work for retraining the models? Thanks very much.