Udostępnij za pośrednictwem


How to rename data columns in Azure ML programmatically

In Azure ML you usually rename columns using the Metadata Editor component. This works well in common scenarios, but I ran into a situation where this was extremely impractical and I needed a way to rename the columns in bulk. I found the following way to do that:

First, add an Execute R Script item to the model.

image

Then, adjust the code as follows:

    1: # Map 1-based optional input ports to variables
    2: dataset1 <- maml.mapInputPort(1) # class: data.frame
    3:  
    4: colnames(dataset1) <- gsub("Scored Probabilities for Class ","",colnames(dataset1))
    5: colnames(dataset1) <- gsub("\"","",colnames(dataset1))
    6:  
    7: # Select data.frame to be sent to the output Dataset port
    8: maml.mapOutputPort("dataset1");

The trick is to use the gsub function in R to replace a text with another text, and use that on the column names. In the example above I did two replacements, but normally you would need to do only one.

The reason I wanted to do this is that my experiment has a multi-class classification model with a few thousand classes, which results in a column to be added to the data for each of the classes while it is scored. This is exactly what I needed, however, Azure ML inconveniently names each of the added columns Scored Probabilities for Class “ <class name> . The most annoying thing about this is that the column name include the double quotes, and the SQL transformation module which I added downstream choked on that. Hence I needed to rename a couple thousand columns. In the code above, the second gsub call removes the double-quote characters. The result of this code is that each scored column is simply named after the class.

There is however a potential issue with this. Even though the script doesn’t do much, the entire data set needs to be loaded into the dataset1 variable, and later the same data is passed back to the Azure ML runtime. When the dataset is large, this can take a lot of time.