Azure Machine Learning : Text Processing using Python Scripts (NLTK)

Article
1/17/2024

Introduction

I started with Machine Learning, trying to understand the concepts of machine learning. Generally Microsoft ease the development experience and allows the developers to focus on the business problem to solve. This is true for Azure Machine learning as well. Azure Machine Learning studio is a tool to create experiments for machine learning problems. Developers can easily click the activities and use the tool to create experiments.

Microsoft Azure covers most of the popular algorithms to solve the business problems. For example, Predictive Analysis, Classification, Clustering and Text Mining are the very frequent business scenarios to use machine learning. All these algorithms covered to some extent.

Microsoft understands its limitation and provides an option for using the Python and R packages. Python and R are two most popular languages used by the data scientist or the developers to solve the Machine learning business scenarios.

Business Scenario

I come across a scenario from customer and this is one of the very common problem in industry. Customer wants us to parse the resume and extract the information like Name of the person, mobile number, email address, his skill set, experiences and his strengths and so.on…

There are no straight answers to this problem. Because the CV/Resume is an unstructured content and there is no definite format of the content as well. Looks like need to mine the text content and make the system to understand to extract the information from the resume.

We looked at all Microsoft products and technologies, and finally decided to explore the Azure Machine Learning to solve this problem. Looks Azure Machine learning is promising, but there is limitation here as well. Azure Machine Learning, doesn’t expose text mining tools for pre-process the content like tokenization, stemming, filtering the stop word etc. When further explore, come across the Python- NLTK (Natural Language Toolkit) and many people in internet used this tool and they were able to solve the problems.

On the other side, people suggested to use R programming languages and this is also one of the popular language with data scientist. It also has useful libraries (like tm package for text mining). In coming week, I will try to share my comparison between Python & R.

What is Python and how will be useful in Machine Learning?

Python is one the most popular and powerful language in data science world to solve the Machine Learning related business problems. There are quite popular libraries like scikit-learn and NLTK to solve most the machine learning business scenarios. So it is must for Azure Machine learning developers to know Python or R and these libraries. This will give definitely an edge to solve the problems quickly.

Python NLTK

Natural Language Toolkit is set of libraries in python for processing the text. It provides lot of useful text processing libraries for classification, tokenization, stemming, tagging and parsing. To understand this, I installed Python and NLTK libraries on my machine, and played around for a week to understand the interfaces available. There is good documentation on these libraries and it is worth of time to go through these libraries to understand and to know to how to use these libraries especially for text processing.

How to Install Python in windows

Installation of Python is very simple and remember it’s an open source. Download the MSI from here. Download the Python 3.4 and install x86 libraries. Looks there are some issues in windows 64 libraries.

Note: You need to install the Python only for you development experience. When executing the package, Microsoft Azure takes care of that, you don’t need to worry about this.

How to install NLTK? -

Installing NLTK is a straight forward installation. Installation file can be downloaded here. Recommended to use 32-bit libraries. It downloads the test data as well, it is easy for developers to use the libraries to explore.

Since its all open source, there is no single package which install everything. Developers needs to install every dependent things as well. Here Numpy is also dependend package, basically it’s a numeric package.

Note: As I said earlier, this installation required for developers exploring the feature of NLTK. This steps differs in Azure ML, we will see below.

Some important interfaces of to know for Text Mining

Interface	Description
Sent_tokenize	Converts the paragraphs into sentences.
Word_tokenize	Convert the sentences into word tokens
Port Stemmer	Convert the word to root verb (running, ran to run)
Lancast stemmer	Another way of stemming.
WordNet Lemmatize	Converts the sentences into POS (Part of speech) tokens. Note: Recommend to use this, so that you can keep the context of the text. In stemming, you cannot keep the context. It just simply converts to root verb. Here you may lose the context of the sentences.
Stopwords	Dictionary of all the stopwords. Use this libraries, to remove the stop words from your paragraph.

Code example:

texts = dataframe1[colnames[0]]
 for index,row in dataframe1.iterrows():
 corpus = row['text']
 for sentence in nltk.sent_tokenize(corpus):
 for token in nltk.word_tokenize(sentence):
 if token.lower() not in l_stopwords :
 token_list.append(token.lower())
 wnl = WordNetLemmatizer()
 for word in token_list:
 token_list1.append(wnl.lemmatize(word).encode("utf8"))
 dataframe_output = pd.DataFrame(np.array(token_list1), columns=['tokens'])

Note: I am not a professional developer in Python. So above code is only for PoC.

When you see the above code, few sentences may not able to understand. Will explain the code line-by-line below.

How to use Python script/NLTK in Azure Machine Learning Studio

Azure ML has Python script module to execute the Python script. In MSDN, there is a starting tutorial how to use the Python script in Azure ml. Azure takes care of all the dependent libraries of the Python and even installed the popular libraries like scikit, Numpy etc. List of libraries installed in Azure listed down in this blog.

But unfortunately, it doesn’t include the NLTK package.

Each Python script module has three inputs,

Dataset1 - An optional dataset from your Machine Learning Studio workspace, containing input data or values.
Dataset2 - A second dataset, also optional.
Script bundle - A zipped file containing custom resources. This input executed during the execution. This will be very useful.

For example, suppose if you have any custom python script which solves some problem. Then upload this script in zip to Azure ML. Drag the zip file and connect the script bundle to this zip file. Even you can use these methods or definition in python script.

Steps to use the NLTK in Azure ML.

Get your input data as you want. Use the Input Module, even you feed the text manually in the input methods.

Drag and Drop the execute Python script module from Python Language Modules.

Since NLTK package is not available by default, you need to download the package during the execution. NLTK is big, so it’s not recommended to download the entire package. Download the individual package one by one as you need.

nltk.download(info_or_id='punkt', download_dir='C:/users/client/nltk_data')

The above statement download the package of punkt which used to convert the paragraphs into sentences and convert the sentences into tokens.

nltk.download(info_or_id='stopwords', download_dir='C:/users/client/nltk_data')

The above statements download the stopwords package. Use this library to remove the stopwords from the text.

nltk.download(info_or_id='wordnet',download_dir='C:/users/client/nltk_data')

The above statements download the WordNet package, use the library to lemmatize the word in text.

The below code shows how to use the NLTK package in Azure ML and pre-process the text. Added the comments for each line, so that .net developer can understand this python script.

def azureml_main(dataframe1 = None, dataframe2 = None):
  # Import the nltk module
  import nltk
  # Import the pandas package. Pandas package generally used for transform the data.
  import pandas as  pd
  # Import the Numpy.  fundamental package for scientific computing with Python. use for powerful N-dimensional array object. 
  import numpy as  np
  # Regular expression package
  import re
  # Array for token list
  token_list = []
  # download the punkt package used for sent_tokenize and word_tokenize
  nltk.download(info_or_id='punkt', download_dir='C:/users/client/nltk_data')
  # download the stopword package used for removing the stop words
  nltk.download(info_or_id='stopwords', download_dir='C:/users/client/nltk_data')
  # download the wordnet package used for lemmatization
  nltk.download(info_or_id='wordnet',download_dir='C:/users/client/nltk_data')
  # Import stopwords
  from nltk.corpus import stopwords
  # import WordNetLemmatizer
  from nltk.stem import WordNetLemmatizer
  # Get the stopwords for english dictionary
  l_stopwords = stopwords.words('english') 
  colnames = dataframe1.columns
  # dataframe1 is the one of the input in this package. similar to dataset in .net. get the column of the dataset.
  # get the text from the dataset of the first column in the dataset.
  texts = dataframe1[colnames[0]] 
  for index,row in dataframe1.iterrows(): # loop through each row  in  data set
  corpus = row['text']  # get  the text of the first row
  for sentence in nltk.sent_tokenize(corpus): # convert the paragraph of the text into sentences
  for token in nltk.word_tokenize(sentence): # convert the sentences into tokens
  if token.lower() not in l_stopwords :  # check each tokens in stop words
  token_list.append(token.lower())  # if  not add this  to list
  wnl = WordNetLemmatizer() # create a object  for wor wordnetlemmatizer
  # for each word in in token list , get the lemmatization word. for example, running will return run
  for word in token_list: 
  token_list1.append(wnl.lemmatize(word).encode("utf8"))
  # transfer the array of list to dataset
  dataframe_output = pd.DataFrame(np.array(token_list1), columns=['tokens'])  
  # return dataset. Note in azure python scrip module always accept dataset as input and also output also should be dataset.
 return [dataframe_output]

Share via