Step by Step Machine Learning: A Classification model to help the Humane Society

發行項
06/13/2016

This tutorial will walk you through a classification machine learning experiment to predict one of several possible outcomes for an animal brought to the humane society.

If you missed the previous tutorials you can find them here

Predicting survivors on the titanic (two-class prediction)
Analyzing breast cancer data (two-class prediction)

For a more comprehensive introduction to data science and Azure Machine Learning Studio check out Data Science and Machine Learning Essentials on MVA

In this post I will show you how to

Create your machine learning workspace
Create a basic classification experiment
Work the data to improve accuracy
Deploy the trained model as a predictive web service and call it from Python code

I recently adopted two cats from the humane society. That’s Oola, our acrobatic ninja cat, curled up on the higher bed and Cluseau, the lovable but somewhat clueless addition to the family, giving himself a bath on the lower level. As I was considering interesting new datasets to explore, I stumbled across a data set from the SPCA (Society for Prevention of Cruelty to Animals) on Kaggle. They wanted to predict the outcome of animals who arrived at the shelter.

This data set helps illustrate two very important principles of machine learning:

Defining the business problem and benefits

Whenever you start an experiment you should have a problem in mind that you want to solve. What’s the goal of the SPCA? To find homes for as many animals as possible. If you are running the SPCA and you can predict that an animal is going to be difficult to place in a home through machine learning, you might be able to take actions to increase the animal’s chance of adoption. In my home town, the humane society has animals on display at local pet stores. I am sure some pet stores get more traffic and have higher adoptions rates, so perhaps you could give priority to animals who are less likely to find a home. As I created this tutorial, I also discovered that animals that were not neutered or spayed had a lower chance of adoption. So using funds to neuter or spay animals at the center should also incresae the adoption rate. Learnings from the experiment can be directly used to improve the adoption rate of animals at the center.

Data preparation

Executing your machine learning experiment is only a part of machine learning. The biggest chunk of work when doing machine learning, is becoming familiar with the data, collecting the data, and getting the data in a suitable format for machine learning. The original data set from the SPCA is not optimized for machine learning and needs some work! I’ll talk more about this in the video “work the data to improve accuracy” which you will find in section 3 of this post.

On with the tutorial!

Okay enough talk, let’s get on with the tutorial.

1. Create a machine learning workspace

First things first you are going to need a Machine Learning Workspace to execute any experiment

2. Create a basic classification experiment

If you have followed the tutorial for the two-class experiments, you will notice the steps here are very similar! The biggest difference is in the evaluation report. The results for a classification experiment are quite different from a two-class experiment.

3. Work the data to improve accuracy

When you get unsatisfactory results from your model, it’s time to go back to your data. MAchine Learning is an iterative process. You will find yourself constantly going back, tweaking the data, trying a new experiment hoping to get better results. In this video I will share a few examples of ways you could work with the data in the SPCA dataset to improve your results.

I could have spent even more time improving this data set for my experiment. I didn’t even get to analyzing the data set for dogs! Think about the factors families consider when they adopt a dog: size, temperament, purebred vs mixed breed, allergy friendly (i.e. how much do they shed fur.). A lot of this information can be determined if we find out information about the breeds. Obviously it would be a fair bit of work to extract the breed for each dog and fetch information from another data source that tells us the temperament, size, and allergy friendly level of each breed. We would also need to make sure each of those values was categorized in a way that was helpful. Should we break down size of dog to small/medium/large? or should we break size into average expected height in inches? Should we simply identify which breeds are considered allergy friendly, such as poodles?

Machine learning is an iterative process, if we aren’t happy with the results, sure you can change the algorithm you use to train the model, and you can tweak the parameters of the algorithm, that can improve your accuracy. But don’t forget to go back and re-evaluate the data itself! Is there additional data you can fetch, or derive? Are the columns that are not formatted in an optimal way for machine learning?

4. Deploy the trained model as a web service

Of course a trained model is only useful if you have a way to use it! In this short video I show you how to deploy the trained model as a predictive web service and call it from a simple Python application.

共用方式為