Поделиться через


Would you have survived the titanic? Try this step by step Machine Learning experiment to find out!

titanic5In this post I will show you step by step how to create a machine learning experiment with Azure Machine Learning Studio that allows you to predict whether you or your friends would have survived the sinking of the titanic!

If you prefer to learn with a video, check out this great video by Jennifer Marsman for a more comprehensive introduction to data science and Azure Machine Learning Studio check out Data Science and Machine Learning Essentials on MVA

Creating a Machine Learning Workspace

To use Azure Machine Learning Studio from your Azure account, you need a Machine Learning workspace. This workspace contains the tools you need to create, manage, and publish machine learning experiments.

To create a workspace, sign-in to your Microsoft Azure account.

How do I get access to Azure Machine Learning Studio?

Azure Machine Learning studio is part of Microsoft Azure. Microsoft Azure is a paid service, but there are a number of programs or trials you can use to explore it’s capabilities

  • You can try the guest sign in for Azure ML Studio to explore many (but not all) features of Azure Machine Learning Studio
  • Sign up for a one month free trial of Microsoft Azure
  • Students get some Azure features for free through DreamSpark, learn how to sign up for student Azure (unfortunately Azure ML studio is not included in the DreamSpark/Azure offer)
  • Will you use Machine Learning in your start-up? Check out Microsoft BizSpark, a program for startups which includes Azure benefits through MSDN
  • If you work at a company, ask if you have an MSDN subscription, you may already have access to Azure
  • If you want to use it in a course at school, faculty can apply for Azure Education grants, which provide all students in the class with a 6 month Azure pass.
  • If you want to use it for academic research, you can apply for Azure Research grants at azure4research.com

1. Navigate to the Microsoft Azure portal portal.azure.com and log in using your Microsoft account credentials

2. In the Microsoft Azure portal create a new machine learning workspace. Select + New | Data + Analytics | Machine Learning image

You will be redirected to the original Azure portal to enter the details for your machine learning workspace.

1. Enter a WORKSPACE NAME for your workspace

NOTE: Later, you can share the experiments you're working on by inviting others to your workspace. You can do this in Machine Learning Studio on the SETTINGS page. You just need the Microsoft account or organizational account for each user.

2. Specify the Azure LOCATION

3. Select an existing Azure STORAGE ACCOUNT or select Create a new storage account to create a new one and give your new storage account a name.

4. Select CREATE AN ML WORKSPACE.

image

Creating a new experiment in Azure Machine Learning Studio

After your Machine Learning workspace is created, you will see it listed on the portal under MACHINE LEARNING. At the time this post was created Machine Learning Workspaces are always displayed in the Azure Classic portal (even if you select the menu option from the new portal to create it), at some point the new portal will be updated so you can list them without going to the Classic view.

SNAGHTML1c7641

Once your Machine Learning workspace is created, select your workspace from the list and then select Sign-in to ML Studio to access the Machine Learning Studio so you can create your first experiment!

image

When prompted to take a tour select Not Now. You may want to take a tour later when you are exploring this tool on your own.

At the bottom of the screen select +NEW image

then select +Blank Experiment image

Change the title at the top of the experiment to read “Titanic survival predictor” image

Loading the data set

The titanic data set is not a sample data set already loaded in Azure Machine Learning Studio. It is an open data set you can download from various sources on the internet. Different files have slightly different columns and formats. The file used in this example is the train.csv Titanic file from Kaggle (this requires a Kaggle account, but if you are exploring machine learning, you might want to consider creating a Kaggle account to access other interesting datasets or to try some of their competitions).

I have renamed the train.csv file on my computer to TitanicSurvival.csv

Once you have downloaded the file you will need to create a dataset in Azure Machine Learning Studio for the titanic csv file.

Select + NEW at the bottom of the screen

Select DATASET | FROM LOCAL FILE clip_image002[6]

1. Select the DATA TO UPLOAD by browsing to select the csv file you downloaded containing the titanic data.

2. Enter NAME FOR THE NEW DATASET

3. Specify the TYPE FOR THE NEW DATASET as Generic CSV File with a header (.csv) this indicates we have a csv file and the first row of the csv file contains the headers for the data columns

4. Enter a description of the dataset to help you remember the dataset contents

5. Select the checkmark to start uploading the data into a dataset

image

Expand Saved Datasets | My Datasets and drag your newly created Titanic dataset to the experiment

image

Right click on the dataset on your worksheet and select dataset | visualize from the pop-up menu, explore the dataset by clicking on different columns. It’s essential in Machine Learning to be familiar with your data. This dataset contains information about passengers on the titanic and whether or not they survived.

image

  • PassengerId is a unique identifier assigned to each passenger
  • Survived is a flag that indicates if a passenger survived. 0 = No, 1 = Yes
  • Pclass is the passenger class. 1 = 1st class, 2 = 2nd class, 3 = 3rd class)
  • Name is the name of the passenger
  • Sex indicates the gender of the passenger
  • Age indicates the age of the passenger
  • Sibsp indicates the number of siblings or spouses aboard the titanic with the passenger
  • Parch indicates the number of parents or children aboard
  • Ticket indicates the ticket number issued to the passenger
  • Fare indicates the amount of money spent on their ticket
  • Cabin indicates the cabin occupied by the passenger
  • Embarked indicates the port where the passenger embarked. C = Cherbourg, Q = Queenstown, S = Southampton)

We are going to use Machine Learning to create a model that predicts whether fictional passengers (you? Your friends?) would have survived the sinking of the titanic.

Selecting Features for the Machine Learning Experiment

Some of the columns in the dataset are not meaningful for predicting whether or not a passenger would have survived. We know they evacuated women and children first, so age and gender are certainly important columns (referred to as features) that we want to consider when making our predictions but PassengerId is just a number assigned to each passenger, and the name of the passenger isn’t going to help us predict survival either.

Let’s select only the significant features in our dataset to use in our machine learning experiment.

Type “Select” into the search bar and drag the Select Columns in Dataset task to the workspace. Connect the output of your dataset to the project columns task input

image

The Select Columns in Dataset task allows you to specify which columns in the data set you think are significant to a prediction (i.e. your features). You need to look at the data in the dataset and decide which columns represent data that you think will affect whether or not a passenger would survive. You also need to select the column you want to predict. In this case we are going to try to predict the value of Survived. This is a 0/1 column that indicates whether a passenger survived the sinking of the titanic.

Click on the Select Columns in Dataset task. On the properties pane on the right hand side, select Launch column selector imageSelect the columns you think affect whether or not a passenger would have survived as well as the column we want to predict: Survived. In the following screenshot, I selected Survived, Pclass, Sex, Age, Parch, and SibSp. image

 

Setting aside data for testing

Whenever we execute machine learning experiments, we use some of our data to train the model and we put some data aside to test the model. In Azure Machine Learning Studio, we use the Split Data task to put aside data for testing.

The Split Data task allows us to divide up our data, we need some of the data to try and find patterns and we need to save some of the data to test if the model we create successfully makes predictions. Traditionally you will split the data 80/20 or 70/30.

Type “split” into the search bar and drag the Split Data task to the workspace. Connect the output of Project Columns task to the input of the Split Data task.

image

Click on the Split Data task to bring up properties, specify .8 as the Fraction of rows in the first output

 

 

image

Training the model

Now we can get Azure Machine Learning Studio to train the model so we can find the patterns in the historical data to make predictions for new records.

Type “train model” into the search bar. Drag the train model task to the workspace. Connect the first output (the one on the left) of the Split Data task to the rightmost input of the Train model task. This will take 80 % of our data and use it to train/teach our model to make predictions.

 

image

We need to tell the train model task which column we are trying to predict with our model. In our case we are trying to predict the value of the column Survived which indicates if a passenger survived the sinking of the titanic.

Click on the Train Model task. In the properties window select Launch Column Selector. Select the column Survived.

image

If you are a data scientist who creates your own algorithms, you could now import your own R code to try and analyze the patterns. But, we can also use one of the existing built-in standard algorithms.

Different types of machine learning, use different algorithms. Since we are trying to predict if an output has one of two values we want to use a two-class algorithm to train our model. Two-clas algorithms are used to predict outcomes that can only have two possible values. In our case a value of 1 or 0 which indicates survival.

Type “two-class” into the search bar. You will see a number of different classification algorithms listed. Each algorithm has its advantages and disadvantages. Check out the Azure Machine Learning Studio Cheat Sheet for a quick reference guide to algorithm selection. I am going to select the Two-Class Decision Forest to train my model. Select one of the two-class algorithms and drag it to the workspace.

Connect the output of the Algorithm task to the leftmost input of the train model task.

image

Testing your model

After the model is trained, we need to see how well it predicts survival, so we need to score the model by having it test against the 20% of the data we split to our second output using the Split Data task.

Type “score” into the search bar and drag the Score Model task to the workspace. Connect the output of Train Model to the left input of the Score model task. Connect the right output of the Split Data task to the right input of the Score Model task as shown in the following screenshot.

image

Now we need a report on our test results.

Type “evaluate” into the search bar and drag the Evaluate Model task to the bottom of the workspace. Connect the output of the Score model task to the left input of the Evaluate Model task.

image

You are now ready to run your experiment!

Press Run on the bottom toolbar. You will see green checkmarks appear on each task as it completes. When the entire experiment is completed you can check how well your model makes predictions.

How to interpret your results

To see your test results, right click on the evaluate model task and select “ Evaluation results | Visualize”.

The closer the graph is to a straight diagonal line the more your model is guessing randomly. You want your line to get as close to the upper left corner as possible.

clip_image002[12]

If you scroll down you can see the detailed results. AUC (Area Under Curve) is a great overall indicator of your model performance. The closer AUC is to 1, the better your model is making predictions.

You can also see the number of false and true positive and negative predictions

  • True positives are how often your model correctly predicted someone would survive
  • False positives are how often your model predicted a flight would survive, when they did not survive (i.e. your model predicted incorrectly)
  • True negatives indicate how often your model correctly predicted a passenger would not survive
  • False negatives indicate how often your model predicted a flight would be not survive, when in fact they did survive (i.e. your model predicted incorrectly)

You want high values for True positives and True negatives, you want low values for False Positives and False negatives.

image

Creating a web service for your trained model

Once you have trained a model with a satisfactory level of accuracy, how do you use it? One of the great things about Azure Machine Learning Studio is how easy it is to take your model and deploy it as a web service. Then you can simply have a website or app call the web service, pass in a set of values for the project columns and the web service will return the predicted value and confidence of the result.

Convert the training experiment to a predictive experiment

Once you've trained your model, you're ready to use it to make predictions for new data. To do this, you convert your training experiment into a predictive experiment. By converting to a predictive experiment, you're getting your trained model ready to be deployed as a web service. Users of the web service will send input data to your model and your model will send back the prediction results.

To convert your training experiment to a predictive experiment, click Run at the bottom of the experiment canvas, then select Set Up Web Service

Azure ML Deploying Web service demo

Creating a web service for your trained model

Once you have trained a model with a satisfactory level of accuracy, how do you use it? One of the great things about Azure Machine Learning Studio is how easy it is to take your model and deploy it as a web service. Then you can simply have a website or app call the web service, pass in a set of values for the project columns and the web service will return the predicted value and confidence of the result.

Convert the training experiment to a predictive experiment

Once you've trained your model, you're ready to use it to make predictions for new data. To do this, you convert your training experiment into a predictive experiment. By converting to a predictive experiment, you're getting your trained model ready to be deployed as a web service. Users of the web service will send input data to your model and your model will send back the prediction results.

To convert your training experiment to a predictive experiment, click Run at the bottom of the experiment canvas, then select Set Up Web Service

Select Set Up Web Service, then select Predictive Web Service.

clip_image001

This will create a new predictive experiment for your web service. The predictive model doesn’t have as many components as your original experiment, you will notice a few differences:

  • You don’t need the data set because when someone calls the web services they will pass in the data to use for the prediction.
  • You still need to identify which columns will be used for predictions if you pass in a full record of data.
  • Your algorithm and Train Model tasks have now become a single trained model which will be used to analyze the data passed in and make a prediction
  • We don’t need to evaluate the model to test it’s accuracy. All we need is a Score model to return a result from our trained model.
  • Two new tasks are added to indicate how the data from the web service is input to the experiment, and how the data from the experiment is returned to the web service.

image

Delete the connection from the Web input to Select Columns in Dataset task and redraw the connection from the Web input to the Score Model task. If you leave the web input connected to project columns, the web service will prompt you for values for all the data columns even though we don’t use them to make our prediction. If you have the web input connected to the score model directly, the web service will only expect the data columns we selected in our Select Columns in DataSet task which we determined are relevant for making predictions.

image

For more details on how to do this conversion, see Convert a Machine Learning training experiment to a predictive experiment

Deploy the predictive experiment as a web service

Now that the predictive experiment has been sufficiently prepared, you can deploy it as an Azure web service. Using the web service, users can send data to your model and the model will return its predictions.

To deploy your predictive experiment,

click Run at the bottom of the experiment canvas

After it runs successfully

Select Deploy Web Service. The web service is set up and you are placed in the web service dashboard.

image

Test the web service

Select the Test link in the web service dashboard. A dialog pops up to ask you for the input data for the service. These are the columns expected by the scoring experiment. Enter a set of data and then select OK. The results generated by the web service are displayed at the bottom of the dashboard.

image

You may have to scroll down to see all the fields you need to enter

  • Survived – leave as 0, that is the value the web service will predict
  • PClass  - Enter 1,2 or 3 for first, second or third class
  • Sex – Enter male or female (lowercase letters)
  • Age – Enter a numeric age
  • SIBSP – Enter the number of siblings and spouses travelling with the passenger
  • PARCH – Enter the number of parents and children travelling with the passenger

image

The results of the test will appear at the bottom of the screen.

Select Details to see the full record returned

You will see the record you entered followed by the predicted output and the probability (columns scored label, and scored probabilities respectively). In the screenshot below there is a .125 (12.5%) probability my imaginary passenger would have perished on the titanic (predited outcome for survived is 0). The value you see returned will vary depending on the data you specified. Try changing the passenger from 3rd class to 1st class, try changing their age, try giving them parents or children on board see how the predicted output and accuracy changes based on the different values.

image

Calling the web service from your code

Once you deploy your web service from Machine Learning Studio, you can send data to the service and receive responses programmatically.

The dashboard provides all the information you need to access your web service. For example, the API key is provided to allow authorized access to the service, and API help pages are provided to help you get started writing your code. Select Request/Response if you are going to call the web service passing one record at a time. Select Batch Execution if you are going to pass multiple records to the web service at a time.

image

On the API help page select Sample code

image

You will be presented with code samples for calling the web service from C#, Python and R

image

Replace the apiKey of abc123 with the API key displayed in the dashboard of your web service.

Replace the values with the values you wish to pass into the web service and you can now call the web service from your code to retrieve predictions!

For more information about accessing a Machine Learning web service, see How to consume a deployed Azure Machine Learning web service.

Now get your friends to enter their data and find out who would survive the sinking of the titanic!