A hands-on high level introduction to TensorFlow, using the Titanic Dataset. Can you predict who of the passengers survived the catastrophe?
Machine learning is everywhere. In every industry, every sector of technology. Consequently, knowledge about machine learning is turning into a vital skill for everybody working in tech.
Thankfully, frameworks like TensorFlow provide well-structured high level functions to get started and tutorials are available to learn the basics of Machine Learning in no time.
Predicting survival on the Titanic
This tutorial builds on the foundation of the TensorFlow’s own Iris flower classification tutorial. Having gone through that tutorial is helpful, but not required.
In this tutorial, we’re taking a similar approach to the TensorFlow tutorial, but we’re going further and work with a slightly more complex dataset: the Titanic dataset.
In 1912, the famous ocean liner sank on its maiden voyage from Southampton in the UK to New York City when it rammed an iceberg. Of the 2,224 passengers, over 1,500 died, making the sinking of the Titanic one of the deadliest maritime disasters in history.
The dataset we’re using in this tutorial is the passenger list of the Titanic. Each passenger entry is augmented with additional information such as the passengers age, class, the ticket price they paid, and more. The data also denotes if a passenger survived the disaster or not.
We’re going to train a machine learning model to predict if a passenger has survived or not. There is also a Kaggle challenge surrounding this task and dataset.
The features we’re using are the passengers’ age, the class they traveled in, their sex, the number of family members that were on board the ship as well, and their port of embarkation. Using these features, our model will learn to predict the survival of the passengers in the test set with a success rate of approximately 80%.
What you are going to need for this tutorial
In order to follow along, you should be able to program some Python and have a basic idea what machine learning is about.
To run the code on Clusterone, you need to sign up. It’s free and only takes a minute. And you even get 10$ in free credits, so you can run your code right away.
Here’s what should be in your toolbox:
- Python 3.5 or higher
- TensorFlow 1.5
- A Clusterone account
- The clusterone Python package. Get it with pip:
pip install clusterone
- The tutorial code and the Titanic dataset. Download them both from our GitHub.
A closer look at the dataset
Before we dive into the implementation in TensorFlow, let’s get an overview of the dataset we’re working with.
As mentioned before, the data is the passenger list of the maiden voyage of the RMS Titanic. Each passenger is mentioned by name. The list contains personal information, such as age, sex, and information about family members that were on board as well. The list also contains information connected to the voyage, such as the class the passenger traveled in (1st, 2nd, or 3rd), the ticket prize, and the cabin number if they had a cabin (this is usually not the case for 3rd class passengers). Finally, the list contains a “Survived” entry, indicating if the passenger survived the catastrophe.
As usually for machine learning datasets, the list is divided in a training set and a test set. We’ll use the training set to train our model and test its performance with the test set.
Since this is real historical data, not all information is complete. For example, the age of about 20% of the passengers is unknown.
Designing the model
Our task is to predict if a person survived the sinking or not. Before we dive into the implementation, we need to figure out what we’re actually doing. Which of the information that we get about each passenger could help us predict their survival?
From the historic accounts, we know that children and women were allowed to board the lifeboats first. Therefore, age and sex may be vial indicators for survival.
Families were also often allowed to board lifeboats together. The “sibsp” feature indicates the number of siblings and spouses of a passenger that were aboard as well. The “parch” feature is very similar, counting how many parents and children a passenger had aboard the ship. If being part of a family actually increased the chances of survival, these features could be a useful indicator.
Additionally, first and second class passengers were allowed into the boat much earlier than third-class passengers. The third class was also located deep inside the ship, so it was much harder for third-class passengers to even get to the lifeboats.
Another feature is the port of embarkation, which could either Southampton, Cherbourgh in France, or Queenstown. I’m not sure how this feature should influence the chances of survival, but we’ll add it just for the heck of it.
Age, class, and family relationships are all stored as numeric values (the class is coded as 1 for first class, 2 for second class, and — you guessed it — 3 for third class). The sex is stored as a string value, being either “male” or “female”. Equally, the embarkation port is either “C”, “S”, or “Q”.
Time to pick a network architecture. Since this is a relatively simple task with a low number of inputs, the network doesn’t have to be all that big. After some trial and error that I am going to spare you, I have found that 3 fully connected layers of 20 neurons each works pretty well. So, that’s what we’re going to use!
Feel free to try a few different setups, maybe you can find one that works much better. If you do, let me know!
Let’s see how to implement our network. For now, we’ll build the code to only predict survival based on two features, age and class. Later, we’ll add the other features to improve our accuracy.
The first step is to load the data. Then we’ll define our network and eventually feed the data to the network and train it. Ready? Okay, let’s go!
Loading the data
Our data is stored in 2 separate CSV files, one for training and one for testing.
For reading the data from file, we define a function called
load_data(). It takes the paths to the training data file and the test data file as inputs.
To read the CSV values, we’re using pandas and its
read_csv() function. Since we want our model to predict survival, we designate the “survived” features as our output (y), while using the other features as input (x).
train_dir is the path to the training CSV file.
dropna() function call there in the middle? This is a method pandas offers to drop rows of the dataset that contain NaN (Not a Number) values. This is important since our neural network will later expect numeric values as input and wouldn’t know what to do with undefined NaNs.
Prepare for Clusterone
Because we want to run our model on our local machine as well as Clusterone, we use Clusterone’s
get_data_path() function. The function automatically detects if the script is running locally or in the cloud. On the local machine, the local data path is used, while on Clusterone the script assumes that the data is located in
The same goes for the path to the log files. TensorFlow creates a bunch of logging data throughout the training and evaluation process. This data can be visualized in TensorBoard and be a great help to assess how well the network does and where improvements could be made.
But in order to do this, TensorBoard needs to know where these log files are located. While you can store the log files wherever you want on your local machine and then point TensorBoard to it, they need to be placed in a specific spot for TensorBoard to function on Clusterone.
This is where the
get_logs_path() function comes in. Just like
get_data_path(), this function detects if a script is running locally or on Clusterone and adjusts the path to the logs accordingly. This is how you call it:
We’re feeding the function our local logs path (
~/Documents/tf-logs/logs in my case) and
get_logs_path() returns the correct path depending on the environment.
Building the neural network
The first step to building our network is telling TensorFlow what input vectors it should expect and what formats they have. We do this using TensorFlow’s feature columns. There’s a variety of different types of feature columns, but we only the need the most simple one,
tf.feature_column.numeric_column. Since our input features passenger class and age are both numbers, a numeric feature column will serve us just fine.
Note that we pass the name of the feature in our input dictionary to the key parameter of the column. This ensures the feature column can later find the values it should contain.
We’re also collecting all our columns in a list called
passenger_features. We’ll now create our classifier and pass it the list of feature columns:
We’re using a premade classifier from TensorFlow’s estimator module. Estimators are a fairly new component of TensorFlow that has been introduced to make models more reusable. Thankfully, TensorFlow also provides a variety of pre-made estimators that we can use right away without having to worry about any of the low-level API.
The DNNClassifier estimator is also used in TensorFlow’s Getting Started guide and performs reasonably well in classification problems such as ours.
This is also where we define the structure of our network. As mentioned before, we’re going with 3 layer with 20 neurons each. They are passed as a list using the
model_dir argument defines a directory where model parameters should be saved to.
n_classes defines how many different classes the classifier should learn to differentiate between. In our case this is two: survival or no survival.
Training the model
Our model is defined, now it’s time to train it. In TensorFlow, this is almost as easy as calling the
train() method of the classifier object:
Almost, because there’s one more function we need to write first: the function
train_input_fn() we’re passing to the
input_fn parameter of
By the way, don’t get confused by the
lambda in front of the function name. This is just a Python trick to be able to pass arguments to
train_input_fn(), although the
train() function requires the
input_fn parameter to be a function signature without input arguments.
train_input_fn() function. Let’s take a look at it:
It takes the features and labels of the training data as input. From them, it creates a dataset object using TensorFlow’s Dataset API. Then we shuffle the dataset and tell it to repeat, which allows us to train for more steps than there are elements in the data. Finally, we use
batch() to separate our data into batches of 100 samples each and return the dataset.
All that’s left to do now is to pass
train_y to the training function as features and labels, respectively.
Evaluating the model
To test how well our model trains, we add an evaluation step. This is done by calling the classifier’s
The syntax is very similar to the
train() method call. We again pass a function coated in a lambda. Here’s the code for the
Again, we create a dataset, this time from our test data. Again we slice it into batches. No shuffling and repeating is needed this time.
In the last line of our
main() function, we print the result of our evaluation using the return object from
And that’s all the code we need for now. Let’s run our model and see how well it performs!
Run the model by heading to your console, navigate into the you cloned the repository to and type:
$ python titanic_basic.py
Once the model is done running, you should see the test accuracy displayed:
Test set accuracy: 0.69
69% is not bad for just looking at two features! It seems like the class and age of the passengers had quite a bit of influence on their chances of survival.
But let’s add the other features and improve our accuracy some more, shall we?
Adding more feature
The features we’re adding to the model are the “sibsp” and “parch” feature denoting family relations on the boat, as well as the sex of a passenger and their port of embarkation.
“sibsp” and “parch” are numerical values, so all we need to do is add numeric feature columns for them:
The sex and port of embarkation are denoted by strings. They can have the values “male” or “female” for sex, and “S”, “Q”, or “C” for the ports embarkation (Southampton, Queenstown, and Cherbourgh). To make these features usable for our machine learning model, we need to do some pre-processing.
Pre-processing the sex feature
We’re going to separate the sex feature into two features, called “Sex_male” and “Sex_female”. Each of these new feature will be numeric. A field contains a 1 if the passenger is of this sex, a 0 otherwise.
Why are we making this split? This allows us to actually train our model on four features instead of three, giving the model another dimension to work with a learn from.
Pre-processing is very common in machine learning and data science, so many libraries come with powerful tools to help us do it. For our purposes, we’ll use sklearn and its encoder classes LabelEncoder and OneHotEncoder.
Our pre-processing will live in a new function called
label_to_onehot(). It takes the data we read from the CSV files as input. We’ll go through the code with the sex feature in mind, but the embarkation feature is processed the same way using the same code.
First, we use the LabelEncoder to create numeric labels for the “male” and “female” strings.
data[feature_name] refers to the “Sex” column in the passenger list. Then, we fire up the OneHotEncoder to turn the numeric feature (either “0” or “1”) into a one-hot vector (either (0,1) or (1,0)).
Now we create two new features (“Sex_male” and “Sex_female”) in our dataset and fill them with them with the data from our one-hot vector. We also delete the old “Sex” feature, since it’s not needed anymore.
The embarkation feature goes through the same transformation process, creating three new numerical features called “embarked_C”, “embarked_Q”, and “embarked_S”. Again, a “1” denotes the passenger boarded at this port, while a “0” denotes he or she didn’t.
Changing the model
Now we have to add the two new feature to the model. Since the features are numeric, we simply create additional numeric feature columns.
That’s all we had to do! Now it’s time to see how our enhanced performs.
Run the updated model:
$ python titanic.py
This time, you should see a test result similar to this:
Test set accuracy: 0.810
81% is a pretty solid result, especially for the relatively low amount of feature we’ve used. Further improvements could be reached by using more of the features and possibly by tuning the hyper-parameters of the model, but we’ll leave it at this for this article.
Running on Clusterone
Now, let’s see how we can run the model on Clusterone. The good news is we don’t need to make any changes to the code! What we do need to do is to link the GitHub repo to the Clusterone servers.
Link the GitHub repo
Follow Create a project section to add
clusterone-tutorials project. Use
clusterone/clusterone-tutorials repository instead of what is shown in the guide.
That is it. You should be able to see the list of commits and files in the Matrix directly.
Run the job
Now that everything is linked, let’s run the model on the server. We create a job on Clusterone like this:
$ just create job single --project clusterone-tutorials \ --name titanic-job \ --command "python titanic/code/titanic.py" \ --setup-command "pip install -r titanic/code/requirements.txt" \ --docker-image TensorFlow-1.8.0-cpu-py36 \ --instance-type aws-t2-small
This creates a job called “titanic-job” on the platform. We tell it we want it to use the code stored in the “clusterone-tutorials” project. We also give the python script we want to run, requirement file that lists packages we need to install, and what docker image and instance type we want.
That’s it! Now, start the job and see how it runs:
$ just start job -p clusterone-tutorials/titanic-job
To see how our job is doing, head over to Clusterone’s web interface, the Matrix. Here, you should find your job in the list under the
clusterone-tutorials project. Here you can find more information about the status of the job, add it to TensorBoard, and more.
We’ve built a very basic and simple model in TensorFlow. We have used a premade estimator, significantly reducing development time and complexity. But even so, our model performed reasonably well, predicting survival for 81% of the passengers in the test set correctly. Yey!
We have also seen how easy it is to add support for Clusterone to a TensorFlow model. All we really needed to do was calling the
get_log_path() functions, which have no impact on our program’s performance on a local machine.
Many a great article has been written about work on the Titanic dataset. I especially enjoyed reading this entry to the Titanic Kaggle competition by Steffan Jonkers, offering a lot of insight on the structure of the dataset.
This article from SocialCops explores the fate of some of the people who got miss-classified and looks into the explanation why these people either defied the odds or died despite a good chance of survival.
If you want to learn more about TensorFlow and Clusterone, check out this article about how to build distributed TensorFlow code and run it on Clusterone.