RMS Titanic (Simple EDA)
January 12, 2021Notebook Source: Repo Link
# Linear algebra and data manipulation imports
import pandas as pd
# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
# sns.set_style('whitegrid')
# plt.style.use('dark_background')
%matplotlib inline
Import data
DATA_DIR='../data/'
# Import train data and test data
train = pd.read_csv(DATA_DIR + 'raw_titanic_train.csv')
test = pd.read_csv(DATA_DIR + 'raw_titanic_test.csv')
EDA
General Outline of EDA
- Summarize representation and objective
- High level data preview (features, targets, train and test sets)
- Visualize any null values and duplicate entries
- Visual EDA
Note that Survived, Pclass, Sex and Embarked are categorical variables here. So we need dummy variables to represent each category using one-hot encoding and avoiding full correlation between the correlations (a category is omitted to prevent full correlation)
1a. Representation
Survived: perfect, 1 represents survived and 0 represents not-survived
PClass: 1 represents 'First Class', 2 represents 'Second Class' and 3 represents 'Third Class'
Sex: Male and female
Embarked: C represents 'Cherbourg', S represents 'Southampton' and Q represents 'Queenstown'
1b. Objective
Predict survival (survived or not) of passengers on rms titanic
2. High level data preview (features, targets, train and test sets)
Train data
# Check info of train dataset
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the Non-Null Count column above, see that the Age feature has missing values (177 to be exact), Embarked has 2 missing values and Cabin has an astounding 687 missing values.
# Check first 10 rows of train data set
train.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
# Check shape of train dataset
train.shape
(891, 12)
From the table below, the columns with reasonable/complete statistic (non-categorical continuous valued features) are Age and Fare. We can see from the train data that the average Age of people onboard the titanic is 29.699118, meaning there were more young people than old people onboard. The youngest person was 5months and the oldest was 80years of age.
# Show description of train dataset for numerical columns
# (may include some categorical columns as well)
train.describe()
# Note that 'Survived' and 'Pclass' shown are categorical variables and not numerical
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Test data
# Check info of test dataset
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
For the test dataset, missing values are noticed for some features as shown above in the Non-Null Count column above, see that the Age feature has 86 missing values, Cabin has 327 missing values, and Fare has 1 missing value. Notice that in the train dataset Fare has no missing value, we would have to handle imputation for Fare in test set specially.
# Check first 10 rows of test data set
test.head(10)
# Has same columns as the train data except the label, 'Survived' is out
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
5 | 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | NaN | S |
6 | 898 | 3 | Connolly, Miss. Kate | female | 30.0 | 0 | 0 | 330972 | 7.6292 | NaN | Q |
7 | 899 | 2 | Caldwell, Mr. Albert Francis | male | 26.0 | 1 | 1 | 248738 | 29.0000 | NaN | S |
8 | 900 | 3 | Abrahim, Mrs. Joseph (Sophie Halaut Easu) | female | 18.0 | 0 | 0 | 2657 | 7.2292 | NaN | C |
9 | 901 | 3 | Davies, Mr. John Samuel | male | 21.0 | 2 | 0 | A/4 48871 | 24.1500 | NaN | S |
# Check the dimensions (row count) of the test data
test.shape
(418, 11)
The test dataset is different from the train dataset, as it should be. Here, as shown in the table below. From the Age column, the average age of people onboard the titanic is 30.272590. The youngest person was 2months and the oldest was 76years of age.
# Show the description of the test data, excluding non-numeric columns
test.describe()
# Note that Pclass' shown is a categorical variable
PassengerId | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
count | 418.000000 | 418.000000 | 332.000000 | 418.000000 | 418.000000 | 417.000000 |
mean | 1100.500000 | 2.265550 | 30.272590 | 0.447368 | 0.392344 | 35.627188 |
std | 120.810458 | 0.841838 | 14.181209 | 0.896760 | 0.981429 | 55.907576 |
min | 892.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 |
25% | 996.250000 | 1.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
50% | 1100.500000 | 3.000000 | 27.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1204.750000 | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.500000 |
max | 1309.000000 | 3.000000 | 76.000000 | 8.000000 | 9.000000 | 512.329200 |
3. Visualize any null values and duplicate entries
Visualize null values in train data
# Visualize null values in train data
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
Visualize null values in test data
# Visualize null values in test data
sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
4. Visual EDA
Visualization of label categories (survived = 1, not survived = 0) in train set
# Visualizing classification label 'Survived'
sns.countplot(train['Survived'])
plt.title('Count of Survived and Not Survived')
/Users/ayomide.bakare/opt/anaconda3/envs/py3.8/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
Text(0.5, 1.0, 'Count of Survived and Not Survived')
# Visualizing classification label 'Survived' for 'Sex' feature
sns.countplot(x='Survived',data=train,hue='Sex',palette='RdBu_r')
plt.title('Count of Survived distributed against Sex')
Text(0.5, 1.0, 'Count of Survived distributed against Sex')
# Visualizing classification label 'Survived' for 'Pclass' feature
sns.countplot(x='Survived',data=train,hue='Pclass',palette='viridis')
plt.title('Count of Survived distributed against Pclass')
Text(0.5, 1.0, 'Count of Survived distributed against Pclass')
train['Age'].dropna().plot.hist(bins=40)
plt.title('Age distribution plot')
Text(0.5, 1.0, 'Age distribution plot')
sns.countplot(x='SibSp',data=train)
plt.title('Count of SibSp')
Text(0.5, 1.0, 'Count of SibSp')
sns.countplot(x='Survived', data=train, hue='Embarked')
plt.title('Count of Survived distributed against Embarked')
# it looks like many passenger embarked from Southampton.
# Nonetheless, this distribution does not clearly show any relevant correlation between embarkation and survival.
Text(0.5, 1.0, 'Count of Survived distributed against Embarked')
sns.distplot(train['Fare'],kde=False,bins=40)
plt.title('Fare distribution plot')
/Users/ayomide.bakare/opt/anaconda3/envs/py3.8/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Fare distribution plot')
plt.figure(figsize=(10,7))
sns.boxplot(x='Pclass',y='Age',data=train)
plt.title('Boxplot of Pclass against Age')
Text(0.5, 1.0, 'Boxplot of Pclass against Age')
Imputation
General Outline of Imputation
- Impute missing values in Age based on mean of age grouped by Pclass
- Engineer Title feature from Name feature
- Drop Cabin feature
- Impute missing values in Embarked with most frequent embarkation
- Impute missing value in Fare feature
NB: Imputation done separately for train and test datasets
import re
# feature engineering on Name feature. It looks like the user title could have an effect on Survival
def eng_name(data):
"""
helper method to engineer new Title feature from Name feature
:param data: train or test data
:return: None
"""
data['Title'] = data['Name'].apply(lambda name: re.split('[,.]+ *', name)[1])
def impute_age(data):
"""
helper method to perform imputation on Age
:param data: train or test data
:return: None
"""
Pclass1Mean = data['Age'][data['Pclass'] == 1].mean()
Pclass2Mean = data['Age'][data['Pclass'] == 2].mean()
Pclass3Mean = data['Age'][data['Pclass'] == 3].mean()
def _impute(cols):
Age = cols[0]
Pclass = cols[1]
if (pd.isnull(Age)):
if Pclass == 1:
return Pclass1Mean
elif Pclass == 2:
return Pclass2Mean
else:
return Pclass3Mean
else:
return Age
data['Age'] = data[['Age', 'Pclass']].apply(_impute,axis=1)
Train dataset
1. Impute missing values in Age based on mean of age grouped by Pclass
# perform imputation on train data
impute_age(train)
Visualize null values after Age imputation in train data
# Visualize null values in train data
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
2. Engineer Title feature from Name feature
eng_name(train)
train.head()[['PassengerId', 'Name', 'Title']]
PassengerId | Name | Title | |
---|---|---|---|
0 | 1 | Braund, Mr. Owen Harris | Mr |
1 | 2 | Cumings, Mrs. John Bradley (Florence Briggs Th... | Mrs |
2 | 3 | Heikkinen, Miss. Laina | Miss |
3 | 4 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | Mrs |
4 | 5 | Allen, Mr. William Henry | Mr |
3. Drop Cabin feature
# Cabin has to many missing values, could make it categorical as Cabin Available vs Cabin Inavailable
train.drop('Cabin',axis=1,inplace=True)
4. Impute missing values in Embarked with most frequent embarkation
# Impute (one) missing value for Fare column. The culprit is a male and third class passenger
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])
Visualize null values after dropping Cabin feature, Title feature engr, and imputing Embarked feature in train data
# Visualize null values in train data
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
plt.figure(figsize=(10,7))
sns.boxplot(x='Pclass',y='Age',data=test)
<AxesSubplot:xlabel='Pclass', ylabel='Age'>
Test dataset
1. Impute missing values in Age based on mean of age grouped by Pclass
# Impute missing age values in test dataset
impute_age(test)
2. Engineer Title feature from Name feature
eng_name(test)
test.head()[['PassengerId', 'Name', 'Title']]
PassengerId | Name | Title | |
---|---|---|---|
0 | 892 | Kelly, Mr. James | Mr |
1 | 893 | Wilkes, Mrs. James (Ellen Needs) | Mrs |
2 | 894 | Myles, Mr. Thomas Francis | Mr |
3 | 895 | Wirz, Mr. Albert | Mr |
4 | 896 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | Mrs |
3. Drop Cabin feature
# Cabin has to many missing values, could make it categorical as Cabin Available vs Cabin Inavailable
test.drop('Cabin',axis=1,inplace=True)
5. Impute missing value in Fare feature
# Impute (one) missing value for Fare column. The culprit is a male and third class passenger
test['Fare'] = test['Fare'].fillna(test['Fare'][(test['Pclass'] == 3) & (test['Sex'] == 'male')].mean())
Visualize null values after dropping Cabin feature, imputing Age, imputing Fare in test data
# Visualize null values in test data¡
sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')
<AxesSubplot:>
# save clean test data to csv file
train.to_csv(DATA_DIR + 'clean_titanic_train.csv',index=False)
test.to_csv(DATA_DIR + 'clean_titanic_test.csv',index=False)