Titanic Survival Data Analysis

TITANIC

Titanic Data Analysis for survival of passengers

Details

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Data Description

(from https://www.kaggle.com/c/titanic) survival: Survival (0 = No; 1 = Yes) pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name: Name

sex: Sex

age: Age

sibsp: Number of Siblings/Spouses Aboard

parch: Number of Parents/Children Aboard

ticket: Ticket Number

fare: Passenger Fare

cabin: Cabin

embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Special Notes:

Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic

Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)

Parent: Mother or Father of Passenger Aboard Titanic

Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-

laws. Some children travelled only with a nanny, therefore parch=0 for them.

As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

JUPYTER NOTEBOOK

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
titanic = sns.load_dataset('titanic')
titanic.head()
Out[2]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Detail of data

In [718]:
titanic.describe()
Out[718]:
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Step 1-Data Cleaning

We will check for NaN or Null values and also check if any of the columns need to be added of removed

In [3]:
#total nu of null values in each column
titanic.apply(lambda x: sum(x.isnull()),axis=0)
Out[3]:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

we see that there are 177 missing values in “age” and “688” missing values in “deck”. First check the values in “deck”

In [4]:
titanic['deck'].value_counts()
Out[4]:
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64

It seems that there is no need for “deck” values as more than 40% are missing.So we remove “deck” field. Also we remove the ‘alive’ column

In [5]:
#removed unwanted columns
def remove_columns(df,column):
    return df.drop(column,inplace=True,axis=1)

remove_columns(titanic,'alive')
remove_columns(titanic,'deck')

Number of Men, Women and Children

In [6]:
titanic['who'].value_counts()
Out[6]:
man      537
woman    271
child     83
Name: who, dtype: int64
In [101]:
xs=[i+0.8 for i,_ in enumerate(titanic['who'])]
sns.barplot(data=titanic,x=titanic['who'],y=titanic.survived)
plt.title('Surival Data for Men,Women and Children')
plt.xlabel('Men Women and Children')
plt.ylabel('Survival Probability')
Out[101]:

There are 177 null values in age which can effect our results.So we have to fill these values by appropiate value

In [102]:
#first we find the mean
titanic['age'].mean()
Out[102]:
29.345679012345681

The mean is 29 but we cant fill all the 177 values by 29 as there can be many children among these 177 values.Let us check the median

In [103]:
titanic['age'].median()
Out[103]:
28.0

..Median is also same as the mean.we cant fill all the 177 values by 28 as there can be many children among these 177 values.So now check how many children are there

In [104]:
# we see that the 'who' column tells us whether the entry if of 'man' , 'woman' or 'child'.
#so we check the number of children here
In [105]:
def count_values(df,col_name):
    return df[col_name].value_counts()
count_values(titanic,'who')
Out[105]:
man      537
woman    271
child     83
Name: who, dtype: int64

Number of children is 83 so we cannot fill the NaN values by media=28 or mean=29.We can fill the NaN entries with median if we have adult ‘man’ or ‘woman’. Let us check the NaN entries in titanic[‘age’] corressponding to titanic[‘who’]=’child

Checking how many Children have missing values

In [106]:
titanic.loc[titanic['who']=='child'].isnull().sum()
Out[106]:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alone          0
Sex_integer    0
child          0
parents        0
dtype: int64

We see that none of the 83 children have NaN values in their respective age column. So its same to replace missing values of age with median =28

In [107]:
titanic['age'].fillna(titanic['age'].median(),inplace=True)
In [108]:
#total nu of null values in each column
titanic.apply(lambda x :sum(x.isnull()),axis=0)
Out[108]:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alone          0
Sex_integer    0
child          0
parents        0
dtype: int64
In [109]:
#making a new variable for Gender
titanic['Sex_integer']=np.where(titanic.sex=='male',1,0)
In [110]:
titanic.head()
Out[110]:
survived pclass sex age sibsp parch fare embarked class who adult_male embark_town alone Sex_integer child parents
0 0 3 male 22 1 0 7.2500 S Third man True Southampton False 1 0 0
1 1 1 female 38 1 0 71.2833 C First woman False Cherbourg False 0 0 0
2 1 3 female 26 0 0 7.9250 S Third woman False Southampton True 0 0 0
3 1 1 female 35 1 0 53.1000 S First woman False Southampton False 0 0 0
4 0 3 male 35 0 0 8.0500 S Third man True Southampton True 1 0 0
In [111]:
# convert floats to integer
titanic['age']=titanic['age'].astype(int)

Observations

Total Number of male and female

In [112]:
titanic['sex'].value_counts()
Out[112]:
male      577
female    314
Name: sex, dtype: int64
In [113]:
sns.countplot(x=titanic['sex'],data=titanic)
plt.xlabel('Gender')
plt.title('Male and Female Passengers')
Out[113]:

No of adult male,adult female and children

In [114]:
titanic_gender_pivot=titanic.pivot_table('sex', 'who',aggfunc='count')
titanic_gender_pivot
Out[114]:
sex
who
child 83
man 537
woman 271
In [115]:
titanic_gender_pivot.plot(kind='bar',figsize=(15,10))
plt.ylabel('Number of persons')
plt.xlabel('Child or Man or Woman')
Out[115]:

Age distribution

In [116]:
sns.distplot(titanic['age'],color='red',kde=False)
plt.xlabel('Age')
plt.title('Age of passengers')
Out[116]:

Most of the age distribution is between 20 and 40.The peak 28 is actually the number of filled missing values with the median

Question : Is there relation between “Survival” and “Gender”

In [117]:
titanic_male_female=titanic.pivot_table('survived','sex',aggfunc='sum')
titanic_male_female
Out[117]:
survived
sex
female 233
male 109
In [118]:
titanic_male_female.plot(kind='bar',figsize=(10,6))
plt.ylabel('survived')
Out[118]:

In [119]:
p= sns.barplot(x="sex", y="survived", data=titanic)
p.set(title = 'Gender Distribution by Survival', 
        xlabel = 'Gender', 
        ylabel = 'Whether Survived', 
        xticklabels = ['Male', 'Female']);
plt.show()

Above calculation and graph shows that female had more chance of survival

Question-2 : is the econmoic condition of the passengers play any role in the survival?

We will solve this in two parts

First : We check the survival related to Cabins(i.e First,Second & Third class

Second : We We check the survival related to the fare

In [120]:
titanic.head()
Out[120]:
survived pclass sex age sibsp parch fare embarked class who adult_male embark_town alone Sex_integer child parents
0 0 3 male 22 1 0 7.2500 S Third man True Southampton False 1 0 0
1 1 1 female 38 1 0 71.2833 C First woman False Cherbourg False 0 0 0
2 1 3 female 26 0 0 7.9250 S Third woman False Southampton True 0 0 0
3 1 1 female 35 1 0 53.1000 S First woman False Southampton False 0 0 0
4 0 3 male 35 0 0 8.0500 S Third man True Southampton True 1 0 0
In [121]:
def percentage_survival_by_two_factors(data,factor1,factor2):
    
    return data.pivot_table('survived',factor1,factor2)*100
In [122]:
# finding survival percentage between class and gender
survival_class_gender=percentage_survival_by_two_factors(titanic,'class','sex')
survival_class_gender
Out[122]:
sex female male
class
First 96.808511 36.885246
Second 92.105263 15.740741
Third 50.000000 13.544669

We see that 96% of female and 35% of male survived compared to 50% and 13% respectively in the third class

In [123]:
g=sns.factorplot(x='sex',
                y='survived',hue='pclass' , kind='bar',data=titanic)
# Fix up the labels
g.set(xlabel='gender',ylabel='survived', title='Gender, Class and Survival'
)

plt.show()
In [124]:
g = sns.factorplot("survived", col="pclass", col_wrap=4,
                    data=titanic,
                    kind="count", size=4.5, aspect=.8,)

g.set_axis_labels("", "Count")
g.set_xticklabels(["Died", "Alive"])
Out[124]:

In [125]:
g= sns.factorplot('pclass','survived',data=titanic)
g.set_axis_labels("Passenger Class", "Survival Probability")
g.set_xticklabels(["Class1", "Class2", "Class3"])
g.set(ylim=(0, 1))
Out[125]:

Above plot and calculations shows that First Class had more chance of survival compared to Third Class.After First Class, the Second Class had more chance of survival.Third class had the least chance of survival

Both Male and Female have higher chance of survival in the upper class as the data shows .

All bove plots shows that Higher Cabin class has more chance of survival

Second :Now we check the survival with the fares

In [126]:
fare = pd.qcut(titanic['fare'], 4)
tt=titanic.pivot_table('survived',['sex',fare])
tt.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.show()

Above graph shows that people paying higher fare had more survival chance.This is because they were in Upper Class.We already showed that upper class had more chances of survival

I will divide age in two groups i.e 0-18 and 18-80 for females and males

In [127]:
age = pd.cut(titanic['age'], [0, 18, 80])
fare = pd.qcut(titanic['fare'], 2)
titanic_fare=titanic.pivot_table('survived', ['sex', age], 'class',aggfunc='mean')
titanic_fare
Out[127]:
class First Second Third
sex age
female (0, 18] 0.909091 1.000000 0.487805
(18, 80] 0.975904 0.903226 0.495050
male (0, 18] 0.750000 0.500000 0.200000
(18, 80] 0.350427 0.086022 0.121622
In [128]:
titanic_fare.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.xlabel('Female , Male and Fare distribution')
Out[128]:

We see that the chances for survival in Class-3 is lower for all age and sex groups except for males between 18 and 80 years where Class-2 survival is higher than class-3 Men of all age groups have high survival in upper class Women of

The above graph again shows that first class had more chance of survival in all the above age groups

Which age group has more survival chance?

In [129]:
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
survival_by_age_gender=titanic.pivot_table('survived', ['sex', age_groups])*100
survival_by_age_gender
Out[129]:
survived
sex age
female (0, 20] 68.000000
(20, 40] 75.661376
(40, 60] 75.555556
(60, 81] 100.000000
male (0, 20] 24.489796
(20, 40] 16.577540
(40, 60] 19.753086
(60, 81] 10.526316
In [130]:
survival_by_age_total=titanic.pivot_table('survived', age_groups)*100
survival_by_age_total
Out[130]:
survived
age
(0, 20] 43.352601
(20, 40] 36.412078
(40, 60] 39.682540
(60, 81] 22.727273
In [131]:
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
titanic.groupby(age_groups).size().plot(kind='bar',stacked=True)
plt.title("Distribution of Age Groups",fontsize=14)
plt.ylabel('Count')
plt.xlabel('Age Group');
In [132]:
p = sns.violinplot(data = titanic, x = 'survived', y = 'age')
p.set(title = 'Survival by Age', 
        xlabel = 'Survival', 
        ylabel = 'Age Distribution', 
        xticklabels = ['Died', 'Survived']);
plt.show()

Above observations show that age group between 20 and 40 has more chance of survival

Question : What is the survival chance for lonely passengers?

sibsp: Number of Siblings/Spouses Aboard

parch: Number of Parents/Children Aboard

In [133]:
titanic.pivot_table('survived' ,'alone' ,aggfunc='count')
Out[133]:
survived
alone
False 354
True 537

So majority of passengers with a companion survived

In [134]:
ax = sns.violinplot(x="alone", y="survived", data=titanic)

Above graph shows that Lonely passengers had less chance of survival

Question : What is survival chance for children without parents?

In [135]:
titanic['child']=np.where(titanic.who=='child',1,0)
titanic['parents']=np.where(titanic.parch!=0,1,0)
In [136]:
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="child", y="survived", hue="parents", data=titanic,
                   size=6, kind="bar", palette="muted",
                   )
g.set(title = 'Children Survival wrt to Family');
g.set_axis_labels("Child or Adult", "Survival Probability")
g.set_xticklabels(["Adult", "Child"])
Out[136]:

The above graph shows that children with parents had more chance of survival compared to children without parents or with nannies

Above calculations shows that lonely passengers had less chance of survival

Conclusions
Based on the above calculations we can approximately say that :
1-Females had more survival chance than the male
2-First class passengers had more survival chance than the lower classes (economic factor)
3-More passengers who paid higer fares survived (also economic factor)
4-Age group between 20 and 40 had highest surival chance
5-Lonely passengers had less survival chance than those travelling with companions
6-More children died who were travelling without parents

Limitations:

Above findings cannot be accurate due to many aspects.Like we are missing a lot of age data (i.e 177 entries).

Also we dont know that in those days, which age category was considered as “child”

There are also 688 entries for “deck” column missing which can effect our finding that Class-1 passengers survived more .Also we dont know the locations of these decks,it is possible that decks in certain locations had more chance of survival than compared to the class of passengers.

It is also possible, that Class-3 passengers had a certain location which made them difficult to survive or it is also possible that the ice berg at the location of Class-3 passengers.

So these conclusions cannot be 100% correct as there are many factors involved which we have no information and also because there are so many missing values.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s