# Titanic Survival Data Analysis

Titanic Data Analysis for survival of passengers

**Details**

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Data Description

(from https://www.kaggle.com/c/titanic) survival: Survival (0 = No; 1 = Yes) pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name: Name

sex: Sex

age: Age

sibsp: Number of Siblings/Spouses Aboard

parch: Number of Parents/Children Aboard

ticket: Ticket Number

fare: Passenger Fare

cabin: Cabin

embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

**Special Notes:**

Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic

Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)

Parent: Mother or Father of Passenger Aboard Titanic

Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-

laws. Some children travelled only with a nanny, therefore parch=0 for them.

As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

**JUPYTER NOTEBOOK**

```
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

```
titanic = sns.load_dataset('titanic')
titanic.head()
```

## Detail of data

```
titanic.describe()
```

# Step 1-Data Cleaning

We will check for NaN or Null values and also check if any of the columns need to be added of removed

```
#total nu of null values in each column
titanic.apply(lambda x: sum(x.isnull()),axis=0)
```

we see that there are 177 missing values in “age” and “688” missing values in “deck”. First check the values in “deck”

```
titanic['deck'].value_counts()
```

**It seems that there is no need for “deck” values as more than 40% are missing.So we remove “deck” field. Also we remove the ‘alive’ column**

```
#removed unwanted columns
def remove_columns(df,column):
return df.drop(column,inplace=True,axis=1)
remove_columns(titanic,'alive')
remove_columns(titanic,'deck')
```

**Number of Men, Women and Children**

```
titanic['who'].value_counts()
```

```
xs=[i+0.8 for i,_ in enumerate(titanic['who'])]
sns.barplot(data=titanic,x=titanic['who'],y=titanic.survived)
plt.title('Surival Data for Men,Women and Children')
plt.xlabel('Men Women and Children')
plt.ylabel('Survival Probability')
```

## There are 177 null values in age which can effect our results.So we have to fill these values by appropiate value

```
#first we find the mean
titanic['age'].mean()
```

**The mean is 29 but we cant fill all the 177 values by 29 as there can be many children among these 177 values.Let us check the median**

```
titanic['age'].median()
```

**..Median is also same as the mean.we cant fill all the 177 values by 28 as there can be many children among these 177 values.So now check how many children are there**

```
# we see that the 'who' column tells us whether the entry if of 'man' , 'woman' or 'child'.
#so we check the number of children here
```

```
def count_values(df,col_name):
return df[col_name].value_counts()
count_values(titanic,'who')
```

Number of children is 83 so we cannot fill the NaN values by media=28 or mean=29.We can fill the NaN entries with median if we have adult ‘man’ or ‘woman’. Let us check the NaN entries in titanic[‘age’] corressponding to titanic[‘who’]=’child

**Checking how many Children have missing values**

```
titanic.loc[titanic['who']=='child'].isnull().sum()
```

We see that none of the 83 children have NaN values in their respective age column. So its same to replace missing values of age with median =28

```
titanic['age'].fillna(titanic['age'].median(),inplace=True)
```

```
#total nu of null values in each column
titanic.apply(lambda x :sum(x.isnull()),axis=0)
```

```
#making a new variable for Gender
titanic['Sex_integer']=np.where(titanic.sex=='male',1,0)
```

```
titanic.head()
```

```
# convert floats to integer
titanic['age']=titanic['age'].astype(int)
```

# Observations

## Total Number of male and female

```
titanic['sex'].value_counts()
```

```
sns.countplot(x=titanic['sex'],data=titanic)
plt.xlabel('Gender')
plt.title('Male and Female Passengers')
```

## No of adult male,adult female and children

```
titanic_gender_pivot=titanic.pivot_table('sex', 'who',aggfunc='count')
titanic_gender_pivot
```

```
titanic_gender_pivot.plot(kind='bar',figsize=(15,10))
plt.ylabel('Number of persons')
plt.xlabel('Child or Man or Woman')
```

# Age distribution

```
sns.distplot(titanic['age'],color='red',kde=False)
plt.xlabel('Age')
plt.title('Age of passengers')
```

**Most of the age distribution is between 20 and 40.The peak 28 is actually the number of filled missing values with the median**

# Question : Is there relation between “Survival” and “Gender”

```
titanic_male_female=titanic.pivot_table('survived','sex',aggfunc='sum')
titanic_male_female
```

```
titanic_male_female.plot(kind='bar',figsize=(10,6))
plt.ylabel('survived')
```

```
p= sns.barplot(x="sex", y="survived", data=titanic)
p.set(title = 'Gender Distribution by Survival',
xlabel = 'Gender',
ylabel = 'Whether Survived',
xticklabels = ['Male', 'Female']);
plt.show()
```

## Above calculation and graph shows that female had more chance of survival

# Question-2 : is the econmoic condition of the passengers play any role in the survival?

## We will solve this in two parts

## First : We check the survival related to Cabins(i.e First,Second & Third class

## Second : We We check the survival related to the fare

```
titanic.head()
```

```
def percentage_survival_by_two_factors(data,factor1,factor2):
return data.pivot_table('survived',factor1,factor2)*100
```

```
# finding survival percentage between class and gender
survival_class_gender=percentage_survival_by_two_factors(titanic,'class','sex')
survival_class_gender
```

**We see that 96% of female and 35% of male survived compared to 50% and 13% respectively in the third class**

```
g=sns.factorplot(x='sex',
y='survived',hue='pclass' , kind='bar',data=titanic)
# Fix up the labels
g.set(xlabel='gender',ylabel='survived', title='Gender, Class and Survival'
)
plt.show()
```

```
g = sns.factorplot("survived", col="pclass", col_wrap=4,
data=titanic,
kind="count", size=4.5, aspect=.8,)
g.set_axis_labels("", "Count")
g.set_xticklabels(["Died", "Alive"])
```

```
g= sns.factorplot('pclass','survived',data=titanic)
g.set_axis_labels("Passenger Class", "Survival Probability")
g.set_xticklabels(["Class1", "Class2", "Class3"])
g.set(ylim=(0, 1))
```

**Above plot and calculations shows that First Class had more chance of survival compared to Third Class.After First Class, the Second Class had more chance of survival.Third class had the least chance of survival**

Both Male and Female have higher chance of survival in the upper class as the data shows .

All bove plots shows that Higher Cabin class has more chance of survival

## Second :Now we check the survival with the fares

```
fare = pd.qcut(titanic['fare'], 4)
tt=titanic.pivot_table('survived',['sex',fare])
tt.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.show()
```

Above graph shows that people paying higher fare had more survival chance.This is because they were in Upper Class.We already showed that upper class had more chances of survival

**I will divide age in two groups i.e 0-18 and 18-80 for females and males**

```
age = pd.cut(titanic['age'], [0, 18, 80])
fare = pd.qcut(titanic['fare'], 2)
titanic_fare=titanic.pivot_table('survived', ['sex', age], 'class',aggfunc='mean')
titanic_fare
```

```
titanic_fare.plot(kind='bar',figsize=(15,10))
plt.ylabel('survival probability')
plt.xlabel('Female , Male and Fare distribution')
```

We see that the chances for survival in Class-3 is lower for all age and sex groups except for males between 18 and 80 years where Class-2 survival is higher than class-3 Men of all age groups have high survival in upper class Women of

**The above graph again shows that first class had more chance of survival in all the above age groups**

# Which age group has more survival chance?

```
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
survival_by_age_gender=titanic.pivot_table('survived', ['sex', age_groups])*100
survival_by_age_gender
```

```
survival_by_age_total=titanic.pivot_table('survived', age_groups)*100
survival_by_age_total
```

```
age_groups = pd.cut(titanic['age'], [0,20,40,60,81])
titanic.groupby(age_groups).size().plot(kind='bar',stacked=True)
plt.title("Distribution of Age Groups",fontsize=14)
plt.ylabel('Count')
plt.xlabel('Age Group');
```

```
p = sns.violinplot(data = titanic, x = 'survived', y = 'age')
p.set(title = 'Survival by Age',
xlabel = 'Survival',
ylabel = 'Age Distribution',
xticklabels = ['Died', 'Survived']);
plt.show()
```

**Above observations show that age group between 20 and 40 has more chance of survival**

# Question : What is the survival chance for lonely passengers?

sibsp: Number of Siblings/Spouses Aboard

parch: Number of Parents/Children Aboard

```
titanic.pivot_table('survived' ,'alone' ,aggfunc='count')
```

**So majority of passengers with a companion survived**

```
ax = sns.violinplot(x="alone", y="survived", data=titanic)
```

**Above graph shows that Lonely passengers had less chance of survival**

# Question : What is survival chance for children without parents?

```
titanic['child']=np.where(titanic.who=='child',1,0)
titanic['parents']=np.where(titanic.parch!=0,1,0)
```

```
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="child", y="survived", hue="parents", data=titanic,
size=6, kind="bar", palette="muted",
)
g.set(title = 'Children Survival wrt to Family');
g.set_axis_labels("Child or Adult", "Survival Probability")
g.set_xticklabels(["Adult", "Child"])
```

**The above graph shows that children with parents had more chance of survival compared to children without parents or with nannies**

**Above calculations shows that lonely passengers had less chance of survival**

##### Conclusions

##### Based on the above calculations we can approximately say that :

##### 1-Females had more survival chance than the male

##### 2-First class passengers had more survival chance than the lower classes (economic factor)

##### 3-More passengers who paid higer fares survived (also economic factor)

##### 4-Age group between 20 and 40 had highest surival chance

##### 5-Lonely passengers had less survival chance than those travelling with companions

##### 6-More children died who were travelling without parents

#### Limitations:

Above findings cannot be accurate due to many aspects.Like we are missing a lot of age data (i.e 177 entries).

Also we dont know that in those days, which age category was considered as “child”

There are also 688 entries for “deck” column missing which can effect our finding that Class-1 passengers survived more .Also we dont know the locations of these decks,it is possible that decks in certain locations had more chance of survival than compared to the class of passengers.

It is also possible, that Class-3 passengers had a certain location which made them difficult to survive or it is also possible that the ice berg at the location of Class-3 passengers.

So these conclusions cannot be 100% correct as there are many factors involved which we have no information and also because there are so many missing values.¶

## references:

**https://www.oreilly.com/learning/pivot-tables**

** https://elitedatascience.com/python-seaborn-tutorial

** http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/tree/master/cookbook/

**https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/**

**http://pbpython.com/pandas-pivot-table-explained.html ****https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/**