Reference

Titanic Survivor Prediction

타이타닉호 침몰 사고 당시 탑승자들의 정보를 활용하여 생존자를 예측하라.

Data Dictionary

Variable Definition Key
Survived Survival 0 = No, 1 = Yes
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex Sex
Age Age in years
SibSp # of siblings / spouses aboard the Titanic
Parch # of parents / children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

Stage I

import

import numpy as np
import pandas as pd

code

import os

print(os.getcwd())
C:\Users\godgk\Desktop\Project\kaggle\Titanic
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
print('train data shape', train.shape)
print('test data shape', test.shape)
print('-----[train infomation]-----')
print(train.info())
print('-----[test infomation]-----')
print(test.info())
train data shape (891, 12)
test data shape (418, 11)
-----[train infomation]-----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
-----[test infomation]-----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None
train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
test.isnull().sum()
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Stage II

import

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
def pie_chart(feature):
    feature_ratio = train[feature].value_counts(sort=False)
    feature_size = feature_ratio.size
    feature_index = feature_ratio.index
    survived = train[train['Survived'] == 1][feature].value_counts()
    dead = train[train['Survived'] == 0][feature].value_counts()
    
    plt.plot(aspect='auto')
    plt.pie(feature_ratio, labels=feature_index, autopct='%1.1f%%')
    plt.title(feature + '\'s ratio in total')
    plt.show()
    
    for i, index in enumerate(feature_index):
        plt.subplot(1, feature_size + 1, i + 1, aspect='equal')
        plt.pie([survived[index], dead[index]], labels=['Survivied', 'Dead'], autopct='%1.1f%%')
        plt.title(str(index) + '\'s ratio')
    
    plt.show()
pie_chart("Sex")
  • 남성 탑승객이 여성 탑승객보다 많다.

  • 여성 탑승객의 생존 비율이 남성 탑승객보다 높다.

pie_chart("Pclass")
  • 1등실 2등실 3등실 순으로 생존 비율이 높다.
pie_chart("Embarked")
train['Ticket'][0:50]
0            A/5 21171
1             PC 17599
2     STON/O2. 3101282
3               113803
4               373450
5               330877
6                17463
7               349909
8               347742
9               237736
10             PP 9549
11              113783
12           A/5. 2151
13              347082
14              350406
15              248706
16              382652
17              244373
18              345763
19                2649
20              239865
21              248698
22              330923
23              113788
24              349909
25              347077
26                2631
27               19950
28              330959
29              349216
30            PC 17601
31            PC 17569
32              335677
33          C.A. 24579
34            PC 17604
35              113789
36                2677
37          A./5. 2152
38              345764
39                2651
40                7546
41               11668
42              349253
43       SC/Paris 2123
44              330958
45     S.C./A.4. 23567
46              370371
47               14311
48                2662
49              349237
Name: Ticket, dtype: object
train.Ticket
0             A/5 21171
1              PC 17599
2      STON/O2. 3101282
3                113803
4                373450
             ...       
886              211536
887              112053
888          W./C. 6607
889              111369
890              370376
Name: Ticket, Length: 891, dtype: object

Stage 3

def bar_chart(feature):
    survived = train[train['Survived'] == 1][feature].value_counts()
    dead = train[train['Survived'] == 0][feature].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived', 'Dead']
    df.plot(kind='bar', stacked=True, figsize=(10,5))
bar_chart("SibSp")
bar_chart("Parch")

Data Preprocessing

train_and_test = [train, test]

Name Feature

for dataset in train_and_test:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.')
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Miss
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S Mrs
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S Mr
pd.crosstab(train['Title'], train['Sex'])
Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
for dataset in train_and_test:
    dataset['Title'] = dataset['Title'].replace(['Capt', 'Col', 'Countess', 'Don','Dona', 'Dr',
                                                 'Jonkheer','Lady','Major', 'Rev', 'Sir'], 'Other')
    dataset['Title'] = dataset['Title'].replace(['Mlle', 'Ms'], 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs') 
pd.crosstab(train['Title'], train['Sex'])
Sex female male
Title
Master 0 40
Miss 185 0
Mr 0 517
Mrs 126 0
Other 3 20
train[['Title', 'Survived']].groupby('Title').mean()
Survived
Title
Master 0.575000
Miss 0.702703
Mr 0.156673
Mrs 0.793651
Other 0.347826
train[['Title', 'Survived']].groupby('Title', as_index = False).mean() 

# as_index = True이면 Title이 index로 작용한다.
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Other 0.347826
for dataset in train_and_test:
    dataset['Title'] = dataset['Title'].astype(str)

Sex Feature

for dataset in train_and_test:
    dataset['Sex'] = dataset['Sex'].astype(str)

Embarked Feature

train.Embarked.value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64
for dataset in train_and_test:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
    dataset['Embarked'] = dataset['Embarked'].astype(str)

Age Feature

Binning

train.Age.isna().sum()
177
for dataset in train_and_test:
    dataset['Age'].fillna(dataset['Age'].mean(), inplace=True)
    dataset['Age'] = dataset['Age'].astype(int)
    train['AgeBand'] = pd.cut(train['Age'], 5)
train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean()
AgeBand Survived
0 (-0.08, 16.0] 0.550000
1 (16.0, 32.0] 0.344762
2 (32.0, 48.0] 0.403226
3 (48.0, 64.0] 0.434783
4 (64.0, 80.0] 0.090909
for dataset in train_and_test:
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4
    
    dataset['Age'] = dataset['Age'].map( { 0:'Child', 1:'Young', 2:'Middle', 3:'Prime', 4:'Old' } ).astype(str)

Fare Feature

for dataset in train_and_test:
    print(dataset['Fare'].isna().sum())
0
1
train[['Pclass', 'Fare']].groupby(['Pclass'], as_index=False).mean()
Pclass Fare
0 1 84.154687
1 2 20.662183
2 3 13.675550
test[test['Fare'].isna()]['Pclass']
152    3
Name: Pclass, dtype: int64
for dataset in train_and_test:
    dataset['Fare'] = dataset['Fare'].fillna(13.675) # Pclass가 3인 승객의 평균 Fare
train['FareBand'] = pd.qcut(train['Fare'], 5)

for dataset in train_and_test:
    dataset.loc[ dataset['Fare'] <= 7.854, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.854) & (dataset['Fare'] <= 10.5), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 10.5) & (dataset['Fare'] <= 21.679), 'Fare']   = 2
    dataset.loc[(dataset['Fare'] > 21.679) & (dataset['Fare'] <= 39.688), 'Fare']   = 3
    dataset.loc[ dataset['Fare'] > 39.688, 'Fare'] = 4
    
    dataset['Fare'] = dataset['Fare'].map( { 0:'XS', 1:'S', 2:'M', 3:'L', 4:'XL' } ).astype(str)

SibSp & Parch Feature (Family)

for dataset in train_and_test:
    dataset['Family'] = dataset['Parch'] + dataset['SibSp']
    dataset['Family'] = dataset['Family'].astype(int)

Other Feature

features_drop = ['Name', 'Ticket', 'Cabin', 'SibSp', 'Parch']
train = train.drop(features_drop, axis = 1)
test = test.drop(features_drop, axis = 1)
train = train.drop(['PassengerId', 'AgeBand', 'FareBand'], axis = 1)
train.head()
Survived Pclass Sex Age Fare Embarked Title Family
0 0 3 male Young XS S Mr 1
1 1 1 female Middle XL C Mrs 1
2 1 3 female Young S S Miss 0
3 1 1 female Middle XL S Mrs 1
4 0 3 male Middle S S Mr 0
test.head()
PassengerId Pclass Sex Age Fare Embarked Title Family
0 892 3 male Middle XS Q Mr 0
1 893 3 female Middle XS S Mrs 1
2 894 2 male Prime S Q Mr 0
3 895 3 male Young S S Mr 0
4 896 3 female Young M S Mrs 2
train = pd.get_dummies(train)
test = pd.get_dummies(test)
train_label = train['Survived']
train_data = train.drop('Survived', axis = 1)
test_data = test.drop('PassengerId', axis = 1).copy()
print(train_data.shape, train_label.shape, test_data.shape)
(891, 22) (891,) (418, 22)
train
Survived Pclass Family Sex_female Sex_male Age_Child Age_Middle Age_Old Age_Prime Age_Young ... Fare_XL Fare_XS Embarked_C Embarked_Q Embarked_S Title_Master Title_Miss Title_Mr Title_Mrs Title_Other
0 0 3 1 0 1 0 0 0 0 1 ... 0 1 0 0 1 0 0 1 0 0
1 1 1 1 1 0 0 1 0 0 0 ... 1 0 1 0 0 0 0 0 1 0
2 1 3 0 1 0 0 0 0 0 1 ... 0 0 0 0 1 0 1 0 0 0
3 1 1 1 1 0 0 1 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
4 0 3 0 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 0 0 1 0 0 0 0 1 ... 0 0 0 0 1 0 0 0 0 1
887 1 1 0 1 0 0 0 0 0 1 ... 0 0 0 0 1 0 1 0 0 0
888 0 3 3 1 0 0 0 0 0 1 ... 0 0 0 0 1 0 1 0 0 0
889 1 1 0 0 1 0 0 0 0 1 ... 0 0 1 0 0 0 0 1 0 0
890 0 3 0 0 1 0 0 0 0 1 ... 0 1 0 1 0 0 0 1 0 0

891 rows × 23 columns

test
PassengerId Pclass Family Sex_female Sex_male Age_Child Age_Middle Age_Old Age_Prime Age_Young ... Fare_XL Fare_XS Embarked_C Embarked_Q Embarked_S Title_Master Title_Miss Title_Mr Title_Mrs Title_Other
0 892 3 0 0 1 0 1 0 0 0 ... 0 1 0 1 0 0 0 1 0 0
1 893 3 1 1 0 0 1 0 0 0 ... 0 1 0 0 1 0 0 0 1 0
2 894 2 0 0 1 0 0 0 1 0 ... 0 0 0 1 0 0 0 1 0 0
3 895 3 0 0 1 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
4 896 3 2 1 0 0 0 0 0 1 ... 0 0 0 0 1 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
413 1305 3 0 0 1 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
414 1306 1 0 1 0 0 1 0 0 0 ... 1 0 1 0 0 0 0 0 0 1
415 1307 3 0 0 1 0 1 0 0 0 ... 0 1 0 0 1 0 0 1 0 0
416 1308 3 0 0 1 0 0 0 0 1 ... 0 0 0 0 1 0 0 1 0 0
417 1309 3 2 0 1 0 0 0 0 1 ... 0 0 1 0 0 1 0 0 0 0

418 rows × 23 columns

Learning

import

!pip install scikit-learn
Requirement already satisfied: scikit-learn in c:\users\godgk\anaconda3\envs\py39r40\lib\site-packages (1.0.2)
Requirement already satisfied: numpy>=1.14.6 in c:\users\godgk\anaconda3\envs\py39r40\lib\site-packages (from scikit-learn) (1.20.3)
Requirement already satisfied: scipy>=1.1.0 in c:\users\godgk\anaconda3\envs\py39r40\lib\site-packages (from scikit-learn) (1.7.1)
Requirement already satisfied: joblib>=0.11 in c:\users\godgk\anaconda3\envs\py39r40\lib\site-packages (from scikit-learn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\godgk\anaconda3\envs\py39r40\lib\site-packages (from scikit-learn) (3.1.0)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.utils import shuffle
train_data, train_label = shuffle(train_data, train_label, random_state = 5)
def train_and_test(model):
    model.fit(train_data, train_label)
    prediction = model.predict(test_data)
    accuracy = round(model.score(train_data, train_label) * 100, 2)
    print("Accuracy : ", accuracy, "%")
    return prediction
log_pred = train_and_test(LogisticRegression())
# SVM
svm_pred = train_and_test(SVC())
#kNN
knn_pred_4 = train_and_test(KNeighborsClassifier(n_neighbors = 4))
# Random Forest
rf_pred = train_and_test(RandomForestClassifier(n_estimators=100))
# Navie Bayes
nb_pred = train_and_test(GaussianNB())
Accuracy :  82.27 %
Accuracy :  83.61 %
Accuracy :  84.74 %
Accuracy :  88.55 %
Accuracy :  79.35 %
submission = pd.DataFrame({
    "PassengerId": test["PassengerId"],
    "Survived": rf_pred
})

submission.to_csv('submission_rf.csv', index=False)