Titanic Survivor Prediction
타이타닉호 침몰 사고
당시 탑승자들의 정보를 활용하여 생존자를 예측하라.
Variable | Definition | Key |
---|---|---|
Survived | Survival | 0 = No, 1 = Yes |
Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
Sex | Sex | |
Age | Age in years | |
SibSp | # of siblings / spouses aboard the Titanic | |
Parch | # of parents / children aboard the Titanic | |
Ticket | Ticket number | |
Fare | Passenger fare | |
Cabin | Cabin number | |
Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
import numpy as np
import pandas as pd
import os
print(os.getcwd())
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
train.head()
print('train data shape', train.shape)
print('test data shape', test.shape)
print('-----[train infomation]-----')
print(train.info())
print('-----[test infomation]-----')
print(test.info())
train.isnull().sum()
test.isnull().sum()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
def pie_chart(feature):
feature_ratio = train[feature].value_counts(sort=False)
feature_size = feature_ratio.size
feature_index = feature_ratio.index
survived = train[train['Survived'] == 1][feature].value_counts()
dead = train[train['Survived'] == 0][feature].value_counts()
plt.plot(aspect='auto')
plt.pie(feature_ratio, labels=feature_index, autopct='%1.1f%%')
plt.title(feature + '\'s ratio in total')
plt.show()
for i, index in enumerate(feature_index):
plt.subplot(1, feature_size + 1, i + 1, aspect='equal')
plt.pie([survived[index], dead[index]], labels=['Survivied', 'Dead'], autopct='%1.1f%%')
plt.title(str(index) + '\'s ratio')
plt.show()
pie_chart("Sex")
-
남성 탑승객이 여성 탑승객보다 많다.
-
여성 탑승객의 생존 비율이 남성 탑승객보다 높다.
pie_chart("Pclass")
- 1등실 2등실 3등실 순으로 생존 비율이 높다.
pie_chart("Embarked")
train['Ticket'][0:50]
train.Ticket
def bar_chart(feature):
survived = train[train['Survived'] == 1][feature].value_counts()
dead = train[train['Survived'] == 0][feature].value_counts()
df = pd.DataFrame([survived, dead])
df.index = ['Survived', 'Dead']
df.plot(kind='bar', stacked=True, figsize=(10,5))
bar_chart("SibSp")
bar_chart("Parch")
train_and_test = [train, test]
for dataset in train_and_test:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.')
train.head()
pd.crosstab(train['Title'], train['Sex'])
for dataset in train_and_test:
dataset['Title'] = dataset['Title'].replace(['Capt', 'Col', 'Countess', 'Don','Dona', 'Dr',
'Jonkheer','Lady','Major', 'Rev', 'Sir'], 'Other')
dataset['Title'] = dataset['Title'].replace(['Mlle', 'Ms'], 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
pd.crosstab(train['Title'], train['Sex'])
train[['Title', 'Survived']].groupby('Title').mean()
train[['Title', 'Survived']].groupby('Title', as_index = False).mean()
# as_index = True이면 Title이 index로 작용한다.
for dataset in train_and_test:
dataset['Title'] = dataset['Title'].astype(str)
for dataset in train_and_test:
dataset['Sex'] = dataset['Sex'].astype(str)
train.Embarked.value_counts(dropna=False)
for dataset in train_and_test:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
dataset['Embarked'] = dataset['Embarked'].astype(str)
train.Age.isna().sum()
for dataset in train_and_test:
dataset['Age'].fillna(dataset['Age'].mean(), inplace=True)
dataset['Age'] = dataset['Age'].astype(int)
train['AgeBand'] = pd.cut(train['Age'], 5)
train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean()
for dataset in train_and_test:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4
dataset['Age'] = dataset['Age'].map( { 0:'Child', 1:'Young', 2:'Middle', 3:'Prime', 4:'Old' } ).astype(str)
for dataset in train_and_test:
print(dataset['Fare'].isna().sum())
train[['Pclass', 'Fare']].groupby(['Pclass'], as_index=False).mean()
test[test['Fare'].isna()]['Pclass']
for dataset in train_and_test:
dataset['Fare'] = dataset['Fare'].fillna(13.675) # Pclass가 3인 승객의 평균 Fare
train['FareBand'] = pd.qcut(train['Fare'], 5)
for dataset in train_and_test:
dataset.loc[ dataset['Fare'] <= 7.854, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.854) & (dataset['Fare'] <= 10.5), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 10.5) & (dataset['Fare'] <= 21.679), 'Fare'] = 2
dataset.loc[(dataset['Fare'] > 21.679) & (dataset['Fare'] <= 39.688), 'Fare'] = 3
dataset.loc[ dataset['Fare'] > 39.688, 'Fare'] = 4
dataset['Fare'] = dataset['Fare'].map( { 0:'XS', 1:'S', 2:'M', 3:'L', 4:'XL' } ).astype(str)
for dataset in train_and_test:
dataset['Family'] = dataset['Parch'] + dataset['SibSp']
dataset['Family'] = dataset['Family'].astype(int)
features_drop = ['Name', 'Ticket', 'Cabin', 'SibSp', 'Parch']
train = train.drop(features_drop, axis = 1)
test = test.drop(features_drop, axis = 1)
train = train.drop(['PassengerId', 'AgeBand', 'FareBand'], axis = 1)
train.head()
test.head()
train = pd.get_dummies(train)
test = pd.get_dummies(test)
train_label = train['Survived']
train_data = train.drop('Survived', axis = 1)
test_data = test.drop('PassengerId', axis = 1).copy()
print(train_data.shape, train_label.shape, test_data.shape)
train
test
!pip install scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.utils import shuffle
train_data, train_label = shuffle(train_data, train_label, random_state = 5)
def train_and_test(model):
model.fit(train_data, train_label)
prediction = model.predict(test_data)
accuracy = round(model.score(train_data, train_label) * 100, 2)
print("Accuracy : ", accuracy, "%")
return prediction
log_pred = train_and_test(LogisticRegression())
# SVM
svm_pred = train_and_test(SVC())
#kNN
knn_pred_4 = train_and_test(KNeighborsClassifier(n_neighbors = 4))
# Random Forest
rf_pred = train_and_test(RandomForestClassifier(n_estimators=100))
# Navie Bayes
nb_pred = train_and_test(GaussianNB())
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": rf_pred
})
submission.to_csv('submission_rf.csv', index=False)