Reference

Pima Indians Diabetes Prediction

Variable Definition
Pregnancies 임신 횟수
Glucose 포도당 부하 검사 수치
BloodPressure 혈압
SkinThickness 팔 삼두근 뒤쪽의 피하지방 측정값(mm)
Inlulin 혈청 인슐린(mu U/ml)
BMI 체질량지수$(\frac{kg}{m^2})$
DiabetesPredigreeFunction 당뇨 내력 가중치 값
Age 나이
Outcome 클래스 결정 값(0 또는 1)

Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Binarizer

import warnings
warnings.filterwarnings(action='ignore')

Preprocessing

diabetes_data = pd.read_csv('diabetes.csv')
diabetes_data.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
diabetes_data.Outcome.value_counts()
0    500
1    268
Name: Outcome, dtype: int64
diabetes_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
  • Glucose, BloodPressure, SkinThickness, Insulin, Bmi은 0이면 안된다.
zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for i, feature in enumerate(zero_features):
    plt.subplot(3,2,i+1)
    plt.hist(diabetes_data[feature])
diabetes_data[diabetes_data['Glucose'] == 0]['Glucose'].count()
5
for feature in zero_features:
    zero_count = diabetes_data[diabetes_data[feature] == 0][feature].count()
    print('{0} 0 건수는 {1}, 퍼센트는 {2:.2f}%'.format(feature, zero_count,
                                                 100*zero_count / diabetes_data[feature].count()))
Glucose 0 건수는 5, 퍼센트는 0.65%
BloodPressure 0 건수는 35, 퍼센트는 4.56%
SkinThickness 0 건수는 227, 퍼센트는 29.56%
Insulin 0 건수는 374, 퍼센트는 48.70%
BMI 0 건수는 11, 퍼센트는 1.43%

SkinThickness, Insulin feature가 0인 행을 지우면 데이터 손실이 너무 크므로 평균값으로 대체한다.

mean_zero_features = diabetes_data[zero_features].mean()
diabetes_data[zero_features] = diabetes_data[zero_features].replace(0, mean_zero_features)

Prediction

def precision_recall_curve_plot(y_test, pred_proba_c1):
    precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba_c1)
    
    plt.figure(figsize=(8,6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle = '-', label = 'precision')
    plt.plot(thresholds, recalls[0:threshold_boundary], label = 'recall')
    
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1), 2))
    
    plt.xlabel('Threshold value'); plt.ylabel('Precision and Recall value')
    plt.legend(); plt.grid()
    plt.show()

def get_clf_eval(y_test, pred=None, pred_proba=None): # 모델 평가 함수
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    
    roc_auc = roc_auc_score(y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    print('정확도 : {0:.3f}, 정밀도 : {1:.3f}, 재현율 : {2:.3f}, F1 : {3:.3f}, AUC : {4:.3f}'.format( accuracy, precision, recall, f1, roc_auc))
    

def get_eval_by_threshold(y_test, pred_proba_c1, thresholds):
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_c1)
        custom_predict = binarizer.transform(pred_proba_c1)
        print('-------------------------------------------------------------------------')
        print('임계값:', round(custom_threshold,2))
        get_clf_eval(y_test, custom_predict, pred_proba_c1)
feature_name = diabetes_data.columns[:-1]
target_name = diabetes_data.columns[-1]

X = diabetes_data.loc[:, feature_name]
y = diabetes_data.loc[:, target_name]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 156, stratify=y)
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]
get_clf_eval(y_test, pred, pred_proba)
오차 행렬
[[90 10]
 [21 33]]
정확도 : 0.799, 정밀도 : 0.767, 재현율 : 0.611, F1 : 0.680, AUC : 0.845
pred_proba_c1 = lr_clf.predict_proba(X_test)[:, 1]
precision_recall_curve_plot(y_test, pred_proba_c1)
thresholds = np.arange(0.3, 0.5, 0.03)
pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_threshold(y_test, pred_proba[:, 1].reshape(-1,1), thresholds)

-------------------------------------------------------------------------
임계값: 0.3
오차 행렬
[[67 33]
 [11 43]]
정확도 : 0.714, 정밀도 : 0.566, 재현율 : 0.796, F1 : 0.662, AUC : 0.845
-------------------------------------------------------------------------
임계값: 0.33
오차 행렬
[[73 27]
 [12 42]]
정확도 : 0.747, 정밀도 : 0.609, 재현율 : 0.778, F1 : 0.683, AUC : 0.845
-------------------------------------------------------------------------
임계값: 0.36
오차 행렬
[[76 24]
 [15 39]]
정확도 : 0.747, 정밀도 : 0.619, 재현율 : 0.722, F1 : 0.667, AUC : 0.845
-------------------------------------------------------------------------
임계값: 0.39
오차 행렬
[[79 21]
 [17 37]]
정확도 : 0.753, 정밀도 : 0.638, 재현율 : 0.685, F1 : 0.661, AUC : 0.845
-------------------------------------------------------------------------
임계값: 0.42
오차 행렬
[[84 16]
 [18 36]]
정확도 : 0.779, 정밀도 : 0.692, 재현율 : 0.667, F1 : 0.679, AUC : 0.845
-------------------------------------------------------------------------
임계값: 0.45
오차 행렬
[[85 15]
 [18 36]]
정확도 : 0.786, 정밀도 : 0.706, 재현율 : 0.667, F1 : 0.686, AUC : 0.845
-------------------------------------------------------------------------
임계값: 0.48
오차 행렬
[[89 11]
 [19 35]]
정확도 : 0.805, 정밀도 : 0.761, 재현율 : 0.648, F1 : 0.700, AUC : 0.845
binarizer = Binarizer(threshold=0.48)

pred_th_048 = binarizer.fit_transform(pred_proba[:, 1].reshape(-1,1))
get_clf_eval(y_test, pred_th_048, pred_proba[:, 1])
오차 행렬
[[89 11]
 [19 35]]
정확도 : 0.805, 정밀도 : 0.761, 재현율 : 0.648, F1 : 0.700, AUC : 0.845