AI 이론/Prediction (supervised)

[실전연습] np.c_, np.append, grid .best_params_, gridsearchcv, classification_report

jasonshin 2021. 11. 25. 17:54

 

STEP #1: PROBLEM STATEMENT

  • 유방암진단 문제 : 양성이냐 악성이냐 예측
  • 30 features are used, examples:
  • - radius (반지름) - texture (조직) - perimeter (둘레) - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (오목함) - concave points (오목한 부분의 점) - symmetry (대칭) - fractal dimension ("coastline approximation" - 1)
  • 30 input features
  • Number of Instances: 569
  • Class Distribution: 212 Malignant(악성), 357 Benign(양성)
  • Target class:
  • - Malignant(악성) - Benign(양성)https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
 

UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set

Breast Cancer Wisconsin (Diagnostic) Data Set Download: Data Folder, Data Set Description Abstract: Diagnostic Wisconsin Breast Cancer Database Data Set Characteristics:   Multivariate Number of Instances: 569 Area: Life Attribute Characteristics: Real N

archive.ics.uci.edu

STEP #2: IMPORTING DATA

# import libraries 
import pandas as pd # Import Pandas for data manipulation using dataframes
import numpy as np # Import Numpy for data statistical analysis 
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import seaborn as sb # Statistical data visualization
# %matplotlib inline

 

from sklearn.datasets import load_breast_cancer

 

cancer = load_breast_cancer()

 

cancer

 

cancer.keys()

 

print(cancer['DESCR'])

 

cancer['target']

 

cancer['target_names']

 

 

# np.c_ 와  np.append

col_names = np.append(cancer['feature_names'], 'target')

 

df= pd.DataFrame(data=np.c_[cancer['data'], cancer['target']], columns=col_names)

 

STEP #3: VISUALIZING THE DATA

sb.pairplot(data=df[['mean radius', 'mean texture', 'mean perimeter']])
plt.show()

sb.pairplot(data=df, vars=['mean radius', 'mean texture', 'mean perimeter'], hue='target')
plt.show()

df1= df[['mean radius', 'mean texture', 'mean perimeter']].corr()

 

# 타겟 컬럼의 값은 각각 몇개씩인지 차트로 나타내세요. 

sb.countplot(data=df, x='target')
plt.show()

 

# mean area와 mean smoothness의 관계를 차트로 나타내세요.
# 단 target의 데이터를 hue에 세팅하세요.

 

sb.scatterplot(x='mean area', y='mean smoothness', hue='target', data=df)
plt.show()

# 상관계수를 히트맵으로 보여주세요.
# 단 cmap은 'coolwarm'으로 세팅하세요. 

plt.figure(figsize=(20,10))
sb.heatmap(data=df.corr(),annot=True, fmt='.1f', cmap='coolwarm', vmin=-1, vmax=1, linewidths=0.5)
plt.show()

STEP #4: MODEL TRAINING (FINDING A PROBLEM SOLUTION)

X =df.iloc[ : , 0 : -2+1]

y= df['target']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=5)

 

from sklearn.svm import SVC
classifier = SVC()
classifier.fit(X_train, y_train)

 

confusion_matrix(y_test, y_pred)

 

STEP #5: EVALUATING THE MODEL

# feature scaling해서 확인해본다

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

 

from sklearn.svm import SVC

classifier = SVC()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)

accuracy_score(y_test, y_pred)

 

STEP #6: IMPROVING THE MODEL

# grid search

param_grid = {'C': [0,1,1,10,100], 'kernel' : ['rdf', 'linear'] , 'gamma' : [1,0.1,0.01] }

from sklearn.model_selection import GridSearchCV

GridSearchCV( SVC(), param_grid, refit= True)

grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=4)

grid.fit(X_train, y_train)

best_classifier = grid.best_estimator_

y_pred = best_classifier.predict(X_test)

confusion_matrix(y_test, y_pred)

array([[46,  2],
       [ 1, 65]], dtype=int64)

# 파라미터의 조합은 어떤 것이 가장 좋은가?
grid.best_params_

{'C': 1, 'gamma': 1, 'kernel': 'linear'}

# 가장 좋은 정확도 : 학습 데이터의 정확도이지 테스트용 데이터로 측정한 정확도는 아니다!!!!
grid.best_score_

0.9714285714285715

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.98      0.96      0.97        48
         1.0       0.97      0.98      0.98        66

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114
반응형