STEP #1: PROBLEM STATEMENT
- 유방암진단 문제 : 양성이냐 악성이냐 예측
- 30 features are used, examples:
- - radius (반지름) - texture (조직) - perimeter (둘레) - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (오목함) - concave points (오목한 부분의 점) - symmetry (대칭) - fractal dimension ("coastline approximation" - 1)
- 30 input features
- Number of Instances: 569
- Class Distribution: 212 Malignant(악성), 357 Benign(양성)
- Target class:
- - Malignant(악성) - Benign(양성)https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
UCI Machine Learning Repository: Breast Cancer Wisconsin (Diagnostic) Data Set
Breast Cancer Wisconsin (Diagnostic) Data Set Download: Data Folder, Data Set Description Abstract: Diagnostic Wisconsin Breast Cancer Database Data Set Characteristics: Multivariate Number of Instances: 569 Area: Life Attribute Characteristics: Real N
archive.ics.uci.edu
STEP #2: IMPORTING DATA
# import libraries
import pandas as pd # Import Pandas for data manipulation using dataframes
import numpy as np # Import Numpy for data statistical analysis
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import seaborn as sb # Statistical data visualization
# %matplotlib inline
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
cancer
cancer.keys()
print(cancer['DESCR'])
cancer['target']
cancer['target_names']
# np.c_ 와 np.append
col_names = np.append(cancer['feature_names'], 'target')
df= pd.DataFrame(data=np.c_[cancer['data'], cancer['target']], columns=col_names)
STEP #3: VISUALIZING THE DATA
sb.pairplot(data=df[['mean radius', 'mean texture', 'mean perimeter']])
plt.show()
sb.pairplot(data=df, vars=['mean radius', 'mean texture', 'mean perimeter'], hue='target')
plt.show()
df1= df[['mean radius', 'mean texture', 'mean perimeter']].corr()
# 타겟 컬럼의 값은 각각 몇개씩인지 차트로 나타내세요.
sb.countplot(data=df, x='target')
plt.show()
# mean area와 mean smoothness의 관계를 차트로 나타내세요.
# 단 target의 데이터를 hue에 세팅하세요.
sb.scatterplot(x='mean area', y='mean smoothness', hue='target', data=df)
plt.show()
# 상관계수를 히트맵으로 보여주세요.
# 단 cmap은 'coolwarm'으로 세팅하세요.
plt.figure(figsize=(20,10))
sb.heatmap(data=df.corr(),annot=True, fmt='.1f', cmap='coolwarm', vmin=-1, vmax=1, linewidths=0.5)
plt.show()
STEP #4: MODEL TRAINING (FINDING A PROBLEM SOLUTION)
X =df.iloc[ : , 0 : -2+1]
y= df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=5)
from sklearn.svm import SVC
classifier = SVC()
classifier.fit(X_train, y_train)
confusion_matrix(y_test, y_pred)
STEP #5: EVALUATING THE MODEL
# feature scaling해서 확인해본다
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
from sklearn.svm import SVC
classifier = SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
STEP #6: IMPROVING THE MODEL
# grid search
param_grid = {'C': [0,1,1,10,100], 'kernel' : ['rdf', 'linear'] , 'gamma' : [1,0.1,0.01] }
from sklearn.model_selection import GridSearchCV
GridSearchCV( SVC(), param_grid, refit= True)
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=4)
grid.fit(X_train, y_train)
best_classifier = grid.best_estimator_
y_pred = best_classifier.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[46, 2],
[ 1, 65]], dtype=int64)
# 파라미터의 조합은 어떤 것이 가장 좋은가?
grid.best_params_
{'C': 1, 'gamma': 1, 'kernel': 'linear'}
# 가장 좋은 정확도 : 학습 데이터의 정확도이지 테스트용 데이터로 측정한 정확도는 아니다!!!!
grid.best_score_
0.9714285714285715
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support
0.0 0.98 0.96 0.97 48
1.0 0.97 0.98 0.98 66
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
'AI 이론 > Prediction (supervised)' 카테고리의 다른 글
[머신러닝] Linear Regression (0) | 2021.11.24 |
---|---|
[머신러닝] Prediction- Regression, MSE 성능평가 (0) | 2021.11.24 |