[TensorFlow] 텐서플로우(TensorFlow 2.x) 와인 데이터 이항 분류

AI/TensorFlow & PyTorch

[TensorFlow] 텐서플로우(TensorFlow 2.x) 와인 데이터 이항 분류

byunghyun23 2021. 3. 5. 16:14

본 포스팅은 로지스틱 회귀 기반 심층신경망 학습 예제입니다.

와인 데이터 데이터세트를 이용하여 모델을 학습하고, 와인을 이항 분류(레드, 화이트 와인) 해보겠습니다.

코드는 아래와 같습니다.

import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# 5.1 와인 데이터셋 불러오기
red = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
white = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
print(red.head())
print(white.head())

# 5.2 와인 데이터셋 합치기
red['type'] = 0
white['type'] = 1
print(red.head(2))
print(white.head(2))

wine = pd.concat([red, white])
print(wine.describe())

# 5.3 레드 와인과 화이트 와인 type 히스토그램
plt.hist(wine['type'])
plt.xticks([0, 1])
plt.show()

# 5.4 데이터 요약 정보 확인
print(wine.info())

# 5.5 데이터 정규화
wine_norm = (wine - wine.min()) / (wine.max() - wine.min())
print(wine_norm.head())
print(wine_norm.describe())

# 5.6 데이터 섞은 후 numpy array로 변환
wine_shuffle = wine_norm.sample(frac=1)
print(wine_shuffle.head())
wine_np = wine_shuffle.to_numpy()
print(wine_np[:5])

# 5.7 train 데이터와 test 데이터로 분리
train_idx = int(len(wine_np) * 0.8)
train_X, train_Y = wine_np[:train_idx, :-1], wine_np[:train_idx, -1]
test_X, test_Y = wine_np[train_idx:, :-1], wine_np[train_idx:, -1]
print(train_X[0])
print(train_Y[0])
print(test_X[0])
print(test_Y[0])
train_Y = tf.keras.utils.to_categorical(train_Y, num_classes=2)
test_Y = tf.keras.utils.to_categorical(test_Y, num_classes=2)
print(train_Y[0])
print(test_Y[0])

# 5.8 와인 데이터셋 분류 모델 생성
model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=48, activation='relu', input_shape=(12,)),
    tf.keras.layers.Dense(units=24, activation='relu'),
    tf.keras.layers.Dense(units=12, activation='relu'),
    tf.keras.layers.Dense(units=2, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.07), loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

# 5.9 와인 데이터셋 분류 모델 학습
history = model.fit(train_X, train_Y, epochs=25, batch_size=32, validation_split=0.25)

# 5.10 분류 모델 학습 결과 시각화
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], 'b-', label='loss')
plt.plot(history.history['val_loss'], 'r--', label='val_loss')
plt.xlabel('Epoch')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], 'g-', label='accuracy')
plt.plot(history.history['val_accuracy'], 'k--', label='val_accuracy')
plt.xlabel('Epoch')
plt.ylim(0.7, 1)
plt.legend()

plt.show()

# 5.11 분류 모델 평가
model.evaluate(test_X, test_Y)

# 테스트 데이터 정답과 예측값 비교
pred_Y = model.predict(test_X)
print(test_Y)
print(pred_Y)

코드를 분석해보면,

먼저 캘리포니아 어바인 대학에서 제공하는 와인 데이터를 pandas dataframe 형태로 불러옵니다.

불러온 레드, 화이트 와인 데이터에 각자가 무슨 와인인지 표시해주기 위해 'type' (사용자정의) 이라는 이름으로

key 설정 후 value에 '0', '1' (여기서 '0'은 레드와인, '1'은 화이트와인으로 정의 함) 값을 할당합니다.

학습을 위해 레드, 화이트 와인 데이터를 concat()을 이용하여 'wine' (사용자정의) 이라는 이름으로 합치는 작업을 진행합니다.

히스토그램을 이용하여 레드 와인(0)과 화이트 와인(1) 개수를 확인합니다.

Fig 1. Comparison of red(0) and white(1) wines

데이터 정규화를 하기 전에 info()를 이용하여 데이터가 어떤 값으로 이루어져 있는지 확인합니다.

모두 숫자이기 때문에 정상적으로 정규화가 가능합니다.

정규화된 데이터를 랜덤하게 섞을 후 Tensorflow를 이용하기 위해서 pandas dataframe 데이터를 ndarray 형태로 변경합니다.

sample(frac=1)은 와인 데이터의 100%, 즉 모든 데이터를 뽑아서 랜덤으로 섞는다는 의미입니다.

다음으로 와인 데이터를 훈련 데이터와 테스트 데이터로 분리합니다.

아래 코드는 'type' 값을 제외한 나머지 값을 train_X(80%), test_X(20%)에 할당하고

'type' 값을 train_Y(80%), test_Y(20%)에 할당하도록 합니다.

numpy array 관련 내용은 여기를 참조해주세요.

train_idx = int(len(wine_np) * 0.8)
train_X, train_Y = wine_np[:train_idx, :-1], wine_np[:train_idx, -1]
test_X, test_Y = wine_np[train_idx:, :-1], wine_np[train_idx:, -1]

데이터 분리 후 to_categorical()를 이용하여 정답 행렬을 원-핫 인코딩 (One-Hot Encoding) 방식으로 변경합니다.

원-핫 인코딩은 정답에 해당하는 인덱스의 값에는 1을 할당하고, 나머지 인덱스에는 모두 0을 할당하는 방식입니다.

to_categorical()의 두 번째 파라미터 num_classes는 정답 클래스의 개수입니다. num_classess의 값이 n개 일 때,

to_categorical()의 인덱스 길이는 n입니다.

train_Y[0] == 1.0라고 할 때,

categorical_train_Y = tf.keras.utils.to_categorical(train_Y, num_classes=2) 실행 후

categorical_train_Y[0] 의 값은 [0. 1.] 입니다.

num_classes 값 만큼 categorical_train_Y 의 인덱스의 길이가 증가하며, 입력값 train_Y[0]을 인덱스 번호와 일치하도록 하여 그 값을 1로 할당하고 나머지 인덱스 값은 0으로 할당합니다.

train_Y[0] == 3.0라고 할 때,

categorical_train_Y = tf.keras.utils.to_categorical(train_Y, num_classes=5) 실행 후

categorical_train_Y[0] 의 값은 [0. 0. 0. 1. 0] 입니다.

즉, 입력값을 인덱스에 매칭 후 그 값을 1로 할당하고 나머지는 0으로 할당합니다.

학습을 진행하기전에 모델 설정을 진행합니다.

회귀 모델과 비슷하지만 분류 모델은 마지막 계층의 활성화함수로 소프트맥스(softmax) 함수를 사용합니다.

소프트맥스 함수의 출력값의 합은 1.0이기 때문에 클래스 분류(확률 계산)에 있어서 유리합니다.

손실 함수는 범주형 교차 엔트로피(categorical crossentropy)를 사용합니다.

분류 문제는 정확도로 퍼포먼스를 측정하기 때문에 compile() 함수 호출 시 인자로 metrics=['accuracy']를

반드시 설정합니다.

fit()을 이용하여 훈련 데이터 중 25%를 검증 데이터로 분리하고 학습을 진행합니다.

손실도와 정확도를 시각화합니다.

테스트 데이터를 기반으로 evaluate()를 이용하여 모델을 평가합니다.

테스트 데이터 정답과 예측값을 비교해보면 정확도가 높다는 것을 알 수 있다.

Fig 5. Comparison of test and predicted data

예측값[length-3]의 값은 [1.0000000e+00 6.2873126e-18] 으로

1.0000000e+00 = 1, 6.2873126e-18 = 0.00000000000000000062873126 이므로 좌측값이 더 크고

예측값[length-2]의 값은 [8.3044963e-03 9.9169546e-01] 으로

8.3044963e-03 = 0.0083044963, 9.9169546e-01 = 0.99169546 이므로 우측값이 더 크다.

정답과 비교했을 때, 일치하는 것을 볼 수 있다.

저작자표시

'AI > TensorFlow & PyTorch' 카테고리의 다른 글

[TensorFlow] 텐서플로우(TensorFlow 2.x) Fashion MNIST (0)	2021.03.05
[TensorFlow] 텐서플로우(TensorFlow 2.x) 와인 데이터 다항 분류 (0)	2021.03.05
[TensorFlow] 텐서플로우(TensorFlow 2.x) 로지스틱 회귀 예제 (0)	2021.03.05
[TensorFlow] 텐서플로우(TensorFlow 2.x) 보스턴 주택 가격 예측 (0)	2021.03.02
[TensorFlow] 텐서플로우(TensorFlow 2.x) 선형 회귀 예제 (0)	2021.02.26

현재글[TensorFlow] 텐서플로우(TensorFlow 2.x) 와인 데이터 이항 분류

정보 기술 놀이터