AI DEV GUIDE SERIES

딥러닝 완전 정복 A-Z

신경망 · CNN · RNN · Transformer · TensorFlow · PyTorch 현업 실전

비전공자도 3개월 안에 딥러닝 현업 투입 — 개념부터 배포까지 한 번에!

대상: ML 기초 완료자기간: 4~6주난이도: ★★★★☆Guide 5/7

AI Dev Guide 시리즈 전체 흐름

단계	가이드	핵심 주제	상태
01	AiDevGuide0001	AI 개념 완전 정복	완료
02	AiDevGuide0002	Python 기초 완전 정복	완료
03	AiDevGuide0003	데이터 다루기 (NumPy·Pandas·Matplotlib)	완료
04	AiDevGuide0004	머신러닝 완전 정복 (Scikit-learn)	완료
05	AiDevGuide0005 (현재)	딥러닝 완전 정복 (TensorFlow·PyTorch)	학습중
06	AiDevGuide0006	생성형 AI (OpenAI API·LangChain·RAG)	예정
07	AiDevGuide0007	실전 프로젝트 (HuggingFace·FastAPI·Docker)	예정

전체 학습 목차 (10 Chapters)

Ch.	챕터명	핵심 내용	난이도
01	딥러닝 개요 & 신경망 기초	퍼셉트론, 활성화 함수, 역전파, 경사하강법	★★☆☆☆
02	TensorFlow / Keras 완전 정복	Sequential/Functional API, 콜백, 모델 저장	★★★☆☆
03	PyTorch 완전 정복	Tensor, Autograd, nn.Module, DataLoader	★★★☆☆
04	CNN — 이미지 분류 완전 정복	Conv2D, Pooling, ResNet, VGG, EfficientNet	★★★★☆
05	RNN / LSTM / GRU — 시계열 & 텍스트	순환 신경망, 장기 의존성, 텍스트 분류	★★★★☆
06	Transformer & Attention 메커니즘	Self-Attention, Multi-Head, BERT, GPT 구조	★★★★★
07	Transfer Learning & Fine-tuning	사전학습 모델 활용, Feature Extraction, PEFT	★★★★☆
08	정규화 & 최적화 심화	Dropout, BatchNorm, Adam, 학습률 스케줄러	★★★★☆
09	실전 프로젝트 (MNIST·CIFAR-10·감성분석)	E2E 파이프라인, GPU 활용, 모델 배포	★★★★★
10	현업 면접 Q&A TOP 10	딥러닝 실전 면접 핵심 질문 & 완벽 답변	★★★★★

Ch 01. 딥러닝 개요 & 신경망 기초

퍼셉트론부터 역전파까지 — 딥러닝의 작동 원리를 완전히 이해한다

1-1. 딥러닝이란 무엇인가?

딥러닝(Deep Learning)은 인간 뇌의 신경망 구조를 모방한 머신러닝의 한 분야입니다. "Deep"는 여러 층(Layer)을 쌓는다는 의미이며, 각 층이 데이터의 점점 더 추상적인 특징을 학습합니다.

구분	머신러닝	딥러닝
특징 추출	수동 (사람이 직접)	자동 (모델이 학습)
필요 데이터	소량~중간	대용량 (GB~TB)
해석 가능성	비교적 높음	블랙박스 (낮음)
연산 자원	CPU 충분	GPU/TPU 필요
성능 (이미지/음성)	한계 존재	인간 수준 또는 초과

1-2. 퍼셉트론(Perceptron) — 신경망의 기본 단위

퍼셉트론은 생물학적 뉴런을 수학적으로 모델링한 것입니다.

# 퍼셉트론 수식
# 입력: x1, x2, ... xn
# 가중치: w1, w2, ... wn
# 편향: b
# 출력: y = activation(w1*x1 + w2*x2 + ... + wn*xn + b)

import numpy as np

class Perceptron:
    def __init__(self, n_inputs, lr=0.01):
        self.w = np.random.randn(n_inputs) * 0.01  # 가중치 초기화
        self.b = 0.0                                 # 편향 초기화
        self.lr = lr                                 # 학습률

    def activate(self, z):
        return 1 if z >= 0 else 0  # 계단 함수 (Step Function)

    def predict(self, x):
        z = np.dot(self.w, x) + self.b  # 선형 결합
        return self.activate(z)

    def fit(self, X, y, epochs=10):
        for epoch in range(epochs):
            for xi, yi in zip(X, y):
                y_pred = self.predict(xi)
                error = yi - y_pred           # 오차 계산
                self.w += self.lr * error * xi  # 가중치 업데이트
                self.b += self.lr * error       # 편향 업데이트

1-3. 활성화 함수(Activation Function) 완전 정복

함수	수식	범위	주요 용도	단점
ReLU	max(0, x)	[0, ∞)	은닉층 기본값	Dying ReLU
Leaky ReLU	x>=0: x, x<0: 0.01x	(-∞, ∞)	ReLU 개선	alpha 튜닝 필요
Sigmoid	1/(1+e^-x)	(0, 1)	이진 분류 출력	기울기 소실
Tanh	(e^x-e^-x)/(e^x+e^-x)	(-1, 1)	RNN 내부	기울기 소실
Softmax	e^xi / sum(e^xj)	(0,1) 합=1	다중 분류 출력	수치 불안정

1-4. 역전파(Backpropagation) & 경사하강법

역전파는 출력 오차를 역방향으로 전파하여 각 가중치의 기울기(gradient)를 계산하는 알고리즘입니다.

# 경사하강법 종류 비교
# 1. BGD (Batch Gradient Descent): 전체 데이터 사용 → 안정적이나 느림
# 2. SGD (Stochastic): 1개씩 업데이트 → 빠르나 불안정
# 3. Mini-batch: 배치 단위 업데이트 → 실제 현업 표준

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)  # 시그모이드 미분

# 2층 신경망 순전파 + 역전파
class SimpleNN:
    def __init__(self):
        np.random.seed(42)
        self.W1 = np.random.randn(2, 4) * 0.01  # 입력(2) -> 은닉(4)
        self.b1 = np.zeros((1, 4))
        self.W2 = np.random.randn(4, 1) * 0.01  # 은닉(4) -> 출력(1)
        self.b2 = np.zeros((1, 1))

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1          # 1층 선형 결합
        self.a1 = sigmoid(self.z1)                # 1층 활성화
        self.z2 = self.a1 @ self.W2 + self.b2    # 2층 선형 결합
        self.a2 = sigmoid(self.z2)                # 출력층
        return self.a2

    def backward(self, X, y, lr=0.01):
        m = X.shape[0]
        dL_da2 = -(y - self.a2)                   # 손실 기울기
        dL_dz2 = dL_da2 * sigmoid_deriv(self.z2) # 역전파: 출력층
        dL_dW2 = self.a1.T @ dL_dz2 / m
        dL_db2 = dL_dz2.mean(axis=0, keepdims=True)

        dL_da1 = dL_dz2 @ self.W2.T
        dL_dz1 = dL_da1 * sigmoid_deriv(self.z1) # 역전파: 은닉층
        dL_dW1 = X.T @ dL_dz1 / m
        dL_db1 = dL_dz1.mean(axis=0, keepdims=True)

        # 가중치 업데이트
        self.W2 -= lr * dL_dW2
        self.b2 -= lr * dL_db2
        self.W1 -= lr * dL_dW1
        self.b1 -= lr * dL_db1

1-5. 손실 함수(Loss Function) 종류

손실 함수	수식	사용 상황
MSE	mean((y - y_hat)^2)	회귀 문제
Binary Crossentropy	-[ylog(p) + (1-y)log(1-p)]	이진 분류
Categorical Crossentropy	-sum(y * log(p))	다중 분류 (원핫)

Ch 02. TensorFlow / Keras 완전 정복

구글이 만든 딥러닝 프레임워크 — Sequential부터 커스텀 레이어까지

2-1. TensorFlow 설치 및 환경 설정

# TensorFlow 설치
pip install tensorflow  # CPU 버전
pip install tensorflow-gpu  # GPU 버전 (CUDA 필요)

# Google Colab: 이미 설치되어 있음!
# GPU 확인
import tensorflow as tf
print("TF 버전:", tf.__version__)
print("GPU 사용 가능:", tf.config.list_physical_devices("GPU"))

# GPU 메모리 설정 (OOM 방지)
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)  # 필요한 만큼만 사용
print("GPU 설정 완료")

2-2. Sequential API — 가장 빠른 시작

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# === Sequential API: 층을 순서대로 쌓기 ===
model = keras.Sequential([
    # 입력층 + 첫 번째 은닉층
    layers.Dense(128, activation="relu", input_shape=(784,)),
    layers.Dropout(0.3),           # 과적합 방지

    # 두 번째 은닉층
    layers.Dense(64, activation="relu"),
    layers.BatchNormalization(),   # 배치 정규화
    layers.Dropout(0.3),

    # 출력층 (10개 클래스)
    layers.Dense(10, activation="softmax")
])

# 모델 요약
model.summary()

# 컴파일: 손실함수, 옵티마이저, 평가지표 지정
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="sparse_categorical_crossentropy",  # 정수 레이블
    metrics=["accuracy"]
)

# MNIST 데이터로 학습
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train = X_train.reshape(-1, 784) / 255.0  # 정규화
X_test  = X_test.reshape(-1, 784)  / 255.0

history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=256,
    validation_split=0.2,
    verbose=1
)

# 평가
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f"테스트 정확도: {acc:.4f}")

2-3. Functional API — 복잡한 모델 구성

# Functional API: 다중 입력/출력, 잔차 연결 등 복잡한 구조 가능
inputs = keras.Input(shape=(784,))
x = layers.Dense(256, activation="relu")(inputs)
x = layers.Dropout(0.4)(x)
x = layers.Dense(128, activation="relu")(x)
skip = layers.Dense(128, activation="relu")(inputs)  # 잔차 연결
x = layers.Add()([x, skip])                           # 더하기
x = layers.Dense(64, activation="relu")(x)
outputs = layers.Dense(10, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)
model.summary()

2-4. 콜백(Callbacks) — 학습 제어의 핵심

# 현업에서 필수 콜백 모음
callbacks = [
    # 1. EarlyStopping: 검증 손실이 개선 없으면 조기 종료
    keras.callbacks.EarlyStopping(
        monitor="val_loss",
        patience=5,          # 5 epoch 동안 개선 없으면 중단
        restore_best_weights=True  # 최고 성능 가중치 복원
    ),

    # 2. ModelCheckpoint: 최고 모델 저장
    keras.callbacks.ModelCheckpoint(
        filepath="best_model.keras",
        monitor="val_accuracy",
        save_best_only=True
    ),

    # 3. ReduceLROnPlateau: 학습률 자동 감소
    keras.callbacks.ReduceLROnPlateau(
        monitor="val_loss",
        factor=0.5,    # 학습률을 절반으로
        patience=3,
        min_lr=1e-7
    ),

    # 4. TensorBoard: 학습 시각화
    keras.callbacks.TensorBoard(log_dir="./logs")
]

history = model.fit(X_train, y_train,
    epochs=100, batch_size=256,
    validation_split=0.2,
    callbacks=callbacks
)

2-5. 모델 저장 & 로드

# === 모델 저장 방법 3가지 ===

# 1. 전체 모델 저장 (권장)
model.save("my_model.keras")          # Keras 형식
model.save("my_model.h5")             # HDF5 형식 (레거시)

# 2. 가중치만 저장
model.save_weights("weights.h5")
model.load_weights("weights.h5")      # 가중치만 로드

# 3. SavedModel 형식 (TF Serving 배포용)
model.save("saved_model_dir")         # 디렉터리로 저장

# 로드
loaded_model = keras.models.load_model("my_model.keras")
predictions = loaded_model.predict(X_test[:5])
print("예측 클래스:", predictions.argmax(axis=1))

# TF Lite 변환 (모바일/엣지 배포용)
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")
tflite_model = converter.convert()
with open("model.tflite", "wb") as f:
    f.write(tflite_model)
print("TFLite 변환 완료!")

Ch 03. PyTorch 완전 정복

연구·현업 모두 최애 프레임워크 — Tensor부터 커스텀 모델까지

3-1. PyTorch vs TensorFlow 비교

항목	PyTorch	TensorFlow/Keras
철학	Dynamic Graph (즉시 실행)	Static Graph (초기) + Eager(2.x)
디버깅	Python 디버거 그대로 사용	tf.function 주의 필요
연구 인기도	압도적 (논문 70%+)	산업 분야 강세
HuggingFace	기본 지원	지원 (일부 제한)
모바일 배포	TorchScript, ONNX	TF Lite 강력

3-2. Tensor 기초 완전 정복

import torch
import torch.nn as nn

# === Tensor 생성 ===
t1 = torch.tensor([[1.0, 2.0], [3.0, 4.0]])  # 리스트로
t2 = torch.zeros(3, 4)          # 0으로 채우기
t3 = torch.ones(3, 4)           # 1로 채우기
t4 = torch.randn(3, 4)          # 정규분포 랜덤
t5 = torch.arange(0, 10, 2)     # range처럼
t6 = torch.linspace(0, 1, 5)    # 균등 간격

# NumPy 변환
import numpy as np
arr = np.array([1, 2, 3])
t = torch.from_numpy(arr)       # NumPy → Tensor
arr2 = t.numpy()                # Tensor → NumPy (CPU만 가능)

# GPU 이동
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
t = t4.to(device)               # GPU로 이동

# Tensor 속성
print(t4.shape)    # torch.Size([3, 4])
print(t4.dtype)    # torch.float32
print(t4.device)   # cpu or cuda:0

# 형태 변환
t = torch.randn(2, 3, 4)
print(t.reshape(6, 4))    # reshape
print(t.view(24))         # view (메모리 연속)
print(t.permute(2, 0, 1)) # 축 순서 변경
print(t.squeeze())        # 크기 1인 차원 제거
print(t.unsqueeze(0))     # 차원 추가

3-3. Autograd — 자동 미분 엔진

# requires_grad=True: 이 텐서의 기울기를 추적
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x[0]**2 + 2*x[1]   # y = x0^2 + 2*x1
y.backward()             # 역전파
print(x.grad)            # [dy/dx0, dy/dx1] = [4.0, 2.0]

# 기울기 초기화 (배치마다 필수!)
optimizer.zero_grad()    # 항상 backward 전에!

# no_grad: 평가 시 메모리 절약
with torch.no_grad():
    output = model(test_input)  # 기울기 계산 안 함

3-4. nn.Module — 커스텀 모델 만들기

import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    """다층 퍼셉트론 — PyTorch 커스텀 모델 기본 틀"""
    def __init__(self, input_dim, hidden_dim, output_dim, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.BatchNorm1d(hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, output_dim)
        )

    def forward(self, x):
        return self.net(x)

model = MLP(784, 256, 10).to(device)
print(model)

3-5. 완전한 학습 루프 (PyTorch 표준 패턴)

from torch.utils.data import TensorDataset, DataLoader

# DataLoader 구성
X_tensor = torch.FloatTensor(X_train)
y_tensor = torch.LongTensor(y_train)
dataset = TensorDataset(X_tensor, y_tensor)
loader  = DataLoader(dataset, batch_size=256, shuffle=True, num_workers=4)

# 옵티마이저 & 손실 함수
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
criterion = nn.CrossEntropyLoss()

best_val_loss = float("inf")

for epoch in range(50):
    # === 학습 모드 ===
    model.train()
    train_loss = 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()           # 1. 기울기 초기화
        outputs = model(X_batch)        # 2. 순전파
        loss = criterion(outputs, y_batch)  # 3. 손실 계산
        loss.backward()                 # 4. 역전파
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # 기울기 클리핑
        optimizer.step()                # 5. 가중치 업데이트
        train_loss += loss.item()

    # === 평가 모드 ===
    model.eval()
    val_loss, correct = 0, 0
    with torch.no_grad():
        for X_batch, y_batch in val_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            out = model(X_batch)
            val_loss += criterion(out, y_batch).item()
            correct += (out.argmax(1) == y_batch).sum().item()

    scheduler.step()
    avg_val_loss = val_loss / len(val_loader)

    # 최고 모델 저장
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        torch.save(model.state_dict(), "best_model.pt")

    if (epoch+1) % 5 == 0:
        acc = correct / len(val_dataset)
        print(f"Epoch {epoch+1} | Val Loss: {avg_val_loss:.4f} | Val Acc: {acc:.4f}")

# 최고 모델 로드
model.load_state_dict(torch.load("best_model.pt"))
model.eval()
print("최고 모델 로드 완료!")

Ch 04. CNN — 이미지 분류 완전 정복

컴퓨터가 이미지를 보는 방법 — Conv2D부터 ResNet까지

4-1. CNN 핵심 개념 — 합성곱이란?

CNN(Convolutional Neural Network)은 이미지의 지역적 패턴을 효율적으로 학습합니다.

구성 요소	역할	파라미터
Conv2D	필터로 특징 맵 추출	filters, kernel_size, stride, padding
MaxPooling2D	공간 크기 축소, 주요 특징 보존	pool_size, stride
BatchNorm	레이어 내 정규화, 학습 안정화	momentum, epsilon
Flatten/GAP	특징 맵 → 벡터 변환	-

4-2. CNN 구축 — TensorFlow 버전

import tensorflow as tf
from tensorflow.keras import layers, models

# CIFAR-10: 32x32 컬러 이미지, 10개 클래스
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train / 255.0
X_test  = X_test  / 255.0

def build_cnn():
    model = models.Sequential([
        # 블록 1: 32 필터
        layers.Conv2D(32, (3,3), padding="same", input_shape=(32,32,3)),
        layers.BatchNormalization(),
        layers.Activation("relu"),
        layers.Conv2D(32, (3,3), padding="same"),
        layers.BatchNormalization(),
        layers.Activation("relu"),
        layers.MaxPooling2D((2,2)),
        layers.Dropout(0.25),

        # 블록 2: 64 필터
        layers.Conv2D(64, (3,3), padding="same"),
        layers.BatchNormalization(),
        layers.Activation("relu"),
        layers.Conv2D(64, (3,3), padding="same"),
        layers.BatchNormalization(),
        layers.Activation("relu"),
        layers.MaxPooling2D((2,2)),
        layers.Dropout(0.25),

        # 블록 3: 128 필터
        layers.Conv2D(128, (3,3), padding="same"),
        layers.BatchNormalization(),
        layers.Activation("relu"),
        layers.GlobalAveragePooling2D(),  # GAP: Flatten 대신
        layers.Dropout(0.5),

        # 분류층
        layers.Dense(256, activation="relu"),
        layers.Dropout(0.5),
        layers.Dense(10, activation="softmax")
    ])
    return model

model = build_cnn()
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

# 데이터 증강 (Data Augmentation)
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    zoom_range=0.1
)
datagen.fit(X_train)

history = model.fit(
    datagen.flow(X_train, y_train, batch_size=64),
    epochs=50,
    validation_data=(X_test, y_test),
    callbacks=[tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)]
)

4-3. ResNet — 잔차 연결로 깊은 네트워크 학습

# PyTorch로 ResNet 잔차 블록 구현
import torch.nn as nn

class ResidualBlock(nn.Module):
    """ResNet 기본 블록 — 잔차 연결로 기울기 소실 해결"""
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride, 1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, 1, 1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_ch)
        self.relu  = nn.ReLU(inplace=True)

        # 차원이 다를 때 shortcut 조정
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )

    def forward(self, x):
        identity = self.shortcut(x)      # 잔차(skip) 경로
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = self.relu(out + identity)  # 잔차 연결!
        return out

# 사전학습 ResNet50 활용 (Transfer Learning)
from torchvision import models as tv_models

resnet = tv_models.resnet50(pretrained=True)
for param in resnet.parameters():    # 기존 파라미터 동결
    param.requires_grad = False

# 분류 헤드만 교체
num_features = resnet.fc.in_features
resnet.fc = nn.Sequential(
    nn.Linear(num_features, 256),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256, 10)  # 내 데이터셋 클래스 수
)

4-4. 주요 CNN 아키텍처 비교

모델	연도	파라미터	ImageNet Top-1	특징
VGG16	2014	138M	74.5%	단순하고 이해 쉬움
ResNet50	2015	25M	76.1%	잔차 연결, 표준
EfficientNetB0	2019	5.3M	77.3%	경량 + 고성능
ConvNeXt	2022	28M	82.1%	Transformer 아이디어 적용
ViT-B/16	2020	86M	81.8%	Transformer 기반 이미지

Ch 05. RNN / LSTM / GRU — 시계열 & 텍스트

순서가 있는 데이터 처리 — 주가 예측, 감성 분석, 기계 번역

5-1. RNN의 핵심 아이디어 — 이전 상태 기억하기

RNN은 시퀀스의 각 타임스텝마다 이전 출력(은닉 상태)을 다음 입력으로 활용합니다.

모델	장기 기억	파라미터 수	주요 용도
기본 RNN	X (기울기 소실)	적음	교육 목적
LSTM	O (Cell State)	많음 (4배)	텍스트, 시계열
GRU	O (간소화)	중간 (3배)	빠른 학습 필요시

5-2. LSTM으로 주가 예측 — 시계열 실전

import numpy as np
import torch
import torch.nn as nn

class LSTMPredictor(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            batch_first=True,   # (batch, seq, feature) 형식
            bidirectional=True  # 양방향 LSTM
        )
        self.fc = nn.Sequential(
            nn.Linear(hidden_size * 2, 32),  # 양방향이므로 *2
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(32, 1)  # 다음 값 예측
        )

    def forward(self, x):
        out, _ = self.lstm(x)   # out: (batch, seq, hidden*2)
        out = out[:, -1, :]    # 마지막 타임스텝만 사용
        return self.fc(out)

# 시계열 데이터 슬라이딩 윈도우 생성
def create_sequences(data, seq_len=60):
    X, y = [], []
    for i in range(len(data) - seq_len):
        X.append(data[i:i+seq_len])         # 60일 입력
        y.append(data[i+seq_len])            # 다음날 예측
    return np.array(X), np.array(y)

# 정규화 (시계열은 MinMaxScaler 주로 사용)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(prices.reshape(-1,1))
X, y = create_sequences(scaled, seq_len=60)
X = torch.FloatTensor(X)  # (samples, 60, 1)
y = torch.FloatTensor(y)

5-3. LSTM으로 텍스트 감성 분석

# 텍스트 → 숫자 변환 파이프라인
import torch
import torch.nn as nn

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=256, n_layers=2, dropout=0.3):
        super().__init__()
        # 임베딩: 단어 → 벡터
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        # LSTM
        self.lstm = nn.LSTM(embed_dim, hidden_dim, n_layers,
                             batch_first=True, dropout=dropout, bidirectional=True)
        # 어텐션 가중치
        self.attention = nn.Linear(hidden_dim * 2, 1)
        # 분류
        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden_dim * 2, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()  # 이진 분류
        )

    def forward(self, x):
        embedded = self.embedding(x)           # (batch, seq, embed)
        lstm_out, _ = self.lstm(embedded)      # (batch, seq, hidden*2)
        # 어텐션 메커니즘
        attn_weights = torch.softmax(self.attention(lstm_out), dim=1)
        context = (lstm_out * attn_weights).sum(dim=1)  # 가중 평균
        return self.classifier(context)

Ch 06. Transformer & Attention 메커니즘

현대 AI의 핵심 — GPT, BERT, ChatGPT의 기반 구조 완전 이해

6-1. Attention이란? — "중요한 것에 집중"

Attention 메커니즘은 시퀀스의 각 위치가 다른 위치들과의 관련성을 직접 계산합니다. "나는 은행에 갔다"에서 "은행"의 의미가 문맥에 따라 달라지는 문제를 해결합니다.

# Self-Attention 수식
# Attention(Q, K, V) = softmax(QK^T / sqrt(dk)) * V

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)  # Query
        self.W_k = nn.Linear(d_model, d_model)  # Key
        self.W_v = nn.Linear(d_model, d_model)  # Value
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        Q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        # Scaled Dot-Product Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn = torch.softmax(scores, dim=-1)
        out = torch.matmul(attn, V).transpose(1, 2).contiguous()
        out = out.view(B, T, C)
        return self.out(out)

6-2. BERT vs GPT 구조 비교

항목	BERT (Encoder)	GPT (Decoder)
Attention 방향	양방향 (좌우 모두)	단방향 (왼쪽만)
사전학습 목적	MLM (마스크 채우기)	CLM (다음 토큰 예측)
잘하는 것	분류, NER, QA	텍스트 생성
대표 모델	BERT, RoBERTa, ELECTRA	GPT-2/3/4, ChatGPT

6-3. 위치 인코딩(Positional Encoding)

# Transformer는 순서 정보가 없음 → 위치 인코딩으로 해결
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)  # 짝수 차원: sin
        pe[:, 1::2] = torch.cos(position * div_term)  # 홀수 차원: cos
        self.register_buffer("pe", pe.unsqueeze(0))   # (1, max_len, d_model)

    def forward(self, x):
        return self.dropout(x + self.pe[:, :x.size(1)])

Ch 07. Transfer Learning & Fine-tuning

거인의 어깨 위에 서기 — 사전학습 모델로 적은 데이터도 최고 성능

7-1. Transfer Learning 전략 3가지

전략	학습 대상	데이터 크기	적합 상황
Feature Extraction	헤드만 학습	매우 적음	도메인 유사, 데이터 부족
Fine-tuning (일부)	헤드 + 상위 몇 층	중간	도메인 약간 다름
Full Fine-tuning	전체 모델	충분	도메인 많이 다름

7-2. HuggingFace Transformers — 현업 표준 라이브러리

pip install transformers datasets accelerate

# === BERT로 텍스트 분류 Fine-tuning ===
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset
import numpy as np

# 1. 모델 & 토크나이저 로드
model_name = "klue/bert-base"  # 한국어 BERT
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

# 2. 데이터셋 준비
dataset = load_dataset("nsmc")  # 네이버 영화 리뷰

def tokenize_fn(examples):
    return tokenizer(
        examples["document"],
        truncation=True,
        max_length=128,
        padding="max_length"
    )

tokenized = dataset.map(tokenize_fn, batched=True)

# 3. 학습 설정
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_dir="./logs"
)

# 4. 평가 함수
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = (preds == labels).mean()
    return {"accuracy": acc}

# 5. 학습 실행
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics
)
trainer.train()

# 6. 추론
inputs = tokenizer("이 영화 정말 재미있어요!", return_tensors="pt")
outputs = model(**inputs)
pred = outputs.logits.argmax(-1).item()
print("감성:", "긍정" if pred == 1 else "부정")

7-3. PEFT / LoRA — 경량 파인튜닝

LoRA(Low-Rank Adaptation)는 대형 모델을 극소량의 파라미터만 추가하여 파인튜닝합니다. GPU 메모리 10배 절약!

pip install peft

from peft import LoraConfig, get_peft_model, TaskType

# LoRA 설정
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,                    # 랭크 (작을수록 파라미터 적음)
    lora_alpha=16,          # 스케일 (보통 r*2)
    target_modules=["query", "value"],  # 어떤 층에 적용
    lora_dropout=0.1,
    bias="none"
)

# LoRA 적용
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 294,912 || all params: 110,108,160 || trainable%: 0.27%
# ← 전체의 0.27%만 학습! 나머지는 동결

Ch 08. 정규화 & 최적화 심화

모델이 잘 학습되도록 — 과적합 방지부터 학습률 스케줄링까지

8-1. 과적합(Overfitting) 방지 전략

기법	원리	코드
Dropout	학습 중 랜덤하게 뉴런 비활성화	nn.Dropout(p=0.5)
L2 정규화	가중치 크기에 페널티	weight_decay=1e-4
Early Stopping	검증 손실 개선 없으면 중단	EarlyStopping(patience=5)
Data Augmentation	훈련 데이터 인위적 증가	ImageDataGenerator

8-2. 배치 정규화(Batch Normalization)

# BatchNorm: 각 미니배치의 분포를 정규화
# 장점: 학습 안정화, 학습률 높게 쓸 수 있음, 정규화 효과

# Conv 계열: BatchNorm2d
nn.Sequential(
    nn.Conv2d(64, 128, 3, padding=1),
    nn.BatchNorm2d(128),   # Conv 다음에 위치
    nn.ReLU()
)

# Linear 계열: BatchNorm1d
nn.Sequential(
    nn.Linear(256, 128),
    nn.BatchNorm1d(128),
    nn.ReLU()
)

# LayerNorm: Transformer에서 주로 사용
nn.LayerNorm(d_model)

8-3. 옵티마이저 종류와 선택 가이드

옵티마이저	특징	추천 상황
Adam	적응적 학습률, 빠른 수렴	기본값 (대부분 상황)
AdamW	Adam + 가중치 감소 수정	Transformer (BERT, GPT)
SGD + momentum	일반화 성능 높음	이미지 분류 최종 학습
Lion	Google 2023, 메모리 효율	대형 모델 학습

8-4. 학습률 스케줄링 — 학습 중 학습률 조정

import torch.optim as optim

optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

# 1. CosineAnnealingLR: 코사인 곡선으로 감소
scheduler1 = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# 2. OneCycleLR: 워밍업 + 코사인 감소 (현업 강력 추천)
scheduler2 = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-3,
    epochs=50,
    steps_per_epoch=len(train_loader)
)

# 3. Warmup + Linear Decay (Transformer 표준)
from transformers import get_linear_schedule_with_warmup
scheduler3 = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=500,    # 첫 500스텝은 학습률 증가
    num_training_steps=5000  # 이후 선형 감소
)

# 학습 루프에서 사용
for batch in train_loader:
    loss = compute_loss(batch)
    loss.backward()
    optimizer.step()
    scheduler2.step()  # 배치마다 (OneCycleLR)
    optimizer.zero_grad()

Ch 09. 실전 프로젝트 — MNIST · CIFAR-10 · 감성분석

End-to-End 딥러닝 파이프라인 — 데이터 준비부터 모델 배포까지

9-1. 프로젝트 1: MNIST 손글씨 인식 (정확도 99%+ 목표)

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# === 데이터 전처리 ===
transform_train = transforms.Compose([
    transforms.RandomRotation(10),           # 증강: 회전
    transforms.RandomAffine(0, translate=(0.1, 0.1)),  # 이동
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST 통계
])
transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST("./data", train=True, download=True, transform=transform_train)
test_dataset  = datasets.MNIST("./data", train=False, transform=transform_test)
train_loader  = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=4)
test_loader   = DataLoader(test_dataset,  batch_size=256, shuffle=False)

# === CNN 모델 ===
class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2), nn.Dropout2d(0.25),
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
            nn.MaxPool2d(2), nn.Dropout2d(0.25)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128*7*7, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.classifier(self.features(x))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MNISTNet().to(device)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-3,
                                           epochs=15, steps_per_epoch=len(train_loader))
criterion = nn.CrossEntropyLoss()

# 학습
for epoch in range(15):
    model.train()
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = criterion(model(images), labels)
        loss.backward()
        optimizer.step()
        scheduler.step()

# 최종 평가
model.eval()
correct = total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        preds = model(images).argmax(1).cpu()
        correct += (preds == labels).sum().item()
        total += labels.size(0)
print(f"테스트 정확도: {100*correct/total:.2f}%")  # 목표: 99.4%+

9-2. 프로젝트 2: 한국어 감성 분석 (BERT 파인튜닝)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import Dataset, DataLoader
import pandas as pd

class NSMCDataset(Dataset):
    def __init__(self, df, tokenizer, max_len=128):
        self.texts  = df["document"].tolist()
        self.labels = df["label"].tolist()
        self.tokenizer = tokenizer
        self.max_len   = max_len

    def __len__(self): return len(self.texts)

    def __getitem__(self, idx):
        enc = self.tokenizer(
            str(self.texts[idx]),
            max_length=self.max_len,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )
        return {
            "input_ids":      enc["input_ids"].squeeze(),
            "attention_mask": enc["attention_mask"].squeeze(),
            "labels":         torch.tensor(self.labels[idx], dtype=torch.long)
        }

# 모델 로드
tokenizer = AutoTokenizer.from_pretrained("snunlp/KR-FinBert-SC")
model = AutoModelForSequenceClassification.from_pretrained("snunlp/KR-FinBert-SC", num_labels=2)

# 추론 예시
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=128,
                        truncation=True, padding=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    label = probs.argmax().item()
    conf  = probs.max().item()
    sentiment = "긍정" if label == 1 else "부정"
    print(f"[{sentiment}] 신뢰도: {conf:.2%}")

predict_sentiment("오늘 개봉한 영화 완전 강추합니다!")

9-3. GPU 활용 & 성능 최적화 팁

# === 현업 GPU 최적화 팁 모음 ===

# 1. Mixed Precision Training (학습 속도 2-3배 향상)
from torch.cuda.amp import GradScaler, autocast
scaler = GradScaler()

for batch in train_loader:
    optimizer.zero_grad()
    with autocast():  # FP16 연산
        outputs = model(batch)
        loss = criterion(outputs, labels)
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

# 2. 다중 GPU (DataParallel)
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)

# 3. torch.compile (PyTorch 2.0+, 최대 2배 속도 향상)
model = torch.compile(model)

# 4. pin_memory + num_workers 최적화
loader = DataLoader(dataset, batch_size=256,
    num_workers=4,   # CPU 코어 수에 맞게
    pin_memory=True, # GPU 전송 속도 향상
    prefetch_factor=2
)

# 5. Gradient Accumulation (작은 GPU에서 큰 배치 효과)
accum_steps = 4  # 실효 배치 = batch_size * 4
for i, (images, labels) in enumerate(train_loader):
    loss = criterion(model(images), labels) / accum_steps
    loss.backward()
    if (i+1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Ch 10. 현업 면접 Q&A TOP 10

딥러닝 개발자 면접에서 반드시 나오는 핵심 질문 & 완벽 답변

Q1. 기울기 소실(Vanishing Gradient) 문제와 해결책은?

Sigmoid/Tanh 사용 시 역전파에서 기울기가 0에 수렴하는 문제입니다. 해결책: ReLU 계열 활성화 함수 사용, Batch Normalization 적용, ResNet의 잔차 연결(Skip Connection), 기울기 클리핑(Gradient Clipping)

Q2. 배치 정규화(Batch Normalization)의 효과는?

Internal Covariate Shift(층간 입력 분포 변화) 문제를 해결합니다. 학습 안정화, 더 높은 학습률 사용 가능, Dropout 대체/보완, 수렴 속도 향상. 추론 시 학습 통계의 이동 평균을 사용합니다.

Q3. Dropout이 과적합을 방지하는 원리는?

학습 중 p 확률로 뉴런을 랜덤 비활성화하여 앙상블 효과를 만들어냅니다. 특정 뉴런에 의존하지 않도록 강제합니다. 추론 시에는 전체 뉴런을 사용하되 출력에 (1-p)를 곱해 기댓값을 맞춥니다.

Q4. Adam 옵티마이저의 동작 원리는?

Momentum(1차 모멘트)과 RMSProp(2차 모멘트)를 결합한 방법입니다. m_t = beta1*m_{t-1} + (1-beta1)*g, v_t = beta2*v_{t-1} + (1-beta2)*g^2. 편향 보정 후 학습률 조정: lr * m_hat / (sqrt(v_hat) + eps)

Q5. CNN에서 Pooling의 역할은? Max vs Average?

공간 크기를 줄여 파라미터 수 감소 및 위치 불변성을 만듭니다. MaxPooling은 주요 특징 유지(분류 강점), AveragePooling은 부드러운 특징 보존(Global 시 선호). 현대 아키텍처는 GAP(Global Average Pooling) 선호합니다.

Q6. Self-Attention과 RNN의 차이는?

RNN: 순차적 처리(병렬화 불가), 장거리 의존성 약함. Self-Attention: 모든 위치를 동시에 비교(O(1) 패스), 장거리 의존성 직접 포착, 병렬 처리 가능, 연산량 O(n^2)이 단점

Q7. Transfer Learning vs Fine-tuning 차이는?

Transfer Learning: 사전학습 모델의 가중치를 동결하고 새 헤드만 학습(Feature Extraction). Fine-tuning: 일부 또는 전체 레이어를 낮은 학습률로 재학습. 데이터 크기와 도메인 유사도에 따라 선택합니다.

Q8. Learning Rate 설정 방법은?

LR Finder: 작은 값에서 시작해 손실이 최소인 지점 찾기. 논문 기본값 사용 (Adam: 1e-3, BERT: 2e-5). Grid/Random Search. Warmup + Cosine Decay 스케줄러로 동적 조정. OneCycleLR이 현업에서 강력합니다.

Q9. 모델 경량화 방법들을 설명하세요.

Pruning: 중요도 낮은 가중치 제거. Quantization: FP32→INT8 변환(4x 경량화). Knowledge Distillation: 큰 모델 지식을 작은 모델에 전수. LoRA: 저랭크 분해로 파인튜닝. TFLite/ONNX 변환으로 배포 최적화

Q10. 딥러닝 모델 배포 시 고려사항은?

추론 속도 vs 정확도 트레이드오프. 모델 직렬화 (ONNX, TorchScript, TF SavedModel). 버전 관리 (MLflow, DVC). 모니터링 (데이터 드리프트, 성능 저하 감지). A/B 테스트. 롤백 전략 수립

학습 완료 체크리스트

Ch01~Ch03 기초

☐ 퍼셉트론, 활성화 함수 이해
☐ 역전파/경사하강법 직접 구현
☐ Keras Sequential/Functional API 숙달
☐ PyTorch 학습 루프 처음부터 작성

Ch04~Ch06 심화

☐ CNN으로 CIFAR-10 분류 90%+
☐ LSTM으로 시계열/텍스트 처리
☐ Self-Attention 코드 직접 구현
☐ BERT/GPT 구조 설명 가능

Ch07~Ch08 실전기술

☐ HuggingFace Trainer로 파인튜닝
☐ LoRA 적용하여 경량 파인튜닝
☐ Mixed Precision Training 적용
☐ 학습률 스케줄러 적재적소 사용

Ch09~Ch10 프로젝트

☐ MNIST 99%+ 정확도 달성
☐ 한국어 감성 분석 서비스 구축
☐ 면접 Q&A 10개 자신감 있게 답변
☐ GitHub 포트폴리오 프로젝트 업로드

Guide 05 완료!

딥러닝 완전 정복 — 신경망의 원리부터 현업 배포까지

이 가이드를 완료했다면 딥러닝 모델을 직접 설계하고 학습·배포할 수 있는 수준이 되었습니다.
다음 가이드에서는 ChatGPT API, LangChain, RAG 등 생성형 AI 개발을 배웁니다.

다음 단계

AiDevGuide0006

생성형 AI 완전 정복