Transformer Part 1

최대 1 분 소요

Attention is all you need

No longer using RNN, CNN modules
For RNNs, the farther apart the time steps are when computing the hidden state vector, the more likely it is that the hidden state will only contain that information after going through the RNN module by the difference of those time steps.

Hash Table T

Key(동물)	Value(다리 개수)
강아지	4
닭	2
문어	8
오징어	10
고양이	4

T(“강아지”) = 4 T(“사자”) = “Error: Key is not found”

Attention is a Hash table with Soft matching

Find and utilize similar keys even if they don’t exactly match the query

Self-Attention(Dot-Product Attention)

Input : One Query q and Multiple Key-Value(k, v) pair Output : Weighted average of Value vector weighted = Inner product of its corresponding Key and Query

Scaled Dot-Product Attention

Problem

As d_k gets larger, the variance of Dot-product q, k increase
- Increased likelihood of certain values within Softmax being unusually large
- Softmax output has a distribution that is skewed toward one value
- Gradient becomes very small.

공유하기

Twitter Facebook LinkedIn

댓글남기기

참고

모델 경량화 기법: Pruning(가지치기)

2024-12-23

2 분 소요

1. 프루닝 개념

모델 경량화 기법: Pruning(가지치기) - 실전편

2024-12-23

1 분 소요

1. 프루닝 구현 방법

충분히 실천 가능한 느리게 늙는 방법 (노년내과 정희원 교수)

2024-12-18

4 분 소요

아마존 AWS와 구글 GCP 서비스 비교

2024-12-16

1 분 소요

아마존 AWS 서비스