Deep L-layer Neural network
What is a deep neural network?
여러개의 hidden layer가 있는 NN을 Deep Neural network라고 한다.
Notation
- 레이어 갯수 : L = 4
- 레이어 $l$에 있는 노드(유닛) 갯수 : $n^{[l]}$
- 레이어 $l$에 있는 activations : $a^{[l]}$
- $a^{[l]}=g^{[l]}(z^{[l]})$
- $z^{[l]}$의 가중치 $w^{[l]}$
Forward Propagation in a Deep Network
Forward Propagation
$$z^{[l]}=w^{[l]}A^{[l-1]}+b^{[l]}
\ A^{[l]}=g^{[l]}(z^{[l]})$$
- Not vectorized$a^{[1]}=g^{[1]}(z^{[1]})$$a^{[2]}=g^{[2]}(z^{[2]})$$z^{[4]}=w^{[4]}a^{[3]}+b^{[4]}$
- $\hat y = a^{[4]}=g^{[4]}(z^{[4]})$
- ...
- $z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}$
- $z^{[1]}=w^{[1]}x+b^{[1]}=w^{[1]}a^{[0]}+b^{[1]}$
- Vectorized (dimension에 유의)$Z^{[l]}=W^{[l]}A^{[l-1]}+b^{[l]}$
- $A^{[l]}=g^{[l]}(Z^{[l]})$
- $for \ l=1 \ to \ 4$
Getting your matrix dimensions right
Parameters W and b
$$z^{[1]}=w^{[1]}x+b^{[1]}
\(3,1)=(3,2) \cdot(2,1)+(3,1)
\
(n^{[1]},1)=(n^{[1]},n^{[0]}) \cdot (n^{[0]},1)+(n^{[1]},1)$$
$w^{[l]}:(n^{[l]},n^{[l-1]})$
$b^{[l]}:(n^{[l]},1)$
$dw^{[l]}:(n^{[l]},n^{[l-1]})$
$db^{[l]}:(n^{[l]},1)$
Vectorized implementation
$$Z^{[1]}=W^{[1]}X+b^{[1]}
\
(n^{[1]},m)=(n^{[1]},n^{[0]}) \cdot (n^{[0]},m)+(n^{[1]},m)$$
training set 이 input X에 column 으로 쌓아 벡터화한다.
b는 broadcasting에 의해 자동으로 column이 복제된다.
$Z^{[l]},A^{[l]} :(n^{[l]},1)$
$dZ^{[l]},dA^{[l]} :(n^{[l]},1)$
Why deep representations?
Circuit theory and deep learning
There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.
shallower networks 를 사용할 경우 필요한 hidden unit의 갯수가 기하급수적으로 늘어나게 된다.
Building blocks of Deep Neural networks
- Forward propagationoutput : $a^{[l]}:g^{[l]}(z^{[l]})$
- $z^{[l]}:w^{[l]}a^{[l-1]}+b^{[l]}, \ cache \ z^{[l]}$
- Backward propagationoutput : $da^{[l-1]}, dw^{[l]},db^{[l]}$
- $da^{[l]}, cached \ z^{[l]}$
Forward and backward propagation
Forward propagation
Input : $a^{[l-1]}$
Output : $a^{[l]}, cache \ z^{[l]}$
Backward propagation
Input : $da^{[l]}$
Output : $da^{[l-1]},dW^{[l]},db^{[l]}$
$$dz^{[l]}=da^{[l]}*g^{[l]'}(z^{[l]})
\ dw^{[l]}=dz^{[l]} \cdot a^{[l-1]}
\db^{[l]}=dz^{[l]}
\da^{[l-1]}=w^{[l]T}dz^{[l]}$$
Parameters vs Hyperparameters
- Hyperparameter는 w, b를 결정하는 값들이다.iterationsnumber of hidden units n
- choice of activation function
- number of hidden layer L
- Learning Rate $\alpha$
- hyperparameter의 다양한 값들을 사용해보면서 cost function이 가장 낮을 때를 찾는다.
What does this have to do with the brain?