Coding/Coursera

[Andrew Ng] Neural Network and Deep Learning : 3. One hidden layer Neural Network

폴밴 2021. 10. 14. 15:20

Neural Networks Overview

What is a Neural Network?

Neural Network는 여러개의 hidden layer가 연결되어 있는 Logistic Regression과 비슷하다.

Backward Propagation(calculation)을 통해 cost function의 각 변수에 대한 편미분 값을 구할 수 있다.

Neural Network Representation

  • $a^{[i]}$를 통해 i번째의 layer를 나타낸다.
  • $a_n$을 통해 n번째의 unit을 나타낸다.

$$a^{[0]} = X \ ,
\
a^{[1]} =\begin{bmatrix}
a_1^{[1]}
\ a_2^{[1]}
\
\vdots
\ a_n^{[1]}
\end{bmatrix}
\ , \ a^{[2]} = \hat y$$

$w^{[1]} = (4,3)
\
b^{[1]}=(4,1)$

$w^{[2]} = (1,4)
\
b^{[2]}=(1,1)$

Computing a Neural Network's Output

$$z_1^{[1]}=w_1^{[1]T}x+b_1^{[1]}, a_1^{[1]}=\sigma(z_1^{[1]})
\ z_2^{[1]}=w_2^{[1]T}x+b_2^{[1]}, a_2^{[1]}=\sigma(z_2^{[1]})
\
z_3^{[1]}=w_3^{[1]T}x+b_3^{[1]}, a_3^{[1]}=\sigma(z_3^{[1]})
\
z_4^{[1]}=w_4^{[1]T}x+b_4^{[1]}, a_4^{[1]}=\sigma(z_4^{[1]})$$

위 식 중에 하나를 다음과 같이 Vectorizing을 할 수 있다.

$$z^{[1]}=\begin{bmatrix}
w_1^{[1]T} \
w_2^{[1]T} \
w_3^{[1]T} \
w_4^{[1]T}
\end{bmatrix}
\cdot
\begin{bmatrix}
x_1 \
x_2 \
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1^{[1]} \
b_2^{[1]} \
b_3^{[1]} \
b_4^{[1]}
\end{bmatrix}
=
\begin{bmatrix}
z_1^{[1]} \
z_2^{[1]} \
z_3^{[1]} \
z_4^{[1]}
\end{bmatrix}
$$

input $x$ 가 주어졌을 때,

$$z^{[1]}=W^{[1]}a^{[0]}+b^{[1]}
\ a^{[1]}=\sigma(z^{[1]})
\ z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}
\ a^{[2]}=\sigma(z^{[2]})$$

Vectorizing across multiple examples

$a^{1}$ : layer 1의 i번째 training example

m개의 training example에 대해서
for $i =1$ to $m$

$$z^{1}=W^{[1]}a^{0}+b^{[1]}
\ a^{1}=\sigma(z^{1})
\ z^{2}=W^{[2]}a^{1}+b^{[2]}
\ a^{2}=\sigma(z^{2})$$

$$X=\begin{bmatrix}
x^{(1)}, \ x^{(2)}, ..., \ x^{(m)}
\end{bmatrix}
\
Z^{[1]}=\begin{bmatrix}
z^{1}, \ z^{1}, ..., \ z^{1}
\end{bmatrix}
\
A^{[1]}=\begin{bmatrix}
a^{1}, \ a^{1}, ..., \ a^{1}
\end{bmatrix}$$

이를 벡터화하면, matrix에서
column 방향 (세로) : 같은 training example
row 방향 (가로) : 같은 unit

Explanation for vectorized implementation

Justification for vectorized implementation

간단히 하기 위해서, $b=0$이라 가정한다면

$Z^{1}=W^{[1]}x^{(i)}$

columns로 쌓아서 벡터화할 수 있다.

Activation functions

Activation function으로 sigmoid는 사용하지 않고 주로 tanh나 ReLU(Rectified Linear Unit)을 사용한다.

Why do you need non-linear activation functions?

Linear function을 사용할 경우 hidden layer를 거쳐도 $a$가 linear function으로 반복되어 마치 하나의 레이어만 가진 것처럼 작동된다.

The purpose of the activation function is to introduce non-linearity into the network

in turn, this allows you to model a response variable (aka target variable, class label, or score) that varies non-linearly with its explanatory variables

non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line--the word for this is affine).

another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had, would behave just like a single-layer perceptron, because summing these layers would give you just another linear function (see definition just above).

(https://stackoverflow.com/questions/9782071/why-must-a-nonlinear-activation-function-be-used-in-a-backpropagation-neural-net)

Derivative of activation functions

  • Sigmoid function

$$g(z) = \frac 1 {1+e^{(-z)}}
\ , \ g'(z)=g(z)(1-g(z))$$

  • tanh function

$$g(z)=tanh(z)=\frac {e^{(z)} - e^{(-z)}} { e^{(z)}+e^{(-z)}}
\ , \ g'(z)=1-(tanh(z))^2$$

  • ReLU / Leaky ReLU function

$$g(z)=max(0,z)
\ , \ g'(z)=1 \ (if \ z \geq0)$$

Gradient descent for Neural Networks

parameters : $w^{[1]},b^{[1]},w^{[2]},b^{[2]}$

cost function : $J(w^{[1]},b^{[1]},w^{[2]},b^{[2]})=\frac 1 m \sum_{i=1}^mL(\hat y,y)$

$n_x = n^{[0]}, \ n^{[1]}, \ n^{[2]}(=1)$

  • Forward propagation
  • Back propagation

Backpropagation intuition (Optional)

Summary of gradient descent

Random Initialization

What happens if you initialize weights to zero?

weights가 모두 0이라면(symmetric), 모든 hidden units에 대해 같은 함수가 전달된다. 이는 NN의 정확성을 떨어뜨리게 된다.

Random initialization

초기 $w$ 값을 랜덤하게 설정해 NN의 정확성을 높인다.

w = np.random.randn((2,2)) * 0.01 
b = np.zeros((2,1))

단, activation function의 z 값이 너무 크게 하지 않기위해서 작은 weights를 사용한다. (*0.01)

Source

Neural Networks and Deep Learning

 

신경망 및 딥 러닝

deeplearning.ai에서 제공합니다. In the first course of the Deep Learning Specialization, you will study the foundational concept of neural networks and ... 무료로 등록하십시오.

www.coursera.org