Coding/Coursera

[Andrew Ng] Machine Learning 정리

폴밴 2021. 9. 1. 09:28

* Notion에서 작성한 내용을 마크다운으로 가져와 작성된 글입니다. KaTeX 수식은 제대로 표기되지 않을 수 있습니다. 강의 내용에 대한 저작권은 Stanford University - Andrew Ng 에게 있습니다.

Supervised Learning

already know what our correct output should look like.
having the idea that there is a relationship between the input and the output.

Regression

continuous output ex) age, price

Classification

discrete output ex) tumor is malignant or benign

Unsupervised Learning

little or no idea what our results should look like.
variables in data → find relaionships → cluster data →derive structure
clustering / non-clustering

Model Representation

$x$ : input variables , input features
$y$ : output , target variables
$(x,y)$ : trainig example
$(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)})...(x^{(i)},y^{(i)})$ : training set
($i=1\ to\ m$, "$i$" is simply index into the training set)
$X$ : space of input vales
$Y$ : space of output values (X = Y = R)
function $h : X \rarr Y$

Cost Function

measure the accuracy of our hypothesis function
takes an average difference of all the results of the hypothesis with actual output y.

$$J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m} (\hat{y_i}-y_i)^2=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2$$

→ Goal : choose parameter $\theta_0 , \theta_1$ to minimize cost function $J$

Cost Function Intuition 1

assume there is one parameter $\theta_0$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled.png

Cost funtion is 0 when hypothesis is identical to actual output y.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%201.png

Cost function becomes bigger value when hypothesis is different from y.

Cost Function Intuition 2

taking two parameters $\theta_0,\theta_1$
$h_\theta(x)=\theta_0+\theta_1x$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%202.png

contour plot / contour figure

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%204.png

choosing different parameters to minimize cost function (contour plot on the right)

Gradient Descent

Start with some $\theta_0,\theta_1$
Keep changing $\theta_0,\theta_1$ to reduce $J(\theta_0,\theta_1)$ until we hopefully end up at a minimum.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%206.png

the slope of the tangent is the derivative at that point.
the size of each step is determined by learnig rate alpha.

Gradient Descent Algorithm

repeat until convergence

$$\theta_j := \theta_j-\alpha \frac {\partial} {\partial \theta_j} J(\theta_0,\theta_1) \\\ (\bold{simultaneously} \ update \ j=0 \ and \ j=1)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%207.png

finding theta which is at local optima

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%208.png

learning rate has to be proper value

Gradient Descent for Linear Regression

Linear Regression Model

$$J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m} (\hat{y_i}-y_i)^2= \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2$$

Gradient Descent Algorithm

$$\theta_j := \theta_j-\alpha \frac {\partial} {\partial \theta_j} J(\theta_0,\theta_1) \\\ (\bold{simultaneously} \ update \ j=0 \ and \ j=1)$$

$$j=0 \ : \ \frac {\partial}{\partial \theta_0} J(\theta_0,\theta_1)=\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})$$

$$j=1 \ : \ \frac {\partial}{\partial \theta_1} J(\theta_0,\theta_1)=\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$$

repeat until convergence

$$\theta_0 := \theta_0-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \
\theta_1 := \theta_1-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$$

"Batch" Gradient Descent

Each step of gradient descent uses all the training examples.

Gradient Descent for Multiple Variables

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%209.png

$n$ = number of features
$m$ = number of training sets
$x_n^{(m)}$
$x^{(i)}$ = input of $i^{th}$ training example
$x_j^{(i)}$ = value of feature $j$ in $i^{th}$ training example

ex ) $x_2^{(3)}=3$ , $x_1^{(2)}=1416$

$$h_\theta(x) = \theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...\theta_nx_n$$

$$X=
\begin{bmatrix}
x_0 \
x_1 \
x_2 \
... \
x_n
\end{bmatrix}

\Theta=
\begin{bmatrix}
\theta_0 \
\theta_1 \
\theta_2 \
... \
\theta_n
\end{bmatrix}$$

$$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...\theta_nx_n
= \Theta^TX$$

previously $(n=1)$
$$\theta_0 := \theta_0-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \
\theta_1 := \theta_1-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$$

new algorithm $(n\geq1)$ $(x_0^{(i)}=1)$

  theta = theta - (alpha / m * sum((predictions-y).*X))';

$$\theta_0 := \theta_0-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_0^{(i)} \\
\theta_1 := \theta_1-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_1^{(i)} \\
\theta_2 := \theta_2-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_2^{(i)} \\
\vdots \\
\boxed{\theta_j := \theta_j-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_j^{(i)}}$$

Feature Scaling

to make sure features are on similar scale.
get every feature into approximately a $(-1 \le x_i \le 1)$ range.

$$x_1 = size \ (0 \sim 2000 \ ft^2) \
x_2 = number \ of \ bedrooms \ (1 \sim 5)$$

$$\to x_1 = \frac {size(ft^2)} {2000} \
\to x_2 = \frac {number \ of \ bedrooms} {5}$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2010.png

Mean Normalization

get every feature into approximately a $(-0.5 \le x_i \le 0.5)$ range.

$$x_i:=\frac {x_i-\mu_i} {s_i}$$

$\mu_i = average \ value \ of \ x_i$
$s_i = range \ of \ x_i = (max-min)$

Learning Rate

$$\theta_j :=\theta_j-\bold{\alpha} \frac {\partial} {\partial\theta_j} J(\theta) $$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2011.png

For sufficiently small $\alpha$ , $J(\theta)$ should decrease on every iteration.
But if $\alpha$ is too small, gradient descent can be slow to converge.
To choose $\alpha$, try $\boxed{...,0.001,0.003,0.01,0.03,0.1,0.3,1,...}$

Polynomial Regression

We can combine multiple features into one. ( Ex. $x_3 \gets x_1 \ \centerdot \ x_2$ )
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic, square root, etc.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2012.png

Feature scaling becomes important when we choose features this way.

Normal Equation

Unlike gradient descent, normal equation is a method to solve for $\theta$ analytically.

$$J(\theta_0,\theta_1,...,\theta_m)=\frac {1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2
\
\frac {\partial}{\partial\theta_j}J(\theta)=\dotsm=0 \ (for \ every \ j)
\
\underline{Solve \ for \ \theta_0,\theta_1,....\theta_n}$$

$$x^{(i)}=
\begin{bmatrix}
x_0^{(i)}\
x_1^{(i)} \
x_2^{(i)} \
... \
x_n^{(i)}
\end{bmatrix}
\ (x^{(i)})^T= [x_0^{(i)} x_1^{(i)} x_2^{(i)} \dots x_n^{(i)} ]
\
X=
\begin{bmatrix}
x_0^{(1)} \ x_1^{(1)} \ x_2^{(1)} \cdots \ x_n^{(1)}\
x_0^{(2)} \ x_1^{(2)} \ x_2^{(2)} \cdots \ x_n^{(2)} \
x_0^{(3)} \ x_1^{(3)} \ x_2^{(3)} \cdots \ x_n^{(3)} \
... \
x_0^{(m)} \ x_1^{(m)} \ x_2^{(m)} \cdots \ x_n^{(m)}
\end{bmatrix}
=\begin{bmatrix}
(x^{(1)})^T \
(x^{(2)})^T \
(x^{(3)})^T \
... \
(x^{(m)})^T
\end{bmatrix}
$$

$$\boxed {\theta = (X^TX)^{-1}X^Ty}$$

Gradient Desent

Need to choose $\alpha$.
Needs many iterations.
Works well even when $n$ (features) is large.

Normal Equation

No need to choose $\alpha$.
Don't need to iterate.
Need to compute $(X^TX)^{-1}$.
Slow if $n$ is very large. ($n>10^4)$

Vectorial Implementation

$$h_\theta(x)=\sum_{j=0}^n \theta_jx_j=\theta^Tx$$

Unvectorized implementation

prediction = 0.0;
for j = 1:n+1;
    prediction = prediction 
                                + theta(j) * x(y)
end;

Vectorized implementation

prediction = theta' * x;

Logistic Regression

Classification

ex. email - spam or not spam / tumor - malignant or benign

$$y \ni \lbrace 0,1 \rbrace$$

0 : Negative Class

1 : Positive Class

Discrete Value

$$if \ h_\theta(x) \geq 0.5, \ y=1
\
if \ h_\theta(x) < 0.5, \ y=0$$

Hypothesis Representation

Using linear regression to classification problem performs very poorly, so we have to chage the form.
Logistic Function

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2013.png

$$h_\theta(x)=g(\theta^Tx)
\
z=\theta^Tx
\
g(z)=\frac{1}{1+e^{-z}}$$

$h_\theta(x)$ = estimated *probability *that y=1 on input x

$$h_\theta(x)=P(y=1|x;\theta)

=1-P(y=0|x;\theta)$$

$$predict \ \text{\textquotedblleft}y=1" \ if \ h_\theta(x)=g(z) \geq0.5
\
\underline{ when \ z=\theta^T \geq 0}$$

$$predict \ \text{\textquotedblleft}y=0" \ if \ h_\theta(x)=g(z) <0.5
\
\underline{when \ z=\theta^T <0}$$

Decision Boundary

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2014.png

$$x=
\begin{bmatrix}
1\
x_1 \
x_2 \
\end{bmatrix}
\
\theta=
\begin{bmatrix}
-3\
1\
1\
\end{bmatrix}$$

$$h_\theta(x)=g(\theta^Tx)=g(-3+x_1+x_2)
\
predict \ \text{\textquotedblleft}y=1" \ if\ \ \bold{z=-3+x_1+x_2\geq0}$$

$$h_\theta(x)=g(\theta^Tx)=g(-3+x_1+x_2)
\
predict \ \text{\textquotedblleft}y=0" \ if\ \ \bold{ z=-3+x_1+x_2<0}$$

$$ Decision \ Boundary \\therefore x_1+x_2=3$$

The input to the sigmoid function g(z) (e.g. $\theta^TX)$ doesn't need to be linear.
e.g.
$z=\theta_0+\theta_1x_1^2+\theta_2x_2^2$ (circle)
or any shape of data

Cost Function

$$J(\theta)=
\frac{1}{m} \sum_{i=1}^{m} \underline{\frac1 2 (h_\theta(x^{(i)})-y^{(i)})^2}
=\frac{1}{m} \sum_{i=1}^{m}\underline{Cost(h_\theta(x^{(i)}),y^{(i)})}$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2015.png

$$Cost(h_\theta(x^{(i)}),y^{(i)}) = -log(h_\theta(x)) \ \underline{ if \ y=1}$$

$$h_\theta(x)=1
\ Cost = 0$$

$$h_\theta(x) \rarr 0
\ Cost \rarr \infin$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2016.png

$$Cost(h_\theta(x^{(i)}),y^{(i)}) = -log(1-h_\theta(x)) \ \underline{ if \ y=0}$$

$$h_\theta(x)=0
\ Cost = 0$$

$$h_\theta(x) \rarr 1
\ Cost \rarr \infin$$

Simplified Cost Function and Gradient Descent

Simplified Cost Function

$$J(\theta)
=\frac{1}{m} \sum_{i=1}^{m}\underline{Cost(h_\theta(x^{(i)}),y^{(i)})}\=- \frac{1}{m} [ \sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]$$

Vectorized Implementation

$$h=g(X\theta) \
J(\theta)= \frac 1 {m} \cdot (-y^Tlog(h)-(1-y)^Tlog(1-h))$$

Gradient Descent

$$\theta_j \colonequals \theta_j-\alpha \frac{\partial} {\partial \theta_j}J(\theta)$$

$$\theta_j \colonequals \theta_j-\frac \alpha m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$$

Vectorized Implementation

$$\theta \colonequals \theta-\frac \alpha m X^T(g(X\theta)-\overrightarrow{y})$$

Algorithm looks identical to linear regression, but $h_\theta(x)$ is Logistic Function.

Optimization Algorithm

Given $\theta$, we have code that can compute $J(\theta) , \ \frac{\partial}{\partial \theta_j}J(\theta) \ (for \ j=0,1,2 \dots,n)$.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2017.png

options = optimset('GradObj','on','MaxIter',100); % optimset() function creates an object containing the options we want to send to fminunc()
initialtheta = zeros(2,1);
[optTheta, functionVal, exitFlag] ...
           = fminunc(@costFunction, initialTheta, options);

Multiclass Classification

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2018.png

$$y \in {\lbrace0,1,...,n\rbrace}
\
h_\theta^{(0)}(x)=P(y=0|x;\theta) \
h_\theta^{(1)}(x)=P(y=1|x;\theta)
\
\cdots
\
h_\theta^{(n)}(x)=P(y=n|x;\theta)
\
prediction=\underset{i}{max}(h_\theta^{(i)}(x))$$

Train a logistic regression classfier $h_\theta(x)$ for each class to predict the probability that $y=i$.
To make a prediction on a new $x$, pick the class that maximizes $h_\theta(x)$.

$$prediction=\underset{i}{max}(h_\theta^{(i)}(x))$$

$h_\theta(x)$ is a probability that $y=i$, by logistic function definition.

Regularization

The Problem of Overfitting

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/0B05AE56-26FA-4254-99C3-81E5E29E83E0.jpeg

If we have too many features, hypothesis may fit the training set very well, but fail to generalize to new examples. →Overfitting
To fix this issue, there are two options : Reducing number of features / Regularization.

Cost Function

Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2020.png

Suppose we penalize and make $\theta_3,\theta_4$ very small.

$$\underset {\theta}{min} \frac1{2m} \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})^2 + \underline{1000\theta_3^2+1000\theta_4^2}$$

The goal is to minimize cost function, therefore by multiplying large value for $\theta_3 \ and \ \theta_4$ makes $\bold{\theta_3 \approx 0 \ and \ \theta_4 \approx 0}$.

Regularization

$$J(\theta)=\frac 1{2m} \sum_{i=1}^{m} [(h(x_\theta^{(i)})-y^{(i)})^2 + \boxed{\lambda \sum_{j=1}^n\theta_j^2}]$$

Small values for parameter $\theta$ makes hypothesis simpler and less prone to overfitting.
If $\lambda$ is extremely large value, algorithm results in underfitting. ($h_\theta(x)=\theta_0$)

Regularized Linear Regression

Gradient Descent

$$\theta_0 \coloneqq \theta_0-\alpha \frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_0^{(i)}$$

$$\theta_j \coloneqq \theta_j-\alpha [(\frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_j^{(i)})+ \frac \lambda m \theta_j]$$

$$\theta_j \coloneqq \theta_j(1-\alpha \frac \lambda m)-\alpha \frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_j^{(i)}
\ (1-\alpha \frac \lambda m )< 1$$

Normal Equation

$$X=\begin{bmatrix}
(x^{(1)})^T
\
\vdots
\
(x^{(m)})^T
\end{bmatrix}
\ (m \times (n+1) \ matrix ) \
y=\begin{bmatrix}
y^{(1)}
\
\vdots
\
y^{(m)}
\end{bmatrix}$$

$$\theta = (X^TX+\underline{\lambda \cdot L})^{-1}X^Ty$$

$$L = \begin{bmatrix}
0&0&0&\cdots&0
\
0&1&0&\cdots&0
\
0&0&1&\cdots&0
\
\vdots& & &\ddots&\vdots
\
0&&&\cdots&1
\end{bmatrix}
\ ((n+1)\times(n+1) \ matrix)$$

Regularization takes care of non-invertible issue.

Regularized Logistic Regression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2021.png

$$J(\theta)
=- \frac{1}{m} [ \sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] + \boxed{ \frac{\lambda} {2m} \sum_{j=1}^n \theta_j^2}$$

Gradient Descent

$$\theta_0 \coloneqq \theta_0-\alpha \frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_0^{(i)}$$

$$\theta_j \coloneqq \theta_j-\alpha [(\frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_j^{(i)})+ \frac \lambda m \theta_j]$$

Advanced Optimization

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2022.png

function [jVal, gradient] = costFunction(theta);
jVal = [code to compute J(theta)];
gradient(1) = [code to compute partial derivative J(theta)]; 
gradient(2) = [code to compute partial derivative J(theta)];
gradient(3) = [code to compute partial derivative J(theta)];
...
gradient(n+1) = [code to compute partial derivative J(theta)]; % because code starts with 1, while theta_n starts from 0.

Neural Networks : Representation

Neurons are basically computational units that take inputs (Dendrites) that are channeled to outputs (Axons).
Sigmoid Activation Function (same as logistic function)

$$h_\theta(x) =g(\theta^Tx)= \frac 1 {1+e^{-\theta^Tx}}$$

$\theta$ : weights (parameters)

Model Representation 1

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2023.png

Layer 1 : Input Layer
Layer 2 : Hidden Layer
Layer 3 : Output Layer

$$\begin{bmatrix}
x_0
\
x_1
\
x_2
\
x_3
\end{bmatrix}
\rarr
\begin{bmatrix}
a_1^{(2)}
\
a_2^{(2)}
\
a_3^{(2)}
\end{bmatrix}
\rarr
h_\theta(x)$$

$a_i^{(j)}$ : "activation" of unit $i$ in layer $j$.
$\Theta^{(j)}$ : matrix of weights controlling function mapping from layer $j$ to layer $j+1$.

$$a_1^{(2)}=
g(\begin{bmatrix}
\theta_{10}^{(1)}
\
\theta_{11}^{(1)}
\
\theta_{12}^{(1)}
\
\theta_{13}^{(1)}
\end{bmatrix}
\times
\begin{bmatrix}
x_0
\
x_1
\
x_2
\
x_3
\end{bmatrix}
)
=g(\Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3)
\
a_2^{(2)}=g(\Theta_{20}^{(1)}x_0+\Theta_{21}^{(1)}x_1+\Theta_{22}^{(1)}x_2+\Theta_{23}^{(1)}x_3)
\
a_3^{(2)}=g(\Theta_{30}^{(1)}x_0+\Theta_{31}^{(1)}x_1+\Theta_{32}^{(1)}x_2+\Theta_{33}^{(1)}x_3)
\
h_\Theta(x)=a^{(3)}1= g(\Theta{10}^{(2)}a_0^{(2)}+\Theta_{11}^{(2)}a_1^{(2)}+\Theta_{12}^{(2)}a_2^{(2)}+\Theta_{13}^{(2)}a_3^{(2)})
\
(h_\Theta(x) \ can \ be \ viewed \ as \ unit \ 1 \ in \ layer \ 3)$$

If network hasthen $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j+1)$.
( +1 comes from $x_0$ and $\Theta_0^{(j)}$, "bias nodes", ex. $\Theta_{10}^{(1)}x_0$ )
$$s_j \ units \ in \ layer \ j,
\
s_{j+1} \ units \ in \ layer \ j+1$$

Model Representation 2

Forward Propagation (Vectorized Implementation)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2024.png

Let $z_1^{(2)}$ ( unit 1 in layer 2 ) asthen $z_i^{(j)}$ can be vectorized as
$$x=a^{(1)}
=\begin{bmatrix}
x_0
\
x_1
\
x_2
\
x_3
\end{bmatrix}
\ , \
z^{(2)}=
\begin{bmatrix}
z_1^{(2)}
\
z_2^{(2)}
\
z_3^{(2)}
\end{bmatrix}
\
\rarr z^{(2)}=\Theta^{(1)}a^{(1)}
\
\vdots
\
z^{(j+1)}=\Theta^{(j)}a^{(j)}
\
(a_0^{(j)}=1 \ (bias \ unit))$$
$$z_1^{(2)} = \Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3
\
a_1^{(2)}=g(z_1^{(2)})$$
The last parameter matrix $\Theta^{(j)}$ will have only one row (because $jth$ layer has only one unit ($h_\theta(x)$)
which is multiplied by one column $a^{(j)}$ so that our result is a single number.

$$h_\Theta(x) = a^{(j+1)}=g(z^{(j+1)})=g(\Theta^{(j)}a^{(j)})$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2025.png

Examples & Intuitions 1

If first parameter matrix is $\Theta^{(1)}=[-30 \ \ 20 \ \ 20]$
$$h_\Theta(x)=g(-30+20x_1+20x_2)
\
x_1=0 \ and \ x_2=0 \ then \ | \ g(-30) \approx0
\
x_1=0 \ and \ x_2=1 \ then \ | \ g(-10) \approx0
\
x_1=1 \ and \ x_2=0 \ then \ | \ g(-10) \approx0
\
x_1=1 \ and \ x_2=1 \ then \ | \ g(10) \approx1
\
\rarr AND$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2026.png

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2027.png

Examples & Intuitions 2

To make $x_1 \ XNOR \ x_2$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2028.png

Use $\Theta^{(1)}$ matrix that combines the values for AND and NOR.\begin{bmatrix}
-30 & 20 &20
\
10 & -20 & -20
\end{bmatrix}$$
$$\Theta^{(1)}
Use $\Theta^{(2)}$ matrix that uses the value for OR.
$$\Theta^{(2)}=
\begin{bmatrix}
-10 & 20 & 20
\end{bmatrix}$$
Then $a_1^{(3)}(=h_\Theta(x))$ becomes XNOR.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2029.png

Neural Network Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2030.png

Multi-class Classification

Let hypothesis function return a vector of values.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2032.png

$$\begin{bmatrix}
x_0
\
x_1
\
x_2
\
\vdots
\
x_n
\end{bmatrix}
\rarr
\begin{bmatrix}
a_0^{(2)}
\
a_1^{(2)}
\
a_2^{(2)}
\
\vdots
\end{bmatrix}
\rarr
\begin{bmatrix}
a_0^{(3)}
\
a_1^{(3)}
\
a_2^{(3)}
\
\vdots
\end{bmatrix}
\rarr
\cdots
\rarr
\begin{bmatrix}
h_\Theta(x)_1
\
h_\Theta(x)_2
\
h_\Theta(x)_3
\
h_\Theta(x)_4
\end{bmatrix}$$

Neural Networks : Learning

Cost Function

$$L=total \ number \ of \ layers \ in \ network
\
s_l=number \ of \ units \ (not \ including \ bias \ unit) \ in \ layer \ 1
\
K=number \ output \ units/classes
\
h_\Theta(x)_k=hypothesis \ that \ results \ in \ the \ k^{th} \ output$$

$$J(\Theta)=- \frac {1}{m} [ \sum_{i=1}^m \bold{\sum_{k=1}^{K}} y_k^{(i)}log(h_\theta(x^{(i)}))k + (1-y_k^{(i)})log(1-(h_\Theta(x^{(i)}))_k )] + \bold{\frac{\lambda}{2m} \sum{l=1}^{L-1}\sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1}(\Theta_{ji}^{(l)})^2}
\$$

Backpropagation Algorithm

Our goal is to minimize $J(\Theta)$ with parameter $\Theta$.
To achieve the goal, we need to compute $J(\Theta)$ and $\frac {\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$
To do so, we use backpropagation algorithm.

$$training \ set : (x^{(1)},y^{(1)}), \cdots,(x^{(m)},y^{(m)})
\
Set \ \Delta_{i,j}^{(l)} :=0 \ for \ all \ (l,i,j)$$

$$For \ t=1 \ to \ m:
\

\ Set \ a^{(1)}:=x^{(t)} \ (unit \ of \ first \ layer \ i s\ same \ as \ x)
\
\ Perform \ Foward \ Propagation \ to \ compute \ a^{(l)} (l=2,3,...,L)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2033.png

$$Using \ y^{(t)}, \
Compute \ \delta^{(L)}=a^{(L)}-y^{(t)}$$

"error values" ($\delta^{(L)}$) for the last layer are simply the differences of our actual results in the last layer ($a^{(L)}$) and correct outputs in y ($y^{(t)}$).

$$Compute \ \delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)} \ using
\
\delta^{(l)}=((\Theta^{(l)})^T \delta^{(l+1)}).a^{(l)}.(1-a^{(l)})
\
(g'(z^{(l)})=a^{(l)}.*(1-a^{(l)}))$$

$$\Delta_{i,j}^{(l)}:=\Delta_{i,j}^{(l)}+a_{j}^{(l)}\delta_{i}^{(l+1)}
\
Vectorization \ \rarr \ \Delta^{(l)}:=\Delta^{(l)}+\delta_{i}^{(l+1)}(a^{(l)})^T
$$

Then we update our new $\Delta$ matrix.

$$ D_{i,j}^{(l)}:= \frac 1 m (\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)}) \ , if \ j \neq 0.
\
D_{i,j}^{(l)}:= \frac 1 m \Delta_{i,j}^{(l)}\ , if \ j = 0.$$

matrix $D$ is used as an "accumulator" to add up our values and eventually compute our partial derivative of $J(\Theta)$.

$$\frac {\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)=D_{ij}^{(l)}$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2034.png

Backpropagation Intuition

$\delta_j^{(l)}$ is the "error" for $a_j^{(l)}$.
delta values are actually the derivative of the cost function.

$$\delta_j^{(l)}=\frac {\partial} {\partial z_j^{(l)}} \ cost(t)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2035.png

To calculate $\delta_2^{(2)}$, we multiply the weights $\Theta_{12}^{(2)}$ and $\Theta_{22}^{(2)}$ by their respective $\delta$ values.

$$\delta_2^{(2)}=\Theta_{12}^{(2)}\delta_1^{(3)}+\Theta_{22}^{(2)}\delta_2^{(3)}$$

Implementation Note : Unrolling Parameters

To use optimizing function as "fminunc()", we should unroll all elements.

$$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)},...$$

thetaVector = [ Theta1(:); Theta2(:); Theta3(:) ];
% becomes column vector
deltaVector = [ D1(:); D2(:); D3(:) ];
% becomes column vector

If Theta1 is 10x11, Theta2 is 10x11, Theta3 is 1x1 matrices, we can get back our original matrices from the "unrolled" versions.

Theta1 = reshape (thetaVector (1:110), 10, 11); 
% select 1 to 110 elements, make 10 by 11 matrix
Theta2 = reshape (thetaVector (111:220), 10, 11);
% select 111 to 220 elements, make 10 by 11 matrix
Theta3 = reshape (thetaVector (221:231), 1, 11);
% select 221 to 231 elements, make 1 by 11 row vector

Gradient Checking

To verify whether backpropagation works as intended, we use gradient checking.

$$\frac {\partial} {\partial \Theta_j} J(\Theta) \approx \frac {J(\Theta_1,\cdots,\Theta_j+\epsilon,\cdots,\Theta_n)-J(\Theta_1,\cdots,\Theta_j-\epsilon,\cdots,\Theta_n)} {2 \epsilon}$$

Once you have verified that backpropagation is correct, you should disable gradient checking. (code to compute gradApprox is very slow)

Random Initialization (Breaking Symmetry)

If we choose all zero elements for initial $\Theta$ matrix, all nodes will update to the same value repeatedly while performing backpropagate.
So we should initialize $\Theta$ to a random value in $[-\epsilon,\epsilon]$.

$$-\epsilon \leq \Theta_{ij}^{(l)} \leq \epsilon$$

Theta1 = rand(10,11) * (2 * init_epsilon) - init_epsilon;
Theta2 = rand(1,11) * (2 * init_epsilon) - init_epsilon;

Summary (Putting it together)

Pick a Network Architecture

Number of input units : Dimension of features $x^{(i)}$
Number of output units : Number of classes ($y_k)$
Resonable Default : 1 hidden layer, or if >1 hidden layer, have same number of hidden units in every layer

Training a Neural Network

Randomly initialize wieghts ($\Theta)$
Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
Implement code to compute cost function $J(\Theta)$
Implement backpropagation to compute partial derivatives $\frac {\partial} {\partial \Theta_{jk}^{(l)}} J(\Theta)$
for i = 1:m { Perform forward propagation and backpropagation using example (x_i, y_i) Get activations a^l and delta terms delta^l for l = 2 to L }
Use gradient checking to compare $\frac {\partial} {\partial \Theta_{jk}^{(l)}} J(\Theta)$ computed using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$

Then disable gradient checking code
6. Use gradient descent or advanced optimization method with backpropagation to try to minimze $J(\Theta)$ as a function of parameters $\Theta$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2036.png

Advice for applying Machine Learning

Evaluating a hypothesis

Trouble shooting for errors in our predictions by :

Getting more training examples
Trying smaller sets of features
Trying additional features
Trying polynomial features
Increasing or decreasing $\lambda$

We can split up the data into to sets : training set (70%) / test set (30%)

Learn $\Theta$ and minimize $J_{train} (\Theta)$ using the training set
Compute the test set error $J_{test}(\Theta)$

The Test Set Error

Linear Regression

$$J_{test}(\Theta)= \frac {1}{2m_{test}} \sum_{i=1}^{m_{test}}
(h_{\Theta}(x_{m_{test}})-y_{m_{test}}^{(i)})^2$$

Classification

$$err(h_\Theta(x),y)=1 \ \
(if \ h_\Theta(x) \geq 0.5 \ and \ y=0,
\ or \ h_\Theta(x) < 0.5 \ and \ y=1)
\
err(h_\Theta(x),y)=0 \ (otherwise)$$

$$Test \ Error = \frac 1 {m_{test}}
\sum_{i=1}^{m_test}
err(h_\Theta(x_{test}^{(i)}),y_{test}^{(i)})$$

Logistic Regression

$$J_{test}(\theta)= - \frac 1 {m_{test}}
\sum_{i=1}^{m_{test}}
y_{test}^{(i)} \ log (h_\theta(x_{test}^{(i)}))+(1-y_{test}^{(i)})log(h_\theta(x_{test}^{(i)}))$$

Model Selection and Training / Validation / Test sets

The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any data set.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2037.png

One way to break down our dataset into the three sets is :

Training Set (60%)
Cross Validation Set (20%)
Test Set (20%)

We can now calculate three separate error values :

Optimize the parameters in $\bold\Theta$ using the data set for each polynomial degree.
Find the polynomial degree $\bold d$ with the least error using the cross validation set.
Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2038.png

Diagnosing Bias vs. Variance

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2039.png

Bias (Underfit) - small degree of polynomial d
$J_{train}(\Theta)$ will be high, $J_{cv}(\Theta) \approx J_{train}(\Theta)$
Variance (Overfit) - large degree of polynomial d
$J_{train}(\Theta)$ will be low, $J_{cv}(\Theta) >> J_{train}(\Theta)$

Regularization and Bias / Variance

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2040.png

Small $\lambda$ : high variance (overfit)
Large $\lambda$ : high bias (underfit)
To choose the model & regularization term:

Create list of $\lambda$ ( 0, 0.01, 0.02, 0.04, 0.08, ... , 10.24 ).
Create a set of models with different degrees.
Go through all the models to learn some $\Theta$ for each $\lambda$.
Compute cross validation error on the $J_{cv}(\Theta)$ without regularization.
Select the best combo that produces the lowest error on the cross validation set.
Apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.

Learning Curves

High Bias (Underfitting)
High Variance (Overfitting)

Debugging a Learning Algorithm

Fixes high bias (underfitting)
Try getting additional features
Try adding polynomial features
Try decreasing $\lambda$
Fixes high variance (overfitting)
Get more training examples (training sets)
Try smaller sets of features
Try increasing $\lambda$ (heavier regularization)
Small Neural Network
more prone to underfitting (Computationally cheaper)
Large Neural Netwok
more prone to overfitting (Computationally expensive)
Use regualrization to address overfitting.

Machine Learning System Design

Prioritizing What to Work On

How could we spend our time to improve the accuracy of this classifier (spam classifier)?

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2043.png

Collect lots of data
Develop sophisticated features
Develop algorithms to process your input in different ways

It is difficult to tell which of the options will be most helpful.

Error Analysis

Recommended Approach
1. Start with simple algorithm, implement quickly, and test it early on your cross validation data.
2. Plot learning curves to decide if more data, more features, etc. are likely to help.
3. Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
It is very important to get error results as a single, numerical value.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2044.png

Error metrics for skewed classes

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2045.png

Precision / Recall

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2046.png

Precision
(Of all patients where we predicted $y=1$, what fraction actually has caner?)\frac { True \ Positive } {\bold{ Predicted \ Positive}}$$
$$\frac { True \ Positive } {\bold{ True \ Positive + False \ Positive}}
Recall
(Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?)\frac { True \ Positive } {\bold{ Actual \ Positive}}$$
$$\frac { True \ Positive } {\bold{ True \ Positive + False \ Negative}}

Trading off precision and recall

Suppose we want to predict $y=1$ (cancer) only if very confident.Higher precision, Lower recall
$$h_\theta(x) \geq \bold{ 0.9}
\
h_\theta(x) < \bold{0.9}$$
Suppose we want to avoid missing too many cases of cancer (avoid false negatives).Higher recall, Lower precision
$$h_\theta(x) \geq \bold{0.3}
\
h_\theta(x) < \bold{0.3}$$
More generally : predict 1 if $h_\theta(x) \geq threshold$.

F Score

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2047.png

$$⁍$$

Data for Machine Learning

Useful test

Given the input $x$, can a human expert confidently predict $y$?
Can we get a large training set, and train the learning algorithm with a lots of parameters in the training set?

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2048.png

Support Vector Machines

Optimization Objective

Alternative view of logistic regression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2050.png

$$Cost \ of \ example: \=-(y-\log h_\theta(x)+(1-y)\log(1-h_\theta(x)))
\
=-y\log \frac {1} {1+e^{-\theta^Tx}} - (1-y)\log(1-\frac {1} {1+e^{-\theta^Tx}})$$

If $y=1$ , (want $\theta^Tx >>0)$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2051.png

If $y=0$ , (want $\theta^Tx <<0)$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2052.png

Support Vector Machine

$$\underset{\theta}
{min}
\frac{1}{m} [ \sum_{i=1}^m y^{(i)}(-\log h_\theta(x^{(i)}))+(1-y^{(i)})(-\log(1-h_\theta(x^{(i)}))]+ \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$$

$m$ : constant, doesn't affect minimization

$A+\underline \lambda B \rarr \underline C A+B \ (C=\frac 1 {\lambda})$

$$\rarr \underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac 1 2 \sum_{j=1}^n\theta_j^2$$

$$h_\theta(x)=1 \ (if \ \theta^Tx \geq 0) \
h_\theta(x)=0 \ (otherwise)$$

Large Margin Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2053.png

$$\underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac 1 2 \sum_{j=1}^n\theta_j^2$$

SVM Decision Boundary : Linearly Serperable Case
Large Margin Classfier in presence of outliers

The mathematics behind Large Margin Classification (optional)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2056.png

Kernels 1

Given $x$, compute new feature depending on proximity to landmarks $l^{(1)},l^{(2)},l^{(3)}$

$$f_1 = similarity (x,l^{(1)})=\exp(- \frac {|x-l^{(1)}|^2} {2 \sigma^2})
=\exp(- \frac {\sum_{j=1}^n (x_j-l_j^{(1)})^2} {2\sigma^2})$$

$$f_2 = similarity (x,l^{(2)})=\exp(- \frac {|x-l^{(2)}|^2} {2 \sigma^2})$$

$$f_3= similarity (x,l^{(3)})=\exp(- \frac {|x-l^{(3)}|^2} {2 \sigma^2})$$

If $x \approx l^{(1)}$
$$f_1 \approx \exp(- \frac 0 {2\sigma^2})=1$$
If $x$ is far from $l^{(1)}$
$$f_1 = \exp (- \frac {(large \ number)^2} {2 \sigma^2}) \approx 0$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2059.png

Kernels 2

Choosing Landmarks $l^{(i)}$

Given $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})$,
Choose $l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},...,l^{(m)}=x^{(m)}$.

Given example $x$ :

$$f_1=similarity (x,l^{(1)})
\ f_2=similarity(x,l^{(2)})
\ \vdots$$

$$feature \ vector \ f=
\begin {bmatrix}
f_0
\f_1
\ \vdots
\f_m
\end {bmatrix}
,
\ f_0=1$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2061.png

SVM with Kernels

Hypothesis : Given $x$, compute features $f \in \mathbb{R}^{m+1}$
Predict $y=1$ if $\theta^Tf \geq 0$

Training :

$$\underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^T\bold{f^{(i)}})+(1-y^{(i)})cost_0(\theta^T\bold{f^{(i)}})]+\frac 1 2 \sum_{j=1}^m\theta_j^2$$

$$\sum_{j=1}^m \theta_j^2 = \theta^T \theta=\theta^TM\theta \ (M:rescale \ factor)$$

SVM parameters
$C= \frac 1 \lambda$

Large $C$ : Lower bias, high variance, less regularized ( small $\lambda$ )

Small $C$ : Higher bias, lower variance, more regularized ( large $\lambda$ )

$\sigma^2 \ \ (\exp(-\frac {|x-l|^2} {2 \sigma^2})$

Large $\sigma^2$ : Features $f_i$ vary more smoothly. Higher bias, lower variance.

Small $\sigma^2$ : Features $f_i$ vary less smoothly. Lower bias, higher variance.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2062.png

Using SVM

Use SVM software package (e.g. liblinear, libsvm, ...) to solve for parameters $\theta$.

Need to specify :

Choice of parameter C
Choice of kernel (similarity function) :
No kernel (linear kernel), Gaussian kernel, ploynomial kernel, ...
Perform feature scaling before using the Gaussian kernel.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2063.png

Multi-class classification

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2064.png

Many SVM packages already have built-in multi-class classification.

Otherwise, use one-vs-all method.

Logistic regression vs. SVMs
- If $n$ is large (relative to m) (e.g. $n$=10,000 , $m$ = 10~1,000)
- Use Logistic Regression, or SVM without a kernel (linear kernel)
- If $n$ is small, $m$ is intermediate
- Use SVM with Gaussian Kernel
- If $n$ is small, $m$ is largeNeural Network likely to work well for most of these settings, but may be slower to train.
- Add more features, then use Logistic Regression or SVM without a kernel
$n$ : number of features, $m$ : number of training examples

Clustering

Unsupervised Learning

In Unsupervised Learning, you are given an unlabled dataset (without labels $y^{(i)}$) and are asked to find some structure in the data.

K-means Algorithm

Randomly Initialize cluster centroids
Assign cluster
Move cluster centroid to average point
Repeat until it converges

InputTraining Set : $x^{(1)},x^{(2)},...,^{(m)}$
$K$ : number of clusters
Algorithm

Randomly initialize K cluster centroids $\mu_1,\mu_2,...,\mu_K \in \mathbb{R}^n$

Repeat {

Cluster Assignment Step$c^{(i)}$ := index of cluster centroid closest to $x^{(i)}$. (1,2,...,K)
for $i=1 \ to \ m$
Move to Centroid$\mu_k$ := average of points assigned to cluster $k$ }
for $k=1 \ to \ K$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2069.png

Optimization Objective

$c^{(i)}$ = index of cluster to which example $x^{(i)}$ is currently assigned

$\mu_k$ = cluster centroid $k$

$\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

(E.g. If $x^{(i)}$ is assigned in cluster 5 then $c^{(i)}=5$, $\mu_{c^{(i)}}=\mu_5$)

Optimization objective (Distortion Cost Function / K-means Algorithm)

$$J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)= \frac 1 m \sum_{i=1}^m |x^{(i)}-\mu_{c^{(i)}}|^2
\
min \ J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)$$

Random Initialization

How to randomly initialize K cluster centroids?
1. Should have $K>m$
2. Randomly pick $K$ training examples.
3. Set $\mu_1,\cdots,\mu_K$ equal to these $K$ examples.
4. $\mu_1 = x^{(1)} , \mu_2 = x^{(2)} ,\cdots$
To avoid local optima, initialize K-means lots of times & pick clustering that gave lowest cost $J$.
1. Randomly initialize K-means.
2. Run K-means. Get $c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K$.
3. Compute cost function (distortion) $J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)$ }
For $i=1 \ to \ 100$ {

Choosing the number of clusters

Usually chosen by hand or human insight.
Elbow method :

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2070.png

Dimensionality Reduction

Motivation 1 : Data Compression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2072.png

Motivation 2 : Data Visualization

Reduce data to max 3-dimensional data

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2074.png

Principal Component Analysis problem formulation (PCA)

2-dimension to 1-dimension :
Find a direction (a vector $u^{(1)} \in \mathbb{R}^n$) onto which to project the data so as to minimize the projection error.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2077.png

n-dimension to k-dimension :
Find $k$ vectors $u^{(1)},u^{(2)},u^{(3)},\cdots,u^{(k)}$ onto which to project the data, so as to minimize the projection error.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2078.png

PCA is not linear regression

Principal Component Analysis Algorithm

Data PreprocessingPreprocessing (feature scaling / mean normalization) :Replace each $x_j^{(i)}$ with $x_j-\mu_j$. (ensure every feature has zero mean.)
If different features on different scales, perform feature scaling.
$x_j^{(i)} \larr \frac {x_j^{(i)}-\mu_j} {s_j}$
$$\mu_j= \frac 1 m \sum_{i=1}^m x_j^{(i)}$$
Training Set : $x^{(1)},x^{(2)},...,x^{(m)}$
Compute "Covariance matrix $\Sigma$"
$$\Sigma= \frac 1 m \sum_{i=1}^n (x^{(i)})(x^{(i)})^T$$
Compute "Eigenvectors" of matrix $\Sigma$$U_{reduce}$ is matrix $U$ with $k$ selected columns.
$z^{(i)}$ is projected $x^{(i)}$.
[U, S, V] = svd(Sigma); Ureduce = U(:, 1:k); z = Ureduce' * x;

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2079.png

Reconstruction from compressed representation

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2080.png

$$X \approx X_{approx}^{(i)}=U_{reduce} \cdot z^{(i)}$$

Choosing the number of principal components

Average squared projection error :

$$\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2$$

Total variation in the data :

$$\frac 1 m \sum_{i=1}^m|x^{(i)}|^2$$

Typically, choose $k$ to be smallest value so that

$$\frac { average \ square } {total \ variation} =
\frac {\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2} { \frac 1 m \sum_{i=1}^m|x^{(i)}|^2 }
\leq 0.01 \ (1%)
\
\rarr 99% \ of \ variance \ is \ retained$$

Choosing $k$ (compressed dimension) algorithmCompute $U_{reduce}, z^{(1)},z^{(2)},...,z^{(m)},x_{approx}^{(1)},...,x_{approx}^{(m)}$.
ORPick smallest value of $k$ for which
$$\frac {\sum_{i=1}^k S_{ii}} {\sum_{i=1}^m S_{ii}}
\geq0.99
\
\rarr 99% \ of \ variance \ retained$$
Check if $\frac {\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2} { \frac 1 m \sum_{i=1}^m|x^{(i)}|^2 }
\leq 0.01$, if not, increase $k$.
Try PCA with $k=1$.

Advice for applying PCA

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2082.png

Mapping $x^{(i)} \rarr z^{(i)}$ should be defined by running PCA only on the training set.

This mapping can be applied as well to the examples in the cross validation and test sets.

Application of PCASpeed up learning algorithm (Choose $k$ by % of variance retain)
Visualization ($k=2 \ or \ 3$)
Reduce memory / disk needed to store data
Bad use of PCA : To prevent overfitting
Before implementing PCA, first try running whatever you want to do with the original data $x^{(i)}$.

Anomaly Detection

Problem Motivation

Untitled

$x_i$ : features
Model $p(x) (probability)$ from data.
$p(x_{test})<\epsilon$ → Flag Anomaly
$p(x_{test}) \geq \epsilon$ → OK

Gaussian Distribution

$$X \sim N(\mu,\sigma^2)
\
\mu : mean , \ \sigma^2 : variance$$

$$P(x;\mu,\sigma^2)=\frac 1 {\sqrt{2\pi}\sigma}exp(- \frac {{(x-\mu)}^2} {2 \sigma^2})$$

Untitled

$$\mu= \frac 1 {m} \sum_{i=1}^mx^{(i)}$$

$$\sigma^2=\frac 1 m \sum_{i=1}^m (x^{(i)}-\mu)^2$$

Algorithm

$$x_i \sim N(\mu_i,\sigma_i^2)
\
p(x)=p(x_1;\mu_1,\sigma_1^2) \cdot p(x_2;\mu_2,\sigma_2^2) \cdot p(x_3;\mu_3,\sigma_3^2) \cdots \cdot p(x_n;\mu_n,\sigma_n^2)
\=\Pi_{j=1}^np(x_j;\mu_j,\sigma_j^2)$$

Choose features $x_i$ that you think might be indicative of anomalous examples.
Fit parameters $\mu_1,\mu_2,...,\mu_n, \ \sigma_1^2, \sigma_2^2,...,\sigma_n^2$
$\mu_j = \frac 1 {m} \sum_{i=1}^mx_j^{(i)}$
$\sigma_j^2=\frac 1 m \sum_{i=1}^m (x_j^{(i)}-\mu_j)^2$
Given new example $x$, compute $p(x)$
$p(x)=\Pi_{j=1}^np(x_j;\mu_j,\sigma_j^2)=\Pi_{j=1}^n\frac 1 {\sqrt{2\pi}\sigma}exp(- \frac {{(x-\mu)}^2} {2 \sigma^2})$
Anomaly if $p(x)<\epsilon$

Untitled

Developing and Evaluating an anomaly detection system

Assume we have some labeled data, of anomalous and non-anomalous examples. ($y=0$ if normal, $y=1$ if anomalous)

Training set : $x^{(1)},x^{(2)},...,x^{(m)}$ (assume normal examples not anomalous)

Cross validation set : ($x_{cv}^{(1)},y_{cv}^{(1)}),...,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})})$

Test set : $(x_{test}^{(1)},y_{test}^{(1)}),...,(x_{test}^{(m_{test})},y_{test}^{(m_{test})})$

Untitled

Algorithm Evaluation

Fit model $p(x)$ on training set { $x^{(1)},...,x^{(m)}$ }
(can also use cross validation set to choose parameter $\epsilon$)
On a cross validation / test example $x$, predict
$y=1$ if $p(x) < \epsilon$ (anomaly)
$y=0$ if $p(x) \geq \epsilon$ (normal)
Possible evaluation metrics :

True positive, false positive, false negative, true negative
Precision / Recall
$F_1$ score

Anomaly Detection vs. Supervised Learning

Untitled

Anomaly DetectionLarge number of negative examples (normal)
Very small number of positive examples (anomalous)
Supervised Learning
Large number of positive and negative examples

Choosing what features to use

Untitled

Error analysis for anomaly detection
$y=1$ if $p(x) < \epsilon$ (anomaly)
$y=0$ if $p(x) \geq \epsilon$ (normal)If $p(x)$ is comparable for both normal and anomalous examples, create new feature.
Want $p(x)$ large for normal examples $x$, small for anomalous examples $x$.

Untitled

Multivariate Gaussian Distribution

Untitled

Don't model $p(x_1),p(x_2),...$ separately.

Model $p(x)$ all in one go.

Parameters : $\mu \in R^n$, $\Sigma \in R^{n \times n}$ (covariance matrix)

$$p(x;\mu,\Sigma)=\frac 1 {{(2\pi)^{\frac n 2}} |\Sigma|^{\frac 1 2}}exp(-\frac 1 2 (x-\mu)^T\Sigma^{-1}(x-\mu))$$

Untitled

Anomaly Detection using the multivariate Gaussian Distribution

Parameter Fitting

$\mu = \frac 1 {m} \sum_{i=1}^mx^{(i)}$

$\Sigma=\frac 1 m \sum_{i=1}^m (x^{(i)}-\mu)(x^{(i)}-\mu)^T$

Untitled

Recommender Systems

Problem Formulation

Example : Predicting movie ratings$n_u$ = no. of users$r(i,j)=1$ if user $j$ has rated movie $i$
$y^{(i,j)}$ = rating givren by user $j$ to movie $i$ (defined only if $r(i,j)=1$ )
$n_m$ = no. of movies

Content-based Recommendations

For each user $j$, learn $\theta^{(i)} \in \mathbb{R}^{n+1}$.
Predict user $j$ as rating movie $i$ with $(\theta^{(i)})^Tx^{(i)}$ stars. $(x_0=1)$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20102.png

$$x^{(3)}=
\begin{bmatrix}
1
\ 0.99
\0
\end{bmatrix}
\ \leftrightarrow \
\theta^{(1)}=
\begin{bmatrix}
0
\ 5
\0
\end{bmatrix}
\
\
(\theta^{(1)})^Tx^{(3)} = 5 \times0.99=4.95$$

$r(i,j)=1$ if user $j$ has rated movie $i$

$y^{(i,j)}$ = rating givren by user $j$ to movie $i$ (defined only if $r(i,j)=1$ )

$\theta^{(j)}$ = parameter vector for user $j$

$x^{(i)}$ = feature vector for movie $i$

$m^{(j)}$ = no. of movies rated by user $j$

For user $j$, movie $i$, predicted rating : $(\theta^{(j)})^T(x^{(i)})$

To learn $\theta^{(j)}$ (parameter for user $j$)
$$\underset{\theta^{(j)}}{min}
\frac 1 2 \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{k=1}^n(\theta^{(j)}_k)^2$$
To learn $\theta^{(1)},\theta^{(2)},...,\theta^{(n_u)}$ (Optimization Algorithm)
$$\underset{\theta^{(1)},...,\theta^{(n_u)}}{min}
\frac 1 2 \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2$$
Gradient descent update$$\theta_k^{(j)}:=\theta_k^{(j)}-\alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)}) \ (k\not=0)$$
$$\theta_k^{(j)}:=\theta_k^{(j)}-\alpha\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)} \ (k=0)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20103.png

Collaborative Filtering

How to choose features ( $x_1,x_2,...$) ($x_0=1$)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20104.png

$$if \ x^{(1)}=
\begin{bmatrix}
1
\ 1.0
\ 0.0
\end{bmatrix}$$

$$\ (\theta^{(1)})^Tx^{(1)} \approx 5
\ (\theta^{(2)})^Tx^{(1)} \approx 5
\ (\theta^{(3)})^Tx^{(1)} \approx 0
\ (\theta^{(4)})^Tx^{(1)} \approx 0$$

$$\rarr \
\theta^{(1)}=\begin{bmatrix}
0
\5
\0
\end{bmatrix}
\ \theta^{(2)}=\begin{bmatrix}
0
\5
\0
\end{bmatrix}
\ \theta^{(3)}=\begin{bmatrix}
0
\0
\5
\end{bmatrix}
\ \theta^{(4)}=\begin{bmatrix}
0
\0
\5
\end{bmatrix}$$

Given parameters $\theta^{(1)},...,\theta^{(n_u)}$, to learn $x^{(1)},...,x^{(n_m)}$
$$\underset{x^{(1)},...,x^{(n_m)}}{min}
\frac 1 2 \sum_{i=1}^{n_m} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2$$
Given $x^{(1)},...,x^{(n_m)}$ (features) and movie ratings, can estimate $\theta^{(1)},...,\theta^{(n_u)}$ (parameters)
Given parameters $\theta^{(1)},...,\theta^{(n_u)}$, can estimate $x^{(1)},...,x^{(n_m)}$ (features)

$$\theta\ (initial \ random \ guess) \rarr x \rarr\theta \rarr x \rarr\theta \rarr x \rarr \cdots$$

Collaborative Filtering Algorithm

Given $x^{(1)},...,x^{(n_m)}$, estimate $\theta^{(1)},\theta^{(2)},...,\theta^{(n_u)}$
$$\underset{\theta^{(1)},...,\theta^{(n_u)}}{min}
\frac 1 2 \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2$$
Given parameters $\theta^{(1)},...,\theta^{(n_u)}$, estimate $x^{(1)},...,x^{(n_m)}$
$$\underset{x^{(1)},...,x^{(n_m)}}{min}
\frac 1 2 \sum_{i=1}^{n_m} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2$$
Minimizing $x^{(1)},...,x^{(n_m)}$ and $\theta^{(1)},...,\theta^{(n_u)}$ simultaneously$$\rarr
\underset{x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}}
{min}
J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$$
$$J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})=
\frac 1 2 \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}k)^2+\frac \lambda 2 \sum{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2$$

Initialize $x^{(1)},...,x^{(n_m)}$ and $\theta^{(1)},...,\theta^{(n_u)}$ to small random values. $(x \in \mathbb{R}^n, \theta \in \mathbb{R}^n)$
Minimize $J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$ using gradient descent (or advanced optimization algorithm)$$x_k^{(i)}:=x_k^{(i)}-\alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})\theta_k^{(j)}+\lambda x_k^{(i)})$$
$$\theta_k^{(j)}:=\theta_k^{(j)}-\alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda \theta_k^{(j)})$$
E.g. for every $j=1,...,n_u$ , $i=1,...,n_m$ (not from zero) $( no \ x_0 ,\theta_0)$
For a user with parameters $\theta$ and a movie with learned features $x$, predict a star rating of $\theta^Tx$.

Vectorization : Low rank matrix Factorization

Untitled

$$Y=
\begin{bmatrix}
5 \ 5 \ 0 \ 0
\ 5 \ ?\ ?\ 0
\? \ 4 \ 0 \ ?
\0 \ 0 \ 5 \ 4
\ 0 \ 0 \ 5 \ 0
\end{bmatrix}
=y^{(i,j)}$$

$$n_m=5
\n_u=4$$

$$X=\begin{bmatrix}
(x^{(1)})^T
\(x^{(2)})^T
\ \vdots
\ (x^{(n_m)})^T
\end{bmatrix}$$

$$\Theta=\begin{bmatrix}
(\theta^{(1)})^T
\(\theta^{(2)})^T
\ \vdots
\ (\theta^{(n_u)})^T
\end{bmatrix}$$

Predicted Ratings (Low Rank Matrix Factorization)

$$X\Theta^T=\begin{bmatrix}
(\theta^{(1)})^T(x^{(1)}) \ (\theta^{(2)})^T(x^{(1)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(1)})

\(\theta^{(1)})^T(x^{(2)}) \ (\theta^{(2)})^T(x^{(2)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(2)})
\ \vdots
\ \(\theta^{(1)})^T(x^{(n_m)}) \ (\theta^{(2)})^T(x^{(n_m)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(n_m)})
\end{bmatrix}$$

Finding related movies

For each product $i$, we learn a feature vector $x^{(i)} \in \mathbb{R}^n$

To find movies $j$ related to movie $i$, find movie with the smallest $|x^{(i)}-x^{(j)}|$.

Implementational Detail : Mean Normalization

It's not a good idea to set to zero when user haven't rated any movies.

Untitled

$$Y=
\begin{bmatrix}
5 \ 5 \ 0 \ 0 \ \underline?
\ 5 \ ?\ ?\ 0 \ \underline?
\? \ 4 \ 0 \ ? \ \underline?
\0 \ 0 \ 5 \ 4 \ \underline?
\ 0 \ 0 \ 5 \ 0 \ \underline?
\end{bmatrix}$$

Take average of other ratings and put it into matrix $\mu$.

$$\mu=
\begin{bmatrix}
2.5
\2.5
\2
\2.25
\1.25
\end{bmatrix}$$

Untitled

Then learn $\theta^{(j)},x^{(i)}$ with mean normalized $Y$.
For user $j$, on moive $i$, predict (add $\mu$ back for prediction)
$$(\theta^{(j)})^T(x^{(i)})+\underline{\mu_i}$$

e.x. For user 5 (Eve) : $(\theta^{(5)})^T(x^{(i)})+\underline{\mu_i}$

Large Scale Machine Learning

Learning with Large Datasets

"It's not who has the best algorithm that wins. It's who has the most data."

To check if training algorithm is ok with selected data sets, we can plot the learning curve. (Sanity check)

Untitled

Stochastic Gradient Descent

Batch Gradient Descent (using all training examples in each iteration)
Stochastic Gradient Descent (using single training example in each iteration.)

Randomly shuffle training examples
Repeat {$\theta_j:=\theta_j-\alpha(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$}
(for $j:=1,...,n$) }
for $i:=1,...,m$ {

Untitled

Mini-batch Gradient Descent

Use $b$ examples in each iteration

if $b=10, m=1,000$

Repeat {

for $i=1,11,21,31,...,991$ {

$\theta_j:=\theta_j-\alpha \underline{ \frac 1 {10} \sum_{k=i}^{i+9} (h_\theta(x^{(k)})-y^{(k)})x_j^{(k)}}$

(for every $j=0,...,n$) }

}

Mini-batch Gradient Descent is efficient when examples are vectorized well.

Stochastic Gradient Descent Convergence

Checking for Convergence

Batch Gradient Descent$J_{train}(\theta)= \frac 1 {2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
Plot $J_{train}(\theta)$ as a function of the number of iterations of gradient descent.
Stochastic Gradient DescentDuring learning, compute $cost(\theta,(x^{(i)},y^{(i)}))$ before updating $\theta$ using $(x^{(i)},y^{(i)})$.
Every 1,000 iterations (say), plot $cost(\theta,(x^{(i)},y^{(i)}))$ averaged over the last 1,000 examples processed by algorithm.
$cost(\theta,(x^{(i)},y^{(i)}))= \frac 1 2 (h_\theta(x^{(i)})-y^{(i)})^2$

Untitled

Learning rate

Learning rate $\alpha$ is typically held constant. Can slowly decrease $\alpha$ over time if we want $\theta$ to converge.
$\alpha= \frac {const1} {iterationNumber+const2}$

Untitled

Online Learning

Untitled

Can adopt to changing user preference.
Learn from $(x,y)$ then discard. $\bold{not} \ (x^{(i)},y^{(i)})$.
Repeat Forever {Update $\theta$ using $(x,y)$ :$(j=0,...,n)$
}
$\theta_j:=\theta_j-\alpha(h_\theta(x)-y)x_j$
Get $(x,y)$ corresponding to user.

Other Online Learning Example

Untitled

Map-reduce and Data Parallelism

Many Learning algorithms can be expressed as computing sums of functions over the training set.
- E.g. Gradient descent, Logistic regression, Deep learning, ...
Batch Gradient Descent $(m=400)$

Untitled

Application example : Photo OCR

Problem description and pipeline

Untitled

Photo OCR (Optical Character Recognition) pipline
1. Text detection
2. Character segmentation
3. Character classification

Sliding windows

Supervised learning for pedestrian detection
Move n pixels (step-size) in each steps

Untitled

Text detection
1D Sliding window for character segmentation

Getting lots of data : Artificial data synthesis

Synthesizing data by introducing distortions(E.g. Audio : Background noise, bad cellphone connection)(E.g. intensity (brightness) of pixel $i$ / $x_i \larr x_i+random \ noise$)
Usually does not help to add purely random / meaningless noise to your data.
Distortion introduced should be representation of the type of noise / distortions in the test set.

Discussion on getting more data

Make sure you have a low bias classifier before expanding the effort. (plot learning curve)
Think "How much work would it be to get 10x as much data as we currently have?"

Artificial data synthesis
Collect/label it yourself
"Crowd source"

Ceiling analysis : What part to the pipeline to work on next

Estimating the errors due to each component (Ceiling analysis)

Untitled

저작자표시 비영리 변경금지

'Coding > Coursera' 카테고리의 다른 글

[Andrew Ng] Neural Network and Deep Learning : 3. One hidden layer Neural Network (0)	2021.10.14
[Andrew Ng] Neural Network and Deep Learning : 2. Basics of Neural Network Programming (3) 프로그래밍 과제 정리 (0)	2021.10.12
[Andrew Ng] Neural Network and Deep Learning : 2. Basics of Neural Network Programming (2) (0)	2021.10.08
[Andrew Ng] Neural Network and Deep Learning : 2. Basics of Neural Network Programming (1) (0)	2021.10.06
[Andrew Ng] Neural Network and Deep Learning : 1. Introduction to Deep Learning (0)	2021.09.28

현재글[Andrew Ng] Machine Learning 정리