Coding/Coursera

[Andrew Ng] Machine Learning 정리

폴밴 2021. 9. 1. 09:28

* Notion에서 작성한 내용을 마크다운으로 가져와 작성된 글입니다. KaTeX 수식은 제대로 표기되지 않을 수 있습니다. 강의 내용에 대한 저작권은 Stanford University - Andrew Ng 에게 있습니다.

Supervised Learning

  • already know what our correct output should look like.
  • having the idea that there is a relationship between the input and the output.

Regression

  • continuous output ex) age, price

Classification

  • discrete output ex) tumor is malignant or benign

Unsupervised Learning

  • little or no idea what our results should look like.
  • variables in data → find relaionships → cluster data →derive structure
  • clustering / non-clustering

Model Representation

  • $x$ : input variables , input features
  • $y$ : output , target variables
  • $(x,y)$ : trainig example
  • $(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)})...(x^{(i)},y^{(i)})$ : training set
    ($i=1\ to\ m$, "$i$" is simply index into the training set)
  • $X$ : space of input vales
  • $Y$ : space of output values (X = Y = R)
  • function $h : X \rarr Y$

Cost Function

  • measure the accuracy of our hypothesis function
  • takes an average difference of all the results of the hypothesis with actual output y.

$$J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m} (\hat{y_i}-y_i)^2=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2$$

→ Goal : choose parameter $\theta_0 , \theta_1$ to minimize cost function $J$

Cost Function Intuition 1

assume there is one parameter $\theta_0$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled.png

Cost funtion is 0 when hypothesis is identical to actual output y.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%201.png

Cost function becomes bigger value when hypothesis is different from y.

Cost Function Intuition 2

  • taking two parameters $\theta_0,\theta_1$
  • $h_\theta(x)=\theta_0+\theta_1x$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%202.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%203.png

contour plot / contour figure

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%204.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%205.png

choosing different parameters to minimize cost function (contour plot on the right)

Gradient Descent

  • Start with some $\theta_0,\theta_1$
  • Keep changing $\theta_0,\theta_1$ to reduce $J(\theta_0,\theta_1)$ until we hopefully end up at a minimum.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%206.png

the slope of the tangent is the derivative at that point.
the size of each step is determined by learnig rate alpha.

Gradient Descent Algorithm

  • repeat until convergence

$$\theta_j := \theta_j-\alpha \frac {\partial} {\partial \theta_j} J(\theta_0,\theta_1) \\\ (\bold{simultaneously} \ update \ j=0 \ and \ j=1)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%207.png

finding theta which is at local optima

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%208.png

learning rate has to be proper value

Gradient Descent for Linear Regression

  • Linear Regression Model

$$J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m} (\hat{y_i}-y_i)^2= \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2$$

  • Gradient Descent Algorithm

$$\theta_j := \theta_j-\alpha \frac {\partial} {\partial \theta_j} J(\theta_0,\theta_1) \\\ (\bold{simultaneously} \ update \ j=0 \ and \ j=1)$$

$$j=0 \ : \ \frac {\partial}{\partial \theta_0} J(\theta_0,\theta_1)=\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})$$

$$j=1 \ : \ \frac {\partial}{\partial \theta_1} J(\theta_0,\theta_1)=\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$$

  • repeat until convergence

$$\theta_0 := \theta_0-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \
\theta_1 := \theta_1-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$$

  • "Batch" Gradient Descent

Each step of gradient descent uses all the training examples.

Gradient Descent for Multiple Variables

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%209.png

  • $n$ = number of features
  • $m$ = number of training sets
  • $x_n^{(m)}$
  • $x^{(i)}$ = input of $i^{th}$ training example
  • $x_j^{(i)}$ = value of feature $j$ in $i^{th}$ training example

ex ) $x_2^{(3)}=3$ , $x_1^{(2)}=1416$

$$h_\theta(x) = \theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...\theta_nx_n$$

$$X=
\begin{bmatrix}
x_0 \
x_1 \
x_2 \
... \
x_n
\end{bmatrix}

\Theta=
\begin{bmatrix}
\theta_0 \
\theta_1 \
\theta_2 \
... \
\theta_n
\end{bmatrix}$$

$$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...\theta_nx_n
= \Theta^TX$$

  • previously $(n=1)$
  • $$\theta_0 := \theta_0-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \
    \theta_1 := \theta_1-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$$
  • new algorithm $(n\geq1)$ $(x_0^{(i)}=1)$
      theta = theta - (alpha / m * sum((predictions-y).*X))';
  • $$\theta_0 := \theta_0-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_0^{(i)} \\
    \theta_1 := \theta_1-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_1^{(i)} \\
    \theta_2 := \theta_2-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_2^{(i)} \\
    \vdots \\
    \boxed{\theta_j := \theta_j-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_j^{(i)}}$$

Feature Scaling

  • to make sure features are on similar scale.
  • get every feature into approximately a $(-1 \le x_i \le 1)$ range.

$$x_1 = size \ (0 \sim 2000 \ ft^2) \
x_2 = number \ of \ bedrooms \ (1 \sim 5)$$

$$\to x_1 = \frac {size(ft^2)} {2000} \
\to x_2 = \frac {number \ of \ bedrooms} {5}$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2010.png

Mean Normalization

  • get every feature into approximately a $(-0.5 \le x_i \le 0.5)$ range.

$$x_i:=\frac {x_i-\mu_i} {s_i}$$

  • $\mu_i = average \ value \ of \ x_i$
  • $s_i = range \ of \ x_i = (max-min)$

Learning Rate

$$\theta_j :=\theta_j-\bold{\alpha} \frac {\partial} {\partial\theta_j} J(\theta) $$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2011.png

  • For sufficiently small $\alpha$ , $J(\theta)$ should decrease on every iteration.
  • But if $\alpha$ is too small, gradient descent can be slow to converge.
  • To choose $\alpha$, try $\boxed{...,0.001,0.003,0.01,0.03,0.1,0.3,1,...}$

Polynomial Regression

  • We can combine multiple features into one. ( Ex. $x_3 \gets x_1 \ \centerdot \ x_2$ )
  • We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic, square root, etc.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2012.png

Feature scaling becomes important when we choose features this way.

Normal Equation

  • Unlike gradient descent, normal equation is a method to solve for $\theta$ analytically.

$$J(\theta_0,\theta_1,...,\theta_m)=\frac {1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2
\
\frac {\partial}{\partial\theta_j}J(\theta)=\dotsm=0 \ (for \ every \ j)
\
\underline{Solve \ for \ \theta_0,\theta_1,....\theta_n}$$

$$x^{(i)}=
\begin{bmatrix}
x_0^{(i)}\
x_1^{(i)} \
x_2^{(i)} \
... \
x_n^{(i)}
\end{bmatrix}
\ (x^{(i)})^T= [x_0^{(i)} x_1^{(i)} x_2^{(i)} \dots x_n^{(i)} ]
\
X=
\begin{bmatrix}
x_0^{(1)} \ x_1^{(1)} \ x_2^{(1)} \cdots \ x_n^{(1)}\
x_0^{(2)} \ x_1^{(2)} \ x_2^{(2)} \cdots \ x_n^{(2)} \
x_0^{(3)} \ x_1^{(3)} \ x_2^{(3)} \cdots \ x_n^{(3)} \
... \
x_0^{(m)} \ x_1^{(m)} \ x_2^{(m)} \cdots \ x_n^{(m)}
\end{bmatrix}
=\begin{bmatrix}
(x^{(1)})^T \
(x^{(2)})^T \
(x^{(3)})^T \
... \
(x^{(m)})^T
\end{bmatrix}
$$

$$\boxed {\theta = (X^TX)^{-1}X^Ty}$$

Gradient Desent

  • Need to choose $\alpha$.
  • Needs many iterations.
  • Works well even when $n$ (features) is large.

Normal Equation

  • No need to choose $\alpha$.
  • Don't need to iterate.
  • Need to compute $(X^TX)^{-1}$.
  • Slow if $n$ is very large. ($n>10^4)$

Vectorial Implementation

$$h_\theta(x)=\sum_{j=0}^n \theta_jx_j=\theta^Tx$$

Unvectorized implementation

prediction = 0.0;
for j = 1:n+1;
    prediction = prediction 
                                + theta(j) * x(y)
end;

Vectorized implementation

prediction = theta' * x;

Logistic Regression

Classification

  • ex. email - spam or not spam / tumor - malignant or benign

$$y \ni \lbrace 0,1 \rbrace$$

0 : Negative Class

1 : Positive Class

  • Discrete Value

$$if \ h_\theta(x) \geq 0.5, \ y=1
\
if \ h_\theta(x) < 0.5, \ y=0$$

Hypothesis Representation

  • Using linear regression to classification problem performs very poorly, so we have to chage the form.
  • Logistic Function

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2013.png

$$h_\theta(x)=g(\theta^Tx)
\
z=\theta^Tx
\
g(z)=\frac{1}{1+e^{-z}}$$

  • $h_\theta(x)$ = estimated *probability *that y=1 on input x

$$h_\theta(x)=P(y=1|x;\theta)

=1-P(y=0|x;\theta)$$

$$predict \ \text{\textquotedblleft}y=1" \ if \ h_\theta(x)=g(z) \geq0.5
\
\underline{ when \ z=\theta^T \geq 0}$$

$$predict \ \text{\textquotedblleft}y=0" \ if \ h_\theta(x)=g(z) <0.5
\
\underline{when \ z=\theta^T <0}$$

Decision Boundary

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2014.png

$$x=
\begin{bmatrix}
1\
x_1 \
x_2 \
\end{bmatrix}
\
\theta=
\begin{bmatrix}
-3\
1\
1\
\end{bmatrix}$$

$$h_\theta(x)=g(\theta^Tx)=g(-3+x_1+x_2)
\
predict \ \text{\textquotedblleft}y=1" \ if\ \ \bold{z=-3+x_1+x_2\geq0}$$

$$h_\theta(x)=g(\theta^Tx)=g(-3+x_1+x_2)
\
predict \ \text{\textquotedblleft}y=0" \ if\ \ \bold{ z=-3+x_1+x_2<0}$$

$$ Decision \ Boundary \\therefore x_1+x_2=3$$

  • The input to the sigmoid function g(z) (e.g. $\theta^TX)$ doesn't need to be linear.
  • e.g.
    $z=\theta_0+\theta_1x_1^2+\theta_2x_2^2$ (circle)
    or any shape of data

Cost Function

$$J(\theta)=
\frac{1}{m} \sum_{i=1}^{m} \underline{\frac1 2 (h_\theta(x^{(i)})-y^{(i)})^2}
=\frac{1}{m} \sum_{i=1}^{m}\underline{Cost(h_\theta(x^{(i)}),y^{(i)})}$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2015.png

$$Cost(h_\theta(x^{(i)}),y^{(i)}) = -log(h_\theta(x)) \ \underline{ if \ y=1}$$

$$h_\theta(x)=1
\ Cost = 0$$

$$h_\theta(x) \rarr 0
\ Cost \rarr \infin$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2016.png

$$Cost(h_\theta(x^{(i)}),y^{(i)}) = -log(1-h_\theta(x)) \ \underline{ if \ y=0}$$

$$h_\theta(x)=0
\ Cost = 0$$

$$h_\theta(x) \rarr 1
\ Cost \rarr \infin$$

Simplified Cost Function and Gradient Descent

  • Simplified Cost Function

$$J(\theta)
=\frac{1}{m} \sum_{i=1}^{m}\underline{Cost(h_\theta(x^{(i)}),y^{(i)})}\=- \frac{1}{m} [ \sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]$$

  • Vectorized Implementation

$$h=g(X\theta) \
J(\theta)= \frac 1 {m} \cdot (-y^Tlog(h)-(1-y)^Tlog(1-h))$$

  • Gradient Descent

$$\theta_j \colonequals \theta_j-\alpha \frac{\partial} {\partial \theta_j}J(\theta)$$

$$\theta_j \colonequals \theta_j-\frac \alpha m \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$$

  • Vectorized Implementation

$$\theta \colonequals \theta-\frac \alpha m X^T(g(X\theta)-\overrightarrow{y})$$

Algorithm looks identical to linear regression, but $h_\theta(x)$ is Logistic Function.

Optimization Algorithm

  • Given $\theta$, we have code that can compute $J(\theta) , \ \frac{\partial}{\partial \theta_j}J(\theta) \ (for \ j=0,1,2 \dots,n)$.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2017.png

options = optimset('GradObj','on','MaxIter',100); % optimset() function creates an object containing the options we want to send to fminunc()
initialtheta = zeros(2,1);
[optTheta, functionVal, exitFlag] ...
           = fminunc(@costFunction, initialTheta, options);

Multiclass Classification

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2018.png

$$y \in {\lbrace0,1,...,n\rbrace}
\
h_\theta^{(0)}(x)=P(y=0|x;\theta) \
h_\theta^{(1)}(x)=P(y=1|x;\theta)
\
\cdots
\
h_\theta^{(n)}(x)=P(y=n|x;\theta)
\
prediction=\underset{i}{max}(h_\theta^{(i)}(x))$$

  • Train a logistic regression classfier $h_\theta(x)$ for each class to predict the probability that $y=i$.
  • To make a prediction on a new $x$, pick the class that maximizes $h_\theta(x)$.

$$prediction=\underset{i}{max}(h_\theta^{(i)}(x))$$

  • $h_\theta(x)$ is a probability that $y=i$, by logistic function definition.

Regularization

The Problem of Overfitting

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/0B05AE56-26FA-4254-99C3-81E5E29E83E0.jpegMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2019.png

  • If we have too many features, hypothesis may fit the training set very well, but fail to generalize to new examples. →Overfitting
  • To fix this issue, there are two options : Reducing number of features / Regularization.

Cost Function

Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2020.png

  • Suppose we penalize and make $\theta_3,\theta_4$ very small.

$$\underset {\theta}{min} \frac1{2m} \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})^2 + \underline{1000\theta_3^2+1000\theta_4^2}$$

  • The goal is to minimize cost function, therefore by multiplying large value for $\theta_3 \ and \ \theta_4$ makes $\bold{\theta_3 \approx 0 \ and \ \theta_4 \approx 0}$.

Regularization

$$J(\theta)=\frac 1{2m} \sum_{i=1}^{m} [(h(x_\theta^{(i)})-y^{(i)})^2 + \boxed{\lambda \sum_{j=1}^n\theta_j^2}]$$

  • Small values for parameter $\theta$ makes hypothesis simpler and less prone to overfitting.
  • If $\lambda$ is extremely large value, algorithm results in underfitting. ($h_\theta(x)=\theta_0$)

Regularized Linear Regression

Gradient Descent

$$\theta_0 \coloneqq \theta_0-\alpha \frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_0^{(i)}$$

$$\theta_j \coloneqq \theta_j-\alpha [(\frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_j^{(i)})+ \frac \lambda m \theta_j]$$

$$\theta_j \coloneqq \theta_j(1-\alpha \frac \lambda m)-\alpha \frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_j^{(i)}
\ (1-\alpha \frac \lambda m )< 1$$

Normal Equation

$$X=\begin{bmatrix}
(x^{(1)})^T
\
\vdots
\
(x^{(m)})^T
\end{bmatrix}
\ (m \times (n+1) \ matrix ) \
y=\begin{bmatrix}
y^{(1)}
\
\vdots
\
y^{(m)}
\end{bmatrix}$$

$$\theta = (X^TX+\underline{\lambda \cdot L})^{-1}X^Ty$$

$$L = \begin{bmatrix}
0&0&0&\cdots&0
\
0&1&0&\cdots&0
\
0&0&1&\cdots&0
\
\vdots& & &\ddots&\vdots
\
0&&&\cdots&1
\end{bmatrix}
\ ((n+1)\times(n+1) \ matrix)$$

  • Regularization takes care of non-invertible issue.

Regularized Logistic Regression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2021.png

$$J(\theta)
=- \frac{1}{m} [ \sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] + \boxed{ \frac{\lambda} {2m} \sum_{j=1}^n \theta_j^2}$$

Gradient Descent

$$\theta_0 \coloneqq \theta_0-\alpha \frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_0^{(i)}$$

$$\theta_j \coloneqq \theta_j-\alpha [(\frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_j^{(i)})+ \frac \lambda m \theta_j]$$

Advanced Optimization

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2022.png

function [jVal, gradient] = costFunction(theta);
jVal = [code to compute J(theta)];
gradient(1) = [code to compute partial derivative J(theta)]; 
gradient(2) = [code to compute partial derivative J(theta)];
gradient(3) = [code to compute partial derivative J(theta)];
...
gradient(n+1) = [code to compute partial derivative J(theta)]; % because code starts with 1, while theta_n starts from 0.

Neural Networks : Representation

  • Neurons are basically computational units that take inputs (Dendrites) that are channeled to outputs (Axons).
  • Sigmoid Activation Function (same as logistic function)

$$h_\theta(x) =g(\theta^Tx)= \frac 1 {1+e^{-\theta^Tx}}$$

  • $\theta$ : weights (parameters)

Model Representation 1

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2023.png

  • Layer 1 : Input Layer
    Layer 2 : Hidden Layer
    Layer 3 : Output Layer

$$\begin{bmatrix}
x_0
\
x_1
\
x_2
\
x_3
\end{bmatrix}
\rarr
\begin{bmatrix}
a_1^{(2)}
\
a_2^{(2)}
\
a_3^{(2)}
\end{bmatrix}
\rarr
h_\theta(x)$$

  • $a_i^{(j)}$ : "activation" of unit $i$ in layer $j$.
  • $\Theta^{(j)}$ : matrix of weights controlling function mapping from layer $j$ to layer $j+1$.

$$a_1^{(2)}=
g(\begin{bmatrix}
\theta_{10}^{(1)}
\
\theta_{11}^{(1)}
\
\theta_{12}^{(1)}
\
\theta_{13}^{(1)}
\end{bmatrix}
\times
\begin{bmatrix}
x_0
\
x_1
\
x_2
\
x_3
\end{bmatrix}
)
=g(\Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3)
\
a_2^{(2)}=g(\Theta_{20}^{(1)}x_0+\Theta_{21}^{(1)}x_1+\Theta_{22}^{(1)}x_2+\Theta_{23}^{(1)}x_3)
\
a_3^{(2)}=g(\Theta_{30}^{(1)}x_0+\Theta_{31}^{(1)}x_1+\Theta_{32}^{(1)}x_2+\Theta_{33}^{(1)}x_3)
\
h_\Theta(x)=a^{(3)}1= g(\Theta{10}^{(2)}a_0^{(2)}+\Theta_{11}^{(2)}a_1^{(2)}+\Theta_{12}^{(2)}a_2^{(2)}+\Theta_{13}^{(2)}a_3^{(2)})
\
(h_\Theta(x) \ can \ be \ viewed \ as \ unit \ 1 \ in \ layer \ 3)$$

  • If network hasthen $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j+1)$.
    ( +1 comes from $x_0$ and $\Theta_0^{(j)}$, "bias nodes", ex. $\Theta_{10}^{(1)}x_0$ )
  • $$s_j \ units \ in \ layer \ j,
    \
    s_{j+1} \ units \ in \ layer \ j+1$$

Model Representation 2

Forward Propagation (Vectorized Implementation)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2024.png

  • Let $z_1^{(2)}$ ( unit 1 in layer 2 ) asthen $z_i^{(j)}$ can be vectorized as
  • $$x=a^{(1)}
    =\begin{bmatrix}
    x_0
    \
    x_1
    \
    x_2
    \
    x_3
    \end{bmatrix}
    \ , \
    z^{(2)}=
    \begin{bmatrix}
    z_1^{(2)}
    \
    z_2^{(2)}
    \
    z_3^{(2)}
    \end{bmatrix}
    \
    \rarr z^{(2)}=\Theta^{(1)}a^{(1)}
    \
    \vdots
    \
    z^{(j+1)}=\Theta^{(j)}a^{(j)}
    \
    (a_0^{(j)}=1 \ (bias \ unit))$$
  • $$z_1^{(2)} = \Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3
    \
    a_1^{(2)}=g(z_1^{(2)})$$
  • The last parameter matrix $\Theta^{(j)}$ will have only one row (because $jth$ layer has only one unit ($h_\theta(x)$)
    which is multiplied by one column $a^{(j)}$ so that our result is a single number.

$$h_\Theta(x) = a^{(j+1)}=g(z^{(j+1)})=g(\Theta^{(j)}a^{(j)})$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2025.png

Examples & Intuitions 1

  • If first parameter matrix is $\Theta^{(1)}=[-30 \ \ 20 \ \ 20]$
  • $$h_\Theta(x)=g(-30+20x_1+20x_2)
    \
    x_1=0 \ and \ x_2=0 \ then \ | \ g(-30) \approx0
    \
    x_1=0 \ and \ x_2=1 \ then \ | \ g(-10) \approx0
    \
    x_1=1 \ and \ x_2=0 \ then \ | \ g(-10) \approx0
    \
    x_1=1 \ and \ x_2=1 \ then \ | \ g(10) \approx1
    \
    \rarr AND$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2026.png

  • OR

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2027.png

Examples & Intuitions 2

To make $x_1 \ XNOR \ x_2$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2028.png

  • Use $\Theta^{(1)}$ matrix that combines the values for AND and NOR.\begin{bmatrix}
    -30 & 20 &20
    \
    10 & -20 & -20
    \end{bmatrix}$$
  • $$\Theta^{(1)}
  • Use $\Theta^{(2)}$ matrix that uses the value for OR.
  • $$\Theta^{(2)}=
    \begin{bmatrix}
    -10 & 20 & 20
    \end{bmatrix}$$
  • Then $a_1^{(3)}(=h_\Theta(x))$ becomes XNOR.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2029.png

  • Neural Network Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2030.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2031.png

Multi-class Classification

Let hypothesis function return a vector of values.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2032.png

$$\begin{bmatrix}
x_0
\
x_1
\
x_2
\
\vdots
\
x_n
\end{bmatrix}
\rarr
\begin{bmatrix}
a_0^{(2)}
\
a_1^{(2)}
\
a_2^{(2)}
\
\vdots
\end{bmatrix}
\rarr
\begin{bmatrix}
a_0^{(3)}
\
a_1^{(3)}
\
a_2^{(3)}
\
\vdots
\end{bmatrix}
\rarr
\cdots
\rarr
\begin{bmatrix}
h_\Theta(x)_1
\
h_\Theta(x)_2
\
h_\Theta(x)_3
\
h_\Theta(x)_4
\end{bmatrix}$$

Neural Networks : Learning

Cost Function

$$L=total \ number \ of \ layers \ in \ network
\
s_l=number \ of \ units \ (not \ including \ bias \ unit) \ in \ layer \ 1
\
K=number \ output \ units/classes
\
h_\Theta(x)_k=hypothesis \ that \ results \ in \ the \ k^{th} \ output$$

$$J(\Theta)=- \frac {1}{m} [ \sum_{i=1}^m \bold{\sum_{k=1}^{K}} y_k^{(i)}log(h_\theta(x^{(i)}))k + (1-y_k^{(i)})log(1-(h_\Theta(x^{(i)}))_k )] + \bold{\frac{\lambda}{2m} \sum{l=1}^{L-1}\sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1}(\Theta_{ji}^{(l)})^2}
\$$

Backpropagation Algorithm

  • Our goal is to minimize $J(\Theta)$ with parameter $\Theta$.
  • To achieve the goal, we need to compute $J(\Theta)$ and $\frac {\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)$
  • To do so, we use backpropagation algorithm.

$$training \ set : (x^{(1)},y^{(1)}), \cdots,(x^{(m)},y^{(m)})
\
Set \ \Delta_{i,j}^{(l)} :=0 \ for \ all \ (l,i,j)$$

$$For \ t=1 \ to \ m:
\

  1. \ Set \ a^{(1)}:=x^{(t)} \ (unit \ of \ first \ layer \ i s\ same \ as \ x)
    \
  2. \ Perform \ Foward \ Propagation \ to \ compute \ a^{(l)} (l=2,3,...,L)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2033.png

$$Using \ y^{(t)}, \
Compute \ \delta^{(L)}=a^{(L)}-y^{(t)}$$

"error values" ($\delta^{(L)}$) for the last layer are simply the differences of our actual results in the last layer ($a^{(L)}$) and correct outputs in y ($y^{(t)}$).

$$Compute \ \delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)} \ using
\
\delta^{(l)}=((\Theta^{(l)})^T \delta^{(l+1)}).a^{(l)}.(1-a^{(l)})
\
(g'(z^{(l)})=a^{(l)}.*(1-a^{(l)}))$$

$$\Delta_{i,j}^{(l)}:=\Delta_{i,j}^{(l)}+a_{j}^{(l)}\delta_{i}^{(l+1)}
\
Vectorization \ \rarr \ \Delta^{(l)}:=\Delta^{(l)}+\delta_{i}^{(l+1)}(a^{(l)})^T
$$

Then we update our new $\Delta$ matrix.

$$ D_{i,j}^{(l)}:= \frac 1 m (\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)}) \ , if \ j \neq 0.
\
D_{i,j}^{(l)}:= \frac 1 m \Delta_{i,j}^{(l)}\ , if \ j = 0.$$

matrix $D$ is used as an "accumulator" to add up our values and eventually compute our partial derivative of $J(\Theta)$.

$$\frac {\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta)=D_{ij}^{(l)}$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2034.png

Backpropagation Intuition

  • $\delta_j^{(l)}$ is the "error" for $a_j^{(l)}$.
  • delta values are actually the derivative of the cost function.

$$\delta_j^{(l)}=\frac {\partial} {\partial z_j^{(l)}} \ cost(t)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2035.png

  • To calculate $\delta_2^{(2)}$, we multiply the weights $\Theta_{12}^{(2)}$ and $\Theta_{22}^{(2)}$ by their respective $\delta$ values.

$$\delta_2^{(2)}=\Theta_{12}^{(2)}\delta_1^{(3)}+\Theta_{22}^{(2)}\delta_2^{(3)}$$

Implementation Note : Unrolling Parameters

  • To use optimizing function as "fminunc()", we should unroll all elements.

$$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)},...$$

thetaVector = [ Theta1(:); Theta2(:); Theta3(:) ];
% becomes column vector
deltaVector = [ D1(:); D2(:); D3(:) ];
% becomes column vector 
  • If Theta1 is 10x11, Theta2 is 10x11, Theta3 is 1x1 matrices, we can get back our original matrices from the "unrolled" versions.
Theta1 = reshape (thetaVector (1:110), 10, 11); 
% select 1 to 110 elements, make 10 by 11 matrix
Theta2 = reshape (thetaVector (111:220), 10, 11);
% select 111 to 220 elements, make 10 by 11 matrix
Theta3 = reshape (thetaVector (221:231), 1, 11);
% select 221 to 231 elements, make 1 by 11 row vector

Gradient Checking

  • To verify whether backpropagation works as intended, we use gradient checking.

$$\frac {\partial} {\partial \Theta_j} J(\Theta) \approx \frac {J(\Theta_1,\cdots,\Theta_j+\epsilon,\cdots,\Theta_n)-J(\Theta_1,\cdots,\Theta_j-\epsilon,\cdots,\Theta_n)} {2 \epsilon}$$

  • Once you have verified that backpropagation is correct, you should disable gradient checking. (code to compute gradApprox is very slow)

Random Initialization (Breaking Symmetry)

  • If we choose all zero elements for initial $\Theta$ matrix, all nodes will update to the same value repeatedly while performing backpropagate.
  • So we should initialize $\Theta$ to a random value in $[-\epsilon,\epsilon]$.

$$-\epsilon \leq \Theta_{ij}^{(l)} \leq \epsilon$$

Theta1 = rand(10,11) * (2 * init_epsilon) - init_epsilon;
Theta2 = rand(1,11) * (2 * init_epsilon) - init_epsilon;

Summary (Putting it together)

Pick a Network Architecture

  • Number of input units : Dimension of features $x^{(i)}$
  • Number of output units : Number of classes ($y_k)$
  • Resonable Default : 1 hidden layer, or if >1 hidden layer, have same number of hidden units in every layer

Training a Neural Network

  1. Randomly initialize wieghts ($\Theta)$
  2. Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$
  3. Implement code to compute cost function $J(\Theta)$
  4. Implement backpropagation to compute partial derivatives $\frac {\partial} {\partial \Theta_{jk}^{(l)}} J(\Theta)$
  5. for i = 1:m { Perform forward propagation and backpropagation using example (x_i, y_i) Get activations a^l and delta terms delta^l for l = 2 to L }
  6. Use gradient checking to compare $\frac {\partial} {\partial \Theta_{jk}^{(l)}} J(\Theta)$ computed using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$

Then disable gradient checking code
6. Use gradient descent or advanced optimization method with backpropagation to try to minimze $J(\Theta)$ as a function of parameters $\Theta$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2036.png

Advice for applying Machine Learning

Evaluating a hypothesis

Trouble shooting for errors in our predictions by :

  • Getting more training examples
  • Trying smaller sets of features
  • Trying additional features
  • Trying polynomial features
  • Increasing or decreasing $\lambda$

We can split up the data into to sets : training set (70%) / test set (30%)

  1. Learn $\Theta$ and minimize $J_{train} (\Theta)$ using the training set
  2. Compute the test set error $J_{test}(\Theta)$

The Test Set Error

  • Linear Regression

$$J_{test}(\Theta)= \frac {1}{2m_{test}} \sum_{i=1}^{m_{test}}
(h_{\Theta}(x_{m_{test}})-y_{m_{test}}^{(i)})^2$$

  • Classification

$$err(h_\Theta(x),y)=1 \ \
(if \ h_\Theta(x) \geq 0.5 \ and \ y=0,
\ or \ h_\Theta(x) < 0.5 \ and \ y=1)
\
err(h_\Theta(x),y)=0 \ (otherwise)$$

$$Test \ Error = \frac 1 {m_{test}}
\sum_{i=1}^{m_test}
err(h_\Theta(x_{test}^{(i)}),y_{test}^{(i)})$$

  • Logistic Regression

$$J_{test}(\theta)= - \frac 1 {m_{test}}
\sum_{i=1}^{m_{test}}
y_{test}^{(i)} \ log (h_\theta(x_{test}^{(i)}))+(1-y_{test}^{(i)})log(h_\theta(x_{test}^{(i)}))$$

Model Selection and Training / Validation / Test sets

The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any data set.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2037.png

One way to break down our dataset into the three sets is :

  • Training Set (60%)
  • Cross Validation Set (20%)
  • Test Set (20%)

We can now calculate three separate error values :

  1. Optimize the parameters in $\bold\Theta$ using the data set for each polynomial degree.
  2. Find the polynomial degree $\bold d$ with the least error using the cross validation set.
  3. Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2038.png

Diagnosing Bias vs. Variance

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2039.png

  • Bias (Underfit) - small degree of polynomial d
    $J_{train}(\Theta)$ will be high, $J_{cv}(\Theta) \approx J_{train}(\Theta)$
  • Variance (Overfit) - large degree of polynomial d
    $J_{train}(\Theta)$ will be low, $J_{cv}(\Theta) >> J_{train}(\Theta)$

Regularization and Bias / Variance

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2040.png

  • Small $\lambda$ : high variance (overfit)
  • Large $\lambda$ : high bias (underfit)
  • To choose the model & regularization term:
  1. Create list of $\lambda$ ( 0, 0.01, 0.02, 0.04, 0.08, ... , 10.24 ).
  2. Create a set of models with different degrees.
  3. Go through all the models to learn some $\Theta$ for each $\lambda$.
  4. Compute cross validation error on the $J_{cv}(\Theta)$ without regularization.
  5. Select the best combo that produces the lowest error on the cross validation set.
  6. Apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.

Learning Curves

  • High Bias (Underfitting)
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2041.png
  • High Variance (Overfitting)
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2042.png

Debugging a Learning Algorithm

  • Fixes high bias (underfitting)
    Try getting additional features
    Try adding polynomial features
    Try decreasing $\lambda$
  • Fixes high variance (overfitting)
    Get more training examples (training sets)
    Try smaller sets of features
    Try increasing $\lambda$ (heavier regularization)
  • Small Neural Network
    more prone to underfitting (Computationally cheaper)
  • Large Neural Netwok
    more prone to overfitting (Computationally expensive)
    Use regualrization to address overfitting.

Machine Learning System Design

Prioritizing What to Work On

  • How could we spend our time to improve the accuracy of this classifier (spam classifier)?

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2043.png

  • Collect lots of data
  • Develop sophisticated features
  • Develop algorithms to process your input in different ways

It is difficult to tell which of the options will be most helpful.

Error Analysis

  • Recommended Approach
    1. Start with simple algorithm, implement quickly, and test it early on your cross validation data.
    2. Plot learning curves to decide if more data, more features, etc. are likely to help.
    3. Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
  • It is very important to get error results as a single, numerical value.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2044.png

Error metrics for skewed classes

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2045.png

Precision / Recall

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2046.png

  • Precision
    (Of all patients where we predicted $y=1$, what fraction actually has caner?)\frac { True \ Positive } {\bold{ Predicted \ Positive}}$$
  • $$\frac { True \ Positive } {\bold{ True \ Positive + False \ Positive}}
  • Recall
    (Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?)\frac { True \ Positive } {\bold{ Actual \ Positive}}$$
  • $$\frac { True \ Positive } {\bold{ True \ Positive + False \ Negative}}

Trading off precision and recall

  • Suppose we want to predict $y=1$ (cancer) only if very confident.Higher precision, Lower recall
  • $$h_\theta(x) \geq \bold{ 0.9}
    \
    h_\theta(x) < \bold{0.9}$$
  • Suppose we want to avoid missing too many cases of cancer (avoid false negatives).Higher recall, Lower precision
  • $$h_\theta(x) \geq \bold{0.3}
    \
    h_\theta(x) < \bold{0.3}$$
  • More generally : predict 1 if $h_\theta(x) \geq threshold$.

F Score

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2047.png

$$⁍$$

Data for Machine Learning

Useful test

  1. Given the input $x$, can a human expert confidently predict $y$?
  2. Can we get a large training set, and train the learning algorithm with a lots of parameters in the training set?

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2048.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2049.png

Support Vector Machines

Optimization Objective

  • Alternative view of logistic regression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2050.png

$$Cost \ of \ example: \=-(y-\log h_\theta(x)+(1-y)\log(1-h_\theta(x)))
\
=-y\log \frac {1} {1+e^{-\theta^Tx}} - (1-y)\log(1-\frac {1} {1+e^{-\theta^Tx}})$$

If $y=1$ , (want $\theta^Tx >>0)$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2051.png

If $y=0$ , (want $\theta^Tx <<0)$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2052.png

  • Support Vector Machine

$$\underset{\theta}
{min}
\frac{1}{m} [ \sum_{i=1}^m y^{(i)}(-\log h_\theta(x^{(i)}))+(1-y^{(i)})(-\log(1-h_\theta(x^{(i)}))]+ \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$$

$m$ : constant, doesn't affect minimization

$A+\underline \lambda B \rarr \underline C A+B \ (C=\frac 1 {\lambda})$

$$\rarr \underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac 1 2 \sum_{j=1}^n\theta_j^2$$

$$h_\theta(x)=1 \ (if \ \theta^Tx \geq 0) \
h_\theta(x)=0 \ (otherwise)$$

Large Margin Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2053.png

$$\underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac 1 2 \sum_{j=1}^n\theta_j^2$$

  • SVM Decision Boundary : Linearly Serperable Case
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2054.png
  • Large Margin Classfier in presence of outliers
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2055.png

The mathematics behind Large Margin Classification (optional)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2056.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2057.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2058.png

Kernels 1

  • Given $x$, compute new feature depending on proximity to landmarks $l^{(1)},l^{(2)},l^{(3)}$

$$f_1 = similarity (x,l^{(1)})=\exp(- \frac {|x-l^{(1)}|^2} {2 \sigma^2})
=\exp(- \frac {\sum_{j=1}^n (x_j-l_j^{(1)})^2} {2\sigma^2})$$

$$f_2 = similarity (x,l^{(2)})=\exp(- \frac {|x-l^{(2)}|^2} {2 \sigma^2})$$

$$f_3= similarity (x,l^{(3)})=\exp(- \frac {|x-l^{(3)}|^2} {2 \sigma^2})$$

  • If $x \approx l^{(1)}$
  • $$f_1 \approx \exp(- \frac 0 {2\sigma^2})=1$$
  • If $x$ is far from $l^{(1)}$
  • $$f_1 = \exp (- \frac {(large \ number)^2} {2 \sigma^2}) \approx 0$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2059.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2060.png

Kernels 2

  • Choosing Landmarks $l^{(i)}$

Given $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})$,
Choose $l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},...,l^{(m)}=x^{(m)}$.

Given example $x$ :

$$f_1=similarity (x,l^{(1)})
\ f_2=similarity(x,l^{(2)})
\ \vdots$$

$$feature \ vector \ f=
\begin {bmatrix}
f_0
\f_1
\ \vdots
\f_m
\end {bmatrix}
,
\ f_0=1$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2061.png

  • SVM with Kernels

Hypothesis : Given $x$, compute features $f \in \mathbb{R}^{m+1}$
Predict $y=1$ if $\theta^Tf \geq 0$

Training :

$$\underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^T\bold{f^{(i)}})+(1-y^{(i)})cost_0(\theta^T\bold{f^{(i)}})]+\frac 1 2 \sum_{j=1}^m\theta_j^2$$

$$\sum_{j=1}^m \theta_j^2 = \theta^T \theta=\theta^TM\theta \ (M:rescale \ factor)$$

  • SVM parameters
  • $C= \frac 1 \lambda$

Large $C$ : Lower bias, high variance, less regularized ( small $\lambda$ )

Small $C$ : Higher bias, lower variance, more regularized ( large $\lambda$ )

  • $\sigma^2 \ \ (\exp(-\frac {|x-l|^2} {2 \sigma^2})$

Large $\sigma^2$ : Features $f_i$ vary more smoothly. Higher bias, lower variance.

Small $\sigma^2$ : Features $f_i$ vary less smoothly. Lower bias, higher variance.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2062.png

Using SVM

Use SVM software package (e.g. liblinear, libsvm, ...) to solve for parameters $\theta$.

Need to specify :

  • Choice of parameter C
  • Choice of kernel (similarity function) :
    No kernel (linear kernel), Gaussian kernel, ploynomial kernel, ...
  • Perform feature scaling before using the Gaussian kernel.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2063.png

  • Multi-class classification

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2064.png

Many SVM packages already have built-in multi-class classification.

Otherwise, use one-vs-all method.

  • Logistic regression vs. SVMs
    • If $n$ is large (relative to m) (e.g. $n$=10,000 , $m$ = 10~1,000)
    • Use Logistic Regression, or SVM without a kernel (linear kernel)
    • If $n$ is small, $m$ is intermediate
    • Use SVM with Gaussian Kernel
    • If $n$ is small, $m$ is largeNeural Network likely to work well for most of these settings, but may be slower to train.
    • Add more features, then use Logistic Regression or SVM without a kernel
  • $n$ : number of features, $m$ : number of training examples

Clustering

Unsupervised Learning

In Unsupervised Learning, you are given an unlabled dataset (without labels $y^{(i)}$) and are asked to find some structure in the data.

K-means Algorithm

  1. Randomly Initialize cluster centroids
  2. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2065.png
  3. Assign cluster
  4. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2066.png
  5. Move cluster centroid to average point
  6. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2067.png
  7. Repeat until it converges
  8. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2068.png
  • InputTraining Set : $x^{(1)},x^{(2)},...,^{(m)}$
  • $K$ : number of clusters
  • Algorithm

Randomly initialize K cluster centroids $\mu_1,\mu_2,...,\mu_K \in \mathbb{R}^n$

Repeat {

  1. Cluster Assignment Step$c^{(i)}$ := index of cluster centroid closest to $x^{(i)}$. (1,2,...,K)
  2. for $i=1 \ to \ m$
  3. Move to Centroid$\mu_k$ := average of points assigned to cluster $k$ }
  4. for $k=1 \ to \ K$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2069.png

Optimization Objective

$c^{(i)}$ = index of cluster to which example $x^{(i)}$ is currently assigned

$\mu_k$ = cluster centroid $k$

$\mu_{c^{(i)}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned

(E.g. If $x^{(i)}$ is assigned in cluster 5 then $c^{(i)}=5$, $\mu_{c^{(i)}}=\mu_5$)

  • Optimization objective (Distortion Cost Function / K-means Algorithm)

$$J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)= \frac 1 m \sum_{i=1}^m |x^{(i)}-\mu_{c^{(i)}}|^2
\
min \ J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)$$

Random Initialization

  • How to randomly initialize K cluster centroids?
    1. Should have $K>m$
    2. Randomly pick $K$ training examples.
    3. Set $\mu_1,\cdots,\mu_K$ equal to these $K$ examples.
    4. $\mu_1 = x^{(1)} , \mu_2 = x^{(2)} ,\cdots$
  • To avoid local optima, initialize K-means lots of times & pick clustering that gave lowest cost $J$.
    1. Randomly initialize K-means.
    2. Run K-means. Get $c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K$.
    3. Compute cost function (distortion) $J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)$ }
  • For $i=1 \ to \ 100$ {

Choosing the number of clusters

  • Usually chosen by hand or human insight.
  • Elbow method :

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2070.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2071.png

Dimensionality Reduction

Motivation 1 : Data Compression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2072.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2073.png

Motivation 2 : Data Visualization

  • Reduce data to max 3-dimensional data

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2074.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2075.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2076.png

Principal Component Analysis problem formulation (PCA)

  • 2-dimension to 1-dimension :
    Find a direction (a vector $u^{(1)} \in \mathbb{R}^n$) onto which to project the data so as to minimize the projection error.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2077.png

  • n-dimension to k-dimension :
    Find $k$ vectors $u^{(1)},u^{(2)},u^{(3)},\cdots,u^{(k)}$ onto which to project the data, so as to minimize the projection error.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2078.png

  • PCA is not linear regression

Principal Component Analysis Algorithm

  • Data PreprocessingPreprocessing (feature scaling / mean normalization) :Replace each $x_j^{(i)}$ with $x_j-\mu_j$. (ensure every feature has zero mean.)
  • If different features on different scales, perform feature scaling.
    $x_j^{(i)} \larr \frac {x_j^{(i)}-\mu_j} {s_j}$
  • $$\mu_j= \frac 1 m \sum_{i=1}^m x_j^{(i)}$$
  • Training Set : $x^{(1)},x^{(2)},...,x^{(m)}$
  • Compute "Covariance matrix $\Sigma$"
  • $$\Sigma= \frac 1 m \sum_{i=1}^n (x^{(i)})(x^{(i)})^T$$
  • Compute "Eigenvectors" of matrix $\Sigma$$U_{reduce}$ is matrix $U$ with $k$ selected columns.
  • $z^{(i)}$ is projected $x^{(i)}$.
  • [U, S, V] = svd(Sigma); Ureduce = U(:, 1:k); z = Ureduce' * x;

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2079.png

Reconstruction from compressed representation

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2080.png

$$X \approx X_{approx}^{(i)}=U_{reduce} \cdot z^{(i)}$$

Choosing the number of principal components

Average squared projection error :

$$\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2$$

Total variation in the data :

$$\frac 1 m \sum_{i=1}^m|x^{(i)}|^2$$

Typically, choose $k$ to be smallest value so that

$$\frac { average \ square } {total \ variation} =
\frac {\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2} { \frac 1 m \sum_{i=1}^m|x^{(i)}|^2 }
\leq 0.01 \ (1%)
\
\rarr 99% \ of \ variance \ is \ retained$$

  • Choosing $k$ (compressed dimension) algorithmCompute $U_{reduce}, z^{(1)},z^{(2)},...,z^{(m)},x_{approx}^{(1)},...,x_{approx}^{(m)}$.
    ORPick smallest value of $k$ for which
  • $$\frac {\sum_{i=1}^k S_{ii}} {\sum_{i=1}^m S_{ii}}
    \geq0.99
    \
    \rarr 99% \ of \ variance \ retained$$
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2081.png
  • Check if $\frac {\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2} { \frac 1 m \sum_{i=1}^m|x^{(i)}|^2 }
    \leq 0.01$, if not, increase $k$.
  • Try PCA with $k=1$.

Advice for applying PCA

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2082.png

Mapping $x^{(i)} \rarr z^{(i)}$ should be defined by running PCA only on the training set.

This mapping can be applied as well to the examples in the cross validation and test sets.

  • Application of PCASpeed up learning algorithm (Choose $k$ by % of variance retain)
  • Visualization ($k=2 \ or \ 3$)
  • Reduce memory / disk needed to store data
  • Bad use of PCA : To prevent overfitting
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2083.png
  • Before implementing PCA, first try running whatever you want to do with the original data $x^{(i)}$.
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2084.png

Anomaly Detection

Problem Motivation

Untitled

  1. $x_i$ : features
  2. Model $p(x) (probability)$ from data.
  3. $p(x_{test})<\epsilon$ → Flag Anomaly
    $p(x_{test}) \geq \epsilon$ → OK

Gaussian Distribution

$$X \sim N(\mu,\sigma^2)
\
\mu : mean , \ \sigma^2 : variance$$

$$P(x;\mu,\sigma^2)=\frac 1 {\sqrt{2\pi}\sigma}exp(- \frac {{(x-\mu)}^2} {2 \sigma^2})$$

Untitled

$$\mu= \frac 1 {m} \sum_{i=1}^mx^{(i)}$$

$$\sigma^2=\frac 1 m \sum_{i=1}^m (x^{(i)}-\mu)^2$$

Algorithm

$$x_i \sim N(\mu_i,\sigma_i^2)
\
p(x)=p(x_1;\mu_1,\sigma_1^2) \cdot p(x_2;\mu_2,\sigma_2^2) \cdot p(x_3;\mu_3,\sigma_3^2) \cdots \cdot p(x_n;\mu_n,\sigma_n^2)
\=\Pi_{j=1}^np(x_j;\mu_j,\sigma_j^2)$$

  1. Choose features $x_i$ that you think might be indicative of anomalous examples.
  2. Fit parameters $\mu_1,\mu_2,...,\mu_n, \ \sigma_1^2, \sigma_2^2,...,\sigma_n^2$
    $\mu_j = \frac 1 {m} \sum_{i=1}^mx_j^{(i)}$
    $\sigma_j^2=\frac 1 m \sum_{i=1}^m (x_j^{(i)}-\mu_j)^2$
  3. Given new example $x$, compute $p(x)$
    $p(x)=\Pi_{j=1}^np(x_j;\mu_j,\sigma_j^2)=\Pi_{j=1}^n\frac 1 {\sqrt{2\pi}\sigma}exp(- \frac {{(x-\mu)}^2} {2 \sigma^2})$
  4. Anomaly if $p(x)<\epsilon$

Untitled

Developing and Evaluating an anomaly detection system

Assume we have some labeled data, of anomalous and non-anomalous examples. ($y=0$ if normal, $y=1$ if anomalous)

Training set : $x^{(1)},x^{(2)},...,x^{(m)}$ (assume normal examples not anomalous)

Cross validation set : ($x_{cv}^{(1)},y_{cv}^{(1)}),...,(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})})$

Test set : $(x_{test}^{(1)},y_{test}^{(1)}),...,(x_{test}^{(m_{test})},y_{test}^{(m_{test})})$

Untitled

  • Algorithm Evaluation
  1. Fit model $p(x)$ on training set { $x^{(1)},...,x^{(m)}$ }
    (can also use cross validation set to choose parameter $\epsilon$)
  2. On a cross validation / test example $x$, predict
    $y=1$ if $p(x) < \epsilon$ (anomaly)
    $y=0$ if $p(x) \geq \epsilon$ (normal)
  3. Possible evaluation metrics :
  • True positive, false positive, false negative, true negative
  • Precision / Recall
  • $F_1$ score

Anomaly Detection vs. Supervised Learning

UntitledUntitled

  • Anomaly DetectionLarge number of negative examples (normal)
  • Very small number of positive examples (anomalous)
  • Supervised Learning
  • Large number of positive and negative examples

Choosing what features to use

Untitled

  • Error analysis for anomaly detection
    $y=1$ if $p(x) < \epsilon$ (anomaly)
    $y=0$ if $p(x) \geq \epsilon$ (normal)If $p(x)$ is comparable for both normal and anomalous examples, create new feature.
  • Want $p(x)$ large for normal examples $x$, small for anomalous examples $x$.

Untitled

Multivariate Gaussian Distribution

Untitled

Don't model $p(x_1),p(x_2),...$ separately.

Model $p(x)$ all in one go.

Parameters : $\mu \in R^n$, $\Sigma \in R^{n \times n}$ (covariance matrix)

$$p(x;\mu,\Sigma)=\frac 1 {{(2\pi)^{\frac n 2}} |\Sigma|^{\frac 1 2}}exp(-\frac 1 2 (x-\mu)^T\Sigma^{-1}(x-\mu))$$

UntitledUntitledUntitledUntitled

Anomaly Detection using the multivariate Gaussian Distribution

  • Parameter Fitting

$\mu = \frac 1 {m} \sum_{i=1}^mx^{(i)}$

$\Sigma=\frac 1 m \sum_{i=1}^m (x^{(i)}-\mu)(x^{(i)}-\mu)^T$

UntitledUntitledUntitled

Recommender Systems

Problem Formulation

  • Example : Predicting movie ratings$n_u$ = no. of users$r(i,j)=1$ if user $j$ has rated movie $i$
  • $y^{(i,j)}$ = rating givren by user $j$ to movie $i$ (defined only if $r(i,j)=1$ )
  • $n_m$ = no. of movies
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20101.png

Content-based Recommendations

  • For each user $j$, learn $\theta^{(i)} \in \mathbb{R}^{n+1}$.
  • Predict user $j$ as rating movie $i$ with $(\theta^{(i)})^Tx^{(i)}$ stars. $(x_0=1)$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20102.png

$$x^{(3)}=
\begin{bmatrix}
1
\ 0.99
\0
\end{bmatrix}
\ \leftrightarrow \
\theta^{(1)}=
\begin{bmatrix}
0
\ 5
\0
\end{bmatrix}
\
\
(\theta^{(1)})^Tx^{(3)} = 5 \times0.99=4.95$$

$r(i,j)=1$ if user $j$ has rated movie $i$

$y^{(i,j)}$ = rating givren by user $j$ to movie $i$ (defined only if $r(i,j)=1$ )

$\theta^{(j)}$ = parameter vector for user $j$

$x^{(i)}$ = feature vector for movie $i$

$m^{(j)}$ = no. of movies rated by user $j$

For user $j$, movie $i$, predicted rating : $(\theta^{(j)})^T(x^{(i)})$

  • To learn $\theta^{(j)}$ (parameter for user $j$)
  • $$\underset{\theta^{(j)}}{min}
    \frac 1 2 \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{k=1}^n(\theta^{(j)}_k)^2$$
  • To learn $\theta^{(1)},\theta^{(2)},...,\theta^{(n_u)}$ (Optimization Algorithm)
  • $$\underset{\theta^{(1)},...,\theta^{(n_u)}}{min}
    \frac 1 2 \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2$$
  • Gradient descent update$$\theta_k^{(j)}:=\theta_k^{(j)}-\alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)}) \ (k\not=0)$$
  • $$\theta_k^{(j)}:=\theta_k^{(j)}-\alpha\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)} \ (k=0)$$

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20103.png

Collaborative Filtering

  • How to choose features ( $x_1,x_2,...$) ($x_0=1$)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20104.png

$$if \ x^{(1)}=
\begin{bmatrix}
1
\ 1.0
\ 0.0
\end{bmatrix}$$

$$\ (\theta^{(1)})^Tx^{(1)} \approx 5
\ (\theta^{(2)})^Tx^{(1)} \approx 5
\ (\theta^{(3)})^Tx^{(1)} \approx 0
\ (\theta^{(4)})^Tx^{(1)} \approx 0$$

$$\rarr \
\theta^{(1)}=\begin{bmatrix}
0
\5
\0
\end{bmatrix}
\ \theta^{(2)}=\begin{bmatrix}
0
\5
\0
\end{bmatrix}
\ \theta^{(3)}=\begin{bmatrix}
0
\0
\5
\end{bmatrix}
\ \theta^{(4)}=\begin{bmatrix}
0
\0
\5
\end{bmatrix}$$

  • Given parameters $\theta^{(1)},...,\theta^{(n_u)}$, to learn $x^{(1)},...,x^{(n_m)}$
  • $$\underset{x^{(1)},...,x^{(n_m)}}{min}
    \frac 1 2 \sum_{i=1}^{n_m} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2$$
  • Given $x^{(1)},...,x^{(n_m)}$ (features) and movie ratings, can estimate $\theta^{(1)},...,\theta^{(n_u)}$ (parameters)
  • Given parameters $\theta^{(1)},...,\theta^{(n_u)}$, can estimate $x^{(1)},...,x^{(n_m)}$ (features)

$$\theta\ (initial \ random \ guess) \rarr x \rarr\theta \rarr x \rarr\theta \rarr x \rarr \cdots$$

Collaborative Filtering Algorithm

  • Given $x^{(1)},...,x^{(n_m)}$, estimate $\theta^{(1)},\theta^{(2)},...,\theta^{(n_u)}$
  • $$\underset{\theta^{(1)},...,\theta^{(n_u)}}{min}
    \frac 1 2 \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2$$
  • Given parameters $\theta^{(1)},...,\theta^{(n_u)}$, estimate $x^{(1)},...,x^{(n_m)}$
  • $$\underset{x^{(1)},...,x^{(n_m)}}{min}
    \frac 1 2 \sum_{i=1}^{n_m} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2$$
  • Minimizing $x^{(1)},...,x^{(n_m)}$ and $\theta^{(1)},...,\theta^{(n_u)}$ simultaneously$$\rarr
    \underset{x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}}
    {min}
    J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$$
  • $$J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})=
    \frac 1 2 \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}k)^2+\frac \lambda 2 \sum{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2$$
  1. Initialize $x^{(1)},...,x^{(n_m)}$ and $\theta^{(1)},...,\theta^{(n_u)}$ to small random values. $(x \in \mathbb{R}^n, \theta \in \mathbb{R}^n)$
  2. Minimize $J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$ using gradient descent (or advanced optimization algorithm)$$x_k^{(i)}:=x_k^{(i)}-\alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})\theta_k^{(j)}+\lambda x_k^{(i)})$$
  3. $$\theta_k^{(j)}:=\theta_k^{(j)}-\alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda \theta_k^{(j)})$$
  4. E.g. for every $j=1,...,n_u$ , $i=1,...,n_m$ (not from zero) $( no \ x_0 ,\theta_0)$
  5. For a user with parameters $\theta$ and a movie with learned features $x$, predict a star rating of $\theta^Tx$.

Vectorization : Low rank matrix Factorization

Untitled

$$Y=
\begin{bmatrix}
5 \ 5 \ 0 \ 0
\ 5 \ ?\ ?\ 0
\? \ 4 \ 0 \ ?
\0 \ 0 \ 5 \ 4
\ 0 \ 0 \ 5 \ 0
\end{bmatrix}
=y^{(i,j)}$$

$$n_m=5
\n_u=4$$

$$X=\begin{bmatrix}
(x^{(1)})^T
\(x^{(2)})^T
\ \vdots
\ (x^{(n_m)})^T
\end{bmatrix}$$

$$\Theta=\begin{bmatrix}
(\theta^{(1)})^T
\(\theta^{(2)})^T
\ \vdots
\ (\theta^{(n_u)})^T
\end{bmatrix}$$

  • Predicted Ratings (Low Rank Matrix Factorization)

$$X\Theta^T=\begin{bmatrix}
(\theta^{(1)})^T(x^{(1)}) \ (\theta^{(2)})^T(x^{(1)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(1)})

\(\theta^{(1)})^T(x^{(2)}) \ (\theta^{(2)})^T(x^{(2)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(2)})
\ \vdots
\ \(\theta^{(1)})^T(x^{(n_m)}) \ (\theta^{(2)})^T(x^{(n_m)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(n_m)})
\end{bmatrix}$$

  • Finding related movies

For each product $i$, we learn a feature vector $x^{(i)} \in \mathbb{R}^n$

To find movies $j$ related to movie $i$, find movie with the smallest $|x^{(i)}-x^{(j)}|$.

Implementational Detail : Mean Normalization

  • It's not a good idea to set to zero when user haven't rated any movies.

Untitled

$$Y=
\begin{bmatrix}
5 \ 5 \ 0 \ 0 \ \underline?
\ 5 \ ?\ ?\ 0 \ \underline?
\? \ 4 \ 0 \ ? \ \underline?
\0 \ 0 \ 5 \ 4 \ \underline?
\ 0 \ 0 \ 5 \ 0 \ \underline?
\end{bmatrix}$$

  • Take average of other ratings and put it into matrix $\mu$.

$$\mu=
\begin{bmatrix}
2.5
\2.5
\2
\2.25
\1.25
\end{bmatrix}$$

Untitled

  • Then learn $\theta^{(j)},x^{(i)}$ with mean normalized $Y$.
  • For user $j$, on moive $i$, predict (add $\mu$ back for prediction)
  • $$(\theta^{(j)})^T(x^{(i)})+\underline{\mu_i}$$

e.x. For user 5 (Eve) : $(\theta^{(5)})^T(x^{(i)})+\underline{\mu_i}$

Large Scale Machine Learning

Learning with Large Datasets

"It's not who has the best algorithm that wins. It's who has the most data."

  • To check if training algorithm is ok with selected data sets, we can plot the learning curve. (Sanity check)

Untitled

Stochastic Gradient Descent

  • Batch Gradient Descent (using all training examples in each iteration)
  • Untitled
  • Stochastic Gradient Descent (using single training example in each iteration.)
  1. Randomly shuffle training examples
  2. Repeat {$\theta_j:=\theta_j-\alpha(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$}
  3. (for $j:=1,...,n$) }
  4. for $i:=1,...,m$ {

Untitled

Mini-batch Gradient Descent

  • Use $b$ examples in each iteration

if $b=10, m=1,000$

Repeat {

for $i=1,11,21,31,...,991$ {

$\theta_j:=\theta_j-\alpha \underline{ \frac 1 {10} \sum_{k=i}^{i+9} (h_\theta(x^{(k)})-y^{(k)})x_j^{(k)}}$

(for every $j=0,...,n$) }

}

  • Mini-batch Gradient Descent is efficient when examples are vectorized well.

Stochastic Gradient Descent Convergence

Checking for Convergence

  • Batch Gradient Descent$J_{train}(\theta)= \frac 1 {2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$
  • Plot $J_{train}(\theta)$ as a function of the number of iterations of gradient descent.
  • Stochastic Gradient DescentDuring learning, compute $cost(\theta,(x^{(i)},y^{(i)}))$ before updating $\theta$ using $(x^{(i)},y^{(i)})$.
  • Every 1,000 iterations (say), plot $cost(\theta,(x^{(i)},y^{(i)}))$ averaged over the last 1,000 examples processed by algorithm.
  • $cost(\theta,(x^{(i)},y^{(i)}))= \frac 1 2 (h_\theta(x^{(i)})-y^{(i)})^2$

Untitled

Learning rate

  • Learning rate $\alpha$ is typically held constant. Can slowly decrease $\alpha$ over time if we want $\theta$ to converge.
  • $\alpha= \frac {const1} {iterationNumber+const2}$

Untitled

Online Learning

Untitled

  • Can adopt to changing user preference.
  • Learn from $(x,y)$ then discard. $\bold{not} \ (x^{(i)},y^{(i)})$.
  • Repeat Forever {Update $\theta$ using $(x,y)$ :$(j=0,...,n)$
  • }
  • $\theta_j:=\theta_j-\alpha(h_\theta(x)-y)x_j$
  • Get $(x,y)$ corresponding to user.

Other Online Learning Example

Untitled

Map-reduce and Data Parallelism

  • Many Learning algorithms can be expressed as computing sums of functions over the training set.
    • E.g. Gradient descent, Logistic regression, Deep learning, ...
  • Batch Gradient Descent $(m=400)$
  • Untitled

UntitledUntitled

Application example : Photo OCR

Problem description and pipeline

Untitled

  • Photo OCR (Optical Character Recognition) pipline
    1. Text detection
    2. Character segmentation
    3. Character classification

Sliding windows

  • Supervised learning for pedestrian detection
  • Move n pixels (step-size) in each steps

UntitledUntitledUntitled

  • Text detection
  • Untitled
  • 1D Sliding window for character segmentation
  • Untitled

Getting lots of data : Artificial data synthesis

  • Synthesizing data by introducing distortions(E.g. Audio : Background noise, bad cellphone connection)(E.g. intensity (brightness) of pixel $i$ / $x_i \larr x_i+random \ noise$)
  • Usually does not help to add purely random / meaningless noise to your data.
  • Distortion introduced should be representation of the type of noise / distortions in the test set.

Discussion on getting more data

  1. Make sure you have a low bias classifier before expanding the effort. (plot learning curve)
  2. Think "How much work would it be to get 10x as much data as we currently have?"
  • Artificial data synthesis
  • Collect/label it yourself
  • "Crowd source"

Ceiling analysis : What part to the pipeline to work on next

  • Estimating the errors due to each component (Ceiling analysis)

UntitledUntitled