Processing math: 100%

Coding/Coursera

[Andrew Ng] Machine Learning 정리

폴밴 2021. 9. 1. 09:28

* Notion에서 작성한 내용을 마크다운으로 가져와 작성된 글입니다. KaTeX 수식은 제대로 표기되지 않을 수 있습니다. 강의 내용에 대한 저작권은 Stanford University - Andrew Ng 에게 있습니다.

Supervised Learning

  • already know what our correct output should look like.
  • having the idea that there is a relationship between the input and the output.

Regression

  • continuous output ex) age, price

Classification

  • discrete output ex) tumor is malignant or benign

Unsupervised Learning

  • little or no idea what our results should look like.
  • variables in data → find relaionships → cluster data →derive structure
  • clustering / non-clustering

Model Representation

  • x : input variables , input features
  • y : output , target variables
  • (x,y) : trainig example
  • (x(1),y(1)),(x(2),y(2))...(x(i),y(i)) : training set
    (i=1 to m, "i" is simply index into the training set)
  • X : space of input vales
  • Y : space of output values (X = Y = R)
  • function h:XY

Cost Function

  • measure the accuracy of our hypothesis function
  • takes an average difference of all the results of the hypothesis with actual output y.

J(θ0,θ1)=12mmi=1(^yiyi)2=12mmi=1(hθ(xi)yi)2

→ Goal : choose parameter θ0,θ1 to minimize cost function J

Cost Function Intuition 1

assume there is one parameter θ0

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled.png

Cost funtion is 0 when hypothesis is identical to actual output y.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%201.png

Cost function becomes bigger value when hypothesis is different from y.

Cost Function Intuition 2

  • taking two parameters θ0,θ1
  • hθ(x)=θ0+θ1x

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%202.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%203.png

contour plot / contour figure

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%204.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%205.png

choosing different parameters to minimize cost function (contour plot on the right)

Gradient Descent

  • Start with some θ0,θ1
  • Keep changing θ0,θ1 to reduce J(θ0,θ1) until we hopefully end up at a minimum.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%206.png

the slope of the tangent is the derivative at that point.
the size of each step is determined by learnig rate alpha.

Gradient Descent Algorithm

  • repeat until convergence

θj:=θjαθjJ(θ0,θ1) (simultaneously update j=0 and j=1)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%207.png

finding theta which is at local optima

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%208.png

learning rate has to be proper value

Gradient Descent for Linear Regression

  • Linear Regression Model

J(θ0,θ1)=12mmi=1(^yiyi)2=12mmi=1(hθ(xi)yi)2

  • Gradient Descent Algorithm

θj:=θjαθjJ(θ0,θ1) (simultaneously update j=0 and j=1)

j=0 \ : \ \frac {\partial}{\partial \theta_0} J(\theta_0,\theta_1)=\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})

j=1 \ : \ \frac {\partial}{\partial \theta_1} J(\theta_0,\theta_1)=\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}

  • repeat until convergence

\theta_0 := \theta_0-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \
\theta_1 := \theta_1-\alpha\frac 1 m
\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}

  • "Batch" Gradient Descent

Each step of gradient descent uses all the training examples.

Gradient Descent for Multiple Variables

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%209.png

  • n = number of features
  • m = number of training sets
  • x(m)n
  • x(i) = input of ith training example
  • x(i)j = value of feature j in ith training example

ex ) x(3)2=3 , x(2)1=1416

hθ(x)=θ0+θ1x1+θ2x2+θ3x3+...θnxn

X=
[x0 x1 x2 ... xn]

\Theta=
[θ0 θ1 θ2 ... θn]

h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...\theta_nx_n
= \Theta^TX

  • previously (n=1)
  • \theta_0 := \theta_0-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \
    \theta_1 := \theta_1-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}
  • new algorithm (n1) (x(i)0=1)
      theta = theta - (alpha / m * sum((predictions-y).*X))';
  • \theta_0 := \theta_0-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_0^{(i)} \\
    \theta_1 := \theta_1-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_1^{(i)} \\
    \theta_2 := \theta_2-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_2^{(i)} \\
    \vdots \\
    \boxed{\theta_j := \theta_j-\alpha\frac 1 m
    \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) x_j^{(i)}}

Feature Scaling

  • to make sure features are on similar scale.
  • get every feature into approximately a (1xi1) range.

x_1 = size \ (0 \sim 2000 \ ft^2) \
x_2 = number \ of \ bedrooms \ (1 \sim 5)

\to x_1 = \frac {size(ft^2)} {2000} \
\to x_2 = \frac {number \ of \ bedrooms} {5}

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2010.png

Mean Normalization

  • get every feature into approximately a (0.5xi0.5) range.

xi:=xiμisi

  • μi=average value of xi
  • si=range of xi=(maxmin)

Learning Rate

θj:=θjαθjJ(θ)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2011.png

  • For sufficiently small α , J(θ) should decrease on every iteration.
  • But if α is too small, gradient descent can be slow to converge.
  • To choose α, try ...,0.001,0.003,0.01,0.03,0.1,0.3,1,...

Polynomial Regression

  • We can combine multiple features into one. ( Ex. x3x1  x2 )
  • We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic, square root, etc.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2012.png

Feature scaling becomes important when we choose features this way.

Normal Equation

  • Unlike gradient descent, normal equation is a method to solve for θ analytically.

J(\theta_0,\theta_1,...,\theta_m)=\frac {1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2
\
\frac {\partial}{\partial\theta_j}J(\theta)=\dotsm=0 \ (for \ every \ j)
\
\underline{Solve \ for \ \theta_0,\theta_1,....\theta_n}

x^{(i)}=
[x(i)0 x(i)1 x(i)2 ... x(i)n]
\ (x^{(i)})^T= [x_0^{(i)} x_1^{(i)} x_2^{(i)} \dots x_n^{(i)} ]
\
X=
[x(1)0 x(1)1 x(1)2 x(1)n x(2)0 x(2)1 x(2)2 x(2)n x(3)0 x(3)1 x(3)2 x(3)n ... x(m)0 x(m)1 x(m)2 x(m)n]
=[(x(1))T (x(2))T (x(3))T ... (x(m))T]

θ=(XTX)1XTy

Gradient Desent

  • Need to choose α.
  • Needs many iterations.
  • Works well even when n (features) is large.

Normal Equation

  • No need to choose α.
  • Don't need to iterate.
  • Need to compute (XTX)1.
  • Slow if n is very large. (n>104)

Vectorial Implementation

hθ(x)=nj=0θjxj=θTx

Unvectorized implementation

prediction = 0.0;
for j = 1:n+1;
    prediction = prediction 
                                + theta(j) * x(y)
end;

Vectorized implementation

prediction = theta' * x;

Logistic Regression

Classification

  • ex. email - spam or not spam / tumor - malignant or benign

y{0,1}

0 : Negative Class

1 : Positive Class

  • Discrete Value

if \ h_\theta(x) \geq 0.5, \ y=1
\
if \ h_\theta(x) < 0.5, \ y=0

Hypothesis Representation

  • Using linear regression to classification problem performs very poorly, so we have to chage the form.
  • Logistic Function

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2013.png

h_\theta(x)=g(\theta^Tx)
\
z=\theta^Tx
\
g(z)=\frac{1}{1+e^{-z}}

  • hθ(x) = estimated *probability *that y=1 on input x

h_\theta(x)=P(y=1|x;\theta)

=1-P(y=0|x;\theta)

predict \ \text{\textquotedblleft}y=1" \ if \ h_\theta(x)=g(z) \geq0.5
\
\underline{ when \ z=\theta^T \geq 0}

predict \ \text{\textquotedblleft}y=0" \ if \ h_\theta(x)=g(z) <0.5
\
\underline{when \ z=\theta^T <0}

Decision Boundary

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2014.png

x=
[1 x1 x2 ]
\
\theta=
[3 1 1 ]

h_\theta(x)=g(\theta^Tx)=g(-3+x_1+x_2)
\
predict \ \text{\textquotedblleft}y=1" \ if\ \ \bold{z=-3+x_1+x_2\geq0}

h_\theta(x)=g(\theta^Tx)=g(-3+x_1+x_2)
\
predict \ \text{\textquotedblleft}y=0" \ if\ \ \bold{ z=-3+x_1+x_2<0}

Decision Boundarythereforex1+x2=3

  • The input to the sigmoid function g(z) (e.g. θTX) doesn't need to be linear.
  • e.g.
    z=θ0+θ1x21+θ2x22 (circle)
    or any shape of data

Cost Function

J(\theta)=
\frac{1}{m} \sum_{i=1}^{m} \underline{\frac1 2 (h_\theta(x^{(i)})-y^{(i)})^2}
=\frac{1}{m} \sum_{i=1}^{m}\underline{Cost(h_\theta(x^{(i)}),y^{(i)})}

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2015.png

Cost(hθ(x(i)),y(i))=log(hθ(x)) if y=1

h_\theta(x)=1
\ Cost = 0

h_\theta(x) \rarr 0
\ Cost \rarr \infin

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2016.png

Cost(hθ(x(i)),y(i))=log(1hθ(x)) if y=0

h_\theta(x)=0
\ Cost = 0

h_\theta(x) \rarr 1
\ Cost \rarr \infin

Simplified Cost Function and Gradient Descent

  • Simplified Cost Function

J(\theta)
=\frac{1}{m} \sum_{i=1}^{m}\underline{Cost(h_\theta(x^{(i)}),y^{(i)})}\=- \frac{1}{m} [ \sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]

  • Vectorized Implementation

h=g(X\theta) \
J(\theta)= \frac 1 {m} \cdot (-y^Tlog(h)-(1-y)^Tlog(1-h))

  • Gradient Descent

θjθjαθjJ(θ)

θjθjαmmi=1(hθ(x(i))y(i))x(i)j

  • Vectorized Implementation

θθαmXT(g(Xθ)y)

Algorithm looks identical to linear regression, but hθ(x) is Logistic Function.

Optimization Algorithm

  • Given θ, we have code that can compute J(θ), θjJ(θ) (for j=0,1,2,n).

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2017.png

options = optimset('GradObj','on','MaxIter',100); % optimset() function creates an object containing the options we want to send to fminunc()
initialtheta = zeros(2,1);
[optTheta, functionVal, exitFlag] ...
           = fminunc(@costFunction, initialTheta, options);

Multiclass Classification

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2018.png

y \in {\lbrace0,1,...,n\rbrace}
\
h_\theta^{(0)}(x)=P(y=0|x;\theta) \
h_\theta^{(1)}(x)=P(y=1|x;\theta)
\
\cdots
\
h_\theta^{(n)}(x)=P(y=n|x;\theta)
\
prediction=\underset{i}{max}(h_\theta^{(i)}(x))

  • Train a logistic regression classfier hθ(x) for each class to predict the probability that y=i.
  • To make a prediction on a new x, pick the class that maximizes hθ(x).

prediction=maxi(h(i)θ(x))

  • hθ(x) is a probability that y=i, by logistic function definition.

Regularization

The Problem of Overfitting

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/0B05AE56-26FA-4254-99C3-81E5E29E83E0.jpegMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2019.png

  • If we have too many features, hypothesis may fit the training set very well, but fail to generalize to new examples. →Overfitting
  • To fix this issue, there are two options : Reducing number of features / Regularization.

Cost Function

Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2020.png

  • Suppose we penalize and make θ3,θ4 very small.

minθ12mmi=1(h(x(i)θ)y(i))2+1000θ23+1000θ24

  • The goal is to minimize cost function, therefore by multiplying large value for θ3 and θ4 makes θ30 and θ40.

Regularization

J(θ)=12mmi=1[(h(x(i)θ)y(i))2+λnj=1θ2j]

  • Small values for parameter θ makes hypothesis simpler and less prone to overfitting.
  • If λ is extremely large value, algorithm results in underfitting. (hθ(x)=θ0)

Regularized Linear Regression

Gradient Descent

θ0θ0α1mmi=1(h(x(i)θ)y(i))x(i)0

θjθjα[(1mmi=1(h(x(i)θ)y(i))x(i)j)+λmθj]

\theta_j \coloneqq \theta_j(1-\alpha \frac \lambda m)-\alpha \frac 1 m \sum_{i=1}^m (h(x_\theta^{(i)})-y^{(i)})x_j^{(i)}
\ (1-\alpha \frac \lambda m )< 1

Normal Equation

X=[(x(1))T  (x(m))T]
\ (m \times (n+1) \ matrix ) \
y=[y(1)  y(m)]

θ=(XTX+λL)1XTy

L = [0000 0100 0010  01]
\ ((n+1)\times(n+1) \ matrix)

  • Regularization takes care of non-invertible issue.

Regularized Logistic Regression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2021.png

J(\theta)
=- \frac{1}{m} [ \sum_{i=1}^m y^{(i)}log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))] + \boxed{ \frac{\lambda} {2m} \sum_{j=1}^n \theta_j^2}

Gradient Descent

θ0θ0α1mmi=1(h(x(i)θ)y(i))x(i)0

θjθjα[(1mmi=1(h(x(i)θ)y(i))x(i)j)+λmθj]

Advanced Optimization

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2022.png

function [jVal, gradient] = costFunction(theta);
jVal = [code to compute J(theta)];
gradient(1) = [code to compute partial derivative J(theta)]; 
gradient(2) = [code to compute partial derivative J(theta)];
gradient(3) = [code to compute partial derivative J(theta)];
...
gradient(n+1) = [code to compute partial derivative J(theta)]; % because code starts with 1, while theta_n starts from 0.

Neural Networks : Representation

  • Neurons are basically computational units that take inputs (Dendrites) that are channeled to outputs (Axons).
  • Sigmoid Activation Function (same as logistic function)

hθ(x)=g(θTx)=11+eθTx

  • θ : weights (parameters)

Model Representation 1

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2023.png

  • Layer 1 : Input Layer
    Layer 2 : Hidden Layer
    Layer 3 : Output Layer

[x0 x1 x2 x3]
\rarr
[a(2)1 a(2)2 a(2)3]
\rarr
h_\theta(x)

  • a(j)i : "activation" of unit i in layer j.
  • Θ(j) : matrix of weights controlling function mapping from layer j to layer j+1.

a_1^{(2)}=
g([θ(1)10 θ(1)11 θ(1)12 θ(1)13]
\times
[x0 x1 x2 x3]
)
=g(\Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3)
\
a_2^{(2)}=g(\Theta_{20}^{(1)}x_0+\Theta_{21}^{(1)}x_1+\Theta_{22}^{(1)}x_2+\Theta_{23}^{(1)}x_3)
\
a_3^{(2)}=g(\Theta_{30}^{(1)}x_0+\Theta_{31}^{(1)}x_1+\Theta_{32}^{(1)}x_2+\Theta_{33}^{(1)}x_3)
\
h_\Theta(x)=a^{(3)}1= g(\Theta{10}^{(2)}a_0^{(2)}+\Theta_{11}^{(2)}a_1^{(2)}+\Theta_{12}^{(2)}a_2^{(2)}+\Theta_{13}^{(2)}a_3^{(2)})
\
(h_\Theta(x) \ can \ be \ viewed \ as \ unit \ 1 \ in \ layer \ 3)

  • If network hasthen Θ(j) will be of dimension sj+1×(sj+1).
    ( +1 comes from x0 and Θ(j)0, "bias nodes", ex. Θ(1)10x0 )
  • s_j \ units \ in \ layer \ j,
    \
    s_{j+1} \ units \ in \ layer \ j+1

Model Representation 2

Forward Propagation (Vectorized Implementation)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2024.png

  • Let z(2)1 ( unit 1 in layer 2 ) asthen z(j)i can be vectorized as
  • x=a^{(1)}
    =[x0 x1 x2 x3]
    \ , \
    z^{(2)}=
    [z(2)1 z(2)2 z(2)3]
    \
    \rarr z^{(2)}=\Theta^{(1)}a^{(1)}
    \
    \vdots
    \
    z^{(j+1)}=\Theta^{(j)}a^{(j)}
    \
    (a_0^{(j)}=1 \ (bias \ unit))
  • z_1^{(2)} = \Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3
    \
    a_1^{(2)}=g(z_1^{(2)})
  • The last parameter matrix Θ(j) will have only one row (because jth layer has only one unit (hθ(x))
    which is multiplied by one column a(j) so that our result is a single number.

hΘ(x)=a(j+1)=g(z(j+1))=g(Θ(j)a(j))

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2025.png

Examples & Intuitions 1

  • If first parameter matrix is Θ(1)=[30  20  20]
  • h_\Theta(x)=g(-30+20x_1+20x_2)
    \
    x_1=0 \ and \ x_2=0 \ then \ | \ g(-30) \approx0
    \
    x_1=0 \ and \ x_2=1 \ then \ | \ g(-10) \approx0
    \
    x_1=1 \ and \ x_2=0 \ then \ | \ g(-10) \approx0
    \
    x_1=1 \ and \ x_2=1 \ then \ | \ g(10) \approx1
    \
    \rarr AND

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2026.png

  • OR

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2027.png

Examples & Intuitions 2

To make x1 XNOR x2

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2028.png

  • Use Θ(1) matrix that combines the values for AND and NOR.[302020 102020]
  • \Theta^{(1)}
  • Use Θ(2) matrix that uses the value for OR.
  • \Theta^{(2)}=
    [102020]
  • Then a(3)1(=hΘ(x)) becomes XNOR.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2029.png

  • Neural Network Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2030.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2031.png

Multi-class Classification

Let hypothesis function return a vector of values.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2032.png

[x0 x1 x2  xn]
\rarr
[a(2)0 a(2)1 a(2)2 ]
\rarr
[a(3)0 a(3)1 a(3)2 ]
\rarr
\cdots
\rarr
[hΘ(x)1 hΘ(x)2 hΘ(x)3 hΘ(x)4]

Neural Networks : Learning

Cost Function

L=total \ number \ of \ layers \ in \ network
\
s_l=number \ of \ units \ (not \ including \ bias \ unit) \ in \ layer \ 1
\
K=number \ output \ units/classes
\
h_\Theta(x)_k=hypothesis \ that \ results \ in \ the \ k^{th} \ output

J(\Theta)=- \frac {1}{m} [ \sum_{i=1}^m \bold{\sum_{k=1}^{K}} y_k^{(i)}log(h_\theta(x^{(i)}))k + (1-y_k^{(i)})log(1-(h_\Theta(x^{(i)}))_k )] + \bold{\frac{\lambda}{2m} \sum{l=1}^{L-1}\sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1}(\Theta_{ji}^{(l)})^2}
\

Backpropagation Algorithm

  • Our goal is to minimize J(Θ) with parameter Θ.
  • To achieve the goal, we need to compute J(Θ) and Θ(l)ijJ(Θ)
  • To do so, we use backpropagation algorithm.

training \ set : (x^{(1)},y^{(1)}), \cdots,(x^{(m)},y^{(m)})
\
Set \ \Delta_{i,j}^{(l)} :=0 \ for \ all \ (l,i,j)

For \ t=1 \ to \ m:
\

  1. \ Set \ a^{(1)}:=x^{(t)} \ (unit \ of \ first \ layer \ i s\ same \ as \ x)
    \
  2. \ Perform \ Foward \ Propagation \ to \ compute \ a^{(l)} (l=2,3,...,L)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2033.png

Using \ y^{(t)}, \
Compute \ \delta^{(L)}=a^{(L)}-y^{(t)}

"error values" (δ(L)) for the last layer are simply the differences of our actual results in the last layer (a(L)) and correct outputs in y (y(t)).

Compute \ \delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)} \ using
\
\delta^{(l)}=((\Theta^{(l)})^T \delta^{(l+1)}).a^{(l)}.(1-a^{(l)})
\
(g'(z^{(l)})=a^{(l)}.*(1-a^{(l)}))

\Delta_{i,j}^{(l)}:=\Delta_{i,j}^{(l)}+a_{j}^{(l)}\delta_{i}^{(l+1)}
\
Vectorization \ \rarr \ \Delta^{(l)}:=\Delta^{(l)}+\delta_{i}^{(l+1)}(a^{(l)})^T

Then we update our new Δ matrix.

D_{i,j}^{(l)}:= \frac 1 m (\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)}) \ , if \ j \neq 0.
\
D_{i,j}^{(l)}:= \frac 1 m \Delta_{i,j}^{(l)}\ , if \ j = 0.

matrix D is used as an "accumulator" to add up our values and eventually compute our partial derivative of J(Θ).

Θ(l)ijJ(Θ)=D(l)ij

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2034.png

Backpropagation Intuition

  • δ(l)j is the "error" for a(l)j.
  • delta values are actually the derivative of the cost function.

δ(l)j=z(l)j cost(t)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2035.png

  • To calculate δ(2)2, we multiply the weights Θ(2)12 and Θ(2)22 by their respective δ values.

\delta_2^{(2)}=\Theta_{12}^{(2)}\delta_1^{(3)}+\Theta_{22}^{(2)}\delta_2^{(3)}

Implementation Note : Unrolling Parameters

  • To use optimizing function as "fminunc()", we should unroll all elements.

Θ(1),Θ(2),Θ(3),...

thetaVector = [ Theta1(:); Theta2(:); Theta3(:) ];
% becomes column vector
deltaVector = [ D1(:); D2(:); D3(:) ];
% becomes column vector 
  • If Theta1 is 10x11, Theta2 is 10x11, Theta3 is 1x1 matrices, we can get back our original matrices from the "unrolled" versions.
Theta1 = reshape (thetaVector (1:110), 10, 11); 
% select 1 to 110 elements, make 10 by 11 matrix
Theta2 = reshape (thetaVector (111:220), 10, 11);
% select 111 to 220 elements, make 10 by 11 matrix
Theta3 = reshape (thetaVector (221:231), 1, 11);
% select 221 to 231 elements, make 1 by 11 row vector

Gradient Checking

  • To verify whether backpropagation works as intended, we use gradient checking.

ΘjJ(Θ)J(Θ1,,Θj+ϵ,,Θn)J(Θ1,,Θjϵ,,Θn)2ϵ

  • Once you have verified that backpropagation is correct, you should disable gradient checking. (code to compute gradApprox is very slow)

Random Initialization (Breaking Symmetry)

  • If we choose all zero elements for initial Θ matrix, all nodes will update to the same value repeatedly while performing backpropagate.
  • So we should initialize Θ to a random value in [ϵ,ϵ].

ϵΘ(l)ijϵ

Theta1 = rand(10,11) * (2 * init_epsilon) - init_epsilon;
Theta2 = rand(1,11) * (2 * init_epsilon) - init_epsilon;

Summary (Putting it together)

Pick a Network Architecture

  • Number of input units : Dimension of features x(i)
  • Number of output units : Number of classes (yk)
  • Resonable Default : 1 hidden layer, or if >1 hidden layer, have same number of hidden units in every layer

Training a Neural Network

  1. Randomly initialize wieghts (Θ)
  2. Implement forward propagation to get hΘ(x(i)) for any x(i)
  3. Implement code to compute cost function J(Θ)
  4. Implement backpropagation to compute partial derivatives Θ(l)jkJ(Θ)
  5. for i = 1:m { Perform forward propagation and backpropagation using example (x_i, y_i) Get activations a^l and delta terms delta^l for l = 2 to L }
  6. Use gradient checking to compare Θ(l)jkJ(Θ) computed using backpropagation vs. using numerical estimate of gradient of J(Θ)

Then disable gradient checking code
6. Use gradient descent or advanced optimization method with backpropagation to try to minimze J(Θ) as a function of parameters Θ

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2036.png

Advice for applying Machine Learning

Evaluating a hypothesis

Trouble shooting for errors in our predictions by :

  • Getting more training examples
  • Trying smaller sets of features
  • Trying additional features
  • Trying polynomial features
  • Increasing or decreasing λ

We can split up the data into to sets : training set (70%) / test set (30%)

  1. Learn Θ and minimize Jtrain(Θ) using the training set
  2. Compute the test set error Jtest(Θ)

The Test Set Error

  • Linear Regression

J_{test}(\Theta)= \frac {1}{2m_{test}} \sum_{i=1}^{m_{test}}
(h_{\Theta}(x_{m_{test}})-y_{m_{test}}^{(i)})^2

  • Classification

err(h_\Theta(x),y)=1 \ \
(if \ h_\Theta(x) \geq 0.5 \ and \ y=0,
\ or \ h_\Theta(x) < 0.5 \ and \ y=1)
\
err(h_\Theta(x),y)=0 \ (otherwise)

Test \ Error = \frac 1 {m_{test}}
\sum_{i=1}^{m_test}
err(h_\Theta(x_{test}^{(i)}),y_{test}^{(i)})

  • Logistic Regression

J_{test}(\theta)= - \frac 1 {m_{test}}
\sum_{i=1}^{m_{test}}
y_{test}^{(i)} \ log (h_\theta(x_{test}^{(i)}))+(1-y_{test}^{(i)})log(h_\theta(x_{test}^{(i)}))

Model Selection and Training / Validation / Test sets

The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any data set.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2037.png

One way to break down our dataset into the three sets is :

  • Training Set (60%)
  • Cross Validation Set (20%)
  • Test Set (20%)

We can now calculate three separate error values :

  1. Optimize the parameters in Θ using the data set for each polynomial degree.
  2. Find the polynomial degree d with the least error using the cross validation set.
  3. Estimate the generalization error using the test set with Jtest(Θ(d)).

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2038.png

Diagnosing Bias vs. Variance

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2039.png

  • Bias (Underfit) - small degree of polynomial d
    Jtrain(Θ) will be high, Jcv(Θ)Jtrain(Θ)
  • Variance (Overfit) - large degree of polynomial d
    Jtrain(Θ) will be low, Jcv(Θ)>>Jtrain(Θ)

Regularization and Bias / Variance

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2040.png

  • Small λ : high variance (overfit)
  • Large λ : high bias (underfit)
  • To choose the model & regularization term:
  1. Create list of λ ( 0, 0.01, 0.02, 0.04, 0.08, ... , 10.24 ).
  2. Create a set of models with different degrees.
  3. Go through all the models to learn some Θ for each λ.
  4. Compute cross validation error on the Jcv(Θ) without regularization.
  5. Select the best combo that produces the lowest error on the cross validation set.
  6. Apply it on Jtest(Θ) to see if it has a good generalization of the problem.

Learning Curves

  • High Bias (Underfitting)
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2041.png
  • High Variance (Overfitting)
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2042.png

Debugging a Learning Algorithm

  • Fixes high bias (underfitting)
    Try getting additional features
    Try adding polynomial features
    Try decreasing λ
  • Fixes high variance (overfitting)
    Get more training examples (training sets)
    Try smaller sets of features
    Try increasing λ (heavier regularization)
  • Small Neural Network
    more prone to underfitting (Computationally cheaper)
  • Large Neural Netwok
    more prone to overfitting (Computationally expensive)
    Use regualrization to address overfitting.

Machine Learning System Design

Prioritizing What to Work On

  • How could we spend our time to improve the accuracy of this classifier (spam classifier)?

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2043.png

  • Collect lots of data
  • Develop sophisticated features
  • Develop algorithms to process your input in different ways

It is difficult to tell which of the options will be most helpful.

Error Analysis

  • Recommended Approach
    1. Start with simple algorithm, implement quickly, and test it early on your cross validation data.
    2. Plot learning curves to decide if more data, more features, etc. are likely to help.
    3. Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
  • It is very important to get error results as a single, numerical value.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2044.png

Error metrics for skewed classes

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2045.png

Precision / Recall

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2046.png

  • Precision
    (Of all patients where we predicted y=1, what fraction actually has caner?)\frac { True \ Positive } {\bold{ Predicted \ Positive}}
  • \frac { True \ Positive } {\bold{ True \ Positive + False \ Positive}}
  • Recall
    (Of all patients that actually have cancer, what fraction did we correctly detect as having cancer?)\frac { True \ Positive } {\bold{ Actual \ Positive}}
  • \frac { True \ Positive } {\bold{ True \ Positive + False \ Negative}}

Trading off precision and recall

  • Suppose we want to predict y=1 (cancer) only if very confident.Higher precision, Lower recall
  • h_\theta(x) \geq \bold{ 0.9}
    \
    h_\theta(x) < \bold{0.9}
  • Suppose we want to avoid missing too many cases of cancer (avoid false negatives).Higher recall, Lower precision
  • h_\theta(x) \geq \bold{0.3}
    \
    h_\theta(x) < \bold{0.3}
  • More generally : predict 1 if hθ(x)threshold.

F Score

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2047.png

Data for Machine Learning

Useful test

  1. Given the input x, can a human expert confidently predict y?
  2. Can we get a large training set, and train the learning algorithm with a lots of parameters in the training set?

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2048.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2049.png

Support Vector Machines

Optimization Objective

  • Alternative view of logistic regression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2050.png

Cost \ of \ example: \=-(y-\log h_\theta(x)+(1-y)\log(1-h_\theta(x)))
\
=-y\log \frac {1} {1+e^{-\theta^Tx}} - (1-y)\log(1-\frac {1} {1+e^{-\theta^Tx}})

If y=1 , (want θTx>>0)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2051.png

If y=0 , (want θTx<<0)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2052.png

  • Support Vector Machine

\underset{\theta}
{min}
\frac{1}{m} [ \sum_{i=1}^m y^{(i)}(-\log h_\theta(x^{(i)}))+(1-y^{(i)})(-\log(1-h_\theta(x^{(i)}))]+ \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2

m : constant, doesn't affect minimization

A+λBCA+B (C=1λ)

\rarr \underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac 1 2 \sum_{j=1}^n\theta_j^2

h_\theta(x)=1 \ (if \ \theta^Tx \geq 0) \
h_\theta(x)=0 \ (otherwise)

Large Margin Intuition

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2053.png

\underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})]+\frac 1 2 \sum_{j=1}^n\theta_j^2

  • SVM Decision Boundary : Linearly Serperable Case
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2054.png
  • Large Margin Classfier in presence of outliers
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2055.png

The mathematics behind Large Margin Classification (optional)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2056.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2057.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2058.png

Kernels 1

  • Given x, compute new feature depending on proximity to landmarks l(1),l(2),l(3)

f_1 = similarity (x,l^{(1)})=\exp(- \frac {|x-l^{(1)}|^2} {2 \sigma^2})
=\exp(- \frac {\sum_{j=1}^n (x_j-l_j^{(1)})^2} {2\sigma^2})

f2=similarity(x,l(2))=exp(xl(2)22σ2)

f3=similarity(x,l(3))=exp(xl(3)22σ2)

  • If xl(1)
  • f1exp(02σ2)=1
  • If x is far from l(1)
  • f1=exp((large number)22σ2)0

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2059.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2060.png

Kernels 2

  • Choosing Landmarks l(i)

Given (x(1),y(1)),(x(2),y(2)),...,(x(m),y(m)),
Choose l(1)=x(1),l(2)=x(2),...,l(m)=x(m).

Given example x :

f_1=similarity (x,l^{(1)})
\ f_2=similarity(x,l^{(2)})
\ \vdots

feature \ vector \ f=
\begin {bmatrix}
f_0
\f_1
\ \vdots
\f_m
\end {bmatrix}
,
\ f_0=1

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2061.png

  • SVM with Kernels

Hypothesis : Given x, compute features fRm+1
Predict y=1 if θTf0

Training :

\underset{\theta}
{min} \ C \sum_{i=1}^m[y^{(i)}cost_1(\theta^T\bold{f^{(i)}})+(1-y^{(i)})cost_0(\theta^T\bold{f^{(i)}})]+\frac 1 2 \sum_{j=1}^m\theta_j^2

mj=1θ2j=θTθ=θTMθ (M:rescale factor)

  • SVM parameters
  • C=1λ

Large C : Lower bias, high variance, less regularized ( small λ )

Small C : Higher bias, lower variance, more regularized ( large λ )

  • σ2  (exp(xl22σ2)

Large σ2 : Features fi vary more smoothly. Higher bias, lower variance.

Small σ2 : Features fi vary less smoothly. Lower bias, higher variance.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2062.png

Using SVM

Use SVM software package (e.g. liblinear, libsvm, ...) to solve for parameters θ.

Need to specify :

  • Choice of parameter C
  • Choice of kernel (similarity function) :
    No kernel (linear kernel), Gaussian kernel, ploynomial kernel, ...
  • Perform feature scaling before using the Gaussian kernel.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2063.png

  • Multi-class classification

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2064.png

Many SVM packages already have built-in multi-class classification.

Otherwise, use one-vs-all method.

  • Logistic regression vs. SVMs
    • If n is large (relative to m) (e.g. n=10,000 , m = 10~1,000)
    • Use Logistic Regression, or SVM without a kernel (linear kernel)
    • If n is small, m is intermediate
    • Use SVM with Gaussian Kernel
    • If n is small, m is largeNeural Network likely to work well for most of these settings, but may be slower to train.
    • Add more features, then use Logistic Regression or SVM without a kernel
  • n : number of features, m : number of training examples

Clustering

Unsupervised Learning

In Unsupervised Learning, you are given an unlabled dataset (without labels y(i)) and are asked to find some structure in the data.

K-means Algorithm

  1. Randomly Initialize cluster centroids
  2. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2065.png
  3. Assign cluster
  4. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2066.png
  5. Move cluster centroid to average point
  6. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2067.png
  7. Repeat until it converges
  8. Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2068.png
  • InputTraining Set : x(1),x(2),...,(m)
  • K : number of clusters
  • Algorithm

Randomly initialize K cluster centroids μ1,μ2,...,μKRn

Repeat {

  1. Cluster Assignment Stepc(i) := index of cluster centroid closest to x(i). (1,2,...,K)
  2. for i=1 to m
  3. Move to Centroidμk := average of points assigned to cluster k }
  4. for k=1 to K

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2069.png

Optimization Objective

c(i) = index of cluster to which example x(i) is currently assigned

μk = cluster centroid k

μc(i) = cluster centroid of cluster to which example x(i) has been assigned

(E.g. If x(i) is assigned in cluster 5 then c(i)=5, μc(i)=μ5)

  • Optimization objective (Distortion Cost Function / K-means Algorithm)

J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)= \frac 1 m \sum_{i=1}^m |x^{(i)}-\mu_{c^{(i)}}|^2
\
min \ J(c^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_K)

Random Initialization

  • How to randomly initialize K cluster centroids?
    1. Should have K>m
    2. Randomly pick K training examples.
    3. Set μ1,,μK equal to these K examples.
    4. μ1=x(1),μ2=x(2),
  • To avoid local optima, initialize K-means lots of times & pick clustering that gave lowest cost J.
    1. Randomly initialize K-means.
    2. Run K-means. Get c(1),,c(m),μ1,,μK.
    3. Compute cost function (distortion) J(c(1),,c(m),μ1,,μK) }
  • For i=1 to 100 {

Choosing the number of clusters

  • Usually chosen by hand or human insight.
  • Elbow method :

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2070.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2071.png

Dimensionality Reduction

Motivation 1 : Data Compression

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2072.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2073.png

Motivation 2 : Data Visualization

  • Reduce data to max 3-dimensional data

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2074.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2075.pngMachine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2076.png

Principal Component Analysis problem formulation (PCA)

  • 2-dimension to 1-dimension :
    Find a direction (a vector u(1)Rn) onto which to project the data so as to minimize the projection error.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2077.png

  • n-dimension to k-dimension :
    Find k vectors u(1),u(2),u(3),,u(k) onto which to project the data, so as to minimize the projection error.

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2078.png

  • PCA is not linear regression

Principal Component Analysis Algorithm

  • Data PreprocessingPreprocessing (feature scaling / mean normalization) :Replace each x(i)j with xjμj. (ensure every feature has zero mean.)
  • If different features on different scales, perform feature scaling.
    x(i)jx(i)jμjsj
  • μj=1mmi=1x(i)j
  • Training Set : x(1),x(2),...,x(m)
  • Compute "Covariance matrix Σ"
  • Σ=1mni=1(x(i))(x(i))T
  • Compute "Eigenvectors" of matrix $\SigmaU_{reduce}ismatrixUwithk$ selected columns.
  • z(i) is projected x(i).
  • [U, S, V] = svd(Sigma); Ureduce = U(:, 1:k); z = Ureduce' * x;

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2079.png

Reconstruction from compressed representation

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2080.png

XX(i)approx=Ureducez(i)

Choosing the number of principal components

Average squared projection error :

1mmi=1x(i)x(i)approx2

Total variation in the data :

1mmi=1x(i)2

Typically, choose k to be smallest value so that

\frac { average \ square } {total \ variation} =
\frac {\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2} { \frac 1 m \sum_{i=1}^m|x^{(i)}|^2 }
\leq 0.01 \ (1%)
\
\rarr 99% \ of \ variance \ is \ retained

  • Choosing k (compressed dimension) algorithmCompute Ureduce,z(1),z(2),...,z(m),x(1)approx,...,x(m)approx.
    ORPick smallest value of k for which
  • \frac {\sum_{i=1}^k S_{ii}} {\sum_{i=1}^m S_{ii}}
    \geq0.99
    \
    \rarr 99% \ of \ variance \ retained
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2081.png
  • Check if $\frac {\frac 1 m \sum_{i=1}^m |x^{(i)}-x_{approx}^{(i)}|^2} { \frac 1 m \sum_{i=1}^m|x^{(i)}|^2 }
    \leq 0.01,ifnot,increasek$.
  • Try PCA with k=1.

Advice for applying PCA

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2082.png

Mapping x(i)z(i) should be defined by running PCA only on the training set.

This mapping can be applied as well to the examples in the cross validation and test sets.

  • Application of PCASpeed up learning algorithm (Choose k by % of variance retain)
  • Visualization (k=2 or 3)
  • Reduce memory / disk needed to store data
  • Bad use of PCA : To prevent overfitting
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2083.png
  • Before implementing PCA, first try running whatever you want to do with the original data x(i).
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%2084.png

Anomaly Detection

Problem Motivation

Untitled

  1. xi : features
  2. Model p(x)(probability) from data.
  3. p(xtest)<ϵ → Flag Anomaly
    p(xtest)ϵ → OK

Gaussian Distribution

X \sim N(\mu,\sigma^2)
\
\mu : mean , \ \sigma^2 : variance

P(x;μ,σ2)=12πσexp((xμ)22σ2)

Untitled

μ=1mmi=1x(i)

σ2=1mmi=1(x(i)μ)2

Algorithm

x_i \sim N(\mu_i,\sigma_i^2)
\
p(x)=p(x_1;\mu_1,\sigma_1^2) \cdot p(x_2;\mu_2,\sigma_2^2) \cdot p(x_3;\mu_3,\sigma_3^2) \cdots \cdot p(x_n;\mu_n,\sigma_n^2)
\=\Pi_{j=1}^np(x_j;\mu_j,\sigma_j^2)

  1. Choose features xi that you think might be indicative of anomalous examples.
  2. Fit parameters μ1,μ2,...,μn, σ21,σ22,...,σ2n
    μj=1mmi=1x(i)j
    σ2j=1mmi=1(x(i)jμj)2
  3. Given new example x, compute p(x)
    p(x)=Πnj=1p(xj;μj,σ2j)=Πnj=112πσexp((xμ)22σ2)
  4. Anomaly if p(x)<ϵ

Untitled

Developing and Evaluating an anomaly detection system

Assume we have some labeled data, of anomalous and non-anomalous examples. (y=0 if normal, y=1 if anomalous)

Training set : x(1),x(2),...,x(m) (assume normal examples not anomalous)

Cross validation set : (x(1)cv,y(1)cv),...,(x(mcv)cv,y(mcv)cv)

Test set : (x(1)test,y(1)test),...,(x(mtest)test,y(mtest)test)

Untitled

  • Algorithm Evaluation
  1. Fit model p(x) on training set { x(1),...,x(m) }
    (can also use cross validation set to choose parameter ϵ)
  2. On a cross validation / test example x, predict
    y=1 if p(x)<ϵ (anomaly)
    y=0 if p(x)ϵ (normal)
  3. Possible evaluation metrics :
  • True positive, false positive, false negative, true negative
  • Precision / Recall
  • F1 score

Anomaly Detection vs. Supervised Learning

UntitledUntitled

  • Anomaly DetectionLarge number of negative examples (normal)
  • Very small number of positive examples (anomalous)
  • Supervised Learning
  • Large number of positive and negative examples

Choosing what features to use

Untitled

  • Error analysis for anomaly detection
    y=1 if p(x)<ϵ (anomaly)
    y=0 if p(x)ϵ (normal)If p(x) is comparable for both normal and anomalous examples, create new feature.
  • Want p(x) large for normal examples x, small for anomalous examples x.

Untitled

Multivariate Gaussian Distribution

Untitled

Don't model p(x1),p(x2),... separately.

Model p(x) all in one go.

Parameters : μRn, ΣRn×n (covariance matrix)

p(x;μ,Σ)=1(2π)n2Σ12exp(12(xμ)TΣ1(xμ))

UntitledUntitledUntitledUntitled

Anomaly Detection using the multivariate Gaussian Distribution

  • Parameter Fitting

μ=1mmi=1x(i)

Σ=1mmi=1(x(i)μ)(x(i)μ)T

UntitledUntitledUntitled

Recommender Systems

Problem Formulation

  • Example : Predicting movie ratingsnu = no. of usersr(i,j)=1 if user j has rated movie i
  • y(i,j) = rating givren by user j to movie i (defined only if r(i,j)=1 )
  • nm = no. of movies
  • Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20101.png

Content-based Recommendations

  • For each user j, learn θ(i)Rn+1.
  • Predict user j as rating movie i with (θ(i))Tx(i) stars. (x0=1)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20102.png

x^{(3)}=
[1 0.99\0]
\ \leftrightarrow \
\theta^{(1)}=
[0 5\0]
\
\
(\theta^{(1)})^Tx^{(3)} = 5 \times0.99=4.95

r(i,j)=1 if user j has rated movie i

y(i,j) = rating givren by user j to movie i (defined only if r(i,j)=1 )

θ(j) = parameter vector for user j

x(i) = feature vector for movie i

m(j) = no. of movies rated by user j

For user j, movie i, predicted rating : (θ(j))T(x(i))

  • To learn θ(j) (parameter for user j)
  • \underset{\theta^{(j)}}{min}
    \frac 1 2 \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{k=1}^n(\theta^{(j)}_k)^2
  • To learn θ(1),θ(2),...,θ(nu) (Optimization Algorithm)
  • \underset{\theta^{(1)},...,\theta^{(n_u)}}{min}
    \frac 1 2 \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2
  • Gradient descent updateθ(j)k:=θ(j)kα(i:r(i,j)=1((θ(j))Tx(i)y(i,j))x(i)k+λθ(j)k) (k0)
  • θ(j)k:=θ(j)kαi:r(i,j)=1((θ(j))Tx(i)y(i,j))x(i)k (k=0)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20103.png

Collaborative Filtering

  • How to choose features ( x1,x2,...) (x0=1)

Machine%20Learning%20c2998368a7ad464e87a39a2aa592963a/Untitled%20104.png

if \ x^{(1)}=
[1 1.0 0.0]

\ (\theta^{(1)})^Tx^{(1)} \approx 5
\ (\theta^{(2)})^Tx^{(1)} \approx 5
\ (\theta^{(3)})^Tx^{(1)} \approx 0
\ (\theta^{(4)})^Tx^{(1)} \approx 0

\rarr \
\theta^{(1)}=[0\5\0]
\ \theta^{(2)}=[0\5\0]
\ \theta^{(3)}=[0\0\5]
\ \theta^{(4)}=[0\0\5]

  • Given parameters θ(1),...,θ(nu), to learn x(1),...,x(nm)
  • \underset{x^{(1)},...,x^{(n_m)}}{min}
    \frac 1 2 \sum_{i=1}^{n_m} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2
  • Given x(1),...,x(nm) (features) and movie ratings, can estimate θ(1),...,θ(nu) (parameters)
  • Given parameters θ(1),...,θ(nu), can estimate x(1),...,x(nm) (features)

θ (initial random guess)xθxθx

Collaborative Filtering Algorithm

  • Given x(1),...,x(nm), estimate θ(1),θ(2),...,θ(nu)
  • \underset{\theta^{(1)},...,\theta^{(n_u)}}{min}
    \frac 1 2 \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2
  • Given parameters θ(1),...,θ(nu), estimate x(1),...,x(nm)
  • \underset{x^{(1)},...,x^{(n_m)}}{min}
    \frac 1 2 \sum_{i=1}^{n_m} \sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}_k)^2
  • Minimizing x(1),...,x(nm) and θ(1),...,θ(nu) simultaneously\rarr
    \underset{x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}}
    {min}
    J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})
  • J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})=
    \frac 1 2 \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac \lambda 2 \sum_{i=1}^{n_m}\sum_{k=1}^n(x^{(i)}k)^2+\frac \lambda 2 \sum{j=1}^{n_u}\sum_{k=1}^n(\theta^{(j)}_k)^2
  1. Initialize x(1),...,x(nm) and θ(1),...,θ(nu) to small random values. (xRn,θRn)
  2. Minimize J(x(1),...,x(nm),θ(1),...,θ(nu)) using gradient descent (or advanced optimization algorithm)x(i)k:=x(i)kα(j:r(i,j)=1((θ(j))Tx(i)y(i,j))θ(j)k+λx(i)k)
  3. θ(j)k:=θ(j)kα(i:r(i,j)=1((θ(j))Tx(i)y(i,j))x(i)k+λθ(j)k)
  4. E.g. for every j=1,...,nu , i=1,...,nm (not from zero) (no x0,θ0)
  5. For a user with parameters θ and a movie with learned features x, predict a star rating of θTx.

Vectorization : Low rank matrix Factorization

Untitled

Y=
[5 5 0 0 5 ? ? 0\? 4 0 ?\0 0 5 4 0 0 5 0]
=y^{(i,j)}

n_m=5
\n_u=4

X=[(x(1))T\(x(2))T  (x(nm))T]

\Theta=[(θ(1))T\(θ(2))T  (θ(nu))T]

  • Predicted Ratings (Low Rank Matrix Factorization)

X\Theta^T=\begin{bmatrix}
(\theta^{(1)})^T(x^{(1)}) \ (\theta^{(2)})^T(x^{(1)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(1)})

\(\theta^{(1)})^T(x^{(2)}) \ (\theta^{(2)})^T(x^{(2)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(2)})
\ \vdots
\ \(\theta^{(1)})^T(x^{(n_m)}) \ (\theta^{(2)})^T(x^{(n_m)}) \ \cdots \ (\theta^{(n_u)})^T(x^{(n_m)})
\end{bmatrix}

  • Finding related movies

For each product i, we learn a feature vector x(i)Rn

To find movies j related to movie i, find movie with the smallest x(i)x(j).

Implementational Detail : Mean Normalization

  • It's not a good idea to set to zero when user haven't rated any movies.

Untitled

Y=
[5 5 0 0 ?_ 5 ? ? 0 ?_\? 4 0 ? ?_\0 0 5 4 ?_ 0 0 5 0 ?_]

  • Take average of other ratings and put it into matrix μ.

\mu=
[2.5\2.5\2\2.25\1.25]

Untitled

  • Then learn θ(j),x(i) with mean normalized Y.
  • For user j, on moive i, predict (add μ back for prediction)
  • (θ(j))T(x(i))+μi

e.x. For user 5 (Eve) : (θ(5))T(x(i))+μi

Large Scale Machine Learning

Learning with Large Datasets

"It's not who has the best algorithm that wins. It's who has the most data."

  • To check if training algorithm is ok with selected data sets, we can plot the learning curve. (Sanity check)

Untitled

Stochastic Gradient Descent

  • Batch Gradient Descent (using all training examples in each iteration)
  • Untitled
  • Stochastic Gradient Descent (using single training example in each iteration.)
  1. Randomly shuffle training examples
  2. Repeat {θj:=θjα(hθ(x(i))y(i))x(i)j}
  3. (for j:=1,...,n) }
  4. for i:=1,...,m {

Untitled

Mini-batch Gradient Descent

  • Use b examples in each iteration

if b=10,m=1,000

Repeat {

for i=1,11,21,31,...,991 {

θj:=θjα110i+9k=i(hθ(x(k))y(k))x(k)j

(for every j=0,...,n) }

}

  • Mini-batch Gradient Descent is efficient when examples are vectorized well.

Stochastic Gradient Descent Convergence

Checking for Convergence

  • Batch Gradient DescentJtrain(θ)=12mmi=1(hθ(x(i))y(i))2
  • Plot Jtrain(θ) as a function of the number of iterations of gradient descent.
  • Stochastic Gradient DescentDuring learning, compute cost(θ,(x(i),y(i))) before updating θ using (x(i),y(i)).
  • Every 1,000 iterations (say), plot cost(θ,(x(i),y(i))) averaged over the last 1,000 examples processed by algorithm.
  • cost(θ,(x(i),y(i)))=12(hθ(x(i))y(i))2

Untitled

Learning rate

  • Learning rate α is typically held constant. Can slowly decrease α over time if we want θ to converge.
  • α=const1iterationNumber+const2

Untitled

Online Learning

Untitled

  • Can adopt to changing user preference.
  • Learn from (x,y) then discard. not (x(i),y(i)).
  • Repeat Forever {Update θ using (x,y) :(j=0,...,n)
  • }
  • θj:=θjα(hθ(x)y)xj
  • Get (x,y) corresponding to user.

Other Online Learning Example

Untitled

Map-reduce and Data Parallelism

  • Many Learning algorithms can be expressed as computing sums of functions over the training set.
    • E.g. Gradient descent, Logistic regression, Deep learning, ...
  • Batch Gradient Descent (m=400)
  • Untitled

UntitledUntitled

Application example : Photo OCR

Problem description and pipeline

Untitled

  • Photo OCR (Optical Character Recognition) pipline
    1. Text detection
    2. Character segmentation
    3. Character classification

Sliding windows

  • Supervised learning for pedestrian detection
  • Move n pixels (step-size) in each steps

UntitledUntitledUntitled

  • Text detection
  • Untitled
  • 1D Sliding window for character segmentation
  • Untitled

Getting lots of data : Artificial data synthesis

  • Synthesizing data by introducing distortions(E.g. Audio : Background noise, bad cellphone connection)(E.g. intensity (brightness) of pixel i / xixi+random noise)
  • Usually does not help to add purely random / meaningless noise to your data.
  • Distortion introduced should be representation of the type of noise / distortions in the test set.

Discussion on getting more data

  1. Make sure you have a low bias classifier before expanding the effort. (plot learning curve)
  2. Think "How much work would it be to get 10x as much data as we currently have?"
  • Artificial data synthesis
  • Collect/label it yourself
  • "Crowd source"

Ceiling analysis : What part to the pipeline to work on next

  • Estimating the errors due to each component (Ceiling analysis)

UntitledUntitled