machine learning 笔记

Linear Regression with one variable

Notion

  • m = Number of training examples
  • x’s = “input” variable / features
  • y’s = “output” variable / “target” variable

Hypothesis

\[ h_{\theta}(x) = \theta_0 + \theta_1x \]

Cost function

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m} \sum_{i=1}^m (\theta_0+\theta_1x^{(i)} - y^{(i)})^2 \]

Gredient descent

\[\theta_j = \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\Theta) \\ = \theta_j - \alpha \frac{\partial}{\partial\theta_j} (\frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2) \\ = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})(x^{(i)}_j) \]
  • j = 0
\[ \theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m(\theta_0+\theta_1x^{(i)}-y^{(i)}) \]
  • j = 1
\[ \theta_1 = \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^m(\theta_0+\theta_1x^{(i)}-y^{(i)})x^{(i)} \]

(simultaneously update \(\theta_j\) for all j)

Linear Regression with multiple variables

Notion

  • \(n\) = number of features
  • \(x^{(i)}\) = input of \(i^{th}\) training example
  • \(x^{(i)}_j\) = input of \(i^{th}\) training example 's \(j^{th}\) feature.
  • \(x^{(2)}_3\) = 2

Hypothesis derivation

  • Previousely: \(h_{\theta}(x) = \theta_0 + \theta_1x\)
  • New algorithm: \(h_{\theta}(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3\)

$
X =
\begin{bmatrix} x_0\x_1\x_2\\vdots\x_n \end{bmatrix}
\in R^{n+1}
$

$
\Theta =
\begin{bmatrix} \theta_0\\theta_1\\theta_2\\vdots\\theta_n \end{bmatrix}
\in R^{n+1}
$

  • For convenience of notation, define \(x_0=1\)

$
h_{\theta}(x)
= \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3
= \Theta^TX
$

Hypothesis

\[\begin{align} h_{\theta}(x) &= \Theta^TX \\ &= \theta_0x_0 + \theta_1x_1 + \theta_2x_2+ \cdots + \theta_nx_n \end{align} \]

Cost function

\[\begin{align} J(\theta) &= J(\theta_0, \theta_1, \cdots, \theta_n) \\ &= \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 \end{align} \]

Gradient descent

\[\begin{align} \theta_j &= \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\Theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial\theta_j} (\frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2) \\ &= \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})(x^{(i)}_j) \end{align} \]
  • j = 0
\[ \theta_0 = \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m(\theta_0+\theta_1x^{(i)}-y^{(i)})(x^{(i)}_0) \]
  • j = 1
\[ \theta_1 = \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^m(\theta_0+\theta_1x^{(i)}-y^{(i)})x^{(i)}(x^{(i)}_1) \]

……

Feature Scaling

  • Make sure features are on a similar scale.
  • Get every feature into approximately a \(-1 \leq x_i \leq 1\) range.

Mean normalization

Replace \(x_i\) with \(x_i-\mu_i\) to make features have approximately zero mean
(Do not apply to ).

E.g.

\[ x_1 = \frac{size-1000}{2000} \] \[ x_2 = \frac{bedrooms-2}{5} \] \[x_1 = \frac{x_1-\mu_1}{\sigma_1} \] \[x_2 = \frac{x_2-\mu_2}{\sigma_2} \]

Learning rate

While doing gradient decent

\[\theta_j = \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\Theta) \]

“Debugging”: How to make sure gradient descent is working correctly.

How to choose learning rate

Summary:

  • If \(\alpha\) too small: slow convergence
  • If \(\alpha\) too large: \(J(\theta)\) may not decrease on every iteration; may not converge.

To choose \(\alpha\), try

\(\cdots, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, \cdots\)

Normal Equation

Normal equation: Method to solve for analy2cally.

\[ \theta = (X^TX)^{-1}X^Ty \]
Gradient Descent Normal Equation
1 Need choose \(\alpha\) No need to choose \(\alpha\)
2 Need many iterations Do not need to iterate
3 Works well even n is large Need to compute \((X^TX)^{-1}\), time complexity O(n^3)
4 Slow if n is very large

Linear Algebra:

Matrices and vectors

  • Matrix:
  • Dimension of matrix:
  • Matrix Elements
  • Vector

Addition and scalar multiplication

Matrix Addition

Scalar Multiplication

Matrix-vector multiplication

Matrix-matrix multiplication

Matrix multiplication properties

Inverse and transpose

  • Inverse
  • Transponse

Octave Tutorial

Basic operations

Moving data around

Computing on data

Plotting data

Control statements: for, while, if statements

Vectorial implementation

\[\begin{split} &\theta_j = \theta_j - \alpha \frac{\partial}{\partial\theta_j} J(\Theta) \\ &= \theta_j - \alpha \frac{\partial}{\partial\theta_j} (\frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)} - y^{(i)}))^2) \\ &= \theta_j - \alpha \frac{\partial}{\partial\theta_j} (\frac{1}{2m} \sum_{i=1}^m (\theta_0+\theta_1x^{(i)} - y^{(i)})^2) \\ &= \theta_j - \alpha (\frac{1}{m} \sum_{i=1}^m (1+x^{(i)})) \end{split} \]
/** * RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS. * LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/ /* var disqus_config = function () { this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable }; */ (function() { // DON'T EDIT BELOW THIS LINE var d = document, s = d.createElement('script'); s.src = 'https://chenzz.disqus.com/embed.js'; s.setAttribute('data-timestamp', +new Date()); (d.head || d.body).appendChild(s); })();