# 抱歉，您的浏览器无法访问本站

### 本页面需要浏览器支持（启用）JavaScript

blaire

👩🏻‍💻ブレア🥣

coursera week 2 - linear regression with multiple variables 1

## 1. Multiple Features \begin{align}x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the column vector of all the feature inputs of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \left| x^{(i)} \right| ; \text{(the number of features)} \end{align}

Macdown Version 0.6.4 (786) MathJax the same this web

### 1.1 hypothesis function

Now define the multivariable form of the hypothesis function as follows, accommodating these multiple features:

$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$

multivariable hypothesis function

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

\begin{align} h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} … \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x \end{align}

The training examples are stored in X row-wise, like such:

\begin{align} X = \begin{bmatrix}x^{(1)}_0 & x^{(1)}_1 \newline x^{(2)}_0 & x^{(2)}_1 \newline x^{(3)}_0 & x^{(3)}_1 \end{bmatrix}&,\theta = \begin{bmatrix}\theta_0 \newline \theta_1 \newline \end{bmatrix} \end{align}

You can calculate the hypothesis as a column vector of size (m x 1) with:

$h_\theta(X) = X \theta$

For the rest of these notes, X will represent a matrix of training examples $x_{(i)}$

## 2. Cost function

For the parameter vector θ (of type $\mathbb{R}^{n+1}$ or in $\mathbb{R}^{(n+1) \times 1}$, the cost function is:

$J(\theta) = \dfrac {1}{2m} \displaystyle \sum_{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$

The vectorized version is:

$J(\theta) = \dfrac {1}{2m} (X\theta - \vec{y})^{T} (X\theta - \vec{y})$

vectorized version is very good!

Matrix Notation

The Gradient Descent rule can be expressed as:

$\theta := \theta - \alpha \nabla J(\theta)$

Where $\nabla J(\theta)$ is a column vector of the form:

$\nabla J(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_0} \newline \frac{\partial J(\theta)}{\partial \theta_1} \newline \vdots \newline \frac{\partial J(\theta)}{\partial \theta_n} \end{bmatrix}$

The j-th component of the gradient is the summation of the product of two terms:

\begin{align} \; &\frac{\partial J(\theta)}{\partial \theta_j} &=& \frac{1}{m} \sum\limits_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)} \newline \; & &=& \frac{1}{m} \sum\limits_{i=1}^{m} x_j^{(i)} \cdot \left(h_\theta(x^{(i)}) - y^{(i)} \right) \end{align}

Sometimes, the summation of the product of two terms can be expressed as the product of two vectors.

\begin{align}\; &\frac{\partial J(\theta)}{\partial \theta_j} = \frac1m \vec{x_j}^{T} (X\theta - \vec{y}) \newline &\nabla J(\theta) = \frac 1m X^{T} (X\theta - \vec{y}) \newline \end{align}

Finally, the matrix notation (vectorized) of the Gradient Descent rule is:

$\theta := \theta - \frac{\alpha}{m} X^{T} (X\theta - \vec{y})$

The gradient descent equation itself is generally the same form; we just have to repeat it for our ‘n’ features:

\begin{align} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline \; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline \; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline & \cdots \newline \rbrace \end{align}

In other words:

\begin{align} & \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0..n} \newline \rbrace \end{align}

### 3.1 Feature Scaling

Idea : Make sure features are on a similar scale 特征缩放 Get every feature into approximately a $-1 \leq x_i \leq 1$ range.

Replace $x_i$ with $x_i - u_i$ to make features have approximately zero mean (Do not apply to $x_0$ = 1).

### 3.2 learning rate

\begin{align} \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \end{align}

• Debugging : How to make sure gradient descent is working correctly
• How to choose learning rate $\alpha$ Summary

• if $\alpha$ is too small: slow convergence [kən’vɜːdʒəns] 收敛
• if $\alpha$ is too large: $J(\theta)$ may not decrease on every iteration; may not converge.

To choose $\alpha$, try

…, 0.001, 0.01, 0.1, 1, …

## 4. Polynomial Regression ### 4.1 Polynomial Regression Feature normalization is very important

### 4.2 Choice of features @2017-02-10 review done

## 5. Normal Equation $\theta = (X^T X)^{-1}X^T y$

### 5.1 num and vector $J(\theta) = \dfrac {1}{2m} (X\theta - \vec{y})^{T} (X\theta - \vec{y})$

\begin{align}\; &\frac{\partial J(\theta)}{\partial \theta_j} = \frac1m \vec{x_j}^{T} (X\theta - \vec{y}) \newline &\nabla J(\theta) = \frac 1m X^{T} (X\theta - \vec{y}) \newline \end{align}

### 5.2 house price example \begin{align} \nabla J(\theta) = \frac 1m X^{T} (X\theta - \vec{y}) \newline \end{align}

So, $\theta = (X^T X)^{-1}X^T y$

### 5.3 $m$ training, $n$ features

Need to choose $\alpha$ No need to choose $\alpha$
Needs many iterations Don’t need to iterate
Works well even when $n$ is large Slow if $n$ is very large

it is usually around ten thousand that I might start to consider switching over to gradient descents or maybe, some other algorithms that we’ll talk about later in this class

### 5.4 $X^T X$ is non-invertible

$\theta = (X^T X)^{-1}X^T y$

What $X^T X$ is non-invertible? （singular / degenerate）

When $X^T X$ is non-invertible, this is very few.

What $X^T X$ is non-invertible? 