Deep learning Guide 12: Gradient Descent 梯度下降

type

status

date

slug

password

summary

Gradient Descent

In this section we are going to introduce the basic concepts underlying gradient descent. Although it is rarely used directly in deep learning, an understanding of gradient descent is key to understanding stochastic gradient descent algorithms. For instance, the optimization problem might diverge due to an overly large learning rate. This phenomenon can already be seen in gradient descent. Likewise, preconditioning is a common technique in gradient descent and carries over to more advanced algorithms. Let's start with a simple special case.

One-Dimensional Gradient Descent

Gradient descent in one dimension is an excellent example to explain why the gradient descent algorithm may reduce the value of the objective function. Consider some continuously differentiable real-valued function . Using a Taylor expansion we obtain

In the Taylor expansion, is called the Big O notation or Landau notation. It's a mathematical notation that describes the asymptotic behavior of a function, in this case, the remainder term of the Taylor expansion.

Informally, represents the term that grows at most quadratically with , i.e., as approaches 0, the term grows like $\epsilon^2$. In other words, the remainder term is bounded by a constant multiple of .

Formally, is defined as:

This definition says that is a set of functions that satisfy the following condition:

There exist positive constants $C$ and such that
For all in the interval , the absolute value of is bounded by times the square of .

If the derivative does not vanish we make progress since . Moreover, we can always choose small enough for the higher-order terms to become irrelevant. Hence we arrive at

Multivariate Gradient Descent

Now that we have a better intuition of the univariate case, let's consider the situation where . That is, the objective function maps vectors into scalars. Correspondingly its gradient is multivariate, too. It is a vector consisting of $d$ partial derivatives:

Each partial derivative element in the gradient indicates the rate of change of at with respect to the input . As before in the univariate case we can use the corresponding Taylor approximation for multivariate functions to get some idea of what we should do. In particular, we have that

Stochastic Gradient Updates

In deep learning, the objective function is usually the average of the loss functions for each example in the training dataset. Given a training dataset of $n$ examples, we assume that is the loss function with respect to the training example of index , where is the parameter vector. Then we arrive at the objective function

The gradient of the objective function at is computed as

If gradient descent is used, the computational cost for each independent variable iteration is , which grows linearly with $n$. Therefore, when the training dataset is larger, the cost of gradient descent for each iteration will be higher. Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of stochastic gradient descent, we uniformly sample an index for data examples at random, and compute the gradient to update :

Gradient Descent

One-Dimensional Gradient Descent

Multivariate Gradient Descent

Stochastic Gradient Updates

tom-ci

交流頻道

加入我們的社群討論分享