Caffe教程４：Solver

Solver简介

Solver执行模型的优化操作，它通过协调网络的前向推断和反向梯度传播来执行参数更新，来减少误差损失。学习职能被分配到solver和net上，sovler负责优化和产生参数更新，net负责计算loss和梯度。

Caffe包含了下面几个Solver：

Stochastic Gradient Descent (type: “SGD”),
AdaDelta (type: “AdaDelta”),
Adaptive Gradient (type: “AdaGrad”),
Adam (type: “Adam”),
Nesterov’s Accelerated Gradient (type: “Nesterov”) and
RMSprop (type: “RMSProp”)

Solver的功能有：

支撑整个优化过程的簿记工作，创建训练网络和测试网络。
通过调用forward/backward来迭代优化并更新参数。
周期性的用测试网络评估学习状态。
在优化过程中创建模型和solver的状态。

每次迭代过程：

调用网络的forward来计算输出和损失
调用网络的backward来计算梯度
根据solver的具体方法用梯度进行参数更新
根据学习率，历史记录和方法来更新solver的状态

Solver把weights从初始化一直保持到模型学习结束。它也有CPU/GPU模式。

Solver的方法

Solver的method处理通用的损失函数最小化的优化问题。对于数据集$D$,优化目标是整个数据集的$|D|$条数据的平均损失：
$$L(W) = \frac{1}{|D|} \sum_i^{|D|} f_W\left(X^{(i)}\right) + \lambda r(W)$$
其中$f_W\left(X^{(i)}\right)$是数据对象$X^{(i)}$的损失，而$r(W)$是一个正则化项，$\lambda$是学习率。实际中数据集的数量$|D|$可以很大，所以每个solver迭代过程我们都使用随机逼近目标的方法，抽取小批量数据 $N <<|D|$:

$$L(W) \approx \frac{1}{N} \sum_i^N f_W\left(X^{(i)}\right) + \lambda r(W)$$

模型在forward步骤中计算$f_W$然后在backward步骤中把梯度$\nabla f_W$反传。
参数更新$\Delta W$是由solver根据误差梯度$\nabla f_W$、正则化的梯度$\nabla r(W)$、和其他和优化方法相关的参数计算的。

SGD

Stochastic gradient descent(type: “SGD”)随机梯度下降法按照负梯度$\nabla L(W)$的线型组合和前面的权重更新$V_t$更新权重$W$。学习率 $\alpha$是负梯度的权重。动量$\mu$是前面一次更新的权重。

最终，我们用如下公式用当前的权重$W_t$和上一次更新值$Vt$来计算第$t+1$次迭代的更新值$V{t+1}$和更新权重$W{t+1}$:
$$V{t+1} = \mu V_t - \alpha \nabla L(Wt)$$
$$W{t+1} = Wt + V{t+1}$$

学习“超参数”($\alpha$和$\mu$）可能需要一点微调技巧以便获取最佳结果。如果你不确定怎么开始的画，可以看看下面的黄金法则，了解更多的信息你可以参看Leon Bottou’s 的Stochastic Gradient Descent Tricks .

设置学习率$\alpha$和动量$\mu$的黄金法则

在深度学习中使用SGD的一个好策略是把学习率$\alpha$设置为$\alpha \approx 0.01 = 10^{-2}$,然后用一个常量因子（例如１０）来在学习过程中调整它，使得损失达到一个明显的”平台”,重复这个过程几遍，通常，你可能想用一个动量$\mu = 0.9$或者相似的值。为了在迭代过程平滑权重更新，动量可以使SGD稳定且快速。
　　
　　
这个过程是 Krizhevsky[^1]在他们著名的ILSVRC-2012　CNN组冠军论文里采用的策略。Caffe让整个策略易于实现，就像我们在对1进行实现的时候做的那样(./examples/imagenet/alexnet_solver.prototxt)

为了这样使用学习率策略，你可以把下面这几行放在你的solver的prototxt文件里:

base_lr: 0.01     # begin training at a learning rate of 0.01 = 1e-2
lr_policy: "step" # learning rate policy: drop the learning rate in "steps"
                  # by a factor of gamma every stepsize iterations
gamma: 0.1        # drop the learning rate by a factor of 10
                  # (i.e., multiply it by a factor of gamma = 0.1)
stepsize: 100000  # drop the learning rate every 100K iterations
max_iter: 350000  # train for 350K iterations total
momentum: 0.9

在上面的设定中，我们总是使用momentum$\mu=0.9$。我们以基础学习率base_lr$\alpha = 0.01 = 10^{-2}$开始训练最早的100,000次迭代过程，然后将学习率乘以gamma$(\gamma)$,即用学习率 $\alpha’ = \alpha \gamma = (0.01) (0.1) = 0.001 = 10^{-3}$训练第100K~200K次迭代过程，然后用$\alpha’’ = 10^{-4}$来训练第200K~300K次迭代，最终用$\alpha’’’ = 10^{-5}$训练到第350K次迭代.

　　
注意动量在很多次迭代后，把$\mu$用因子$\frac{1}{1 - \mu}$乘以你更新的次数设置，所以你如果增加了$\mu$,那么最好相应的减少$\alpha$，反之亦然。
　　
　　
例如，设动量$\mu＝0.9$，我们得到一个有效的更新尺寸乘数$\frac{1}{1 - 0.9} = 10$,如果我们增加动量到$\mu=0.99$,那么我们就把更新大小乘数增加到了100,于是我们应该把学习率$\alpha$下调10.

注意到上面的设定仅仅是一个指导，他们肯定不能够保证是最优的，甚至都不能工作！如果学习过程误入歧途了（例如你看到很大或者NaN、inf等损失值或者输出），尝试着把base_lr减少，例如：减少到0.001，然后重训练，重复这个过程直到你找到一个合适的base_lr.

[^1]: A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 2012.

AdaDelta

AdaDelta方法(type: “AdaDelta”)是一个”robust learning rate method”.[^2].它也是一种基于梯度的优化方法（类似SGD)，它的更新公式是：

$$ \begin{align}
(v_t)i &= \frac{\operatorname{RMS}((v{t-1})_i)}{\operatorname{RMS}\left( \nabla L(Wt) \right){i}} \left( \nabla L(W_{t’}) \right)_i
\
\operatorname{RMS}\left( \nabla L(Wt) \right){i} &= \sqrt{E[g^2] + \varepsilon}
\
E[g^2]t &= \delta{E[g^2]{t-1} } + (1-\delta)g{t}^2
\end{align} $$
$$(W{t+1})_i =
(W_t)_i - \alpha
(v_t)_i.$$

AdaGrad

adaptive gradient(type: “AdaGrad”)方法（Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization The Journal of Machine Learning Research, 2011.）
是一个基于梯度的方法，它尝试“在干草堆里找缝衣针”也就是找到预测能力很强但很少见的feature（find needles in haystacks in the form of very predictive but rarely seen features,）,Duchi等人如是说。给出前面迭代的所有更新信息($\left( \nabla L(W) \right)_{t’}$其中($t’ \in {1, 2, …, t}$),论文中提出的更新公式如下，它给每个组件$i$指定一个权重W:

$$
(W_{t+1})_i =
(W_t)_i - \alpha
\frac{\left( \nabla L(Wt) \right){i}}{
\sqrt{\sum{t’=1}^{t} \left( \nabla L(W{t’}) \right)_i^2}
}
$$

注意在实践中，对于权重$W \in \mathcal{R}^d$,AdaGrad实现仅用了$\mathcal{O}(d)$额外的存储来保存历史梯度信息，而传统方法要用$\mathcal{O}(dt)$来保存历史梯度信息。

Scaffolding

Solver的scaffolding准备优化方法并且初始化模型，通过调用Solver::Presolve()方法调用来完成。

> caffe train -solver examples/mnist/lenet_solver.prototxt
I0902 13:35:56.474978 16020 caffe.cpp:90] Starting Optimization
I0902 13:35:56.475190 16020 solver.cpp:32] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
solver_mode: GPU
net: "examples/mnist/lenet_train_test.prototxt"

网络初始化

I0902 13:35:56.655681 16020 solver.cpp:72] Creating training net from net file: examples/mnist/lenet_train_test.prototxt
[...]
I0902 13:35:56.656740 16020 net.cpp:56] Memory required for data: 0
I0902 13:35:56.656791 16020 net.cpp:67] Creating Layer mnist
I0902 13:35:56.656811 16020 net.cpp:356] mnist -> data
I0902 13:35:56.656846 16020 net.cpp:356] mnist -> label
I0902 13:35:56.656874 16020 net.cpp:96] Setting up mnist
I0902 13:35:56.694052 16020 data_layer.cpp:135] Opening lmdb examples/mnist/mnist_train_lmdb
I0902 13:35:56.701062 16020 data_layer.cpp:195] output data size: 64,1,28,28
I0902 13:35:56.701146 16020 data_layer.cpp:236] Initializing prefetch
I0902 13:35:56.701196 16020 data_layer.cpp:238] Prefetch initialized.
I0902 13:35:56.701212 16020 net.cpp:103] Top shape: 64 1 28 28 (50176)
I0902 13:35:56.701230 16020 net.cpp:103] Top shape: 64 1 1 1 (64)
[...]
I0902 13:35:56.703737 16020 net.cpp:67] Creating Layer ip1
I0902 13:35:56.703753 16020 net.cpp:394] ip1 <- pool2
I0902 13:35:56.703778 16020 net.cpp:356] ip1 -> ip1
I0902 13:35:56.703797 16020 net.cpp:96] Setting up ip1
I0902 13:35:56.728127 16020 net.cpp:103] Top shape: 64 500 1 1 (32000)
I0902 13:35:56.728142 16020 net.cpp:113] Memory required for data: 5039360
I0902 13:35:56.728175 16020 net.cpp:67] Creating Layer relu1
I0902 13:35:56.728194 16020 net.cpp:394] relu1 <- ip1
I0902 13:35:56.728219 16020 net.cpp:345] relu1 -> ip1 (in-place)
I0902 13:35:56.728240 16020 net.cpp:96] Setting up relu1
I0902 13:35:56.728256 16020 net.cpp:103] Top shape: 64 500 1 1 (32000)
I0902 13:35:56.728270 16020 net.cpp:113] Memory required for data: 5167360
I0902 13:35:56.728287 16020 net.cpp:67] Creating Layer ip2
I0902 13:35:56.728304 16020 net.cpp:394] ip2 <- ip1
I0902 13:35:56.728333 16020 net.cpp:356] ip2 -> ip2
I0902 13:35:56.728356 16020 net.cpp:96] Setting up ip2
I0902 13:35:56.728690 16020 net.cpp:103] Top shape: 64 10 1 1 (640)
I0902 13:35:56.728705 16020 net.cpp:113] Memory required for data: 5169920
I0902 13:35:56.728734 16020 net.cpp:67] Creating Layer loss
I0902 13:35:56.728747 16020 net.cpp:394] loss <- ip2
I0902 13:35:56.728767 16020 net.cpp:394] loss <- label
I0902 13:35:56.728786 16020 net.cpp:356] loss -> loss
I0902 13:35:56.728811 16020 net.cpp:96] Setting up loss
I0902 13:35:56.728837 16020 net.cpp:103] Top shape: 1 1 1 1 (1)
I0902 13:35:56.728849 16020 net.cpp:109]     with loss weight 1
I0902 13:35:56.728878 16020 net.cpp:113] Memory required for data: 5169924

损失函数

I0902 13:35:56.728893 16020 net.cpp:170] loss needs backward computation.
I0902 13:35:56.728909 16020 net.cpp:170] ip2 needs backward computation.
I0902 13:35:56.728924 16020 net.cpp:170] relu1 needs backward computation.
I0902 13:35:56.728938 16020 net.cpp:170] ip1 needs backward computation.
I0902 13:35:56.728953 16020 net.cpp:170] pool2 needs backward computation.
I0902 13:35:56.728970 16020 net.cpp:170] conv2 needs backward computation.
I0902 13:35:56.728984 16020 net.cpp:170] pool1 needs backward computation.
I0902 13:35:56.728998 16020 net.cpp:170] conv1 needs backward computation.
I0902 13:35:56.729014 16020 net.cpp:172] mnist does not need backward computation.
I0902 13:35:56.729027 16020 net.cpp:208] This network produces output loss
I0902 13:35:56.729053 16020 net.cpp:467] Collecting Learning Rate and Weight Decay.
I0902 13:35:56.729071 16020 net.cpp:219] Network initialization done.
I0902 13:35:56.729085 16020 net.cpp:220] Memory required for data: 5169924
I0902 13:35:56.729277 16020 solver.cpp:156] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt

完成

1 2	I0902 13:35:56.806970 16020 solver.cpp:46] Solver scaffolding done. I0902 13:35:56.806984 16020 solver.cpp:165] Solving LeNet

参数更新

实际的权重更新是由solver完成的，它通过把参数传递给Sovler::ComputeUpdateValue()来实现的。这个方法会整合所有权重衰减到权重梯度里（现在只包含误差梯度）来获得每个网络最终的梯度,然后这些gradients被学习率$\alpha$缩放，需要减去的更新值被存在每个参数Blob的diff属性里。最终，调用每个参数blob的Blob::Update方法,完成最终的权重更新（从它的data里减去diff）.

快照和恢复

Solvor在训练过程中通过调用Solver::Snapshot()方法和Solver::SnapshotSolverState()方法分别保存权重和它自身的状态。权重快照导出了学习到的模型，solver快照相当一个断点，使得训练可以从该快照恢复。恢复是通过Solver::Restore()和Solver::RestoreSolverState()方法完成的.

权重保存文件没有后缀，而状态保存文件有.solverstate的后缀。每个文件都会有一个_iter_N的前缀来指明是多少次迭代的快照。

快照是在Solver定义prototxt里这样配置的:

The snapshot interval in iterations.
snapshot: 5000
# File path prefix for snapshotting model weights and solver state.
# Note: this is relative to the invocation of the `caffe` utility, not the
# solver definition file.
snapshot_prefix: "/path/to/model"
# Snapshot the diff along with the weights. This can help debugging training
# but takes more storage.
snapshot_diff: false
# A final snapshot is saved at the end of training unless
# this flag is set to false. The default is true.
snapshot_after_train: true

[^2]: M. Zeiler ADADELTA: AN ADAPTIVE LEARNING RATE METHOD. arXiv preprint, 2012.

Prev Home Next