[COURSE DL#3] MultiLayer Neural Networks

1. Feedward Operation

1.1 History of neural network

  • 神经网络的数学模型:

McCulloch and Pitts 1943

Include recurrent and non-recurrent (with “circles”) networks

Use thresholding function as nonlinear activation

No learning

  • 神经网络的早期工作

Starting from Rosenblatt 1958

Using thresholding function as nonlinear activation prevented computing derivatives with the chain rule, and so errors could not be propagated back to guide the computation of gradients

  • 反向传播的自1960年的发展

The key idea is to use the chain rule to calculate derivatives
It was reflected in multiple works, earliest from the field of control

  • 神经网络标准的后向传播

Rumelhart, Hinton, and Williams, Nature 1986. Clearly

appreciated the power of backpropagation and demonstrated it on key tasks, and applied it to pattern recognition generally

In 1985, Yann LeCun independently developed a learning algorithm for three-layer networks in which target values were propagated, rather than derivatives. In 1986, he proved that it was equivalent to standard backpropagation

  • 证明三层神经网络的普遍表示能力:

Hecht-Nielsen 1989

  • 卷积神经网络

Introduced by Kunihiko Fukushima in 1980

Improved by LeCun, Bottou, Bengio, and Haffner in 1998

  • 深度信念网络

Hinton, Osindero, and Tech 2006

  • 自编码器

Hinton and Salakhutdinov 2006 (Science)

  • 深度学习

Hinton. Learning multiple layers of representations. Trends in Cognitive Sciences*, 2007.

Unsupervised multilayer pre-training + supervised fine-tuning (BP)

  • 语音识别的大规模深度学习

Geoff Hinton and Li Deng started this research at Microsoft

Research Redmond in late 2009.

Generative DBN pre-training was not necessary

Success was achieved by large-scale training data + large deep neural network (DNN) with large, context-dependent output layers

  • 大规模图像的无监督学习

Andrew Ng et al. 2011

Unsupervised feature learning

16000 CPUs

  • 大规模图像的监督学习

Krizhevsky, Sutskever, and Hinton 2012

Supervised learning with convolutional neural network

No unsupervised pre-training

1.2 Two-layer neural network model linear classifiers

image-20200217085109235

然而,两层神经网络不能解决非线性问题。

image-20200217085714004

1.3 Add a hidden layer to model nonlinear classfiers

image-20200217090723061

1.4 三层神经网络

image-20200217091302793
  • 网络激活:每个隐藏单元计算输入的加权和:
image-20200217091427959
  • 激活函数:
image-20200217091635730
  • 神经网络的输出等效于一系列判别函数:
image-20200217091739712

1.5 三层神经网络的表示能力

  • 它可以表示任何判别函数

  • 然而,神经单元的数量很多

  • 模式识别模型,例如KNN、SVM等,都可以用神经网络估计。它们称为浅层模型。

  • 深度结构:神经节点的数量随着层数指数下降。

image-20200217093118917

2. Backpropogation

2.1 反向传播

  • 训练多层神经网络最常见的监督训练方法。

  • 给定输入,改变网络的参数从而让输出与目标值接近。

  • 然而,没有一个显性的指导说明隐藏单元应该是什么。

2.2 三层网络作为解释

image-20200217093632047

2.3 训练误差

image-20200217093713376
  • 可微

  • 也有很多其他的函数,例如交叉熵

image-20200217093809299

2.4 梯度下降

  • 权重随机初始化,并且朝着误差下降的方向改变。
image-20200217113236793 image-20200217113300299
  • 迭代更新:
image-20200217113405450

2.5 隐藏到输出的权重

image-20200217093858580 image-20200217094237765
  • 单元j的敏感度:
image-20200217094913399
  • 决定整体的输出误差随着单元的激活怎么改变。

  • 权重更新规则:

image-20200217095044913

2.6 激活函数

符号函数不是一个好的选择。

image-20200217095539411

常见的选择:

  • Sigmoid function
image-20200217095428909
  • Tanh function
image-20200217095442031
  • Hard thanh
image-20200217095559497 image-20200217095453927
  • Rectified linear unit
image-20200217095505876
  • Softplus: smooth version of ReLU
image-20200217095521159
  • Softmax:预测离散的概率
image-20200217095717870
  • Maxout:
image-20200217095732158

3. Discussions