In this blog post, we will go through several classic CNN structures that builds the backbones of Computer Vision.

Source Code: https://github.com/BC-Li/deep_learning_playground

Environment

  • NVIDIA GeForce GTX 1080Ti 12GiB * 1

LeNet

First appeared in Gradient-based learning applied to document recognition

Structure

../_images/lenet.svg

channel 在深度学习的算法学习中,都会提到 channels 这个概念。在一般的深度学习框架的 conv2d 中,如 tensorflow 、mxnet ,channels 都是必填的一个参数。

channels 该如何理解?

一般的RGB图片,channels 数量是 3 (红、绿、蓝);而monochrome图片,channels 数量是 1

一般 channels 的含义是,每个卷积层中卷积核的数量。 为什么这么说呢,看下面的例子:

如下图,假设现有一个为 6×6×3的图片样本,使用 3×3×3 的卷积核(filter)进行卷积操作。此时输入图片的 channels 为 3 ,而卷积核中的 in_channels 与 需要进行卷积操作的数据的 channels 一致(这里就是图片样本,为3)。

网络结构:

net = nn.Sequential(
    nn.Conv2d(1,6,kernel_size=5,padding=2),nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2,stride=2),#28*28->14*14
    nn.Conv2d(6,16,kernel_size=5,),nn.Sigmoid(),#14*14->10*10
    nn.AvgPool2d(kernel_size=2,stride=2),#10*10->5*5
    nn.Flatten(),
    nn.Linear(16 * 5 * 5,120),nn.Sigmoid(),
    nn.Linear(120,84),nn.Sigmoid(),
    nn.Linear(84,10)
)

在 GPU 上训练结果:

(colanora_conda_env) [colanora@colanora learning]$ python -u "/home/colanora/learning/lenet.py"
training on cuda:0
/home/colanora/learning/lenet.py:28: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  display.set_matplotlib_formats('svg')
<Figure size 350x250 with 1 Axes>
<Figure size 350x250 with 1 Axes>
<Figure size 350x250 with 1 Axes>
...
loss 0.482, train acc 0.817, test acc 0.791
48381.2 examples/sec on cuda:0

AlexNet

Structure

Left: LeNet, Right: AlexNet

../_images/alexnet.svg

alexnet = nn.Sequential(
    nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(96, 256, kernel_size=5, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(256, 384, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Conv2d(384, 384, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Conv2d(384, 256, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Flatten(),
    nn.Linear(6400, 4096),
    nn.ReLU(),
    nn.Dropout(),
    nn.Linear(4096, 4096),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 10),
)

GPU 上训练结果:

(colanora_conda_env) [colanora@colanora learning]$ python -u "/home/colanora/learning/lenet.py"
training on cuda:0
/home/colanora/learning/lenet.py:28: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  display.set_matplotlib_formats("svg")
<Figure size 350x250 with 1 Axes>
<Figure size 350x250 with 1 Axes>
<Figure size 350x250 with 1 Axes>
...
loss 0.323, train acc 0.881, test acc 0.884
1503.4 examples/sec on cuda:0

NIN

Structure

../_images/nin.svg

Code

def nin_block(in_channels, out_channels, kernel_size, strides, padding):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1),
        nn.ReLU(),
    )


nin_net = nn.Sequential(
    nin_block(1, 96, kernel_size=11, strides=4, padding=0),
    nn.MaxPool2d(3, stride=2),
    nin_block(96, 256, kernel_size=5, strides=1, padding=2),
    nn.MaxPool2d(3, stride=2),
    nin_block(256, 384, kernel_size=3, strides=1, padding=1),
    nn.MaxPool2d(3, stride=2),
    nn.Dropout(0.5),
    nin_block(384, 10, kernel_size=3, strides=1, padding=1),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
)

Train on GPU

(colanora_conda_env) [colanora@colanora learning]$ python -u "/home/colanora/learning/lenet.py"
training on cuda:0
/home/colanora/learning/lenet.py:28: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  display.set_matplotlib_formats("svg")
<Figure size 350x250 with 1 Axes>
<Figure size 350x250 with 1 Axes>
<Figure size 350x250 with 1 Axes>
...
loss 0.491, train acc 0.819, test acc 0.804
1374.1 examples/sec on cuda:0

Inception-Net

Structure

inception block

../_images/inception.svg

network structure

../_images/inception-full.svg

Code

# inception-net
class inception_block(nn.Module):
    def __init__(self, in_channels, c1, c2, c3, c4, **kwargs):
        super(inception_block, self).__init__(**kwargs)
        self.p1_1 = nn.Conv2d(in_channels, c1, kernel_size=1)
        self.p2_1 = nn.Conv2d(in_channels, c2[0], kernel_size=1)
        self.p2_2 = nn.Conv2d(c2[0], c2[1], kernel_size=3, padding=1)
        self.p3_1 = nn.Conv2d(in_channels, c3[0], kernel_size=1)
        self.p3_2 = nn.Conv2d(c3[0], c3[1], kernel_size=5, padding=2)
        # 线路4,3x3最大汇聚层后接1x1卷积层
        self.p4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
        self.p4_2 = nn.Conv2d(in_channels, c4, kernel_size=1)

    def forward(self, x):
        p1 = torch.nn.functional.relu(self.p1_1(x))
        p2 = torch.nn.functional.relu(self.p2_2(torch.nn.functional.relu(self.p2_1(x))))
        p3 = torch.nn.functional.relu(self.p3_2(torch.nn.functional.relu(self.p3_1(x))))
        p4 = torch.nn.functional.relu(self.p4_2(self.p4_1(x)))
        # 在通道维度上连结输出
        return torch.cat((p1, p2, p3, p4), dim=1)

b1 = nn.Sequential(
    nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)

b2 = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=1),
    nn.ReLU(),
    nn.Conv2d(64, 192, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
)
b3 = nn.Sequential(
    inception_block(192, 64, (96, 128), (16, 32), 32),
    inception_block(256, 128, (128, 192), (32, 96), 64),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
)
b4 = nn.Sequential(
    inception_block(480, 192, (96, 208), (16, 48), 64),
    inception_block(512, 160, (112, 224), (24, 64), 64),
    inception_block(512, 128, (128, 256), (24, 64), 64),
    inception_block(512, 112, (144, 288), (32, 64), 64),
    inception_block(528, 256, (160, 320), (32, 128), 128),
    nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
)
b5 = nn.Sequential(
    inception_block(832, 256, (160, 320), (32, 128), 128),
    inception_block(832, 384, (192, 384), (48, 128), 128),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
)
inception_net = nn.Sequential(b1, b2, b3, b4, b5, nn.Linear(1024, 10))

Train on GPU

(colanora_conda_env) [colanora@colanora learning]$ python -u "/home/colanora/learning/lenet.py"
training on cuda:0
/home/colanora/learning/lenet.py:28: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  display.set_matplotlib_formats("svg")
loss 0.240, train acc 0.908, test acc 0.896
1669.3 examples/sec on cuda:0

ResNet

../_images/residual-block.svg

主要是在卷积块后面接了一个跨层的数据通路,把 x 直接跨过去了。

让我们聚焦于神经网络局部:如图 图7.6.2所示,假设我们的原始输入为xx,而希望学出的理想映射为f(x)f(x)(作为 图7.6.2上方激活函数的输入)。 图7.6.2左图虚线框中的部分需要直接拟合出该映射f(x)f(x),而右图虚线框中的部分则需要拟合出残差映射f(x)−xf(x)−x。 残差映射在现实中往往更容易优化。 以本节开头提到的恒等映射作为我们希望学出的理想映射f(x)f(x),我们只需将 图7.6.2中右图虚线框内上方的加权运算(如仿射)的权重和偏置参数设成0,那么f(x)f(x)即为恒等映射。 实际中,当理想映射f(x)f(x)极接近于恒等映射时,残差映射也易于捕捉恒等映射的细微波动。 图7.6.2右图是ResNet的基础架构–残差块(residual block)。 在残差块中,输入可通过跨层数据线路更快地向前传播。

Train on GPU

参数:

batch_size = 256
resize = 96
lr, num_epochs = 0.1, 10
(colanora_conda_env) [colanora@colanora learning]$ python -u "/home/colanora/learning/lenet.py"
training on cuda:0
/home/colanora/learning/lenet.py:28: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  display.set_matplotlib_formats("svg")
loss 0.012, train acc 0.997, test acc 0.906
2215.3 examples/sec on cuda:0

感觉好像参数环境啥的忘写了,等我有空补一下

开学人就是这么卑微

DenseNet

  • ResNet将整个拟合函数分为(或者说展开)为两部分:一个简单的线性项和一个复杂的非线性项。

    (f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x}).

  • DenseNet 更进一步,用连接 将函数分解成一个展开式:

    (\mathbf{x} \to \left[ \mathbf{x}, f_1(\mathbf{x}), f_2([\mathbf{x}, f_1(\mathbf{x})]), f_3([\mathbf{x}, f_1(\mathbf{x}), f_2([\mathbf{x}, f_1(\mathbf{x})])]), \ldots\right].)

    这些展开式用多层展开机连接,实现起来就是用全连接连起来就行了。

    ../_images/densenet.svg

    稠密网络主要由2部分构成:稠密块(dense block)和过渡层(transition layer)。 前者定义如何连接输入和输出,而后者则控制通道数量,使其不会太复杂。

Code

# DenseNet
def conv_block(input_channels, num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels), nn.ReLU(), nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1)
    )


class DenseBlock(nn.Module):
    def __init__(self, num_convs, input_channels, num_channels):
        super(DenseBlock, self).__init__()
        layer = []
        for i in range(num_convs):
            layer.append(conv_block(num_channels * i + input_channels, num_channels))
        self.net = nn.Sequential(*layer)

    def forward(self, X):
        for blk in self.net:
            Y = blk(X)
            # 连接通道维度上每个块的输入和输出
            X = torch.cat((X, Y), dim=1)
        return X


blk = DenseBlock(2, 3, 10)
X = torch.randn(4, 3, 8, 8)
Y = blk(X)
print(Y.shape)
# 由于每个稠密块都会带来通道数的增加,使用过多则会过于复杂化模型。 而过渡层可以用来控制模型复杂度。 它通过 1×1 卷积层来减小通道数,并使用步幅为2的平均汇聚层减半高和宽,从而进一步降低模型复杂度。
def transition_block(input_channels, num_channels):
    return nn.Sequential(
        nn.BatchNorm2d(input_channels),
        nn.ReLU(),
        nn.Conv2d(input_channels, num_channels, kernel_size=1),
        nn.AvgPool2d(kernel_size=2, stride=2),
    )


blk = transition_block(23, 10)
print(blk(Y).shape)
# the same as resnet
# b1 = nn.Sequential(
#     nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
#     nn.BatchNorm2d(64),
#     nn.ReLU(),
#     nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
# )
num_channels, growth_rate = 64, 32
num_convs_in_dense_blocks = [4, 4, 4, 4]
blks = []

for i, num_convs in enumerate(num_convs_in_dense_blocks):
    blks.append(DenseBlock(num_convs, num_channels, growth_rate))
    num_channels += num_convs * growth_rate
    if i != len(num_convs_in_dense_blocks) - 1:
        blks.append(transition_block(num_channels, num_channels // 2))
        num_channels = num_channels // 2


densenet = nn.Sequential(
    b1,
    *blks,
    nn.BatchNorm2d(num_channels),
    nn.ReLU(),
    nn.AdaptiveMaxPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(num_channels, 10),
)

Train on GPU

参数:

batch_size = 256
resize = 96
lr, num_epochs = 0.1, 10
(colanora_conda_env) [colanora@colanora learning]$ python -u "/home/colanora/learning/lenet.py"
training on cuda:0
/home/colanora/learning/lenet.py:28: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
  display.set_matplotlib_formats("svg")
loss 0.147, train acc 0.947, test acc 0.910
2561.6 examples/sec on cuda:0

APPENDIX

1.Backbone:翻译为骨干网络的意思,既然说是主干网络,就代表其是网络的一部分,那么是哪部分呢?这个主干网络大多时候指的是提取特征的网络,其作用就是提取图片中的信息,共后面的网络使用。这些网络经常使用的是resnet VGG等,而不是我们自己设计的网络,因为这些网络已经证明了在分类等问题上的特征提取能力是很强的。在用这些网络作为backbone的时候,都是直接加载官方已经训练好的模型参数,后面接着我们自己的网络。让网络的这两个部分同时进行训练,因为加载的backbone模型已经具有提取特征的能力了,在我们的训练过程中,会对他进行微调,使得其更适合于我们自己的任务。

2.Neck:是放在backbone和head之间的,是为了更好的利用backbone提取的特征。

3.Bottleneck:瓶颈的意思,通常指的是网网络输入的数据维度和输出的维度不同,输出的维度比输入的小了许多,就像脖子一样,变细了。经常设置的参数 bottle_num=256,指的是网络输出的数据的维度是256 ,可是输入进来的可能是1024维度的。

4.Head:head是获取网络输出内容的网络,利用之前提取的特征,head利用这些特征,做出预测。

Categories:

Updated: