Mxnet (35): 使用全卷积网络（FCN）进行语义分割

科技2022-07-16 131

1. 转置卷积

装置卷积层用来增加输入的宽和高。

让我们考虑一个基本情况，输入和输出通道均为1，填充为0，跨度为1。下图说明了转置卷积如何通过 $2 \times 2$ 内核是根据 $2 \times 2$ 输入矩阵得到 $3 x 3$ 的输出

将上面的过程转化为代码如下，其中kernel为K，输入为X：

def trans_conv(X, K): h, w = K.shape Y = np.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1)) for i in range(X.shape[0]): for j in range(X.shape[1]): Y[i: i + h, j: j + w] += X[i, j] * K return Y X = np.array([[0, 1], [2, 3]]) K = np.array([[0, 1], [2, 3]]) trans_conv(X, K)

使用gluon的nn.Conv2DTranspose以获得相同的结果。如 nn.Conv2D，输入和内核均应为4D张量。

X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2) tconv = nn.Conv2DTranspose(1, kernel_size=2) tconv.initialize(init.Constant(K)) tconv(X)

1.1 填充，步幅和通道设置

我们将填充元素应用于卷积中的输入，而将它们应用于转置卷积中的输出。一种 $1 \times 1$ padding表示我们首先按正常方式计算输出，然后删除第一行/最后一行。

tconv = nn.Conv2DTranspose(1, kernel_size=2, padding=1) tconv.initialize(init.Constant(K)) tconv(X) # array([[[[4.]]]])

步幅也适用于输出

tconv = nn.Conv2DTranspose(1, kernel_size=2, strides=2) tconv.initialize(init.Constant(K)) tconv(X)

还可以用来还原通道，降低通道数，下面的转置卷积对形状的更改和上面的卷积完全相反

X = np.random.uniform(size=(1, 10, 16, 16)) conv = nn.Conv2D(20, kernel_size=5, padding=2, strides=3) tconv = nn.Conv2DTranspose(10, kernel_size=5, padding=2, strides=3) conv.initialize() tconv.initialize() tconv(conv(X)).shape == X.shape # True

2. 全卷积网络（FCN）

全卷积网络使用卷积神经网络将图像像素转换为像素类别。与先前介绍的卷积神经网络不同，FCN通过转置的卷积层将中间层特征图的高度和宽度转换回输入图像的大小，从而使预测与输入图像中的输入图像具有一一对应的关系。空间尺寸（高度和宽度）。给定空间维度上的位置，通道维度的输出将是对应于该位置的像素的类别预测。

2.1 创建模型

全卷积网络首先使用卷积神经网络来提取图像特征，然后通过1×1 卷积层将通道数转换为类别数。最后通过使用转置的卷积层将特征图的高度和宽度转换为输入图像的大小。模型输出与输入图像具有相同的高度和宽度，并且在空间位置上具有一一对应的关系。最终输出通道包含相应空间位置的像素的类别预测。

下面使用在ImageNet上预训练的ResNet-18模型进行微调。模型成员变量的最后两层features是全局平均池化层 GlobalAvgPool2D和示例扁平化层Flatten。该 output模块包含用于输出的完全连接层。完全卷积网络不需要这些层。

pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True) pretrained_net.features[-4:], pretrained_net.output

从新创建全卷积网络实例net。它重复pretrained_net的除了最后两层的所有神经层features的实例成员变量的模型参数。

net = nn.HybridSequential() for layer in pretrained_net.features[:-2]: net.add(layer)

给定分别为320和480的高度和宽度的输入，正向计算将把输入的高度和宽度减小为原来的1/32：10和15。

X = np.random.uniform(size=(1, 3, 320, 480)) net(X).shape # (1, 512, 10, 15)

接下来需要通过 $1 \times 1$ 卷积层将通道数输出为数据的类别数量,这里Pascal VOC2012的种类为21。并且通过转置卷积层将宽高放大为原来的32倍。只要将步幅设置为32，并将padding设置为 $32 / 2 = 16$ ,即可达到方法32倍的效果，将kernel设置为 $64 \times 64$

num_classes = 21 net.add( nn.Conv2D(num_classes, kernel_size=1), nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16, strides=32) )

2.2 初始化转置卷积层

我们已经知道转置的卷积层可以放大特征图。在图像处理中，有时我们需要放大图像，即上采样。上采样的方法很多，一种常见的方法是双线性插值。简单来说, 为了获得输出图像的像素坐标 $(x, y)$ , 首先将坐标映射到输入图像的坐标 $(x^{'}, y^{'})$ 。然后在输入图像上找到4个最接近 $(x^{'}, y^{'})$ 的坐标，然后通过 $(x^{'}, y^{'})$ 和它附近的四个像素的相对距离计算 $(x, y)$ 。下面构建一个函数，通过双线插值进行上采样。

def bilinear_kernel(in_channels, out_channels, kernel_size): factor = (kernel_size + 1) // 2 if kernel_size % 2 == 1: center = factor - 1 else: center = factor - 0.5 og = (np.arange(kernel_size).reshape(-1, 1), np.arange(kernel_size).reshape(1, -1)) filt = (1 - np.abs(og[0] - center) / factor) * (1 - np.abs(og[1] - center) / factor) weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size)) weight[range(in_channels), range(out_channels), :, :] = filt return np.array(weight)

现在，我们将对由转置卷积层实现的双线性插值上采样进行实验。构造一个转置的卷积层，将输入的高度和宽度放大2倍，并使用函数初始化其卷积内核。

conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))

读取图像X并将升采样结果记录为Y。为了打印图像，我们需要调整通道尺寸的位置。

img = image.imread('img/catdog.jpg') X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0)/255 Y = conv_trans(X) out_img = Y[0].transpose(1, 2, 0) print('输入图片形状:', img.shape) print('处理过得输出形状:', out_img.shape) px.imshow(out_img.asnumpy(), width=img.shape[1]/2, height=img.shape[0]/2)

初始化转置卷积层和 $1 \times 1$ 卷积层

W = bilinear_kernel(num_classes, num_classes, 64) net[-1].initialize(init.Constant(W)) net[-2].initialize(init=init.Xavier())

3. 训练

此处的损失函数和准确度计算与图像分类中使用的损失函数和准确度计算没有实质性区别。由于我们使用转置卷积层的通道来预测像素类别，因此在axis=1中指定了（通道尺寸）选项SoftmaxCrossEntropyLoss。另外，该模型基于每个像素的预测类别是否正确来计算精度。

def accuracy(y_hat, y): if len(y_hat.shape) > 1 and y_hat.shape[1] > 1: y_hat = y_hat.argmax(axis=1) cmp = y_hat.astype(y.dtype) == y return float(cmp.sum()) def train_batch(net, features, labels, loss, trainer, devices, split_f=d2l.split_batch): X_shards, y_shards = split_f(features, labels, devices) with autograd.record(): pred_shards = [net(X_shard) for X_shard in X_shards] ls = [loss(pred_shard, y_shard) for pred_shard, y_shard in zip(pred_shards, y_shards)] for l in ls: l.backward() # ignore_stale_grad代表可以使用就得梯度参数 trainer.step(labels.shape[0], ignore_stale_grad=True) train_loss_sum = sum([float(l.sum()) for l in ls]) train_acc_sum = sum(accuracy(pred_shard, y_shard) for pred_shard, y_shard in zip(pred_shards, y_shards)) return train_loss_sum, train_acc_sum def train(net, train_iter, test_iter, loss, trainer, num_epochs, devices=d2l.try_all_gpus(), split_f=d2l.split_batch): num_batches, timer = len(train_iter), d2l.Timer() epochs_lst, loss_lst, train_acc_lst, test_acc_lst = [],[],[],[] for epoch in range(num_epochs): metric = d2l.Accumulator(4) for i, (features, labels) in enumerate(train_iter): timer.start() l, acc = train_batch( net, features, labels, loss, trainer, devices, split_f) metric.add(l, acc, labels.shape[0], labels.size) timer.stop() if (i + 1) % (num_batches // 5) == 0: epochs_lst.append(epoch + i / num_batches) loss_lst.append(metric[0] / metric[2]) train_acc_lst.append(metric[1] / metric[3]) test_acc_lst.append(d2l.evaluate_accuracy_gpus(net, test_iter, split_f)) print(f"[epock {epoch+1}] train loss: {metric[0] / metric[2]:.3f} train acc: {metric[1] / metric[3]:.3f}", f" test_loss: {test_acc_lst[-1]:.3f}") print(f'loss {metric[0] / metric[2]:.3f}, train acc ' f'{metric[1] / metric[3]:.3f}, test acc {test_acc_lst[-1]:.3f}') print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec on ' f'{str(devices)}') fig = go.Figure() fig.add_trace(go.Scatter(x=epochs_lst, y=loss_lst, name='train loss')) fig.add_trace(go.Scatter(x=epochs_lst, y=train_acc_lst, name='train acc')) fig.add_trace(go.Scatter(x=list(range(1,len(test_acc_lst)+1)), y=test_acc_lst, name='test acc')) fig.update_layout(width=800, height=480, xaxis_title='epoch', yaxis_range=[0, 1]) fig.show()

加载数据,比较费内存，选取16一组：

batch_size = 16 train_iter, test_iter = load_data_voc(batch_size, crop_size)

由于图片都比较大会加载在内存中，如果内存不够用，可以考虑减少数据量。

num_epochs, lr, wd, devices = 5, 0.1, 1e-3, [npx.gpu()] loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1) net.collect_params().reset_ctx(devices) trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr, 'wd': wd}) train(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

4.预测

在预测期间，我们需要标准化每个通道中的输入图像，并将它们转换为卷积神经网络所需的四维输入格式。

def predict(img): X = test_iter._dataset.normalize_image(img) X = np.expand_dims(X.transpose(2, 0, 1), axis=0) pred = net(X.as_in_ctx(devices[0])).argmax(axis=1) return pred.reshape(pred.shape[1], pred.shape[2]) def label2image(pred): colormap = VOC_COLORMAP.as_in_ctx(devices[0]) X = pred.astype('int32') return colormap[X, :]

获取测试数据，并进行预测。为模型使用步幅为32的转置卷积层，所以当输入图像的高度或宽度不能被32整除时，转置卷积层输出的高度或宽度会偏离输入图像的大小。为了解决此问题，我们可以在图像中裁剪多个具有高和宽为32的整数倍的矩形区域，然后对这些区域中的像素执行正向计算。组合时，这些区域必须完全覆盖输入图像。当像素被多个区域覆盖时，在不同区域的正向计算中输出的转置卷积层的平均值可以用作softmax操作的输入，以预测类别。

test_images, test_labels = d2l.read_voc_images(voc_dir, False) n, imgs = 4, [] for i in range(n): crop_rect = (0, 0, 480, 320) X = image.fixed_crop(test_images[i], *crop_rect) pred = label2image(predict(X)) imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)] Image(show_imgs(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=1.5))

第一排原图，第二排预测图，第三排是标签。

5.参考

https://d2l.ai/chapter_computer-vision/fcn.html

6.代码

github

Processed: 0.011, SQL: 8