分析卷积神经网络图像识别过程和使用 Augmentation 解决 CNN 图像识别中的 Overfitting 问题 —— 以 Kaggle Dogs vs. Cats 为例

科技2026-01-29 11

本篇文章内容来自 Coursera: Convolutional Neural Networks in TensorFlow Week 1-2。提供方 deeplearning.ai，讲师 Laurence Moroney。 Case: 实现猫狗图像的智能识别，Dogs vs. Cats （已结束）

Code on GitHub 基本模型， Augmentation

Code on Google Colab 基本模型， Augmentation

Google Colab 类似一个在线的 IPython Notebook，不需要另外准备数据，但国内应该是需要梯子才能上。 GitHub 上提供了数据的路径和完整的 code。接上一篇文章所说，我最近选择 DeepLearning.AI TensorFlow Developer 专业证书作为接着 Machine Learning 后的另一个学习的项目。因为 Tensorflow 是主攻 Deep Learning，相比之前常用的 scikit-learn 中的工具还是有一些区别，并且在当下的 AI 大趋势下出现的频率越来越高，所以我就想分享一些我在学习中接触到的新的、容易上手的项目，帮助感兴趣的同学可以快速了解和运用这些工具，同时也可以帮助到未来失忆的我，hhh

基本模型

相比之前 Kaggle 上 Titanic 或者 House Price 的项目，这次的数据是以图片的形式存在的，所以它也有另外的标注方法。Keras 中使用的是通过将不同 category 数据存入不同文件路径实现 training set 数据的标注。以Dogs vs. Cats 为例，这里首先需要两层的分类。第一层将所有数据分为 training 和 testing，前者用于训练模型，后者用于 cross validation。第二层将 training 和 testing 中的数据归入 cats 和 dogs 实现数据的标注。在数据的预处理中，如果数据来源只有 cats 和 dogs 两个分类，那么就需要按照一定的比例将原始数据随机的归入到 training set 和 testing set 中。下面的 code 主要就是实现了这一功能，（来自课程作业…）

# 这里假设原文件为两个文件夹，分别存有cats的图片文件和dogs的图片文件 # create directories try: os.mkdir('/tmp/cats-v-dogs') os.mkdir('/tmp/cats-v-dogs/training') os.mkdir('/tmp/cats-v-dogs/testing') os.mkdir('/tmp/cats-v-dogs/training/cats') os.mkdir('/tmp/cats-v-dogs/training/dogs') os.mkdir('/tmp/cats-v-dogs/testing/cats') os.mkdir('/tmp/cats-v-dogs/testing/dogs') except OSError: pass

这里有必要先筛选掉空白文件，以免在之后的模型处理中造成困难，

# 创建函数，首先移除无效文件，之后根据SPLIT_SIZE将原数据随机分入training和testing sets def split_data(SOURCE, TRAINING, TESTING, SPLIT_SIZE): # SOURCE表示数据原文件 # TRAINING和TESTING表示储存分离后文件的dictionaries files = [] for filename in os.listdir(SOURCE): file = SOURCE + filename if os.path.getsize(file) > 0: files.append(filename) else: print(filename + " is zero length, so ignoring it!") training_length = int(len(files) * SPLIT_SIZE) testing_length = int(len(files) - training_length) shuffled_set = random.sample(files, len(files)) training_set = shuffled_set[0:training_length] testing_set = shuffled_set[-testing_length:] for filename in training_set: this_file = SOURCE + filename destination = TRAINING + filename copyfile(this_file, destination) for filename in testing_set: this_file = SOURCE + filename destination = TESTING + filename copyfile(this_file, destination) # fill the directories CAT_SOURCE_DIR = "/tmp/PetImages/Cat/" TRAINING_CATS_DIR = "/tmp/cats-v-dogs/training/cats/" TESTING_CATS_DIR = "/tmp/cats-v-dogs/testing/cats/" DOG_SOURCE_DIR = "/tmp/PetImages/Dog/" TRAINING_DOGS_DIR = "/tmp/cats-v-dogs/training/dogs/" TESTING_DOGS_DIR = "/tmp/cats-v-dogs/testing/dogs/" split_size = .9 split_data(CAT_SOURCE_DIR, TRAINING_CATS_DIR, TESTING_CATS_DIR, split_size) split_data(DOG_SOURCE_DIR, TRAINING_DOGS_DIR, TESTING_DOGS_DIR, split_size)

注意文章开头提供的 code 中使用的数据已经进行过分类预处理，所以在使用这些数据训练模型时可以直接跳过上面的步骤。

基本模型使用的方法是 CNN（Convolutional Neural Networks，卷积神经网络），它的基本思路是可以通过矩阵转换 highlight 并且选择出一些与模型目的相关的特征，通过将这些特征带入最终的 Neurons 来提高模型的效果，后面有更加可视化的处理过程。

另外 CNN 中的 Pooling 就是简单为了缩小 input files 的像素数量，进行降维。

首先我们可以借助 matplotlib 实现一些可视化，随机选取一些 input files 进行观察。其实在课程中 Laurence 有特别强调了解 training，testing 中图像特征的重要性。首先，如果 training set 中图像特征过于单一（这里单一指目标图像，比如 Dogs，的位置、角度、大小、朝向、姿势等等），就会导致 overfitting。但是如果 testing set 中的图像也很单一，并且和 training set 单一到一起去了，那整个的 cross validation 就会变得异常完美，也会使得模型在见到新的图像时缺乏做出正确判断的能力。所以我们在构造 training 和 testing sets 时，就应该尽量的保证两边都有足够多样的原始数据。

%matplotlib inline import matplotlib.image as mpimg import matplotlib.pyplot as plt # 设置一个 4x4 的图片矩阵 nrows = 4 ncols = 4 pic_index = 0 # Index for iterating over images fig = plt.gcf() fig.set_size_inches(ncols*4, nrows*4) pic_index+=8 next_cat_pix = [os.path.join(train_cats_dir, fname) for fname in train_cat_fnames[ pic_index-8:pic_index] ] next_dog_pix = [os.path.join(train_dogs_dir, fname) for fname in train_dog_fnames[ pic_index-8:pic_index] ] for i, img_path in enumerate(next_cat_pix+next_dog_pix): # Set up subplot; subplot indices start at 1 sp = plt.subplot(nrows, ncols, i + 1) sp.axis('Off') # Don't show axes (or gridlines) img = mpimg.imread(img_path) plt.imshow(img) plt.show()

这里分别随机选取了 training set 中的8张猫的图像和8张狗的图像，可以确定不同图像间还是有明显的区别。

下面构建我们的CNN 模型，这里除了 Convolutional 和 Pooling 以外，是一个三层的 Neural Network。Flatten()先把所有像素点排一排作为第一层，第二层 hidden layer 有512个结点，第三层 output layer 因为结果只有两个，所以只需要一个结点。

import tensorflow as tf model = tf.keras.models.Sequential([ # 这里是一个两层的 Convolutional + Pooling # input files 是 150x150 像素的彩色文件所以shape=(150, 150, 3) tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(150, 150, 3)), tf.keras.layers.MaxPooling2D(2,2), tf.keras.layers.Conv2D(32, (3,3), activation='relu'), tf.keras.layers.MaxPooling2D(2,2), tf.keras.layers.Conv2D(64, (3,3), activation='relu'), tf.keras.layers.MaxPooling2D(2,2), # Flatten the results to feed into a DNN tf.keras.layers.Flatten(), # 512 neuron hidden layer tf.keras.layers.Dense(512, activation='relu'), # 注意输出结果是binary的(只有0或1), 所以 output neuron=1 tf.keras.layers.Dense(1, activation='sigmoid') ])

关于每一个 Layers 像素的多少，可以通过model.summary()

Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 148, 148, 16) 448 _________________________________________________________________ max_pooling2d (MaxPooling2D) (None, 74, 74, 16) 0 _________________________________________________________________ conv2d_1 (Conv2D) (None, 72, 72, 32) 4640 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 36, 36, 32) 0 _________________________________________________________________ conv2d_2 (Conv2D) (None, 34, 34, 64) 18496 _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 17, 17, 64) 0 _________________________________________________________________ flatten (Flatten) (None, 18496) 0 _________________________________________________________________ dense (Dense) (None, 512) 9470464 _________________________________________________________________ dense_1 (Dense) (None, 1) 513 ================================================================= Total params: 9,494,561 Trainable params: 9,494,561 Non-trainable params: 0 _________________________________________________________________

每一次 nxn 的卷积会使原图像减少 n-1 个像素点，每一次 nxn 的 Pooling 会使原图像缩小 n 倍。

之后继续定义 model 的 loss func 和优化条件，

from tensorflow.keras.optimizers import RMSprop # 可以设置不同的lr， learning rate参数 model.compile(optimizer=RMSprop(lr=0.001), loss='binary_crossentropy', metrics = ['accuracy'])

生成并带入数据，

from tensorflow.keras.preprocessing.image import ImageDataGenerator # 这里只 rescaled 原数据 by 1./255，因为每一个像素点的数值为0～255 train_datagen = ImageDataGenerator( rescale = 1.0/255. ) test_datagen = ImageDataGenerator( rescale = 1.0/255. ) # batch_size 除了会影响模型效果，也会显著影响运算速度 train_generator = train_datagen.flow_from_directory(train_dir, batch_size=20, class_mode='binary', target_size=(150, 150)) validation_generator = test_datagen.flow_from_directory(validation_dir, batch_size=20, class_mode = 'binary', target_size = (150, 150))

training

history = model.fit(train_generator, validation_data=validation_generator, steps_per_epoch=100, epochs=15, validation_steps=50, verbose=2)

这里可以通过下面的代码用可视化的方法了解一下 CNN 模型中卷积的工作原理，

import numpy as np import random from tensorflow.keras.preprocessing.image import img_to_array, load_img # Let's define a new Model that will take an image as input, and will output # intermediate representations for all layers in the previous model after # the first. successive_outputs = [layer.output for layer in model.layers[1:]] #visualization_model = Model(img_input, successive_outputs) visualization_model = tf.keras.models.Model(inputs = model.input, outputs = successive_outputs) # Let's prepare a random input image of a cat or dog from the training set. cat_img_files = [os.path.join(train_cats_dir, f) for f in train_cat_fnames] dog_img_files = [os.path.join(train_dogs_dir, f) for f in train_dog_fnames] img_path = random.choice(cat_img_files + dog_img_files) img = load_img(img_path, target_size=(150, 150)) # this is a PIL image x = img_to_array(img) # Numpy array with shape (150, 150, 3) x = x.reshape((1,) + x.shape) # Numpy array with shape (1, 150, 150, 3) # Rescale by 1/255 x /= 255.0 # Let's run our image through our network, thus obtaining all # intermediate representations for this image. successive_feature_maps = visualization_model.predict(x) # These are the names of the layers, so can have them as part of our plot layer_names = [layer.name for layer in model.layers] # ----------------------------------------------------------------------- # Now let's display our representations # ----------------------------------------------------------------------- for layer_name, feature_map in zip(layer_names, successive_feature_maps): if len(feature_map.shape) == 4: #------------------------------------------- # Just do this for the conv / maxpool layers, not the fully-connected layers #------------------------------------------- n_features = feature_map.shape[-1] # number of features in the feature map size = feature_map.shape[ 1] # feature map shape (1, size, size, n_features) # We will tile our images in this matrix display_grid = np.zeros((size, size * n_features)) #------------------------------------------------- # Postprocess the feature to be visually palatable #------------------------------------------------- for i in range(n_features): x = feature_map[0, :, :, i] x -= x.mean() x /= x.std () x *= 64 x += 128 x = np.clip(x, 0, 255).astype('uint8') display_grid[:, i * size : (i + 1) * size] = x # Tile each filter into a horizontal grid #----------------- # Display the grid #----------------- scale = 20. / n_features plt.figure( figsize=(scale * n_features, scale) ) plt.title ( layer_name ) plt.grid ( False ) plt.imshow( display_grid, aspect='auto', cmap='viridis' )

从每一次卷积和 Pooling 的结果可以大概判断，模型在分析这两只猫的合影的过程中，逐渐选取了右侧猫的身型特征和耳朵特征，而放弃了图片中的其他要素。所以通过两层的卷积和 Pooling，模型所要判断的并不是整个图片上所有的信息，而是经过选择后的被 highlight 了的信息。

另外，通过调用history.history[ 'accuracy' ]可以得到整个模型的 accuracy 和 epochs 的关系曲线，如下图这里可以看到 validation set 的准确度在第二个 epoch 的时候已经稳定下来，之后的拟合只是在不断提升模型对 training set 的熟悉程度，所以是一个典型的 overfitting 问题。

关于如何理解这里的 overfitting 有一个很好的例子。可以首先把我们自己想成一个图像识别模型，比如说我们刚来到世上的时候需要不断观察不同的人才能对他们的职业作出正确的判断。比如在区别士兵的问题上，我们通过观察士兵的图片和街上行人的图片总结出戴头盔的通常是士兵，于是我们判定每一个戴头盔的人都是士兵。但是如果有一天我们在建筑工地上见到有人戴着头盔，我们就会错误的把他们也判断成是士兵，这是因为带头盔的建筑工人并没有在我们之前的的学习过程中出现过，所以我们也没有找到能够区别军人和建筑工人的特征。

同样的，如果模型在 training set 没有见过在 testing set 会出现的特征，就会导致在 testing set 中的错误判断。鉴于训练模型的目的是为了使它能够可以广泛的做出正确判断，所以我们需要不断扩大 training set 中样本的多样性，力求使它可以覆盖现实生活中的绝大多数情况。这也是 Augmentation 方法的主要思路。

Augmentation

正如上面所说，Augmentation 的核心方法就是通过对原有数据进行拉伸旋转扩大 training set 中样本的多样性，它的实现在 Keras 中也十分的 straightforward，只需要在生成数据时在ImageDataGenerator()命令中添加相应的转换变量就可以。例如，

train_datagen = ImageDataGenerator( rescale=1./255, rotation_range=40, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, fill_mode='nearest')

除之前的 scaling factore rescale 外，以下的几个参数分别代表 rotation_range：中心旋转 width_shift_range：横行拉长 height_shift_range：纵向拉长 shear_range：剪切旋转 zoom_range：放大缩小 horizontal_flip：镜像旋转 fill_mode：拉伸旋转过程中产生的空白像素点的填充方法

通过使用 Augmentation 方法，在使用 epoch=100 的回归中，可以有效解决 overfitting 的问题，如下图，（其中散点为 training set，实线为 testing set。） augmentation 后的 training set 由于复杂程度的增加，在相同数量 epoch 拟合的效果会稍有降低，但随着拟合次数增加， accuracy 会逐渐恢复到正常水平。但是对于 testing set 的结果的比较，augmentation 可以有效带动 testing set 预测准确率的持续增加，避免了 overfitting 的问题。

不过值得注意的是 augmentation 作为一种比较 tricky 的避免 overfitting 的手段并不是最理想的方法。正确的方法还是通过不断收集更多的数据进行 training 来实现尽可能广泛的对现实数据的覆盖能力。

Processed: 0.033, SQL: 9