图像数据分析

科技2023-11-29 82

图像数据分析

Exploratory data analysis comprises of brief analyses to describe a dataset to guide the modeling process and to answer preliminary questions. For classification problems, this might include looking at the distributions of variables or checking for any meaningful patterns of predictors across different classes. The same problem holds for the classification of image data. We intend to find meaningful information simple operations can give us. Here, I outline a couple of methods we can do to achieve this goal using Chest X-Rays data [source]. This dataset consists of X-ray images of pneumonia patients and healthy controls.

探索性数据分析包括简短的分析，以描述数据集以指导建模过程并回答初步问题。对于分类问题，这可能包括查看变量的分布或检查跨不同类的预测变量的任何有意义的模式。对于图像数据的分类也存在相同的问题。我们打算寻找简单的操作可以为我们提供的有意义的信息。在这里，我概述了我们可以使用胸部X射线数据[ 来源 ]达到此目标的几种方法。该数据集由肺炎患者和健康对照的X射线图像组成。

原始比较 (Raw Comparison)

First, we can start by simply looking at a few randomly sampled images.

首先，我们可以从简单地查看一些随机采样的图像开始。

import os import numpy as np import matplotlib.pyplot as plt from tensorflow.keras.preprocessing import image %matplotlib inline train_dir = 'DATA/train' # image folder # get the list of jpegs from sub image class folders normal_imgs = [fn for fn in os.listdir(f'{train_dir}/NORMAL') if fn.endswith('.jpeg')] pneumo_imgs = [fn for fn in os.listdir(f'{train_dir}/PNEUMONIA') if fn.endswith('.jpeg')] # randomly select 3 of each select_norm = np.random.choice(normal_imgs, 3, replace = False) select_pneu = np.random.choice(pneumo_imgs, 3, replace = False) # plotting 2 x 3 image matrix fig = plt.figure(figsize = (8,6)) for i in range(6): if i < 3: fp = f'{train_dir}/NORMAL/{select_norm[i]}' label = 'NORMAL' else: fp = f'{train_dir}/PNEUMONIA/{select_pneu[i-3]}' label = 'PNEUMONIA' ax = fig.add_subplot(2, 3, i+1) # to plot without rescaling, remove target_size fn = image.load_img(fp, target_size = (100,100), color_mode='grayscale') plt.imshow(fn, cmap='Greys_r') plt.title(label) plt.axis('off') plt.show() # also check the number of files here len(normal_imgs), len(pneumo_imgs)

This step will pull random images from each sub-folders and display them.

此步骤将从每个子文件夹中提取随机图像并显示它们。

图像作为矩阵 (Images as Matrix)

For the next few steps, we will work directly with the pixel values of each image so we can do operations on them. We can accomplish this by converting our images into a Numpy array.

在接下来的几个步骤中，我们将直接处理每个图像的像素值，以便可以对其进行操作。我们可以通过将图像转换为Numpy数组来完成此操作。

# making n X m matrix def img2np(path, list_of_filename, size = (64, 64)): # iterating through each file for fn in list_of_filename: fp = path + fn current_image = image.load_img(fp, target_size = size, color_mode = 'grayscale') # covert image to a matrix img_ts = image.img_to_array(current_image) # turn that into a vector / 1D array img_ts = [img_ts.ravel()] try: # concatenate different images full_mat = np.concatenate((full_mat, img_ts)) except UnboundLocalError: # if not assigned yet, assign one full_mat = img_ts return full_mat # run it on our folders normal_images = img2np(f'{train_dir}/NORMAL/', normal_imgs) pnemonia_images = img2np(f'{train_dir}/PNEUMONIA/', pneumo_imgs)

This function will iterate through each file and turn them into an (n, m) matrix, where n is the number of observations and m is the number of pixels.

此函数将遍历每个文件并将它们转换为( n，m )矩阵，其中n是观察数， m是像素数。

平均图像 (Average Image)

Now let’s see what the average image looks like for each class. To compute the average image, we can take the average value of each pixel across all observations.

现在，让我们看一下每个班级的平均图像。要计算平均图像，我们可以取所有观测值中每个像素的平均值。

def find_mean_img(full_mat, title, size = (64, 64)): # calculate the average mean_img = np.mean(full_mat, axis = 0) # reshape it back to a matrix mean_img = mean_img.reshape(size) plt.imshow(mean_img, vmin=0, vmax=255, cmap='Greys_r') plt.title(f'Average {title}') plt.axis('off') plt.show() return mean_img norm_mean = find_mean_img(normal_images, 'NORMAL') pneu_mean = find_mean_img(pnemonia_images, 'PNEUMONIA')

We can see from the average image that pneumonia X-rays tend to show higher obstruction around the chest area.

从普通图像中我们可以看到，肺炎X射线倾向于在胸部周围显示更高的阻塞。

平均图像之间的对比 (Contrast Between Average Images)

Using the average images, we can also compute the difference.

使用平均图像，我们还可以计算差异。

contrast_mean = norm_mean - pneu_mean plt.imshow(contrast_mean, cmap='bwr') plt.title(f'Difference Between Normal & Pneumonia Average') plt.axis('off') plt.show()

变化性 (Variability)

Similarly, we can also look at which area is most variable in either class by computing variance or standard deviation instead of the mean. Here the lighter area indicates higher variability. Again we can see that in pneumonia X-rays, there is more variability within the lungs.

类似地，我们还可以通过计算方差或标准差而不是平均值来查看哪个类别中哪个区域的变量最大。这里较亮的区域表示较高的可变性。再次我们可以看到，在肺炎X射线中，肺内的变异性更大。

特征图像 (Eigenimages)

Lastly, we can use a dimension reduction technique such as the principal component analysis (PCA) to visualize the components that describe each class the best. The eigenimages, which is essentially the eigenvectors (components) of PCA of our image matrix, can be reshaped into a matrix and be plotted. It’s also called eigenfaces as this approach was first used for facial recognition research. Here we will visualize the principal components that describe 70% of variability for each class.

最后，我们可以使用降维技术(例如主成分分析(PCA))来可视化描述每个类别的最佳成分。特征图像本质上是我们图像矩阵PCA的特征向量(分量)，可以重塑成矩阵并绘制出来。由于此方法最初用于面部识别研究，因此也称为本征脸。在这里，我们将可视化描述每个类别70％变异性的主要成分。

from sklearn.decomposition import PCA from math import ceil def eigenimages(full_mat, title, n_comp = 0.7, size = (64, 64)): # fit PCA to describe n_comp * variability in the class pca = PCA(n_components = n_comp, whiten = True) pca.fit(full_mat) print('Number of PC: ', pca.n_components_) return pca def plot_pca(pca, size = (64, 64)): # plot eigenimages in a grid n = pca.n_components_ fig = plt.figure(figsize=(8, 8)) r = int(n**.5) c = ceil(n/ r) for i in range(n): ax = fig.add_subplot(r, c, i + 1, xticks = [], yticks = []) ax.imshow(pca.components_[i].reshape(size), cmap='Greys_r') plt.axis('off') plt.show() plot_pca(eigenimages(normal_images, 'NORMAL')) plot_pca(eigenimages(pnemonia_images, 'PNEUMONIA'))

You can see that the eigenimages of healthy X-ray images show much more edge definitions around rib cages and organs compared to the pneumonia class.

您可以看到，与肺炎类别相比，健康的X射线图像的本征图像显示出肋骨笼和器官周围的边缘清晰得多。

Today, I briefly showed a few quick and easy methods to find patterns in a simple image dataset. Evidently, these methods are great when working with images that have somewhat regular compositions. In addition to above methods, we can also look at the fisherfaces, and the correlation matrix across pixels for our exploratory analysis of the image data.

今天，我简要介绍了几种在简单图像数据集中查找图案的快速简便的方法。显然，这些方法在处理具有一定规则构图的图像时非常有用。除上述方法外，我们还可以查看鱼脸和像素间的相关矩阵，以进行图像数据的探索性分析。

数据源 (Data Source)

翻译自: https://towardsdatascience.com/exploratory-data-analysis-ideas-for-image-classification-d3fc6bbfb2d2

图像数据分析

Processed: 0.017, SQL: 9