python os 批量文件尺寸异常检查（代码可运行）

科技2025-09-17 109

简介

批量产生的文件（例如数据集样本）往往拥有顺序递增的序号。然而，因为一些因素（例如原始文件被破坏），其中的一些样本是无法使用的。这些坏掉的样本往往在尺寸上和正常的样本有很大差异（见下图）。如果对这些异常数据样本不加处理，在后续的操作（例如深度学习训练）中就有可能会出现异常（例如NaN）。

我在之前的数据处理过程中使用了一种文件尺寸批量检查方法，用于解决连续序号文件中具有异常尺寸的文件的问题。 1.我们检查每个文件尺寸是否在正常阈值之中。如果不在这个范围之内，我们将使用其他序号的正常文件覆盖当前文件。 2.由于多个文件夹中都可能出现类似排列的文件序号，不同的文件夹之间相同序号的文件可能具有对应的关系，所以我们对于所有的文件夹都要进行相同的替换操作。

代码

import os import time def check_size(rangeId,thresholdSize, filePathCheckSize, filePathAllToProcess): startId, endId = rangeId lowerThresholdSize, upperThresholdSize = thresholdSize numFile = endId - startId + 1 numFileHalf = numFile // 2 for i in range(startId,endId+1): id = i tmpSize = getSize(id, filePathCheckSize) while tmpSize < lowerThresholdSize or tmpSize > upperThresholdSize: if id+numFileHalf < endId: id += numFileHalf else: id -= numFileHalf tmpSize = getSize(id, filePathCheckSize) if i!=id: print("file "+str(i)+" (size: "+ str(getSize(i,filePathCheckSize)) +"kb) will be substituted by file "+ str(id) +". ") for file in filePathAllToProcess: os.system("cp "+ file + str(id) + ".mat " + file + str(i) + ".mat") print("cp "+ file + str(id) + ".mat " + file + str(i) + ".mat") def getSize(i, filePathCheckSize): content = os.popen('du -s ' + filePathCheckSize + str(i)+'.mat').read() return int(content.split()[0])

调用示例

def main(): start = time.clock() rangeId = [24260, 25500] # 文件连续序号范围 thresholdSize = [400, float('inf')] # 设置[正常尺寸的最小值，正常尺寸的最大值] filePathCheckSize = 'pathA/A_' # 用于检查尺寸的文件夹 filePathAllToProcess = ['pathA/A_', 'pathB/B_'] # 需要对应修改的文件夹 check_size(rangeId, thresholdSize, filePathCheckSize, filePathAllToProcess) end = time.clock() print("time used: "+str(end-start)) if __name__ == "__main__": main()

结果

Processed: 0.010, SQL: 8