《机器学习实战》之 k-近邻算法

科技2025-06-03 91

本章内容

k-近邻算法概述实施kNN分类算法伪代码python代码实例一：改进约会网站的配对效果准备数据：从文本文件中解析数据分析数据：使用Matplotlib创建散点图准备数据：归一化数值测试算法：作为完整程序验证分类器使用算法：构建完整可用程序实例二：手写识别系统准备数据：将图像转换为测试向量测试算法：使用kNN算法识别手写数字本章小结

k-近邻分类算法从文本文件中解析和导入数据使用Matplotlib创建扩散图 -归一化数值

k-近邻算法概述

简言之，k-近邻算法采用测量不同特征值之间的距离方法进行分类

优点：精度高，对异常值不敏感、无数据输入假定缺点：计算复杂度高、空间复杂度高适用数据范围：数值型和标称型

k-近邻算法的一般流程

收集数据：可以使用任何方法准备数据：距离计算所需要的数值，最好结构化的数据格式分析数据：可以使用任何方法训练数据：不适用与k-近邻算法测试算法：计算错误率使用算法：首先需要输入样本数据和结构化的输出结果，然后运行k-近邻算法判定输入数据属于哪一分类，最后应用对计算出的分类执行后续的处理。

实施kNN分类算法

k-近邻算法的伪代码和实际的python代码

伪代码

对未知类别属性的数据集中的每个点依次执行以下操作：

计算已知类别数据集中的点与当前点之间的距离；按照距离递增次序排列；选取与当前点距离最小的k个点；确定前k个点所在类别的出现频率；返回前k个点出现的频率最高的类别作为当前点的预测分类

python代码

''' k-近邻算法 inX-用于分类的输入变量 dataSet-输入的训练集 labels-标签变量【标签向量的元素个数与dataSet的行数相同】 k-选择最近邻居的数目 def classify0(inX,dataSet,labels,k): # 计算欧式距离 dataSetSize=dataSet.shape[0] diffMat=tile(inX,(dataSetSize,1))-dataSet sqDiffMat=diffMat**2 sqDistances=sqDiffMat.sum(axis=1) distances=sqDistances**0.5 sortedDistIndicies=distances.argsort() # 选择距离最小的k个点 classCount={} for i in range(k): voteIlabel=labels[sortedDistIndicies[i]] classCount[voteIlabel]=classCount.get(voteIlabel,0)+1 # 排序 sortedClassCount=sorted(classCount.items(), key=operator.itemgetter(1),reverse=True) return sortedClassCount[0][0]

实例一：改进约会网站的配对效果

在约会网站上使用k-近邻算法

收集数据：提供文本文件准备数据：使用python解析文本文件分析数据：使用Matplotlib画二维散点图训练算法：暂时不适合测试算法：使用用户提供的部分数据作为测试样本使用算法：产生简单的命令行程序，然后用户可以输入一些特征数据以判断对方为自己喜欢的类型

准备数据：从文本文件中解析数据

样本主要包括以下三个特征：

每年获得的飞行常客里程数玩视频游戏所消耗的时间百分比每周消耗的冰淇淋公斤数

从文本文件中可得知数据具有三个标签： didntlike,smallDoses,largeDoses

file2matrix函数：输入为文件名字字符串，输出为训练样本矩阵[returnMat]和类标签向量[classLabelVector]

# 将文本记录转换为NumPy的解析程序 def file2matrix(filename): # 读取文件行数 fr=open(filename) arrayOLines=fr.readlines() numberOfLines=len(arrayOLines) # 创建返回的NumPy矩阵 returnMat=zeros((numberOfLines,3)) classLabelVector=[] # 解析文件数据到列表 index=0 for line in arrayOLines: line=line.strip() listFromLine=line.split('\t') returnMat[index,:]=listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index+=1 return returnMat,classLabelVector

returnMat: [[4.0920000e+04 8.3269760e+00 9.5395200e-01] [1.4488000e+04 7.1534690e+00 1.6739040e+00] [2.6052000e+04 1.4418710e+00 8.0512400e-01] … [2.6575000e+04 1.0650102e+01 8.6662700e-01] [4.8111000e+04 9.1345280e+00 7.2804500e-01] [4.3757000e+04 7.8826010e+00 1.3324460e+00]] classLabelVector[0:20] [3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

分析数据：使用Matplotlib创建散点图

采用图形化的方式直观的展示数据，更能清晰的了解数据之间的结构。散点图使用了datingDataMat矩阵的第二，第三列数据，分别表示特征值： “玩视频游戏所消耗时间百分比”和“每周所消耗的冰淇淋公斤数”

def showdatas(datingDataMat,datingLabels): font=FontProperties(fname=r'c:\windows\fonts\simsun.ttc',size=14) # 设置汉字格式 fig,axs=plt.subplots(nrows=2,ncols=2,sharex=False,sharey=False,figsize=(13,8)) numberOfLabels=len(datingLabels) LabelsColors=[] for i in datingLabels: if i==1: LabelsColors.append('black') if i==2: LabelsColors.append('orange') if i==3: LabelsColors.append('red') # 画出散点图,以datingDataMat矩阵的第一(飞行常客例程)、第二列(玩游戏)数据画散点数据,散点大小为15,透明度为0.5 axs[0][0].scatter(x=datingDataMat[:,0],y=datingDataMat[:,1],color=LabelsColors,s=15,alpha=.5) axs0_title_text=axs[0][0].set_title(u'每年获得的飞行常客里程数与玩视频游戏所消耗时间占比',FontProperties=font) axs0_xlabel_text = axs[0][0].set_xlabel(u'每年获得的飞行常客里程数', FontProperties=font) axs0_ylabel_text = axs[0][0].set_ylabel(u'玩视频游戏所消耗时间占', FontProperties=font) plt.setp(axs0_title_text, size=9, weight='bold', color='red') plt.setp(axs0_xlabel_text, size=7, weight='bold', color='black') plt.setp(axs0_ylabel_text, size=7, weight='bold', color='black') # 画出散点图,以datingDataMat矩阵的第一(飞行常客例程)、第三列(冰激凌)数据画散点数据,散点大小为15,透明度为0.5 axs[0][1].scatter(x=datingDataMat[:, 0], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5) axs0_title_text = axs[0][1].set_title(u'每年获得的飞行常客里程数与每周消耗的冰淇淋公斤数', FontProperties=font) axs0_xlabel_text = axs[0][1].set_xlabel(u'每年获得的飞行常客里程数', FontProperties=font) axs0_ylabel_text = axs[0][1].set_ylabel(u'每周消耗的冰激凌公斤数', FontProperties=font) plt.setp(axs0_title_text, size=9, weight='bold', color='red') plt.setp(axs0_xlabel_text, size=7, weight='bold', color='black') plt.setp(axs0_ylabel_text, size=7, weight='bold', color='black') # 画出散点图,以datingDataMat矩阵的第二(玩游戏)、第三列(冰激凌)数据画散点数据,散点大小为15,透明度为0.5 axs[1][0].scatter(x=datingDataMat[:, 1], y=datingDataMat[:, 2], color=LabelsColors, s=15, alpha=.5) # 设置标题,x轴label,y轴label axs2_title_text = axs[1][0].set_title(u'玩视频游戏所消耗时间占比与每周消费的冰激淋公升数', FontProperties=font) axs2_xlabel_text = axs[1][0].set_xlabel(u'玩视频游戏所消耗时间占比', FontProperties=font) axs2_ylabel_text = axs[1][0].set_ylabel(u'每周消费的冰激淋公升数', FontProperties=font) plt.setp(axs2_title_text, size=9, weight='bold', color='red') plt.setp(axs2_xlabel_text, size=7, weight='bold', color='black') plt.setp(axs2_ylabel_text, size=7, weight='bold', color='black') # 设置图例 didntlike=mlines.Line2D([],[],color='black',marker='.',markersize=6,label='didntLike') smallDoses=mlines.Line2D([],[],color='orange',marker='.',markersize=6,label='smallDoses') largeDoses=mlines.Line2D([],[],color='red',marker='.',markersize=6,label='largeDoses') # 添加图例 axs[0][0].legend(handles=[didntlike,smallDoses,largeDoses]) axs[0][1].legend(handles=[didntlike,smallDoses,largeDoses]) axs[1][0].legend(handles=[didntlike,smallDoses,largeDoses]) plt.show()

准备数据：归一化数值

下表给出了提取的四组数据，如果想要计算样本3和样本4之间的距离，可以使用下面的方法： $d=\sqrt{(0-67)^2+(20000-32000)^2+(1.1-0.1)^2}$ 很容易看见，上面方程式中数字相差太大的属性对计算结果的影响最大，因此处理这种不同取值范围的特征值事，通常采用将数值归一化，如将取值范围处理为0到1或者-1到1. $n e w V a l u e = (o l d V a l u e - m i n) / (m a x - m i n)$

# 准备数据：归一化数值 def autoNorm(dataSet): minvals=dataSet.min(0) #dtadSet.min(0)中的参数0使得函数可以从列中选取最小值 maxvals=dataSet.max(0) ranges=maxvals-minvals normDataSet=zeros(shape(dataSet)) m=dataSet.shape[0] normDataSet=dataSet-tile(minvals,(m,1)) # tile()将变量内容复制成输入矩阵同样大小的矩阵 normDataSet=normDataSet/tile(ranges,(m,1)) # 具体的特征值相除 return normDataSet,ranges,minvals

函数的执行结果 normDataSet: [[0.44832535 0.39805139 0.56233353] [0.15873259 0.34195467 0.98724416] [0.28542943 0.06892523 0.47449629] … [0.29115949 0.50910294 0.51079493] [0.52711097 0.43665451 0.4290048 ] [0.47940793 0.3768091 0.78571804]] ranges: [9.1273000e+04 2.0919349e+01 1.6943610e+00] minvals: [0. 0. 0.001156]

测试算法：作为完整程序验证分类器

使用错误率来检测分类器的性能，即分类器给出错误结果的次数除以测试数据的总数。

# 分类器针对约会网站的测试代码 def datingClassTest(): hoRatio=0.10 datingDataMat,datingLabels=file2matrix('datingTestSet2.txt') normMat,ranges,minVals=autoNorm(datingDataMat) m=normMat.shape[0] numTestVecs=int(m*hoRatio) errorCount=0.0 for i in range(numTestVecs): classifierResult=classify0(normMat[i,:],normMat[numTestVecs:m,:],\ datingLabels[numTestVecs:m],3) print('the classifier came back with:{},the real answer is:{}'.format(classifierResult,datingLabels[i])) if (classifierResult !=datingLabels[i]): errorCount+=1.0 print('the total error rate is :{}%'.format(errorCount/float(numTestVecs)*100))

输出结果： the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:3 the classifier came back with:3,the real answer is:3 the classifier came back with:2,the real answer is:2 the classifier came back with:1,the real answer is:1 the classifier came back with:3,the real answer is:1 the total error rate is :5.0% 分类器处理约会数据集的错误率为5.0%，这个结果是个很不错的结果。，用户完全可以输入未知对象的属性，由分类软件帮助她判定某个对象的课交往程度。

使用算法：构建完整可用程序

用户分别输入上述的三个特征值，就可以预测出对未知对象的喜欢程度。

def classifPerson(): resultList=['not at all','in small doses','in large doses'] percentTats=float(input('percentage of time spent playing video games?')) ffmiles=float(input('frequent flier miles earned per year?')) iceCream=float(input('liters of ice cream consumed per year?')) datingDataMat,datingLabels=file2matrix('datingTestSet2.txt') normMat,ranges,minVals=autoNorm(datingDataMat) inArr=array([ffmiles,percentTats,iceCream]) classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3) print('You will probably like this person:{}'.format(resultList[classifierResult-1]))

实际效果测试： percentage of time spent playing video games?10 frequent flier miles earned per year?10000 liters of ice cream consumed per year?0.5 You will probably like this person:in small doses

实例二：手写识别系统

收集数据：提供文本文件准备数据：编写函数img2vector()，将图像格式转换为分类器使用的向量格式。分析数据：在python命令提示符中检查数据训练数据：可以不用使用测试算法：使用部分数据作为测试样本使用算法

准备数据：将图像转换为测试向量

trainingDigits: textDigits: 可以将一个3232的二进制图像矩阵转换为11024的向量，这样前一个实例的分类器就可以处理数字图像信息了。

# 将图片格式化处理为一个向量 def img2vector(filename): returnVect=zeros((1,1024)) fr=open(filename) for i in range(32): lineStr=fr.readline() for j in range(32): returnVect[0,32*i+j]=int(lineStr[j]) return returnVect

该函数创建了1*1024的Numpy数组，然后打开给定的文件，循环读出来文件的前32行，并将每行的头32个字符值存储在Numpy数组中，最后返回数组。测试结果： returnVect[0,0:32]: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

测试算法：使用kNN算法识别手写数字

def handwritingClassTest(): hwLabels=[] trainingFileList=listdir('trainingDigits') #获取目录内容 m=len(trainingFileList) trainMat=zeros((m,1024)) # 从文件名解析分类数字 for i in range(m): fileNameStr=trainingFileList[i] fileStr=fileNameStr.split('.')[0] classNumStr=int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainMat[i,:]=img2vector('trainingDigits/{}'.format(fileNameStr)) testFileList=listdir('testDigits') errorCount=0.0 mTest=len(testFileList) for i in range(mTest): fileNameStr=testFileList[i] fileStr=fileNameStr.split('.')[0] classNumStr=int(fileStr.split('_')[0]) vectorUnderTest=img2vector('testDigits/{}'.format(fileNameStr)) classifierResult=classify0(vectorUnderTest,trainMat,hwLabels,3) print('the classifier came back with:{},the real answer is:{}'.format(classifierResult,classNumStr)) if classifierResult!=classNumStr: errorCount+=1 print('the total number of errors is :{}'.format(errorCount)) print('the total error rate is :{}'.format(errorCount/float(mTest)))

测试结果： the classifier came back with:5,the real answer is:9 the classifier came back with:1,the real answer is:9 the classifier came back with:5,the real answer is:9 the classifier came back with:5,the real answer is:9 the classifier came back with:3,the real answer is:9 the classifier came back with:3,the real answer is:9 the classifier came back with:4,the real answer is:9 the total number of errors is :322.0 the total error rate is :0.3403805496828753

本章小结

k-近邻算法是最简单最有效的算法，但必须保存全部数据集，如果数据集过大，必需使用大量的存储空间，此外，对每个数据计算距离时，使用时间可能过长。且无法给出任何数据的基础结构信息，无法知晓平均实例样本和典型的实例样本具有什么特征。

Processed: 0.012, SQL: 8