首先介绍一下相关算法:我们希望通过一些特征进行分类,比如说最近流行电影姜子牙,我和我的家乡等等电影非常好看,但是我们不知道怎么分类,我们就采用一些方法与要求进行分类。所以我们就使用k-邻近算法进行分类,自动划分电影题材,然后进行分类。
我们先了解什么是k-邻近算法。由于我们以前看的机器学习大家可能会难以理解,很抽象,甚至看不懂,所以我们先看看相关算法。我们可能会看的如何进行分类东西。也就是说采用不同的特征值进行分类。
他的优点是:精度高,对异常值不敏感,没有输入数据的假定。但是缺点在于计算复杂度高,空间复杂度高。
工作原理:我们找一个样本集,并且样本集里面存在着标签,我们知道相对应的关系。我们输入没有标签的新数据以后,把数据的每一个特征与样本特征进行比较,最终提到一个最相似的一个标签。
也就是说,我们就提取样本里面前k个最相似数据,最后选择k个最相似的数据里面出现最多的进行分类,作为新的分类。
所以总结一下:k-邻近算法就是在给定相关数据里面选择前k个最相似的数据,然后进行分类。
我们现在做一个实战:
''' Created on Sep 16, 2010 kNN: k Nearest Neighbors Input: inX: vector to compare to existing dataset (1xN) dataSet: size m data set of known vectors (NxM) labels: data set labels (1xM vector) k: number of neighbors to use for comparison (should be an odd number) Output: the most popular class label @author: pbharrin ''' from numpy import * import operator from os import listdir def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = tile(inX, (dataSetSize, 1)) - dataSet sqDiffMat = diffMat ** 2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances ** 0.5 sortedDistIndicies = distances.argsort() classCount = {} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] def createDataSet(): group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]) labels = ['A', 'A', 'B', 'B'] return group, labels def file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines()) # get the number of lines in the file returnMat = zeros((numberOfLines, 3)) # prepare matrix to return classLabelVector = [] # prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split('\t') returnMat[index, :] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index += 1 return returnMat, classLabelVector def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - tile(minVals, (m, 1)) normDataSet = normDataSet / tile(ranges, (m, 1)) # element wise divide return normDataSet, ranges, minVals def datingClassTest(): hoRatio = 0.50 # hold out 10% datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') # load data setfrom file normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m * hoRatio) errorCount = 0.0 for i in range(numTestVecs): classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3) print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])) if (classifierResult != datingLabels[i]): errorCount += 1.0 print "the total error rate is: %f" % (errorCount / float(numTestVecs)) print errorCount def img2vector(filename): returnVect = zeros((1, 1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0, 32 * i + j] = int(lineStr[j]) return returnVect def handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') # load the training set m = len(trainingFileList) trainingMat = zeros((m, 1024)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] # take off .txt classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainingMat[i, :] = img2vector('trainingDigits/%s' % fileNameStr) testFileList = listdir('testDigits') # iterate through the test set errorCount = 0.0 mTest = len(testFileList) for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] # take off .txt classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)) if (classifierResult != classNumStr): errorCount += 1.0 print ("\nthe total number of errors is: %d" % errorCount) print ("\nthe total error rate is: %f" % (errorCount / float(mTest))) def createDataSet(): group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] return group,labels group, labels=createDataSet() print(group,labels)代码有点看不懂没有关系,我们可以先看一看这些东西,我们没有什么输出的东西,只是导入了相关的数据。
接下来我们进行算法分析:
分类算法:
def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = tile(inX, (dataSetSize, 1)) - dataSet sqDiffMat = diffMat ** 2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances ** 0.5 sortedDistIndicies = distances.argsort() classCount = {} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]距离计算:
diffMat = tile(inX, (dataSetSize, 1)) - dataSet sqDiffMat = diffMat ** 2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances ** 0.5我们可能会有报错,原因是我们有一个参数变成了字典,里面没有iteritems方法,改成items就可以了。
我们可以改变值来进行判断。
我们可以从文本里面获取数据,我们先建立一个函数在那个文件里面:
def file2matrix(filename): fr = open(filename) arrayOLines = fr.readlines() numberOLines = len(arrayOLines) returnMat = zeros((numberOLines,3)) classLabelVector = [] index = 0 for line in arrayOLines: line = line.strip() listFormLine = line.split('\t') returnMat[index,:]=listFormLine[0:3] classLabelVector.append(int(listFormLine[-1])) index += 1 return returnMat,classLabelVector我们前面操作和打开文件和读写文件一模一样,我们先创建了一个0矩阵,我们在0矩阵里面要添加相应的文件。
line.strip()用来去除回车,我们通过\t进行字符串分割成一个列表,然后取前三个元素进行一行行进行添加即可。添加到特征矩阵里面,最后我们用负索引,把最后一个元素存储到列向量里面。一定要进行转换,否则按照字符串进行处理。
我们可以采用matplotlib库进行创建。我们是要导入相关库的。
我们可以这么操作:
from numpy import * import matplotlib import matplotlib.pyplot as plt def file0matrix(filename): with open(filename) as f: total = f.readlines() numberLen = len(total) returnMat = zeros((numberLen,3)) classVector=[] index = 0 for line in total: line = line.strip() item = line.split('\t') returnMat[index:,] = item[0:3] classVector.append(item[-1]) index+=1 return returnMat,classVector dataMat, dataVector = file0matrix('datingTestSet.txt') print(dataMat[:,1]) print(dataMat[:,2]) fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(dataMat[:,1],dataMat[:,2]) plt.show()我们看看,里面的scatter函数可以支持标记散点图上面的点。但是会出现颜色一样不好分辨,所以我们就这么处理:
ax = fig.add_subplot(111) #在一张figu里面生成多张子图,参数1是子图总行数 参数2是子图总列数 参数3是子图位置 ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels)) #散点图 亮度设置 plt.show()这样就区分开了。
我们有些时候发现,训练集里面有一些较大的数字会产生影响,从而干扰我们一些判断,所以我们就采用归一化的方法进行减少,可以这么操作:
def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - tile(minVals, (m, 1)) normDataSet = normDataSet / tile(ranges, (m, 1)) # element wise divide return normDataSet, ranges, minVals我们把每列参数最小值存在于nimVal里面,最大值存在maxVal里面,参数0可以取到最小值。可以在列里面取到函数计算返回新的矩阵。
最终进行测试算法:
def datingClassTest(): hoRatio = 0.50 # hold out 10% datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') # load data setfrom file normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m * hoRatio) errorCount = 0.0 for i in range(numTestVecs): classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3) print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])) if (classifierResult != datingLabels[i]): errorCount += 1.0 print "the total error rate is: %f" % (errorCount / float(numTestVecs)) print errorCount我们要对人进行分类:
def classifyPerson():#新数据测试 resultList=['not at all','in small doses','in large doses'] percentTats=float(input("percentage of time spent playing video games?")) ffMiles=float(input("frequent filter miles earned per year")) iceCream=float(input("liters of ice cream consumed per year?")) datingDataMat,datingLabels=file2matrix('datingTestSet2.txt') normMat,ranges,minVals=autoNorm(datingDataMat) inArr=array([ffMiles,percentTats,iceCream]) classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3) print("You will probably like this person:",resultList[classifierResult-1])