爬虫——2020.10.4 xpath解析实例

科技2022-07-14 122

案例一：xpath解析获取图片

注意：解决中文乱码的问题。

在获取响应数据后，进行手动编码：response.encoding = 'utf-8'对出现乱码的地方进行处理：img_name = img_name.encode('iso-8859-1').decode('gbk')注意爬取中使用xpath时，局部数据解析时目录，要写成 ./

爬取一页的结果：

import os import requests from lxml import etree url = 'http://pic.netbian.com/4kdongman/' headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" } response = requests.get(url,headers=headers) #为防止加载中文乱码，设定响应数据的编码格式 # response.encoding = 'utf-8' page_text = response.text tree = etree.HTML(page_text) a_list = tree.xpath('//ul[@class="clearfix"]/li/a') if not os.path.exists('./picLib'): os.makedirs('./picLib') for a in a_list: img_url = 'http://pic.netbian.com' + a.xpath('./img/@src')[0] img_name = a.xpath('./img/@alt')[0] + '.jpg' #通用处理中文乱码的解决方案 img_name = img_name.encode('iso-8859-1').decode('gbk') img_data = requests.get(img_url,headers=headers).content img_path = 'picLib/' + img_name with open(img_path,'wb') as fp: fp.write(img_data) print(img_name, '下载成功！')

爬取多页的结果：但是注意这个网站，第一页与后面的页数的url不兼容。第一页用这种爬取方式爬取不到。

import os import requests from lxml import etree if not os.path.exists('./picLib'): os.makedirs('./picLib') headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" } for page in range(1,4): url = f'http://pic.netbian.com/4kdongman/index_{page}.html' response = requests.get(url,headers=headers) #为防止加载中文乱码，设定响应数据的编码格式 # response.encoding = 'utf-8' page_text = response.text tree = etree.HTML(page_text) a_list = tree.xpath('//ul[@class="clearfix"]/li/a') for a in a_list: img_url = 'http://pic.netbian.com' + a.xpath('./img/@src')[0] img_name = a.xpath('./img/@alt')[0] + '.jpg' #通用处理中文乱码的解决方案 img_name = img_name.encode('iso-8859-1').decode('gbk') img_data = requests.get(img_url,headers=headers).content img_path = 'picLib/' + img_name with open(img_path,'wb') as fp: fp.write(img_data) print(img_name, '下载成功！')

案例二：爬取全国城市名称

注意：xapth()中可以使用“|”，或运算来实现多个表达式共存。表达式之间加上空格。

Processed: 0.011, SQL: 8