使用scrapy爬取古诗文网的前十页数据

科技2024-04-01 92

内容简介

使用scrapy爬取古诗文网的前十页数据创建scrapy框架设置scrapy项目写爬虫类设置爬取的内容保存数据标题设置多页爬取（在gsww_spider.py里面设置）程序运行效果

使用scrapy爬取古诗文网的前十页数据

创建scrapy框架

使用cmd创建一个爬虫项目

scrapy startproject gsww #创建新项目

然后进入目录中，创建spider

cd gsww scrapy genspider gsww_spider www.gushiwen.cn

设置scrapy项目

在settings的程序里面设置

ROBOTSTXT_OBEY = False #设置不遵守robots协议

DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/5', 'Accept-Language': 'en', } #设置请求头的headers

ITEM_PIPELINES = { 'gsww.pipelines.GswwPipeline': 300, }

写爬虫类

class GswwSpiderSpider(scrapy.Spider): name = 'gsww_spider' allowed_domains = ['www.gushiwen.cn'] start_urls = ['https://www.gushiwen.cn/default_1.aspx'] page = 1 def myprint(self, value): print('='*30) print(value) print('='*30) def parse(self, response): gsw_divs = response.xpath("//div[@class='left']/div[@class='sons']") for gsw_div in gsw_divs: # self.myprint(type(response)) # response.xpath返回SelectorList对象 title = gsw_div.xpath('.//b/text()').getall() title = ''.join(title) # self.myprint(title) dynasty = gsw_div.xpath('.//p[@class="source"]/a[1]/text()').getall() dynasty = ''.join(dynasty) author = gsw_div.xpath('.//p[@class="source"]/a[2]/text()').getall() author = ''.join(author) # 下面的//text()代表的是获取class='contson'下面所有的子孙文本 content_list = gsw_div.xpath(".//div[@class='contson']//text()").getall() # self.myprint(content_list) content = "" .join(content_list).strip() self.myprint(content) item = GswwItem(title=title, dynasty=dynasty, author=author, content=content) yield item

设置爬取的内容

class GswwItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() dynasty = scrapy.Field() author = scrapy.Field() content = scrapy.Field()

保存数据

import json class GswwPipeline: def open_spider(self, spider): self.fp = open("古诗文.txt", 'w', encoding='utf-8') def process_item(self, item, spider): self.fp.write(json.dumps(dict(item),ensure_ascii=False) + "\n") return item def close_spider(self, spider): self.fp.close()

标题设置多页爬取（在gsww_spider.py里面设置）

next_url = response.xpath('//a[@id="amore"]/@href').get() print(next_url) if not next_url: return else: yield scrapy.Request('https://www.gushiwen.cn' + next_url, callback=self.parse) # scrapy.Request(这个网址一定要是str类型的,所以前面就不能使用getall方法来获取，getall方法获取的是一个列表)

程序运行效果

为了方便我们运行方便，可以自己写一个程序放到项目的根目录下

from scrapy import cmdline #导入scrapy下面的cmdline包 # 调用cmdline.execute方法执行运行命令 cmdline.execute("scrapy crawl gsww_spider".split(' '))

运行效果：

最后把源码链接贴出来： https://download.csdn.net/download/qiaoenshi/12913580

Processed: 0.010, SQL: 8