使用scrapy爬取古诗文网的前十页数据

    科技2024-04-01  92

    内容简介

    使用scrapy爬取古诗文网的前十页数据创建scrapy框架设置scrapy项目写爬虫类设置爬取的内容保存数据标题设置多页爬取(在gsww_spider.py里面设置)程序运行效果

    使用scrapy爬取古诗文网的前十页数据

    创建scrapy框架

    使用cmd创建一个爬虫项目

    scrapy startproject gsww #创建新项目

    然后进入目录中,创建spider

    cd gsww scrapy genspider gsww_spider www.gushiwen.cn

    设置scrapy项目

    在settings的程序里面设置

    ROBOTSTXT_OBEY = False #设置不遵守robots协议

    DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/5', 'Accept-Language': 'en', } #设置请求头的headers

    ITEM_PIPELINES = { 'gsww.pipelines.GswwPipeline': 300, }

    写爬虫类

    class GswwSpiderSpider(scrapy.Spider): name = 'gsww_spider' allowed_domains = ['www.gushiwen.cn'] start_urls = ['https://www.gushiwen.cn/default_1.aspx'] page = 1 def myprint(self, value): print('='*30) print(value) print('='*30) def parse(self, response): gsw_divs = response.xpath("//div[@class='left']/div[@class='sons']") for gsw_div in gsw_divs: # self.myprint(type(response)) # response.xpath返回SelectorList对象 title = gsw_div.xpath('.//b/text()').getall() title = ''.join(title) # self.myprint(title) dynasty = gsw_div.xpath('.//p[@class="source"]/a[1]/text()').getall() dynasty = ''.join(dynasty) author = gsw_div.xpath('.//p[@class="source"]/a[2]/text()').getall() author = ''.join(author) # 下面的//text()代表的是获取class='contson'下面所有的子孙文本 content_list = gsw_div.xpath(".//div[@class='contson']//text()").getall() # self.myprint(content_list) content = "" .join(content_list).strip() self.myprint(content) item = GswwItem(title=title, dynasty=dynasty, author=author, content=content) yield item

    设置爬取的内容

    class GswwItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() dynasty = scrapy.Field() author = scrapy.Field() content = scrapy.Field()

    保存数据

    import json class GswwPipeline: def open_spider(self, spider): self.fp = open("古诗文.txt", 'w', encoding='utf-8') def process_item(self, item, spider): self.fp.write(json.dumps(dict(item),ensure_ascii=False) + "\n") return item def close_spider(self, spider): self.fp.close()

    标题设置多页爬取(在gsww_spider.py里面设置)

    next_url = response.xpath('//a[@id="amore"]/@href').get() print(next_url) if not next_url: return else: yield scrapy.Request('https://www.gushiwen.cn' + next_url, callback=self.parse) # scrapy.Request(这个网址一定要是str类型的,所以前面就不能使用getall方法来获取,getall方法获取的是一个列表)

    程序运行效果

    为了方便我们运行方便,可以自己写一个程序放到项目的根目录下

    from scrapy import cmdline #导入scrapy下面的cmdline包 # 调用cmdline.execute方法执行运行命令 cmdline.execute("scrapy crawl gsww_spider".split(' '))

    运行效果:

    最后把源码链接贴出来: https://download.csdn.net/download/qiaoenshi/12913580

    Processed: 0.010, SQL: 8