内容简介
使用scrapy爬取古诗文网的前十页数据创建scrapy框架设置scrapy项目写爬虫类设置爬取的内容保存数据标题设置多页爬取(在gsww_spider.py里面设置)程序运行效果
使用scrapy爬取古诗文网的前十页数据
创建scrapy框架
使用cmd创建一个爬虫项目
scrapy startproject gsww
然后进入目录中,创建spider
cd gsww
scrapy genspider gsww_spider www.gushiwen.cn
设置scrapy项目
在settings的程序里面设置
ROBOTSTXT_OBEY
= False
DEFAULT_REQUEST_HEADERS
= {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/5',
'Accept-Language': 'en',
}
ITEM_PIPELINES
= {
'gsww.pipelines.GswwPipeline': 300,
}
写爬虫类
class GswwSpiderSpider(scrapy
.Spider
):
name
= 'gsww_spider'
allowed_domains
= ['www.gushiwen.cn']
start_urls
= ['https://www.gushiwen.cn/default_1.aspx']
page
= 1
def myprint(self
, value
):
print('='*30)
print(value
)
print('='*30)
def parse(self
, response
):
gsw_divs
= response
.xpath
("//div[@class='left']/div[@class='sons']")
for gsw_div
in gsw_divs
:
title
= gsw_div
.xpath
('.//b/text()').getall
()
title
= ''.join
(title
)
dynasty
= gsw_div
.xpath
('.//p[@class="source"]/a[1]/text()').getall
()
dynasty
= ''.join
(dynasty
)
author
= gsw_div
.xpath
('.//p[@class="source"]/a[2]/text()').getall
()
author
= ''.join
(author
)
content_list
= gsw_div
.xpath
(".//div[@class='contson']//text()").getall
()
content
= "" .join
(content_list
).strip
()
self
.myprint
(content
)
item
= GswwItem
(title
=title
, dynasty
=dynasty
, author
=author
, content
=content
)
yield item
设置爬取的内容
class GswwItem(scrapy
.Item
):
title
= scrapy
.Field
()
dynasty
= scrapy
.Field
()
author
= scrapy
.Field
()
content
= scrapy
.Field
()
保存数据
import json
class GswwPipeline:
def open_spider(self
, spider
):
self
.fp
= open("古诗文.txt", 'w', encoding
='utf-8')
def process_item(self
, item
, spider
):
self
.fp
.write
(json
.dumps
(dict(item
),ensure_ascii
=False) + "\n")
return item
def close_spider(self
, spider
):
self
.fp
.close
()
标题设置多页爬取(在gsww_spider.py里面设置)
next_url
= response
.xpath
('//a[@id="amore"]/@href').get
()
print(next_url
)
if not next_url
:
return
else:
yield scrapy
.Request
('https://www.gushiwen.cn' + next_url
, callback
=self
.parse
)
程序运行效果
为了方便我们运行方便,可以自己写一个程序放到项目的根目录下
from scrapy
import cmdline
cmdline
.execute
("scrapy crawl gsww_spider".split
(' '))
运行效果:
最后把源码链接贴出来: https://download.csdn.net/download/qiaoenshi/12913580