scrapy爬取图片链接并保存到Mysql

科技2025-12-27 9

背景

思路分析

代码展示

item文件

spider文件

pipelines文件

settings文件

总结

背景

这个国庆没有出去溜达，办完事后就在家一直待着，然后在B站看了一些scrapy爬虫的视频，本人也试着用scrapy爬了一些网站，在这里和大家分享一下我的爬取方法和代码，有什么不对的地方，还请多多指教。本文章仅为学习、交流，不得用于其他违法或商业用途。

思路分析

本次爬取的网站是尤果网，如图所示，该网页有9组主图组成，每组主图包括标题名称、姓名、照片风格和图片上传日期和下一页分页，点击分页可进行页面切换。我们这次要爬取的也是这些内容，然后再把主图的图片下载到本地。

接下来就是如何获取到这些信息，首先进入浏览器的开发者模式，然后鼠标点击并选择开发者模式左上角的箭头。然后将鼠标移至任意一个主图中，发现这些信息都在div标签下面，那接下来要做的就是获取这个网页信息，网页信息获取完成后再获取下一页的信息，大致流程如下图所示。

代码展示

item文件

# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class YouguowangItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() #标题名称 date = scrapy.Field() #上传日期 name = scrapy.Field() #模特名字 style = scrapy.Field() #风格 url = scrapy.Field() #链接地址 #pass

spider文件

该部分是用来进行网页信息获取和翻页

# -*- coding: utf-8 -*- import scrapy from youguowang.items import YouguowangItem class YouguoSpider(scrapy.Spider): name = 'youguo' allowed_domains = ['www.ugirl.com'] start_urls = ['https://www.ugirl.com/meinvtupian/'] def parse(self, response): div_list = response.xpath('//div[@id="gallery-box"]/div') for div in div_list: item = YouguowangItem() item['url'] = div.xpath('./img/@src').extract_first() item['title'] = div.xpath('./aside/h3/text()').extract_first() item['name'] = div.xpath('./aside/p/text()').extract_first() item['style'] = div.xpath('./aside/p[2]/span/text()').extract_first() if item['style'] is not None: item['style'] = item['style'].strip() item['date'] = div.xpath('./aside/p[3]/text()').extract_first() #yield item #print('item-url',item['url']) yield item #获取翻页信息 for i in range(2,12): #print('当前地址是',i) next_url = 'https://www.ugirl.com/meinvtupian/p-{}.html'.format(i) print('next_url',next_url) yield scrapy.Request( next_url, callback=self.parse, meta={'item':item} )

pipelines文件

用来下载图片和保存数据，这里使用Mysql进行数据保存，需要提前建立数据库和表信息。

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.images import ImagesPipeline import scrapy from scrapy.exceptions import DropItem import pymysql #保存图片，可用 class YouguowangPipeline(ImagesPipeline): def get_media_requests(self, item, info): print('item-url是',item['url']) #for image_url in item['url']: #print("图片连接:", image_url) #yield scrapy.Request(image_url) yield scrapy.Request(item['url']) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") return item #保存到数据库 class YouguowangPipelineB(object): def __init__(self): self.db = None self.cursor = None def process_item(self, item, spider): # 数据库的名字和密码自己知道！！！youguowang是数据库的名字 self.db = pymysql.connect(host='localhost', user='root', passwd='sl-1006', db='youguowang') self.cursor = self.db.cursor() # 由于可能报错所以在这重复拿了一下item中的数据，存在了data的字典中 data = { "title": item['title'], "url": item['url'], "date": item['date'], "name": item['name'], "style": item['style'] } # 注意：MySQL数据库命令语句 insert_sql = "INSERT INTO bole (title, url, date, name,style) VALUES (%s,%s,%s,%s,%s)" try: self.cursor.execute(insert_sql, ( data['title'], data['url'], data['date'], data['name'], data['style'])) self.db.commit() print('成功了') except Exception as e: print('问题数据跳过！.......', e) self.db.rollback() self.cursor.close() self.db.close() #return item

settings文件

进行文件配置，设置请求头和开启下载通道。

BOT_NAME = 'youguowang' SPIDER_MODULES = ['youguowang.spiders'] NEWSPIDER_MODULE = 'youguowang.spiders' LOG_LEVEL = "WARNING" ROBOTSTXT_OBEY = True ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline': 1, 'scrapy.pipelines.files.FilesPipeline': 2, } # 图片过滤器，最小高度和宽度，低于此尺寸不下载 IMAGES_MIN_HEIGHT = 110 IMAGES_MIN_WIDTH = 110 DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36", } IMAGES_STORE = 'D:\\ImageSpider' ITEM_PIPELINES = { 'youguowang.pipelines.YouguowangPipeline': 300, # 'youguowang.pipelines.YouguowangPipelineQ': 500, 'youguowang.pipelines.YouguowangPipelineB':301 }

运行结果：

终端窗口显示

Mysql数据

图片下载：

总结

scrapy是一个十分强大的爬虫框架，刚刚接触时对不同的文件及用途都不是很熟悉，但随着慢慢深入了解会发现scrapy真的很好用，就仿佛又开启了一个新世界的大门，再送给大家一句我最近领会的一句话“你知道的越多，知道的就越少”。

Processed: 0.037, SQL: 9