参考代码如下
def main(): job_list = [] key = "数据挖掘" dqs = ["010", "020", "050020", "050090", "030", "060080", "040", "060020", "070020", "210040", "280020", "170020"] new_key = urllib.parse.quote(key, 'utf-8') for item in dqs: url = "https://www.liepin.com/zhaopin/?key="+new_key+"&dqs="+item print(url) # 获取职位列表链接 job_html = get_job_html(url) # 解析网页分析网页得到链接 link_list = get_job_link(job_html) # 把链接储存到数组中 for i in link_list: job_list.append(i) # 保存职位链接到表格中 save_link(job_list)参考代码:
def get_job_html(url): print("-------爬取job网页-------") html = "" head = { "User-Agent": UserAgent().random } """ head:模拟浏览器头部信息 "User-Agent":浏览器标识 """ request = urllib.request.Request(url=url, headers=head) try: response = urllib.request.urlopen(request) html = response.read().decode("utf-8") except Exception as e: return None return html分析网页元素获取数据
由于每一个页面只有40条数据,所以要实现自动获取获取下一页链接,来实现自动爬取,找到网页种的元素(下一页)获取下一页的链接,实现递归爬取
参考代码如下
def get_job_link(html): job_link = [] # 解析网页得到链接 soup = BeautifulSoup(html, "html.parser") for item in soup.find_all('h3'): if item.has_attr("title"): # 抽取链接内容 link = item.find_all("a")[0]["href"] job_link.append(link) print(link) try: find_next_link = soup.select(".pager > div.pagerbar > a")[7]['href'] if find_next_link == "javascript:": return job_link # 拼接上域名 find_next_link = "https://www.liepin.com" + str(find_next_link).replace('°', '0') print(find_next_link) # 获取到下一个网页的数据 next_html = get_job_html(find_next_link) # 解析网页 if next_html is not None: next_link = get_job_link(next_html) for link in next_link: job_link.append(link) except Exception as e: print(e) finally: return job_link获取到职位链接后保存在表格中,下一步就是访问这个些链接,爬取到详细信息,并保存到数据库中
爬取开始就发现有链接的规律,有两种链接,第一种是正常的可以直接访问的,还有一种没有添加域名的,所以我们有加上域名
基本框架的搭建:
def main(): # 读取表格链接 links = read_excel_get_link() # 获取链接网页 for i in range(0, len(links)): if links[i][0] != 'h': links[i] = "https://www.liepin.com" + links[i] print(links[i]) # 获取网页 message_html = getLink.get_job_html(links[i]) if message_html is not None: # 解析数据 message_data = get_message_data(message_html) else: continue # 保存一条数据 try: save_datas_sql(message_data) except Exception as e: continuemessage_html = getLink.get_job_html(links[i])
调用上面获取职位链接时的函数:get_job_html参看网页的元素:
使用标签选择器来定位元素在爬取过程中有时会遇到一些转义字符的问题需要注意参考代码
def get_message_data(html): data = [] soup = BeautifulSoup(html, "html.parser") try: # 岗位名称 title = soup.select(".title-info > h1")[0]['title'] data.append(title) # 公司 company = soup.select(".title-info > h3 > a") if len(company) != 0: company = company[0]['title'] else: company = " " data.append(company) # 薪水 salary = soup.select(".job-title-left > p") if len(salary) != 0: salary = salary[0].contents[0] else: salary = " " salary = salary \ .replace('\n', '') \ .replace('\t', '') \ .replace('\r', '') \ .replace(' ', '') \ .replace('"', '') data.append(salary) # 描述 description = soup.select(".content.content-word") if len(description) != 0: all_des = description[0].contents description = " " for item in all_des: if type(item) == bs4.element.NavigableString: # print(item) description = description + item # print(description) else: description = " " description = description \ .replace('\n', '') \ .replace('\t', '') \ .replace('\r', '') \ .replace(' ', '') \ .replace('"', '') data.append(description) except Exception as e: print(e) finally: print(data) return data创建数据库代码
# 建表语句 def init_job_sqlite(): connet = sqlite3.connect("job_message.db") # 打开或创建文件 # 建表 c = connet.cursor() # 获取游标 sql = ''' create table if not exists job_message( id integer not null primary key autoincrement, title text not null, company text, salary text, description text ) ''' c.execute(sql) # 执行sql语句 connet.commit() # 提交 connet.close() # 关闭数据库插入数据到数据库中实现数据的储存
def save_datas_sql(data): init_job_sqlite() # 初始化数控库 # 插入数据 connet = sqlite3.connect("job_message.db") # 打开或创建文件 c = connet.cursor() # 获取游标 for index in range(0, 4): data[index] = '"' + data[index] + '"' sql = ''' insert into job_message(title,company,salary,description) values(%s)''' % ",".join(data) c.execute(sql) connet.commit()关于数据库中的数据如何展现在静态网页上,我这上一篇学习的博客中有记录爬取豆瓣笔记
由于数据太多,这里选取前100条数据显示出来
python代码参考下面的app.py中的代码
关键前端代码如下
<table class="table table-hover table-light"> <tr> <td>id</td> <td>职位</td> <td>公司</td> <td>工资</td> <td>职位描述</td> </tr> {%for job in jobs%} <tr> <td>{{job[0]}}</td> <td>{{job[1]}}</td> <td>{{job[2]}}</td> <td>{{job[3]}}</td> <td>{{job[4]}}</td> </tr> {%endfor%} </table>关键前端代码:
<div id="main" style="width: 100%;height:450px;margin: 0 auto;"></div> <script type="text/javascript"> // 基于准备好的dom,初始化echarts实例 var myChart = echarts.init( document.getElementById('main')); var data = {{ data }}; option = { xAxis: { type: 'value', splitLine: { lineStyle: { type: 'dashed' } }, name: "年薪/万", splitNumber: 10 }, yAxis: { type: 'value', name: "统计/个", splitLine: { lineStyle: { type: 'dashed' } } }, series: [{ symbolSize: 10, data: data, type: 'scatter' }] }; // 使用刚指定的配置项和数据显示图表。 myChart.setOption(option); </script>