正则表达式
字符含义
abc一个abc[…]匹配[]中出现的任意一个字符[0-9]表示匹配0-9任意一个数字(abc|李四|小红)表示匹配abc或李四或小红(abc|李四|小红){2,3}表示匹配abc或李四或小红2次或3次^abc表示要匹配的字符串必须要以a开头abc$表示要匹配的字符串必须要以c结尾
元字符
符号含义
.任意一个字符\d一个数字\s一个空格\b单词边界(单词的左边或右边有空格的)\w一个字母.数字.下划线甚至一个中文或\D一个数字之外的字符/ 大写取和小写相反的
量词
符号含义
a{3}表示aaad{2,4}表示dd或ddd或dddd{n,m}表示一个字符要匹配n-m次+表示{1,},一次或多次*表示{0,},任意次,0次或多次?表示{0,1},0次或1次
re模块
import re
str = "a123b456b789b"
reg
= re
.compile('[a-zA-z]\d')
result
= re
.findall
(reg
, str)
print(result
)
reg2
= re
.compile('a.+b')
print(re
.findall
(reg2
, str))
reg3
= re
.compile('a.+?b')
print(re
.findall
(reg3
,str3
))
re模块的贪婪模式会尽可能匹配多的字符串
requests模块
import requests
import chardet
url
= "http://www.baidu.com"
user_agent
= [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"]
res
= requests
.get
(url
, headers
={"User-Agent":random
.choice
(user_agent
)})
encode
= chardet
.detect
(res
.content
)
res
.encoding
= encode
.get
('encoding')
print(res
.text
)
get()方法的常用参数:
url: 这是要发送请求的网页链接
headers: 这是访问网页时的游览器头部信息,必须是一个字典,这是可选的
proxies: 这个参数可以设置代理IP ,必须是一个字典,这是可选的
BeautifulSoup模块
from bs4
import BeautifulSoup
html
= """
<html>
<head> <title>The Dormouse's story</title> </head>
<body>
<h1><b>123456</b></h1>
<p class="title" name="dromouse">
<b>The Dormouse's story</b>
aaaaa
</p>
<p class="title" name="dromouse" title='new'><b>The Dormouse's story</b>a</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
<a href="http://example.com/tillie" class="siterr" id="link4">Tillie</a>;
<a href="http://example.com/tillie" class="siterr" id="link5">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
<ul id="ulone">
<li>01</li>
<li>02</li>
<li>03</li>
<li>04</li>
<li>05</li>
</ul>
<div class='div11'>
<ul id="ultwo">
<li> 0001 </li>
<li>0002</li>
<li>0003</li>
<li>0004</li>
<li>0005</li>
</ul>
</div>
</body>
</html>
"""
soup
= BeautifulSoup
(html
,'html.parser')
a
= soup
.find
(class_
='story').find
('a')
print(a
.attrs
)
find()方法会找到第一个和它匹配的标签,且只会找一个
如果想找多个,请用find_all()方法,它是查找所有的匹配项,并返回一个列表
抓取 彼岸壁纸 图片链接
使用 requests+bs4 抓取 彼岸壁纸 图片链接,并存储到csv文件
import requests
import chardet
import random
from bs4
import BeautifulSoup
import csv
def get_html(url
):
user_agent
= [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
]
res
= requests
.get
(url
, headers
={"User-Agent": random
.choice
(user_agent
)})
encode
= chardet
.detect
(res
.content
)
res
.encoding
= encode
.get
('encoding')
return res
if __name__
== "__main__":
url
="http://www.netbian.com/"
html
= get_html
(url
).text
soup
= BeautifulSoup
(html
,"html.parser")
img_list
= soup
.find
('div', class_
='list').find_all
('img')
list = []
for i
in img_list
:
list.append
([i
.attrs
['src']])
with open("壁纸链接.csv", "w", encoding
="utf-8", newline
="") as f
:
writer
= csv
.writer
(f
)
writer
.writerows
(list)