1.获取图片
我们想实现从https://www.jianshu.com/p/1376959c3679中下载所有图片。
1.1 获取url链接和headers表头
import requests
from bs4
import BeautifulSoup
import pandas
as pd
url
='https://www.jianshu.com/p/1376959c3679'
headers
={'User-Agent':'Mozilla
/5.0 (Windows NT
10.0; WOW64
) AppleWebKit
/537.36 \
(KHTML
, like Gecko
) Chrome
/78.0.3904.108 Safari
/537.36'
}
r
=requests
.get
(url
,headers
=headers
)
html
=r
.text
.encode
(r
.encoding
).decode
()
headers的查找方式如下图所示。
1.2 查找图片的标签及属性
soup
=BeautifulSoup
(html
,'lxml')
imgs
=soup
.findAll
(lambda tag
:tag
.name
=='img' and tag
.has_attr
('data-original-src'))
print(imgs
)
srcs
=[i
.attrs
['data-original-src'] for i
in imgs
]
print('{:*^110}'.format(''))
print(srcs
)
sources
=['https:'+src
for src
in srcs
]
print('{:*^110}'.format(''))
for i
in sources
:
print(i
)
1.3 保存图片到指定文件
import os
firedir
=os
.getcwd
()+'户外风景独好'
if not os
.path
.exists
(firedir
):
os
.mkdir
(firedir
)
for i
in range(len(sources
)):
rpi
=requests
.get
(sources
[i
],headers
=headers
)
if rpi
.status_code
==200:
with open (firedir
+'/%s.jpg'%i
,mode
='wb') as f
:
f
.write
(rpi
.content
)
print('正在下载第 %d 张图片。。。'%i
)
2. 获取表格
我们想实现采集汇通网https://rl.fx678.com/date/20201007.html财经数据表格,如下图所示。
2.1 获取url链接和headers表头
import requests
from bs4
import BeautifulSoup
import pandas
as pd
import os
import re
import numpy
as np
url
='http://rl.fx678.com/date/20201007.html'
headers
={'User-Agent':'Mozilla
/5.0 (Windows NT
10.0; WOW64
) AppleWebKit
/537.36 \
(KHTML
, like Gecko
) Chrome
/78.0.3904.108 Safari
/537.36'
}
r
=requests
.get
(url
,headers
=headers
)
html
=r
.text
.encode
(r
.encoding
).decode
()
soup
=BeautifulSoup
(html
,'lxml')
2.2 查找图片的标签及属性
如上图,显示了图片的属性、标签。 tr是一行,td是一行中的一bai列,th是标题du列,可以等同于td
<tr>
<th>标题
</th>
<th>标题
</th>
<th>标题
</th>
</tr>
<tr>
<td>内容dao
</td>
<td>内容
</td>
<td>内容
</td>
</tr>
table
=soup
.find
('table',id='current_data')
table
2.3 查看行数、列数,准备表格
height
=len(table
.findAll
(lambda tag
:tag
.name
=='tr' and len(tag
.findAll
('td'))>=1))
print(height
)
for row
in table
.findAll
('tr'):
print(len(row
.findAll
('td')),end
='\t')
备注:不带标题的行为57行,打印了59行的列数,前两个为0表示标题(th),9代表一行有9列,7代表一行有7列,且代表有合并行的。如图1前两行合并了,所以是9,7。
columns
=[x
.text
for x
in table
.tr
.findAll
('th')]
print(columns
)
columns
=[i
.replace
('\xa0',' ') for i
in columns
]
print(columns
)
width
=len(columns
)
df
=pd
.DataFrame
(data
=np
.full
((height
,width
),' ', dtype
='U'),columns
=columns
)
df
2.4 解析表格内容
rows
=[row
for row
in table
.findAll
('tr') if row
.find
('td')!=None]
for i
in range(len(rows
)):
cells
=rows
[i
].findAll
('td')
if len(cells
)==width
:
df
.iloc
[i
]=[cell
.text
.strip
() for cell
in cells
]
for j
in range(len(cells
)):
if cells
[j
].has_attr
('rowspan'):
z
=int(cells
[j
].attrs
['rowspan'])
df
.iloc
[i
:i
+z
,j
]=[cells
[j
].text
.strip
()]*z
else:
w
=len(cells
)
df
.iloc
[i
,width
-w
:]=[cell
.text
.strip
() for cell
in cells
]
df