最近B站有一部很火的青春剧《风犬少年的天空》,现在看到第9集(迫切等更),嘿好看,于是爬了它的评论,累计10000+,等更完了应该远不止一万,不管它,先看看小破站上观影人的ATTITUDE。 需要源代码可以评论,留下邮箱。 开干!!!!!!!!!!!!!!
导入相关库
import pandas
as pd
import numpy
as np
import matplotlib
.pyplot
as plt
import os
import jieba
import re
from collections
import Counter
from wordcloud
import WordCloud
, ImageColorGenerator
from PIL
import Image
plt
.rcParams
['font.sans-serif'] = ['SimHei']
os
.chdir
('D:\学习笔记\python\爬虫\风犬少年的天空热评')
data
= pd
.read_excel
('shortviews.xlsx', sheet_name
='views', encoding
='GBK')
评分情况
remark
= data
.groupby
(data
['score'])[['score']].count
()
remark
['rate'] = round(remark
['score'] / remark
['score'].sum(), 3)
remark
.columns
= ['freq', 'rate']
remark
freqratescore
29920.09441870.01861980.01985910.0561086190.814
plt
.figure
(figsize
=(12, 6))
plt
.style
.use
('ggplot')
plt
.bar
(x
=remark
.index
, height
=remark
['freq'], bottom
=0, color
=['grey', 'grey', 'grey', 'grey', 'brown'])
plt
.grid
(False)
plt
.title
('评分情况', fontdict
=dict(fontsize
=30))
plt
.xlabel
('评分', fontsize
=18)
plt
.ylabel
('计数', fontsize
=18)
plt
.tick_params
(labelsize
=16)
plt
.show
()
可以看出,仅10分占了81%,B友都很是喜欢啊!
总评词云图
txt
= ''.join
(data
['content'].values
.tolist
())
txt
= re
.sub
('[,‘“”;’()()?!。Bb【】的了 是看也就]', '', txt
)
segments
= jieba
.lcut
(txt
)
count
= Counter
(segments
)
res
= sorted(count
.items
(), key
=lambda x
: x
[1], reverse
=True)
image
= Image
.open('bg.jpg')
img
= np
.array
(image
)
wc
=WordCloud
(
background_color
="#fff",
width
=990,
height
=440,
margin
=10,
max_font_size
=100,
random_state
=30,
font_path
='C:/Windows/Fonts/simkai.ttf',
mask
=img
).generate_from_frequencies
(count
)
wc
.to_file
('wc.png')
plt
.figure
(figsize
=(25,25))
plt
.imshow
(wc
)
plt
.axis
('off')
plt
.show
()
In my personal perspective, B友们对这部剧表现出不同程度的喜爱,很大一部分原因可能是某一剧情引起了大家的回忆从而产生的共鸣,毕竟2020年了,国家这些年对教育的重视和包容,高中绝大多数都上过。
graphara
= {}
for _key
, _value
in count
.items
():
if len(_key
) > 1:
graphara
.update
({_key
: _value
})
wc
=WordCloud
(
background_color
="#fff",
width
=990,
height
=440,
margin
=10,
max_font_size
=100,
random_state
=30,
font_path
='C:/Windows/Fonts/simkai.ttf',
).generate_from_frequencies
(graphara
)
plt
.figure
(figsize
=(12, 8))
plt
.imshow
(wc
)
plt
.axis
('off')
plt
.show
()
青春、感动、真实、搞笑、现实、遗憾…,大家领悟吧!
差评
dislike
= data
[data
['score'] < 5]
txt1
= ''.join
(dislike
['content'].values
.tolist
())
txt1
= re
.sub
('[,‘“”;’()()?!。Bb【】的了 是看也就]', '', txt1
)
segments
= jieba
.lcut
(txt1
)
count
= Counter
(segments
)
dis_graphara
= {}
for _key
, _value
in count
.items
():
if len(_key
) > 1:
dis_graphara
.update
({_key
: _value
})
wc
=WordCloud
(
background_color
="#fff",
width
=990,
height
=440,
margin
=10,
max_font_size
=100,
random_state
=30,
font_path
='C:/Windows/Fonts/simkai.ttf',
).generate_from_frequencies
(dis_graphara
)
plt
.figure
(figsize
=(18, 8))
plt
.imshow
(wc
)
plt
.axis
('off')
plt
.show
()
从词云图中的天天、一直可以看出来,B站对这部剧的支持力度有多大,部分B友们已经受不住这样的轰炸了。但与剧情相关的词出现的很少,能看到的也就尴尬、不感兴趣,我猜可能是因为用的方言的原因,所以这些低分可能是B友们不喜欢B站轰炸式的推送吧,这就关系到B站的算法了,咋也不懂。。。