如何用Python爬取B站可爱的表情包

0x01 前言

最近在逛B站的时候发现了一套我寻找已久的猫人表情包

QQ20220517-090317-HD

但是当我点进去之后,看到了灰色的下载按钮逐渐起了杀心

image-20220517090548240

简单来说,就是不让你下载,只能下载这个必剪APP才能使用,无奈这个剪辑软件并不适配MacOS系统!可恶

但是这并不影响我下载,只要按下正义的F12,选择元素,找到链接,然后右键保存:

image-20220517091128168

image-20220517091150031

点击这个直接链接之后右键就能下载到本地:

image-20220517091316618

But!,这样虽然可以下载了,但是每个表情都要点进去按F12,一共有228张表情,岂不是要重复228次?

那我们只好写一个正义的爬虫来帮我下载了。

0x02 思路1

由于这个网站是动态加载的,我们需要做更多的操作。

利用seleniumBeautifulSoup4来模拟浏览器操作,获取该页面所有我们需要的数据

1
2
3
from selenium.webdriver.common.keys import Keys #模仿键盘,操作下拉框的
from bs4 import BeautifulSoup #解析html的
from selenium import webdriver #模仿浏览器的

我使用的是Chrome浏览器,需要去下载selenium所需要的对应的浏览器驱动,下载地址如下:
https://chromedriver.chromium.org/downloads

我使用的Chrome版本是101,所以点击如下图所示的地方进入下载:

image-20220517092500840

如果驱动报错就要看报错信息,这里我就是因为版本不兼容报错的,所以我格外说一下😢

具体实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
from selenium.webdriver.common.keys import Keys #模仿键盘,操作下拉框的
from bs4 import BeautifulSoup #解析html的
from selenium import webdriver #模仿浏览器的

driver = webdriver.Chrome(executable_path="/Users/q1jun/chromedriver")
driver.get('https://cool.bilibili.com/x/co-create/user/material_list?ps=15&pn=1&mid=1445996044&up_from=1&biz_from=3&material_type=7&t=1652707801269')
driver.maximize_window() #最大化页面
#把下拉条往下拉..
driver.find_element_by_xpath('//input[@class="readerImg"]').send_keys(Keys.DOWN)

#driver.page_source是返回页面的内容.这里你需要解析下html.用到的是BeautifulSoup.记得倒包
html = BeautifulSoup(driver.page_source)
print(html)

那么问题来了,如下图所示,driver.find_element_by_xpath这个函数已经被弃用了,导致最后获得的数据并不完全。

image-20220516214249267

运行结果如下:

image-20220516214318031

才这么几条数据,肯定不是我们所需要的两百多个表情包!可恶,启动Plan B!

0x03 思路2

通过分析页面的请求网址,查看B站是如何请求数据的,我们仔模仿这个请求不就行了?理论存在,实践开始!

首先还是一个正义的F12,选择网络->过滤XDR

image-20220517093257540

然后我们一直向下划,让前端页面一直发送请求啦取后台数据:

image-20220516214516357

image-20220516214542568

可以看到我标记的两个变量pnt,只有这两个变量是一直在变化的 (其他的固定参数可以先不用管)。

很显然pn应该是PageNumber的缩写,标识请求的页数,那t和这串数字1652708159004是什么意思呢?

如果认识时间戳的同学肯定一眼钉真就看出来了,这就是时间戳啊,可以用Python的time包验证一下:

image-20220516214442786

(因为不是同时加载的,所以略有不同)

知道请求参数,那就好办了,直接用requests获取数据打印出来,代码如下:

这里只设置了pn=2,暂时处理一部分数据,如果一次取太多数据了不好分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
import requests
import time

headers = {
'Accept': 'application/json'
}
pn = 2 #测试只抓取一部分
url = f'https://cool.bilibili.com/x/co-create/user/material_list?ps=15&pn={pn}&mid=1445996044&up_from=1&biz_from=3&material_type=7&t={time.time()}'
html=requests.get(url,headers=headers);
if html.status_code == 200:
html_bytes = html.content;
html_str = html_bytes.decode();
print(html_str)

获取的结果如下:

1
{"code":0,"message":"0","ttl":1,"data":{"material_list":[{"material_id":822876,"material_type":7,"sid":0,"title":"一猫人-真相.gif","cover":"https://i0.hdslb.com/bfs/material_up/7165468d675fa4df192ec861403e7997e9303c34.gif","duration":0,"musicians":"","categories":"","used_count":233,"videopre_url":""},{"material_id":961278,"material_type":7,"sid":0,"title":"一猫人-世俗的欲望1","cover":"https://i0.hdslb.com/bfs/material_up/0b0b9aaa63617ef602f33cea0a83f5e730f4402f.jpg","duration":0,"musicians":"","categories":"","used_count":223,"videopre_url":""},{"material_id":832071,"material_type":7,"sid":0,"title":"一猫人-上班.gif","cover":"https://i0.hdslb.com/bfs/material_up/46376cc652e9113945f9213c2a3a561287d1851b.gif","duration":0,"musicians":"","categories":"","used_count":216,"videopre_url":""},{"material_id":940356,"material_type":7,"sid":0,"title":"一猫人-中秋快乐.j","cover":"https://i0.hdslb.com/bfs/material_up/70e1ca54f83da61532f333b0bf5abfe81ac51686.jpg","duration":0,"musicians":"","categories":"","used_count":205,"videopre_url":""},{"material_id":891899,"material_type":7,"sid":0,"title":"一猫人-棒.gif","cover":"https://i0.hdslb.com/bfs/material_up/a49e0c6c8c61bd0db039264e1301d7d53e83534e.gif","duration":0,"musicians":"","categories":"","used_count":172,"videopre_url":""},{"material_id":894317,"material_type":7,"sid":0,"title":"一猫人-飞机.gif","cover":"https://i0.hdslb.com/bfs/material_up/85a2ff51879f8b6c89938145a28f0d599336fe0a.gif","duration":0,"musicians":"","categories":"","used_count":162,"videopre_url":""},{"material_id":931391,"material_type":7,"sid":0,"title":"一猫人-小丑.png","cover":"https://i0.hdslb.com/bfs/material_up/b68b50d6d739d713c152e493924363f3b72f6f6e.png","duration":0,"musicians":"","categories":"","used_count":157,"videopre_url":""},{"material_id":972898,"material_type":7,"sid":0,"title":"一猫人-斯莱特林.p","cover":"https://i0.hdslb.com/bfs/material_up/3d78bd013a0a820d01985dc29ff832a39b0d8944.png","duration":0,"musicians":"","categories":"","used_count":153,"videopre_url":""},{"material_id":830663,"material_type":7,"sid":0,"title":"一猫人-帅.gif","cover":"https://i0.hdslb.com/bfs/material_up/644bf254c17b733653425db528b3b39553599e1a.gif","duration":0,"musicians":"","categories":"","used_count":140,"videopre_url":""},{"material_id":822879,"material_type":7,"sid":0,"title":"一猫人-冲.gif","cover":"https://i0.hdslb.com/bfs/material_up/7be99b8651dd69cc0664bed583a3256f873e20cb.gif","duration":0,"musicians":"","categories":"","used_count":123,"videopre_url":""},{"material_id":867832,"material_type":7,"sid":0,"title":"一猫人-帅醒.png","cover":"https://i0.hdslb.com/bfs/material_up/61dbe0163a0c545892f195d53d8ba001d394751e.png","duration":0,"musicians":"","categories":"","used_count":122,"videopre_url":""},{"material_id":961783,"material_type":7,"sid":0,"title":"一猫人-JOJO承太","cover":"https://i0.hdslb.com/bfs/material_up/451c9d93008bf50dd0005443c74e45dd72b71128.png","duration":0,"musicians":"","categories":"","used_count":121,"videopre_url":""},{"material_id":860578,"material_type":7,"sid":0,"title":"一猫人-跳舞.gif","cover":"https://i0.hdslb.com/bfs/material_up/d4072bef4bd47bedc6051916d93beb256f7c8c6a.gif","duration":0,"musicians":"","categories":"","used_count":121,"videopre_url":""},{"material_id":942980,"material_type":7,"sid":0,"title":"一猫人-赫奇帕奇.j","cover":"https://i0.hdslb.com/bfs/material_up/88ce5ab9891d8a42133c41bdab538e2414b31f2b.jpg","duration":0,"musicians":"","categories":"","used_count":119,"videopre_url":""},{"material_id":972899,"material_type":7,"sid":0,"title":"一猫人-拉文克劳.p","cover":"https://i0.hdslb.com/bfs/material_up/92ed49143c4301c9e8e5af86b85458c0b6675ac5.png","duration":0,"musicians":"","categories":"","used_count":115,"videopre_url":""}],"pager":{"total":228,"pn":2,"ps":15}}}

分析我们取到的数据,找到我们需要的东西:

image-20220516215701036

通过前面浏览器调试页面的Accept: application/json得知请求到的数据是Json数据。

然后使用json包处理一下得到的数据 (从第17行开始):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import requests
import json
import csv
from multiprocessing.dummy import Pool
import time

headers = {
'Accept': 'application/json'
}
pn = 2
url = f'https://cool.bilibili.com/x/co-create/user/material_list?ps=15&pn={pn}&mid=1445996044&up_from=1&biz_from=3&material_type=7&t={time.time()}'
html=requests.get(url,headers=headers);
if html.status_code == 200:
html_bytes = html.content;
html_str = html_bytes.decode();
# print(html_str)
data = json.loads(html_str);
all_items=data['data']['material_list']
pic_content=[];
for item in all_items:
title = item['title'];
material_id = item['material_id']
uri = item['cover']
used_count = item['used_count']
pic_content.append({'名称': title, 'ID': material_id, '使用次数':used_count, 'URI':uri});
for pic in pic_content:
print(pic)

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/usr/local/bin/python3.9 /Users/q1jun/Desktop/pythonProject/scraper.py
{'名称': '一猫人-真相.gif', 'ID': 822876, '使用次数': 233, 'URI': 'https://i0.hdslb.com/bfs/material_up/7165468d675fa4df192ec861403e7997e9303c34.gif'}
{'名称': '一猫人-世俗的欲望1', 'ID': 961278, '使用次数': 223, 'URI': 'https://i0.hdslb.com/bfs/material_up/0b0b9aaa63617ef602f33cea0a83f5e730f4402f.jpg'}
{'名称': '一猫人-上班.gif', 'ID': 832071, '使用次数': 216, 'URI': 'https://i0.hdslb.com/bfs/material_up/46376cc652e9113945f9213c2a3a561287d1851b.gif'}
{'名称': '一猫人-中秋快乐.j', 'ID': 940356, '使用次数': 205, 'URI': 'https://i0.hdslb.com/bfs/material_up/70e1ca54f83da61532f333b0bf5abfe81ac51686.jpg'}
{'名称': '一猫人-棒.gif', 'ID': 891899, '使用次数': 172, 'URI': 'https://i0.hdslb.com/bfs/material_up/a49e0c6c8c61bd0db039264e1301d7d53e83534e.gif'}
{'名称': '一猫人-飞机.gif', 'ID': 894317, '使用次数': 162, 'URI': 'https://i0.hdslb.com/bfs/material_up/85a2ff51879f8b6c89938145a28f0d599336fe0a.gif'}
{'名称': '一猫人-小丑.png', 'ID': 931391, '使用次数': 157, 'URI': 'https://i0.hdslb.com/bfs/material_up/b68b50d6d739d713c152e493924363f3b72f6f6e.png'}
{'名称': '一猫人-斯莱特林.p', 'ID': 972898, '使用次数': 153, 'URI': 'https://i0.hdslb.com/bfs/material_up/3d78bd013a0a820d01985dc29ff832a39b0d8944.png'}
{'名称': '一猫人-帅.gif', 'ID': 830663, '使用次数': 140, 'URI': 'https://i0.hdslb.com/bfs/material_up/644bf254c17b733653425db528b3b39553599e1a.gif'}
{'名称': '一猫人-冲.gif', 'ID': 822879, '使用次数': 123, 'URI': 'https://i0.hdslb.com/bfs/material_up/7be99b8651dd69cc0664bed583a3256f873e20cb.gif'}
{'名称': '一猫人-帅醒.png', 'ID': 867832, '使用次数': 122, 'URI': 'https://i0.hdslb.com/bfs/material_up/61dbe0163a0c545892f195d53d8ba001d394751e.png'}
{'名称': '一猫人-JOJO承太', 'ID': 961783, '使用次数': 121, 'URI': 'https://i0.hdslb.com/bfs/material_up/451c9d93008bf50dd0005443c74e45dd72b71128.png'}
{'名称': '一猫人-跳舞.gif', 'ID': 860578, '使用次数': 121, 'URI': 'https://i0.hdslb.com/bfs/material_up/d4072bef4bd47bedc6051916d93beb256f7c8c6a.gif'}
{'名称': '一猫人-赫奇帕奇.j', 'ID': 942980, '使用次数': 119, 'URI': 'https://i0.hdslb.com/bfs/material_up/88ce5ab9891d8a42133c41bdab538e2414b31f2b.jpg'}
{'名称': '一猫人-拉文克劳.p', 'ID': 972899, '使用次数': 115, 'URI': 'https://i0.hdslb.com/bfs/material_up/92ed49143c4301c9e8e5af86b85458c0b6675ac5.png'}

好!没有问题!

浏览器中直接划到最下面,看看有多深多少页。

查看最后一张,一共有16页,最后一张是火箭队.pn

image-20220516223755327

看到这里肯定有人很奇怪,啊为什么是火箭队.pn不是火箭队.png??

这就是另一个问题了,通过查看浏览器中其他数据可以看到表情包的格式有.png .jpg .gif

但是!可能由于B站的API自身缘故在获得title这个元素的时候文件后缀有可能少一点,

于是出现了这种残缺后缀:

.j,.jp,.jpg还有.和直接省略的!

为了保证获取的表情包具有可读性,文件名是一个重要的东西,所以就需要处理一下这个title了。

处理文件名后缀的方法如下:

1
name = title.replace('一猫人-', '').replace('.png', '').replace('.pn', '').replace('.p', '').replace('.gif','').replace('.gi','').replace('.g', '').replace('.', '').replace('.jpg', '').replace('.jp', '').replace('.j', '')

大功告成,接下来就是爬取所有表情包了。

爬取所有页面pn=0-16(也许有第0页,不过没关系)的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import requests
import json
import time

headers = {
'Accept': 'application/json'
}
pic_content = []
for pn in range(0,17):
url = f'https://cool.bilibili.com/x/co-create/user/material_list?ps=15&pn={pn}&mid=1445996044&up_from=1&biz_from=3&material_type=7&t={time.time()}'
html = requests.get(url, headers=headers)
# print(f'start={time.ctime()}')
# time.sleep(2)
# print(f'end={time.ctime()}')
print(f'uri = {url}')
if html.status_code == 200:
html_bytes = html.content
html_str = html_bytes.decode()
# print(html_str)
data = json.loads(html_str)
all_items = data['data']['material_list']
for item in all_items:
title = item['title']
material_id = item['material_id']
uri = item['cover']
used_count = item['used_count']
pic_content.append({'名称': title, 'ID': material_id, '使用次数': used_count, 'URI': uri})
for pic in pic_content:
print(pic)

最后通过os包操作下载到本地的当前目录的source文件夹。

最终代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import requests
import json
import time
import os


def download(url: str, dst_dir: str, title: str):
if not os.path.exists(dst_dir):
os.makedirs(dst_dir)

name = title.replace('一猫人-', '').replace('.png', '').replace('.pn', '').replace('.p', '').replace('.gif',
'').replace('.gi',
'').replace(
'.g', '').replace('.', '').replace('.jpg', '').replace('.jp', '').replace('.j', '')
filename = url.split('/')[-1].replace(" ", "_")
if '.png' in filename:
filename = name + ".png"
if '.gif' in filename:
filename = name + ".gif"
if '.jpg' in filename:
filename = name + ".jpg"
file_path = os.path.join(dst_dir, filename)
r = requests.get(url, stream=True)

if r.ok:
print("saving to", os.path.abspath(file_path))

with open(file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024 * 8):
if chunk:
f.write(chunk)
f.flush()
os.fsync(f.fileno())
else: # HTTP status code 4XX/5XX
print("Download failed: status code {}\n{}".format(r.status_code, r.text))


headers = {
'Accept': 'application/json'
}
url_list = []
pic_content = []
for pn in range(0, 17):
url = f'https://cool.bilibili.com/x/co-create/user/material_list?ps=15&pn={pn}&mid=1445996044&up_from=1&biz_from=3&material_type=7&t={time.time()}'
html = requests.get(url, headers=headers)
# print(f'start={time.ctime()}')
# time.sleep(2)
# print(f'end={time.ctime()}')
print(f'uri = {url}')
if html.status_code == 200:
html_bytes = html.content
html_str = html_bytes.decode()
# print(html_str)
data = json.loads(html_str)
all_items = data['data']['material_list']
for item in all_items:
title = item['title']
material_id = item['material_id']
uri = item['cover']
used_count = item['used_count']
pic_content.append({'名称': title, 'ID': material_id, '使用次数': used_count, 'URI': uri})
url_list.append(uri)
download(uri, dst_dir="source", title=title)
for pic in pic_content:
print(pic)

最终效果截图:

image-20220516233421550

image-20220516233544162

一共227张猫人表情!又可以狠狠的水群了!🥰

带薪群聊

下载的表情包