通过https://api.bilibili.com/x/v1/dm/list.so?oid=xxxx
接口即可下载B站视频弹幕,但是如何获取某视频的oid呢?
解析视频链接
以下代码可以解析任何B站视频链接对应的信息:
import re
import requests
headers = {
"authority": "api.bilibili.com",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"accept": "application/json, text/plain, */*",
}
def get_real_url(url):
r = requests.head(url, headers=headers)
return r.headers['Location']
def get_avbvid(url):
if "b23.tv" in url:
url = get_real_url(url)
url = url.strip("/")
m_obj = re.search("[?&]p=(\d+)", url)
p = 0
if m_obj:
p = int(m_obj.group(1))
s_pos = url.rfind("/") + 1
r_pos = url.rfind("?")
avbvid = None
if r_pos == -1:
avbvid = url[s_pos:]
else:
avbvid = url[s_pos:r_pos]
if avbvid.startswith("av"):
return "aid", avbvid[2:], p
else:
return "bvid", avbvid, p
def get_cid(url, all_cid=False):
t, avbvid, p = get_avbvid(url)
res = requests.get(
f"https://api.bilibili.com/x/web-interface/view?{t}={avbvid}", headers=headers)
res.encoding = "u8"
data = res.json()['data']
cids = {row["page"]: (row['part'], row["cid"]) for row in data["pages"]}
if all_cid:
return cids
elif p == 0:
return data["title"], data["cid"]
else:
return cids[p]
url = "https://www.bilibili.com/video/BV1hL411N79E?spm_id_from=333.999.0.0"
title, cid = get_cid(url)
print(title, cid)
985毕业,裸辞……不卷不是中国人 544521953
经测试该接口可以顺利获取指定B站视频的详细信息,包含该视频的cid:
以上代码对分P视频也适用,链接带有P参数时,可以顺利解析对于页的视频cid。
下载弹幕
有了cid就可以下载视频对应的弹幕:
p属性对应含义如下:
345.12100,1,25,16777215,1649119856,0,b0f43983,1023650567903644160,11 分别表示: 弹幕出现时间,模式,字体大小,颜色,发送时间戳,弹幕池,用户Hash,数据库ID,page
弹幕出现时间是视频所播放的秒数,用户Hash表示用户id经过CRC32算法进行hash处理,相关理论可参考:www.itu.int/rec/T-REC-I…
数据库ID可用于后续查询这条弹幕的点赞数等信息。
下载并解析弹幕的代理代码如下:
import requests
import xmltodict
import datetime
import pandas as pd
def download_dm(cid):
res = requests.get(
f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}", headers=headers)
res.encoding = "u8"
data = []
for row in xmltodict.parse(res.text)["i"]["d"]:
text = row["#text"] # 弹幕内容
# 弹幕出现时间,模式,字体大小,颜色,发送时间戳,弹幕池,用户Hash,数据库ID,page
time_diff, mode, font_size, color, timestamp, pool, id_hash, ids, page = row["@p"].split(",")
time_diff = str(datetime.timedelta(
seconds=round(float(time_diff), 2))).strip("0:")
time = str(datetime.datetime.fromtimestamp(int(timestamp)))
color = hex(int(color))[2:].upper()
data.append([time, time_diff, font_size, color, id_hash, ids, text])
dm_df = pd.DataFrame(
data, columns=["发送时间", "弹幕出现时间", "字体大小", "颜色", "用户CRC", "db_id", "弹幕内容"])
return dm_df
dm_df = download_dm(cid)
dm_df
用户Hash反查用户ID
有人用js开发了相应的反查功能,实现了用户Hash到原始用户ID的转换:
参考:用crc彩虹表反向B站弹幕“匿名”?我不想浪费内存,但是要和彩虹表一样快!
基于上述js代码,我们可以将其改造成python:
import sys
import time
CRCPOLYNOMIAL = 0xEDB88320
crctable = [0 for x in range(256)]
for i in range(256):
crcreg = i
for _ in range(8):
if (crcreg & 1) != 0:
crcreg = CRCPOLYNOMIAL ^ (crcreg >> 1)
else:
crcreg = crcreg >> 1
crctable[i] = crcreg
def crc32(text):
crcstart = 0xFFFFFFFF
for i in range(len(str(text))):
index = (crcstart ^ ord(str(text)[i])) & 255
crcstart = (crcstart >> 8) ^ crctable[index]
return crcstart
def crc32_last_index(text):
crcstart = 0xFFFFFFFF
for i in range(len(str(text))):
index = (crcstart ^ ord(str(text)[i])) & 255
crcstart = (crcstart >> 8) ^ crctable[index]
return index
def get_crc_index(t):
for i in range(256):
if crctable[i] >> 24 == t:
return i
return -1
def deep_check(i, index):
text = ""
tc=0x00
hashcode = crc32(i)
tc = hashcode & 0xff ^ index[2]
if not (tc <= 57 and tc >= 48):
return [0]
text += str(tc - 48)
hashcode = crctable[index[2]] ^ (hashcode >>8)
tc = hashcode & 0xff ^ index[1]
if not (tc <= 57 and tc >= 48):
return [0]
text += str(tc - 48)
hashcode = crctable[index[1]] ^ (hashcode >> 8)
tc = hashcode & 0xff ^ index[0]
if not (tc <= 57 and tc >= 48):
return [0]
text += str(tc - 48)
hashcode = crctable[index[0]] ^ (hashcode >> 8)
return [1, text]
def crack(text):
index = [0 for x in range(4)]
i = 0
ht = int(f"0x{text}", 16) ^ 0xffffffff
for i in range(3,-1,-1):
index[3-i] = get_crc_index(ht >> (i*8))
snum = crctable[index[3-i]]
ht ^= snum >> ((3-i)*8)
for i in range(100000000):
lastindex = crc32_last_index(i)
if lastindex == index[3]:
deepCheckData = deep_check(i, index)
if deepCheckData[0]:
break
if i == 100000000:
return -1
return f"{i}{deepCheckData[1]}"
将上述代码命名为crc32.py
并保存。
经测试反查每个hash对应的用户ID耗时1秒左右,那么600多条弹幕全部反查,耗时近10分钟。所以我们可以使用多进程加速:
from crc32 import crack
from multiprocessing import Pool
with Pool(8) as p:
dm_df["用户ID"] = p.map(crack, dm_df.用户CRC)
dm_df
经测试在经过1分30多秒的情况下完成了600多条数据的反查。
注意:根据实际的CPU核心数调整进程数才能发挥最佳提速,进程数并不是越高越好。
查询每个用户对应的昵称
用了用户ID就可以开始查询对应的昵称了,使用的接口为:
代码如下:
def get_nick_name(mid):
res = requests.get(
f"https://api.bilibili.com/x/space/acc/info?mid={mid}",
headers=headers
)
r = res.json()
if "data" in r and r["data"]:
return r["data"]['name']
else:
print(r)
dm_df["用户昵称"] = dm_df.用户ID.apply(get_nick_name)
想要多线程加速时,可以使用如下代码:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=8) as executor:
dm_df["用户昵称"] = list(executor.map(get_nick_name, dm_df.用户ID))
访问提速请慎用,本人由于多次使用多线程访问B站接口,导致持续报出:
{'code': -412, 'message': '请求被拦截', 'ttl': 1, 'data': None}
下载弹幕点赞数
每次鼠标移动到弹幕上可以在开发者工具看到接口请求。传入视频的cid和弹幕对应的ID即可看到单条弹幕的点赞数:
def get_dm_likes(ids):
res = requests.get(
f"https://api.bilibili.com/x/v2/dm/thumbup/stats?oid={cid}&ids={ids}", headers=headers)
r = res.json()
if "data" in r:
return r["data"][ids]["likes"]
else:
print(r)
with ThreadPoolExecutor(max_workers=8) as executor:
dm_df["弹幕点赞数"] = list(executor.map(get_dm_likes, dm_df.db_id))
dm_df.sort_values(["弹幕点赞数", "发送时间"], ascending=False, inplace=True)
dm_df