B站视频弹幕解析

1,858 阅读5分钟

通过https://api.bilibili.com/x/v1/dm/list.so?oid=xxxx接口即可下载B站视频弹幕,但是如何获取某视频的oid呢?

解析视频链接

以下代码可以解析任何B站视频链接对应的信息:

import re
import requests

headers = {
    "authority": "api.bilibili.com",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
    "accept": "application/json, text/plain, */*",
}


def get_real_url(url):
    r = requests.head(url, headers=headers)
    return r.headers['Location']


def get_avbvid(url):
    if "b23.tv" in url:
        url = get_real_url(url)
    url = url.strip("/")
    m_obj = re.search("[?&]p=(\d+)", url)
    p = 0
    if m_obj:
        p = int(m_obj.group(1))
    s_pos = url.rfind("/") + 1
    r_pos = url.rfind("?")
    avbvid = None
    if r_pos == -1:
        avbvid = url[s_pos:]
    else:
        avbvid = url[s_pos:r_pos]
    if avbvid.startswith("av"):
        return "aid", avbvid[2:], p
    else:
        return "bvid", avbvid, p


def get_cid(url, all_cid=False):
    t, avbvid, p = get_avbvid(url)
    res = requests.get(
        f"https://api.bilibili.com/x/web-interface/view?{t}={avbvid}", headers=headers)
    res.encoding = "u8"
    data = res.json()['data']
    cids = {row["page"]: (row['part'], row["cid"]) for row in data["pages"]}
    if all_cid:
        return cids
    elif p == 0:
        return data["title"], data["cid"]
    else:
        return cids[p]


url = "https://www.bilibili.com/video/BV1hL411N79E?spm_id_from=333.999.0.0"
title, cid = get_cid(url)
print(title, cid)
985毕业,裸辞……不卷不是中国人 544521953

经测试该接口可以顺利获取指定B站视频的详细信息,包含该视频的cid:

image-20220416162641192

以上代码对分P视频也适用,链接带有P参数时,可以顺利解析对于页的视频cid。

下载弹幕

有了cid就可以下载视频对应的弹幕:

image-20220416163226527

p属性对应含义如下:

345.12100,1,25,16777215,1649119856,0,b0f43983,1023650567903644160,11 分别表示: 弹幕出现时间,模式,字体大小,颜色,发送时间戳,弹幕池,用户Hash,数据库ID,page

弹幕出现时间是视频所播放的秒数,用户Hash表示用户id经过CRC32算法进行hash处理,相关理论可参考:www.itu.int/rec/T-REC-I…

数据库ID可用于后续查询这条弹幕的点赞数等信息。

下载并解析弹幕的代理代码如下:

import requests
import xmltodict
import datetime
import pandas as pd


def download_dm(cid):
    res = requests.get(
        f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}", headers=headers)
    res.encoding = "u8"
    data = []
    for row in xmltodict.parse(res.text)["i"]["d"]:
        text = row["#text"] # 弹幕内容
        # 弹幕出现时间,模式,字体大小,颜色,发送时间戳,弹幕池,用户Hash,数据库ID,page
        time_diff, mode, font_size, color, timestamp, pool, id_hash, ids, page = row["@p"].split(",")
        time_diff = str(datetime.timedelta(
            seconds=round(float(time_diff), 2))).strip("0:")
        time = str(datetime.datetime.fromtimestamp(int(timestamp)))
        color = hex(int(color))[2:].upper()
        data.append([time, time_diff, font_size, color, id_hash, ids, text])
    dm_df = pd.DataFrame(
        data, columns=["发送时间", "弹幕出现时间", "字体大小", "颜色", "用户CRC", "db_id", "弹幕内容"])
    return dm_df


dm_df = download_dm(cid)
dm_df
image-20220416165041922

用户Hash反查用户ID

有人用js开发了相应的反查功能,实现了用户Hash到原始用户ID的转换:

github.com/esterTion/B…

参考:用crc彩虹表反向B站弹幕“匿名”?我不想浪费内存,但是要和彩虹表一样快!

基于上述js代码,我们可以将其改造成python:

import sys
import time

CRCPOLYNOMIAL = 0xEDB88320
crctable = [0 for x in range(256)]

for i in range(256):
    crcreg = i
    for _ in range(8):
        if (crcreg & 1) != 0:
            crcreg = CRCPOLYNOMIAL ^ (crcreg >> 1)
        else:
            crcreg = crcreg >> 1
    crctable[i] = crcreg

def crc32(text):
    crcstart = 0xFFFFFFFF
    for i in range(len(str(text))):
        index = (crcstart ^ ord(str(text)[i])) & 255
        crcstart = (crcstart >> 8) ^ crctable[index]
    return crcstart

def crc32_last_index(text):
    crcstart = 0xFFFFFFFF
    for i in range(len(str(text))):
        index = (crcstart ^ ord(str(text)[i])) & 255
        crcstart = (crcstart >> 8) ^ crctable[index]
    return index

def get_crc_index(t):
    for i in range(256):
        if crctable[i] >> 24 == t:
            return i
    return -1

def deep_check(i, index):
    text = ""
    tc=0x00
    hashcode = crc32(i)
    tc = hashcode & 0xff ^ index[2]
    if not (tc <= 57 and tc >= 48):
        return [0]
    text += str(tc - 48)
    hashcode = crctable[index[2]] ^ (hashcode >>8)
    tc = hashcode & 0xff ^ index[1]
    if not (tc <= 57 and tc >= 48):
        return [0]
    text += str(tc - 48)
    hashcode = crctable[index[1]] ^ (hashcode >> 8)
    tc = hashcode & 0xff ^ index[0]
    if not (tc <= 57 and tc >= 48):
        return [0]
    text += str(tc - 48)
    hashcode = crctable[index[0]] ^ (hashcode >> 8)
    return [1, text]

def crack(text):
    index = [0 for x in range(4)]
    i = 0
    ht = int(f"0x{text}", 16) ^ 0xffffffff
    for i in range(3,-1,-1):
        index[3-i] = get_crc_index(ht >> (i*8))
        snum = crctable[index[3-i]]
        ht ^= snum >> ((3-i)*8)
    for i in range(100000000):
        lastindex = crc32_last_index(i)
        if lastindex == index[3]:
            deepCheckData = deep_check(i, index)
            if deepCheckData[0]:
                break
    if i == 100000000:
        return -1
    return f"{i}{deepCheckData[1]}"

将上述代码命名为crc32.py并保存。

经测试反查每个hash对应的用户ID耗时1秒左右,那么600多条弹幕全部反查,耗时近10分钟。所以我们可以使用多进程加速:

from crc32 import crack
from multiprocessing import Pool

with Pool(8) as p:
    dm_df["用户ID"] = p.map(crack, dm_df.用户CRC)
dm_df
image-20220416170235571

经测试在经过1分30多秒的情况下完成了600多条数据的反查。

注意:根据实际的CPU核心数调整进程数才能发挥最佳提速,进程数并不是越高越好。

查询每个用户对应的昵称

用了用户ID就可以开始查询对应的昵称了,使用的接口为:

image-20220416170518052

代码如下:

def get_nick_name(mid):
    res = requests.get(
        f"https://api.bilibili.com/x/space/acc/info?mid={mid}",
        headers=headers
    )
    r = res.json()
    if "data" in r and r["data"]:
        return r["data"]['name']
    else:
        print(r)

dm_df["用户昵称"] = dm_df.用户ID.apply(get_nick_name)

想要多线程加速时,可以使用如下代码:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=8) as executor:
    dm_df["用户昵称"] = list(executor.map(get_nick_name, dm_df.用户ID))

访问提速请慎用,本人由于多次使用多线程访问B站接口,导致持续报出:{'code': -412, 'message': '请求被拦截', 'ttl': 1, 'data': None}

下载弹幕点赞数

每次鼠标移动到弹幕上可以在开发者工具看到接口请求。传入视频的cid和弹幕对应的ID即可看到单条弹幕的点赞数:

image-20220416172004609

def get_dm_likes(ids):
    res = requests.get(
        f"https://api.bilibili.com/x/v2/dm/thumbup/stats?oid={cid}&ids={ids}", headers=headers)
    r = res.json()
    if "data" in r:
        return r["data"][ids]["likes"]
    else:
        print(r)

with ThreadPoolExecutor(max_workers=8) as executor:
    dm_df["弹幕点赞数"] = list(executor.map(get_dm_likes, dm_df.db_id))
dm_df.sort_values(["弹幕点赞数", "发送时间"], ascending=False, inplace=True)
dm_df
image-20220416172023160