初探WebGPU
实际工作中遇到的问题
在开发Meta2d的时候,遇到过很多性能瓶颈,例如
- 批量操作卡顿: 批量移动画布上框选的上千个图元时,每个图元的计算都是串行的,从而导致耗时严重(虽然最核心的问题不在这里,而在于历史数据是深拷贝的,过程非常耗时和耗内存,但是这也确实是一个值得思考的优化点),页面卡顿
- canvas滤镜性能差:在设置图元的滤镜参数的时候,由于默认走的canvas的filter滤镜配置,此配置暂时无法明确知道到底哪些滤镜会走gpu加速,哪些不会,是否走gpu取决于浏览器实现 + 滤镜类型 + 渲染路径 + 是否触发回退,这是浏览器调度的,暂时没有找到相关文档详细说明这一点,但是从实际表现来看,在canvas上的filter性能是不够优秀的,当图元应用复杂滤镜后会有明显卡顿,这就阻碍了滤镜在实际工程中的广泛和高效应用
- 滤镜可控性差:canvas滤镜与css的滤镜对齐,css滤镜是浏览器提供内置的参数化配置,从功能上来说他是有限的,无法做到像素级的处理,无法实现自定义的滤镜算法,就无法应用图像处理算法得到想要的结果,虽然svg提供的滤镜设置似乎可以弥补这一点,但是这依然有限,并且在 OffscreenCanvas中无法访问到svg滤镜(OffscreenCanvas在worker中 他无dom结构,不在dom树中就无法访问svg元素无法应用svg滤镜)
有了上面这些问题,思考解决这些问题需要的是个什么能力?首先他能支持并行运算,且不阻塞js主线程,其次他能接受图像做任意的像素级的处理,并且他是可编程的。
方案
WASM
最开始的想法是使用WASM去加速运算,WASM支持SIMD,似乎能够实现并行运算的能力,但是能力还是有限,他依然运行在cpu上,受到cpu核心的限制,另外wasm虽然接近原生性能,但是他中间依然要通过js胶水层粘接,频繁切换上下文存在性能开销,另一方面从内存角度来说他与js通过线性内存交互,处理图像这种多维数据需要拍平数据,运算不便
WebGPU
很早就听说WebGPU,之前一直认为我没有webgl的相关经验就无法上手WebGPU,后来一个机缘巧合看到了国外的Surma大佬发布的一篇博客,,我才认识到了WebGPU的强大能力,这也是我第一次接触到GPU通用运算的实际demo,我联想到如果能够运用webgpu的计算性能,那么web端的很多性能问题都将迎刃而解,并且还能够实现更多的惊人效果(傅里叶变换做音频降噪、视频实时滤镜、2d粒子系统...),脑袋里似乎打开了一个新的潘多拉宝盒
先看看几个实际的例子
实际案例
首先需要说明的是,浏览器端的gpu能力是阉割版的,他无法发挥gpu的全部能力,原因有很多,一是浏览器沙箱环境被限制能力,二是浏览器不允许单个页面就占用所有gpu能力,三是跨平台的API抽象导致他无法暴露所有底层能力,webgpu的实际性能受gpu型号和显存影响,这和cpu一样,很好理解,我的设备是macos,M2基础版,我们来看看这个阉割版的gpu能做些什么
本人虽然有一些粗略的图形学相关的基础知识,但是还远远不能手搓渲染器,所以这里就不展示webgpu在图形渲染中的例子,仅展示部分通用计算的案例
基于canvas2d 30000小球弹性碰撞下的百亿级计算能力
我们首先实现三万个小球进行随机弹性碰撞的案例,这个案例中我们不采用任何优化手段,用暴力算法去遍历每个小球和其余所有小球之间的碰撞逻辑,然后渲染画面出来,在不使用webgpu进行图像渲染的情况下,我们使用canvas 2d作为渲染后端,gpu负责核心的碰撞计算,并将结果返回给canvas进行图像渲染,下面是源代码和运行效果:
<!DOCTYPE html>
<html lang="zh">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>球体碰撞模拟</title>
<style>
body {
margin: 0;
padding: 20px;
font-family: Arial, sans-serif;
background: #1a1a1a;
color: white;
}
#container {
display: flex;
flex-direction: column;
align-items: center;
gap: 10px;
}
canvas {
border: 2px solid #333;
background: #000;
}
#fps {
font-size: 24px;
font-weight: bold;
color: #0f0;
}
#controls {
display: flex;
gap: 20px;
margin-top: 10px;
}
label {
display: flex;
align-items: center;
gap: 5px;
}
input {
padding: 5px;
}
</style>
</head>
<body>
<div id="container">
<div>球数量: <span id="number">0</span></div>
<canvas></canvas>
</div>
<script type="module">
function random(a, b) {
return Math.random() * (b - a) + a;
}
const NUM_BALLS = 30000;
const BUFFER_SIZE = NUM_BALLS * 6 * Float32Array.BYTES_PER_ELEMENT;
const minRadius = 2
const maxRadius = 2
document.querySelector("#number").textContent = NUM_BALLS
const ctx = document.querySelector("canvas").getContext("2d");
ctx.canvas.width = 1500
ctx.canvas.height = 900
function fatal(msg) {
document.body.innerHTML = `<pre>${msg}</pre>`;
throw Error(msg);
}
if (!("gpu" in navigator)) fatal("WebGPU not supported");
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) fatal("Couldn’t request WebGPU adapter");
const device = await adapter.requestDevice();
if (!device) fatal("Couldn’t request WebGPU device");
// shader
const module = device.createShaderModule({
code: `
struct Ball {
radius: f32,
position: vec2<f32>,
velocity: vec2<f32>,
}
struct Scene {
width: f32,
height: f32,
}
@group(0) @binding(0)
var<storage, read> input: array<Ball>; // 绑定组 上一帧状态 输入
@group(0) @binding(1)
var<storage, read_write> output: array<Ball>; // 绑定组1 下一帧状态 输出
@group(0) @binding(2)
var<storage, read> scene: Scene; // 场景宽高
const PI: f32 = 3.14159; // 定义PI常量
const TIME_STEP: f32 = 0.016;
// 定义compute shader
@compute @workgroup_size(256) // 指定工作组大小,一般设置为256即可
fn main(
@builtin(global_invocation_id) // 全局调用单元id
global_id : vec3<u32>,
) {
let num_balls = arrayLength(&output); // 获取数组的长度
if(global_id.x >= num_balls) {
return;
}
var src_ball = input[global_id.x]; // 值拷贝 当前小球数据
let dst_ball = &output[global_id.x]; // 值引用 输出小球的内存数据
(*dst_ball) = src_ball; // 解引用 让输出小球状态赋值为当前小球的数据
// 计算球与球之间的弹性碰撞 算法复杂度 O(N**2)
for(var i = 0u; i < num_balls; i = i + 1u) {
if(i == global_id.x) { // 排除与自己相撞
continue;
}
var other_ball = input[i]; // 获取目标小球状态
let n = src_ball.position - other_ball.position;
let distance = length(n);
if(distance >= src_ball.radius + other_ball.radius) { // 相撞条件
continue;
}
// 求碰撞后的速度
let overlap = src_ball.radius + other_ball.radius - distance;
(*dst_ball).position = src_ball.position + normalize(n) * overlap/2.;
let src_mass = pow(src_ball.radius, 2.0) * PI;
let other_mass = pow(other_ball.radius, 2.0) * PI;
let c = 2.*dot(n, (other_ball.velocity - src_ball.velocity)) / (dot(n, n) * (1./src_mass + 1./other_mass));
(*dst_ball).velocity = src_ball.velocity + c/src_mass * n;
}
(*dst_ball).position = (*dst_ball).position + (*dst_ball).velocity * TIME_STEP;
// 球和墙面的碰撞
if((*dst_ball).position.x - (*dst_ball).radius < 0.) {
(*dst_ball).position.x = (*dst_ball).radius;
(*dst_ball).velocity.x = -(*dst_ball).velocity.x;
}
if((*dst_ball).position.y - (*dst_ball).radius < 0.) {
(*dst_ball).position.y = (*dst_ball).radius;
(*dst_ball).velocity.y = -(*dst_ball).velocity.y;
}
if((*dst_ball).position.x + (*dst_ball).radius >= scene.width) {
(*dst_ball).position.x = scene.width - (*dst_ball).radius;
(*dst_ball).velocity.x = -(*dst_ball).velocity.x;
}
if((*dst_ball).position.y + (*dst_ball).radius >= scene.height) {
(*dst_ball).position.y = scene.height - (*dst_ball).radius;
(*dst_ball).velocity.y = -(*dst_ball).velocity.y;
}
}
`,
});
const bindGroupLayout = device.createBindGroupLayout({ // 创建绑定组布局 数据 用于告诉gpu数据长什么样
entries: [
{
binding: 0,
visibility: GPUShaderStage.COMPUTE,
buffer: {
type: "read-only-storage",
},
},
{
binding: 1,
visibility: GPUShaderStage.COMPUTE,
buffer: {
type: "storage",
},
},
{
binding: 2,
visibility: GPUShaderStage.COMPUTE,
buffer: {
type: "read-only-storage",
},
},
],
});
const pipeline = device.createComputePipeline({
layout: device.createPipelineLayout({
bindGroupLayouts: [bindGroupLayout],
}),
compute: {
module,
entryPoint: "main",
},
});
const scene = device.createBuffer({
size: 2 * Float32Array.BYTES_PER_ELEMENT,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
const input = device.createBuffer({
size: BUFFER_SIZE,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
const output = device.createBuffer({
size: BUFFER_SIZE,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});
const stagingBuffer = device.createBuffer({
size: BUFFER_SIZE,
usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
});
const bindGroup = device.createBindGroup({
layout: bindGroupLayout,
entries: [
{
binding: 0,
resource: {
buffer: input,
},
},
{
binding: 1,
resource: {
buffer: output,
},
},
{
binding: 2,
resource: {
buffer: scene,
},
},
],
});
function raf() {
return new Promise((resolve) => requestAnimationFrame(resolve));
}
let inputBalls = new Float32Array(new ArrayBuffer(BUFFER_SIZE));
for (let i = 0; i < NUM_BALLS; i++) {
inputBalls[i * 6 + 0] = random(minRadius, maxRadius);
inputBalls[i * 6 + 2] = random(0, ctx.canvas.width);
inputBalls[i * 6 + 3] = random(0, ctx.canvas.height);
inputBalls[i * 6 + 4] = random(-100, 100);
inputBalls[i * 6 + 5] = random(-100, 100);
}
let outputBalls;
device.queue.writeBuffer(
scene,
0,
new Float32Array([ctx.canvas.width, ctx.canvas.height])
);
while (true) {
device.queue.writeBuffer(input, 0, inputBalls);
const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindGroup);
const dispatchSize = Math.ceil(NUM_BALLS / 256);
passEncoder.dispatchWorkgroups(dispatchSize);
passEncoder.end();
commandEncoder.copyBufferToBuffer(output, 0, stagingBuffer, 0, BUFFER_SIZE);
const commands = commandEncoder.finish();
device.queue.submit([commands]);
await stagingBuffer.mapAsync(GPUMapMode.READ, 0, BUFFER_SIZE);
const copyArrayBuffer = stagingBuffer.getMappedRange(0, BUFFER_SIZE);
const data = copyArrayBuffer.slice();
outputBalls = new Float32Array(data);
stagingBuffer.unmap();
drawScene(outputBalls);
inputBalls = outputBalls;
await raf();
}
// 使用canvas 2d渲染
function drawScene(balls) {
ctx.save();
ctx.scale(1, -1);
ctx.translate(0, -ctx.canvas.height);
ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height);
ctx.fillStyle = "red";
for (let i = 0; i < balls.length; i += 6) {
const r = balls[i + 0];
const px = balls[i + 2];
const py = balls[i + 3];
const vx = balls[i + 4];
const vy = balls[i + 5];
let angle = Math.atan(vy / (vx === 0 ? Number.EPSILON : vx));
if (vx < 0) angle += Math.PI;
const ex = px + Math.cos(angle) * Math.sqrt(2) * r;
const ey = py + Math.sin(angle) * Math.sqrt(2) * r;
ctx.beginPath();
ctx.arc(px, py, r, 0, 2 * Math.PI, true);
ctx.moveTo(ex, ey);
ctx.arc(px, py, r, angle - Math.PI / 4, angle + Math.PI / 4, true);
ctx.lineTo(ex, ey);
ctx.closePath();
ctx.fill();
}
ctx.restore();
}
</script>
图1. 三万个小球弹性碰撞
由于gif的帧率问题,画面看起来略微卡顿,但其实从帧率信息上来看,在不做任何优化的情况下,画面基本能够维持在40帧左右,性能非常之高。
作为对比,我们来看看CPU计算下的情况是怎么样的,由于这个计算量实在过大,防止卡死,我们将小球数量降低到一万个。请看代码和VCR:
<!DOCTYPE html>
<html lang="zh">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>球体碰撞模拟</title>
<style>
body {
margin: 0;
padding: 20px;
font-family: Arial, sans-serif;
background: #1a1a1a;
color: white;
}
#container {
display: flex;
flex-direction: column;
align-items: center;
gap: 10px;
}
canvas {
border: 2px solid #333;
background: #000;
}
label {
display: flex;
align-items: center;
gap: 5px;
}
input {
padding: 5px;
}
</style>
</head>
<body>
<div id="container">
<div>球数量: <span id="number">0</span></div>
<canvas></canvas>
</div>

<script>
const params = new URLSearchParams(location.search);
function parameter(name, def) {
if (!params.has(name)) return def;
return parseFloat(params.get(name));
}
let NUM_BALLS = parameter("balls", 10000);
const minRadius = parameter("min_radius", 2);
const maxRadius = parameter("max_radius", 2);
const render = parameter("render", 1);
const ctx = document.querySelector("canvas").getContext("2d");
ctx.canvas.width = parameter("width", 1500);
ctx.canvas.height = parameter("height", 900);
const ballCountInput = document.getElementById("number");
ballCountInput.textContent = NUM_BALLS;
// 球体数据结构
class Ball {
constructor(radius, x, y, vx, vy) {
this.radius = radius;
this.x = x;
this.y = y;
this.vx = vx;
this.vy = vy;
}
}
let balls = [];
const TIME_STEP = 0.016;
const PI = Math.PI;
function random(a, b) {
return Math.random() * (b - a) + a;
}
function initBalls() {
balls = [];
for (let i = 0; i < NUM_BALLS; i++) {
balls.push(new Ball(
random(minRadius, maxRadius),
random(0, ctx.canvas.width),
random(0, ctx.canvas.height),
random(-100, 100),
random(-100, 100)
));
}
}
function updatePhysics() {
const newBalls = balls.map(ball => new Ball(
ball.radius,
ball.x,
ball.y,
ball.vx,
ball.vy
));
// 处理球与球的碰撞
for (let i = 0; i < balls.length; i++) {
const ball = balls[i];
const newBall = newBalls[i];
for (let j = 0; j < balls.length; j++) {
if (i === j) continue;
const other = balls[j];
const dx = ball.x - other.x;
const dy = ball.y - other.y;
const distance = Math.sqrt(dx * dx + dy * dy);
// 检测碰撞
if (distance < ball.radius + other.radius) {
const overlap = ball.radius + other.radius - distance;
// 分离重叠的球
const nx = dx / distance;
const ny = dy / distance;
newBall.x = ball.x + nx * overlap / 2;
newBall.y = ball.y + ny * overlap / 2;
// 计算碰撞后的速度(弹性碰撞)
const ballMass = Math.pow(ball.radius, 2) * PI;
const otherMass = Math.pow(other.radius, 2) * PI;
const dvx = other.vx - ball.vx;
const dvy = other.vy - ball.vy;
const dotProduct = dx * dvx + dy * dy;
const normSquared = dx * dx + dy * dy;
const c = 2 * dotProduct / (normSquared * (1 / ballMass + 1 / otherMass));
newBall.vx = ball.vx + (c / ballMass) * dx;
newBall.vy = ball.vy + (c / ballMass) * dy;
}
}
// 应用速度
newBall.x += newBall.vx * TIME_STEP;
newBall.y += newBall.vy * TIME_STEP;
// 墙壁碰撞检测
if (newBall.x - newBall.radius < 0) {
newBall.x = newBall.radius;
newBall.vx = -newBall.vx;
}
if (newBall.y - newBall.radius < 0) {
newBall.y = newBall.radius;
newBall.vy = -newBall.vy;
}
if (newBall.x + newBall.radius >= ctx.canvas.width) {
newBall.x = ctx.canvas.width - newBall.radius;
newBall.vx = -newBall.vx;
}
if (newBall.y + newBall.radius >= ctx.canvas.height) {
newBall.y = ctx.canvas.height - newBall.ry;
newBall.vy = -newBall.vy;
}
}
balls = newBalls;
}
function drawScene() {
ctx.save();
ctx.scale(1, -1);
ctx.translate(0, -ctx.canvas.height);
ctx.clearRect(0, 0, ctx.canvas.width, ctx.canvas.height);
ctx.fillStyle = "red";
for (const ball of balls) {
const angle = Math.atan(ball.vy / (ball.vx === 0 ? Number.EPSILON : ball.vx));
const correctedAngle = ball.vx < 0 ? angle + Math.PI : angle;
const ex = ball.x + Math.cos(correctedAngle) * Math.sqrt(2) * ball.radius;
const ey = ball.y + Math.sin(correctedAngle) * Math.sqrt(2) * ball.radius;
ctx.beginPath();
ctx.arc(ball.x, ball.y, ball.radius, 0, 2 * Math.PI, true);
ctx.moveTo(ex, ey);
ctx.arc(ball.x, ball.y, ball.radius, correctedAngle - Math.PI / 4, correctedAngle + Math.PI / 4, true);
ctx.lineTo(ex, ey);
ctx.closePath();
ctx.fill();
}
ctx.restore();
}
function animate() {
updatePhysics();
if (render !== 0) {
drawScene();
}
requestAnimationFrame(animate);
}
// 初始化并开始动画
initBalls();
animate();
</script>
</body>
</html>
图2:cpu计算一万小球弹性碰撞
尽管将小球数量缩小到一万个,cpu计算下仅能绘制每秒4帧的画面,这与gpu计算的性能上有30倍的差距。视频实时滤镜
WebGPU 的另一大应用场景是对视频流进行实时滤镜与特效处理。这一场景对性能要求极高,需要在每一帧中完成大量像素级计算,并充分利用 GPU 的高度并行计算能力,才能在保证低延迟的同时维持流畅的实时播放效果
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<style>
button {
display: block;
}
canvas {
width: 1000px;
margin-top: 24px;
object-fit: contain;
}
.error {
color: red;
}
.fullscreen {
background: black;
margin: 0;
width: 100vw;
object-fit: contain;
position: fixed;
top: 0;
left: 0;
}
</style>
</head>
<body>
<h2>
WebGPU视频滤镜demo</h2>
<button id="button">播放视频</button>
<pre id="logs"></pre>
<video id="video" style="width: 1000px"></video>
<canvas id="canvas"></canvas>
<script>
window.onunhandledrejection = (event) => {
let error = document.createElement("p");
error.textContent = event.reason;
error.classList.add('error');
logs.appendChild(error);
};
window.onerror = (err) => {
let error = document.createElement("p");
error.textContent = err;
error.classList.add('error');
logs.appendChild(error);
};
</script>
<script type="module">
const video = document.querySelector("video");
video.addEventListener('loadedmetadata', () => {
canvas.width = video.videoWidth;
canvas.height = video.videoHeight;
});
button.onclick = async () => {
video.src = './视频.MOV';
await video.play();
(function render() {
const videoFrame = new VideoFrame(video);
applyFilter(videoFrame);
requestAnimationFrame(render);
})();
};
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();
const format = navigator.gpu.getPreferredCanvasFormat();
const context = document.querySelector("canvas").getContext("webgpu");
context.configure({device, format});
const module = device.createShaderModule({
code: `
@vertex
fn vertexMain(@builtin(vertex_index) i : u32) -> @builtin(position) vec4f {
const quadPos = array(vec2f(-1, 1), vec2f(-1, -1), vec2f(1, 1), vec2f(1, -1));
return vec4f(quadPos[i], 0, 1);
}
@group(0) @binding(0) var myTexture: texture_external;
@fragment
fn fragmentMain(@builtin(position) position : vec4f) -> @location(0) vec4f {
let result = textureLoad(myTexture, vec2u(position.xy));
if (position.x > f32(textureDimensions(myTexture).x / 2)) {
return result;
}
let gray = dot(result.xyz, vec3f(1, 0.71, 0.07));
return vec4f(gray, gray, gray, 1);
}
`
});
const pipeline = device.createRenderPipeline({
layout: "auto",
vertex: {module},
fragment: {module, targets: [{format}]},
primitive: {topology: "triangle-strip"}
});
function applyFilter(videoFrame) {
const texture = device.importExternalTexture({source: videoFrame});
const bindgroup = device.createBindGroup({
layout: pipeline.getBindGroupLayout(0),
entries: [{binding: 0, resource: texture}]
});
const commandEncoder = device.createCommandEncoder();
const colorAttachments = [
{
view: context.getCurrentTexture().createView(),
loadOp: "clear",
storeOp: "store"
}
];
const passEncoder = commandEncoder.beginRenderPass({colorAttachments});
passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindgroup);
passEncoder.draw(4);
passEncoder.end();
device.queue.submit([commandEncoder.finish()]);
videoFrame.close();
}
</script>
</body>
</html>
图3:webgpu+webcodecs实时视频处理
除此之外,WebGPU 在音频处理领域同样能够提供非常可观的性能支持。诸如频谱分析(FFT)、卷积混响、实时滤波、本质上都属于高度数据并行、计算密集型任务。在传统 Web 技术栈中,这类计算往往只能依赖 CPU 或 Web Audio API 的固定能力,扩展性和性能上都存在明显上限。而 WebGPU 允许开发者直接将大规模音频数据映射到 GPU 上进行并行计算,使得原本难以在浏览器中实时完成的音频算法成为可能。
可以说,WebGPU 在本质上解放了浏览器的算力生产力。它打破了过去“浏览器只能做 UI 和轻量逻辑”的固有认知,使 Web 平台开始具备接近原生应用的计算能力。这不仅意味着性能的提升,更意味着应用形态的拓展:越来越多原本只能存在于桌面端、原生端甚至专业软件中的算法与效果,开始具备在 Web 环境中落地的可能性。
我是没有webgl基础的前端学者,webgpu的极高性能深深吸引了我,掌握这项技能意味着掌握浏览器的算力大军,我深知这对于未来的开发和职业发展是颠覆性的,所以我打算从头去了解webgpu的一些使用方法和细节,奈何市面上关于这部分的教程非常有限,webgpu也还在逐渐发展中,在互联网上关于这部分的书籍仅找到一本叫《The WebGPU Sourcebook》,是全英文版,我也将这本书作为入门书籍,并开启此专栏记录从零基础的学习过程和一些自己的思考,如果你对此也感兴趣,也希望我的随笔能够帮助到你。