🚀🚀🚀前端的无限可能-纯Web实现的字幕视频工具 FlyCut CaptionFlyCut Caption是我最近

FlyCut Caption是我最近开源的一个项目，是一个不依赖后端，完全由前端实现的，视频编辑工具；可以通过字幕对视频进行剪辑，支持视频字幕生成，字幕视频编辑，合成导出带字幕的视频。而且这个项目是完全不依赖服务器，所有的功能都是由前端实现的，其证明了现在前端的无限可能，在AI、在多媒体领域的可行性。

项目已经开源，可以直接查看，或者在官网上直接尝试

GItHub Repo地址：github.com/x007xyz/fly…
官网地址：caption.flycut.co/

📋 使用指南

使用指南是对项目操作流程的简单说明，不感兴趣的同学可以直接查看后面的技术实现内容。

1. 上传视频文件

支持格式：MP4, WebM, AVI, MOV
支持音频：MP3, WAV, OGG
拖拽文件到上传区域或点击选择文件

上传完成后，进入ASR配置界面：

2. 生成字幕

选择识别语言（支持中文、英文等多种语言）
点击开始识别，AI 将自动生成带时间戳的字幕
识别过程在后台进行，不影响界面操作

3. 编辑字幕

选择片段：在字幕列表中选择要删除的片段
批量操作：支持全选、批量删除、恢复删除等操作
实时预览：点击字幕片段可跳转到对应时间点
历史记录：支持撤销/重做操作

4. 视频预览

预览模式：自动跳过删除的片段，预览最终效果

5. 字幕样式

字体设置：字体大小、粗细、颜色
位置调整：字幕显示位置、对齐方式
背景样式：背景颜色、透明度、边框
实时预览：所见即所得的样式调整

6. 导出结果

字幕导出：SRT 格式（通用字幕格式）、JSON 格式
视频导出：
- 仅保留未删除的片段
- 可选择烧录字幕到视频
- 支持不同质量设置
- 多种格式输出

技术实现

整体架构图

graph TB
    subgraph "前端层 Frontend Layer"
        UI[FlyCutCaption 主组件]
        FileUpload[文件上传组件]
        VideoPlayer[视频播放器]
        SubtitleEditor[字幕编辑器]
        ExportDialog[导出对话框]
    end

    subgraph "服务层 Service Layer"
        ASR[ASRService 语音识别]
        VideoProcessor[统一视频处理器]
        MessageCenter[消息中心]
    end

    subgraph "引擎层 Engine Layer"
        WebAVEngine[WebAV引擎]
        FFmpegEngine[FFmpeg引擎]
        TransformersJS[transformers]
    end

    subgraph "存储层 Store Layer"
        AppStore[应用状态]
        HistoryStore[字幕历史]
        ThemeStore[主题设置]
        MessageStore[消息状态]
    end

    subgraph "Web Workers"
        ASRWorker[ASR Worker]
    end

    UI --> FileUpload
    UI --> VideoPlayer
    UI --> SubtitleEditor
    UI --> ExportDialog

    FileUpload --> AppStore
    VideoPlayer --> AppStore
    SubtitleEditor --> HistoryStore

    ASR --> ASRWorker
    ASRWorker --> TransformersJS

    VideoProcessor --> WebAVEngine
    VideoProcessor --> FFmpegEngine

    AppStore -.-> MessageStore
    HistoryStore -.-> MessageStore

    MessageCenter --> MessageStore

架构中比较重要的两个点是：

使用transformers.js加载Whisper模型的来实现ASR获取字幕内容
基于WebAV实现视频的编辑和导出

其他部分更多是UI和数据处理的内容，在这里不会过多的进行说明。

`transformers.js` 让前端拥有使用AI的能力

Transformers.js 允许你在浏览器里直接运行预训练的Transformer模型（文本 / 图像 / 音频 / 多模态），API 与 Python 的 transformers 很相似，但能在客户端用 WASM / WebGPU 等后端做推理，从而实现“无服务器/本地推理”的体验。

Whisper则是一个ASR模型，可以根据音频生成字幕内容。

使用transformers.js加载Whisper就可以实现音视频字幕内容的获取了，当然这个过程还是比较耗时的，所有我们将他的处理放到webworker中。

具体的调用过程可以查看下图：

sequenceDiagram
    participant UI as 用户界面
    participant ASRService as ASR服务
    participant ASRWorker as ASR Worker
    participant TransformersJS as Transformers.js
    participant WhisperModel as Whisper模型

    UI->>ASRService: 开始语音识别
    ASRService->>ASRWorker: 创建Worker
    ASRWorker->>TransformersJS: 加载pipeline
    TransformersJS->>WhisperModel: 下载模型
    WhisperModel-->>TransformersJS: 模型加载完成
    TransformersJS-->>ASRWorker: Pipeline准备就绪
    ASRWorker-->>ASRService: 模型加载完成
    ASRService-->>UI: 更新加载状态

    UI->>ASRService: 发送音频数据
    ASRService->>ASRWorker: 处理音频
    ASRWorker->>TransformersJS: 执行推理
    TransformersJS->>WhisperModel: 语音识别
    WhisperModel-->>TransformersJS: 返回转录结果
    TransformersJS-->>ASRWorker: 返回chunks
    ASRWorker-->>ASRService: 返回SubtitleTranscript
    ASRService-->>UI: 更新字幕数据

我们的主界面和Worker进行通讯，然后拿到生成好的字幕数据，AI处理的部分就完成了。然后我们就需要进入视频编辑处理以及合成的环节。

WebAV 赋予前端视频编辑能力

WebAV是一个基于WebCodecs API的纯前端视频处理方案，无需服务端支持，就可以实现视频的裁剪、生成等功能。

我们在编辑视频的过程中其实并没有直接使用WebAV对视频进行剪辑，而是通过视频播放器实现预览模式，跳过裁剪的内容，来模拟剪辑效果，因为视频的编辑操作是一个相对耗时的操作，而且我们实现了历史记录功能，每次撤销操作都需要重新修改视频的话也不是很方便。

WebAV主要是在视频合成导出时使用到。

视频裁剪

其第一个功能就是对视频进行裁剪，因为我们可以通过字幕裁剪视频，所以最终的视频可能有很多内容需要被裁剪掉，WebAV并没有直接删除视频片段的功能，所以我们对其MP4Clip.clip的功能进行了简单封装，实现了裁剪功能：

private async splitVideoByDeletedSegments(clip: MP4Clip, deletedSegments: VideoSegment[]): Promise<MP4Clip[]> {
  const resultClips: MP4Clip[] = [];
  let currentClip = clip;
  let currentTime = 0;

  for (const deleteSegment of mergedDeleted) {
    // 保留删除片段之前的内容
    if (deleteSegment.start > currentTime) {
      const keepDuration = deleteSegment.start - currentTime;
      if (keepDuration > 0.01) {
        const splitTime = (deleteSegment.start - currentTime) * 1e6; // 转为微秒
        const [keepPart, remaining] = await currentClip.split(splitTime);
        if (keepPart && keepPart.meta.duration > 0) {
          resultClips.push(keepPart);
        }
        currentClip = remaining;
      }
    }

    // 跳过删除的片段
    const deleteDuration = deleteSegment.end - deleteSegment.start;
    if (deleteDuration > 0 && currentClip) {
      const deleteTime = deleteDuration * 1e6;
      if (currentClip.meta.duration > deleteTime) {
        const [, remaining] = await currentClip.split(deleteTime);
        currentClip = remaining;
      } else {
        currentClip = null;
        break;
      }
    }

    currentTime = deleteSegment.end;
  }

  return resultClips;
}

字幕及视频合成

我们使用到WebAV的第二个功能就是字幕合成，导出添加了字幕的视频本就是我们项目的核心功能；对应字幕，WebAV提供了EmbedSubtitlesClip进行实现，使用起来也非常简单,我们直接把生成好的SRT内容和字幕配置传递EmbedSubtitlesClip就可以了，然后添加到合成器; 最后就是视频合成，直接调用合成方法就可以了。

const subtitleSprite = new OffscreenSprite(
        new EmbedSubtitlesClip(subtitleStructs, {
          videoWidth: this.videoClip.meta.width,
          videoHeight: this.videoClip.meta.height,
          fontSize: effectiveStyle.fontSize,
          fontFamily: effectiveStyle.fontFamily,
          fontWeight: effectiveStyle.fontWeight,
          fontStyle: effectiveStyle.fontStyle,
          color: effectiveStyle.color,
          textBgColor: effectiveStyle.backgroundOpacity > 0 
            ? `${effectiveStyle.backgroundColor}${Math.round(effectiveStyle.backgroundOpacity * 255).toString(16).padStart(2, '0')}`
            : undefined,
          strokeStyle: effectiveStyle.borderWidth > 0 ? effectiveStyle.borderColor : undefined,
          lineWidth: effectiveStyle.borderWidth,
          lineJoin: 'round',
          lineCap: 'round',
          letterSpacing: effectiveStyle.letterSpacing.toString(),
          bottomOffset: effectiveStyle.bottomOffset,
          textShadow: effectiveStyle.shadowBlur > 0 ? {
            offsetX: effectiveStyle.shadowOffsetX,
            offsetY: effectiveStyle.shadowOffsetY,
            blur: effectiveStyle.shadowBlur,
            color: effectiveStyle.shadowColor,
          } : undefined,
        })
      );
      
      // 设置字幕时间（覆盖整个视频时长）
      const totalDuration = keptSegments.reduce((total, seg) => {
        return total + (seg.end - seg.start);
      }, 0) * 1e6; // 转换为微秒
      
      subtitleSprite.time = {
        offset: 0,
        duration: totalDuration,
      };
      
      // 设置 z-index 确保字幕在视频上方
      subtitleSprite.zIndex = 10;
      
      // 添加到合成器
      await this.combinator.addSprite(subtitleSprite);

总结

一说起前端，大部分人都会想起UI界面；但是在很多领域，随着技术的迭代，前端其实也拥有了新的可能性，我觉得相对于传统的界面交互实现，我们也可以去探索前端更多的可能性。

🚀🚀🚀前端的无限可能-纯Web实现的字幕视频工具 FlyCut Caption