大文件上传

49 阅读9分钟

大文件上传

上传大文件时的痛点

  • 上传时间比较久
  • 中间一旦出错,需要重新上传
  • 一般服务需要对文件的大小进行限制

针对这些问题的解决办法:分片上传

读取文件

import logo from './logo.svg';
import './App.css';
import Upload from './components/Upload';

function App() {
  const handleFileChange = (e: Event) => {
    console.log('File changed:');
    console.log(e.target.files);


  }
  return (
    <div className="App">
      <Upload />
      <input type="file" onChange={handleFileChange} />
    </div>
  );
}

export default App;

文件分片

核心是使用Blob对象的slice方法,上一步获取到选择的文件是一个File对象,他是继承于Blob,所以可以使用slice方法对文件进行分割

用法Blob:slice() 方法 - Web API | MDN

slice()
slice(start)
slice(start, end)
slice(start, end, contentType)
// 分片大小
const CHUNK_SIZE = 1024 * 1024; // 1MB
// 文件分片
const createFileChunk  = (file: File) => {
  const chunks: Blob[] = [];
  let cur = 0;

  while (cur < file.size) {
    const chunk = file.slice(cur, cur + CHUNK_SIZE);
    chunks.push(chunk);
    cur += CHUNK_SIZE;
  }

  return chunks;
}

Hash计算

// 计算hash
const calculateHash =  (chunks: Blob[]) => {
 return new Promise(resolve => {
   // 1. 第一个和最后一个切片全部参与计算
  // 2. 中间切片之计算前两个字节,后两个字节,和中间两个字节

  // 所有参与计算的切片
  const targetChunks: Blob[] = [];

  const spark = new SparkMD5.ArrayBuffer();
  const fileReader = new FileReader();

  chunks.forEach((chunk, index) => {
    if (index === 0 || index === chunks.length - 1) {
      // 第一个和最后一个切片全部参与计算
      targetChunks.push(chunk);
    } else {
      // 中间切片之计算前两个字节,后两个字节,和中间两个字节
      const firstChunks = chunk.slice(0, 2);
      const middleChunks = chunk.slice(CHUNK_SIZE / 2, CHUNK_SIZE / 2 + 2);
      const lastChunks = chunk.slice(CHUNK_SIZE - 2, CHUNK_SIZE);
      targetChunks.push(firstChunks, middleChunks, lastChunks);
    }
  });

  fileReader.readAsArrayBuffer(new Blob(targetChunks));
  fileReader.onload = (e) => {
    spark.append(e.target?.result as ArrayBuffer);
    // console.log('hash:' ,spark.end());
    resolve(spark.end());
  }
 })
};
const hash = await calculateHash(chunks);
console.log(hash);

文件上传

如果上传一个1G的文件,加入每一个分片的大小为1M,那么总的分片大小是1024个分片,浏览器肯定处理不了,因为切片文件太多,浏览器一次创建了太多的请求,这是没有必要的,拿chrome来说,默认的请求并发数为6,过多的请求并部队提升上传速度,反而给浏览器带来了巨大的负担,因此,有必要限制前端请求个数。

如何解决?

创建最大并发数的请求,比如6个,那么同一时刻我们就允许浏览器只发送6个请求,其中,一个请求有了返回结果,就发起下一个新的请求,以此类推,直至所有的请求发送完毕。

上传文件时一般还要用到FormData对象,需要把传递的文件还有额外信息放到这个FormData对象里面。

前端实现
// 上传切片
const uploadChunks = async (chunks: Blob[]) => {

  const data = chunks.map((chunk, index) => {
    return {
      fileName,
      chunk,
      chunkHash: `${fileHash}-${index}`,
      fileHash
    }
  })

  const formData = data.map((item) => {
    const _formData = new FormData();
    _formData.append("fileName", item.fileName);
    _formData.append("chunk", item.chunk);
    _formData.append("chunkHash", item.chunkHash);
    _formData.append("fileHash", item.fileHash);
    return _formData;
  })

  const MAX_CONCURRENT = 6; // 最大并发数
  let index = 0;
  const taskPool = [];

  while(index < formData.length) {
    const task= fetch('/upload', {
      method: 'POST',
      body: formData[index]
    })

    
    const _targetTask = taskPool.findIndex(t => t === task);
    
    taskPool.splice( _targetTask)
    taskPool.push(task);

    if(taskPool.length === MAX_CONCURRENT) {
      await Promise.race(taskPool);
    }

    index++;
  }

  // 最后不足MAX_CONCURRENT的任务
  await Promise.all(taskPool);
}
服务端实现

后端在处理文件时需要用到multiparty这个工具,multiparty是一个专门用于Node.js的、专门处理HTTP请求中multiparty/form-data格式数据的库。

后端在处理每个上传的分片时候,应该先将它们临时存放到服务器的一个地方,方便合并的时候再去读取,为了区分不同文件的分片,需要用到文件对应的hash为文件夹命名,将这个文件的所有分片存放到这个文件夹中。

const express = require('express');
const cors = require('cors');
const multiparty = require('multiparty');
const path = require('path');
const fse = require('fs-extra');

const app = express();

app.use(cors());

const UPLOAD_DIR = path.resolve(__dirname, 'uploads');


app.post(`/upload`, (req, res) => {
    const form = new multiparty.Form();

    form.parse(req, async (err, fields, files) => {
        if(err) {
            return res.status(500).send('Error parsing form data');
        }

        console.log('Fields:', fields);
        console.log('Files:', files);
        // 存放临时目录
        const fileHash = fields['fileHash'][0];
        const chunkHash = fields['chunkHash'][0];

        if(!fse.existsSync(UPLOAD_DIR)) {
            await fse.mkdir(UPLOAD_DIR)
        }

        const chunkPath = path.resolve(UPLOAD_DIR, fileHash);


        if(!fse.existsSync(chunkPath)) {
            await fse.mkdir(chunkPath)
        }

        const oldPath = files.chunk[0].path
        console.log('Old Path:', oldPath);
        await fse.move(oldPath, path.resolve(chunkPath, chunkHash), { overwrite: true })
        res.status(200).send('Files uploaded successfully');
    })
})

app.listen(3000, () => {
    console.log('Server is running on http://localhost:3000');
})

合并文件

把所有切片都上传到服务器之后,要将所有的切片合并成一个完整的文件。

前端实现

前端需要向服务器发送一个合并请求,为了区分要合并的文件,需要将文件的hash值给传过去。

// 合并请求
const mergeRequest = () => {
  fetch("http://localhost:3000/merge", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      fileName: fileName.current,
      fileHash: fileHash.current,
      size: CHUNK_SIZE,
    }),
  }).then(() => {
    alert("合并成功");
  });
};
服务端实现
app.post("/merge", async (req, res) => {
  const { fileName, fileHash, size } = req.body;

  // 保存切片的目录
  const chunkDir = path.resolve(UPLOAD_DIR, fileHash);

  // 如果切片的目录不存在,直接返回错误信息
  if (!fse.existsSync(chunkDir)) {
    return res.status(401).json({
      code: 401,
      message: "合并失败,请重新上传",
    });
  }

  //  如果合并的文件已经存在,不需要进行合并
  const filePath = path.resolve(UPLOAD_DIR, fileHash + fileExtension(fileName));
  if (fse.existsSync(filePath)) {
    res.status(200).json({
      code: 200,
      message: "合并成功",
    });
    return;
  }

  //  如果文件不存在,进行合并
  const chunkPaths = await fse.readdir(chunkDir);
  // 根据切片的下标进行排序
  chunkPaths.sort((a, b) => {
    return a.split("-")[1] - b.split("-")[1];
  });

  const list = chunkPaths.map((chunkName, index) => {
    return new Promise((resolve) => {
      const chunkPath = path.resolve(chunkDir, chunkName);
      const readStream = fse.createReadStream(chunkPath);
      const writeStream = fse.createWriteStream(filePath, {
        start: index * size,
        end: (index + 1) * size,
      });

      readStream.pipe(writeStream);

      readStream.on("end", async () => {
        await fse.unlink(path.resolve(chunkPath));
        resolve();
      });
    });
  });

  await Promise.all(list);

  await fse.remove(chunkDir);
  res.status(200).send("Merge request received");
});

到这里,已经实现了大文件上传的基本功能了,但是没有考虑到如果上传相同文件的情况,而且如果中间网络断了,就要重新上传所有的分片。如果要解决这两个问题,就要使用秒传和断点续传。

秒传&断点续传

如果有相同的文件进行hash计算时,对应的hash值应该是一样的,并且服务端在给上传的文件命名时,就是用对应的hash值命名的,所以在上传之前可以判断有没有这个文件,如果有这个文件,就不用重复上传了,直接告诉用户上传成功,给用户的柑橘和就像是实现了秒传。

前端实现

前端在上传之前,需要将对应文件的hash值告诉服务器,服务端会看有没有对应的文件,如果有,就直接返回,不执行分片上传的操作了。

// 验证文件是否已经上传过
const verify = () => {
  return fetch("http://localhost:3000/verify", {
    method: "POST",
    headers: {
      "content-type": "application/json",
    },
    body: JSON.stringify({
      fileHash: fileHash.current,
      fileName: fileName.current,
    }),
  })
    .then((res) => res.json())
    .then((data) => {
      return data;
    });
};
服务端实现
app.post("/verify", async (req, res) => {
  const { fileHash, fileName } = req.body;
  const filePath = path.resolve(UPLOAD_DIR, fileHash + fileExtension(fileName));

  if (fse.existsSync(filePath)) {
    // 文件已存在,通知前端无需上传
    return res.status(200).json({
      ok: true,
      data: {
        shouldUpload: false,
      },
    });
  } else {
    // 文件不存在,通知前端需要上传
    return res.status(200).json({
      ok: true,
      data: {
        shouldUpload: true,
      },
    });
  }
});

断点续传

完成上面的步骤后,当我们再上传相同的文件,即使改了文件名,也会提示我们秒传成功了,因为服务器已经有那个文件了,这样就解决了重复上传的问题。

如果在上传的过程中,发生网络中断,应该如何解决呢?

如果在断网之前已经上传了一部分分片,在上传之前,只需要拿到这部分分片,然后再过滤掉这些切片,就可以避免重复上传这些分片了,换句话说,只需要上传失败的分片。

前端实现
// 上传切片
const uploadChunks = async (chunks: Blob[], existsChunks: string[]) => {
  // ...

  // 把服务端已经存在的切片过滤出来,不再上传
  const formData = data
    .filter((item) => !existsChunks.includes(item.chunkHash))
    .map((item) => {
      const _formData = new FormData();
      _formData.append("fileName", item.fileName);
      _formData.append("chunk", item.chunk);
      _formData.append("chunkHash", item.chunkHash);
      _formData.append("fileHash", item.fileHash);
      return _formData;
    });

  // ...

}


const handleFileChange = async (e: ChangeEvent<HTMLInputElement>) => {
  // ...
  // 上传分片
  uploadChunks(chunks, data.data.existsChunks);
};
服务端实现
app.post("/verify", async (req, res) => {
  const { fileHash, fileName } = req.body;
  const filePath = path.resolve(UPLOAD_DIR, fileHash + fileExtension(fileName));

  // 所有上传成功的切片
  const chunkDir = path.resolve(UPLOAD_DIR, fileHash);

  let chunkPaths = [];
  if (fse.existsSync(chunkDir)) {
    chunkPaths = await fse.readdir(chunkDir);
    console.log("🚀 ~ chunkPaths:", chunkPaths);
  }

  if (fse.existsSync(filePath)) {
    // 文件已存在,通知前端无需上传
    return res.status(200).json({
      ok: true,
      data: {
        shouldUpload: false,
      },
    });
  } else {
    // 文件不存在,通知前端需要上传
    return res.status(200).json({
      ok: true,
      data: {
        shouldUpload: true,
        existsChunks: chunkPaths,
      },
    });
  }
});

可以看到第1个切片已经上传成功了

过滤之后可以看到,只需要把不存在于服务端的切片上传

总代码

前端
import SparkMD5 from "spark-md5";
import { useEffect, useRef } from "react";

import type { ChangeEvent } from "react";

function Upload() {
  // 文件名字
  const fileName = useRef<string>("");
  // 文件Hash
  const fileHash = useRef<string>("");

  // 分片大小
  const CHUNK_SIZE = 1024 * 1024; // 1MB
  // 文件分片
  const createFileChunk = (file: File) => {
    const chunks: Blob[] = [];
    let cur = 0;

    while (cur < file.size) {
      const chunk = file.slice(cur, cur + CHUNK_SIZE);
      chunks.push(chunk);
      cur += CHUNK_SIZE;
    }

    return chunks;
  };

  // 计算hash
  const calculateHash = (chunks: Blob[]) => {
    return new Promise((resolve) => {
      // 1. 第一个和最后一个切片全部参与计算
      // 2. 中间切片之计算前两个字节,后两个字节,和中间两个字节

      // 所有参与计算的切片
      const targetChunks: Blob[] = [];

      const spark = new SparkMD5.ArrayBuffer();
      const fileReader = new FileReader();

      chunks.forEach((chunk, index) => {
        if (index === 0 || index === chunks.length - 1) {
          // 第一个和最后一个切片全部参与计算
          targetChunks.push(chunk);
        } else {
          // 中间切片之计算前两个字节,后两个字节,和中间两个字节
          const firstChunks = chunk.slice(0, 2);
          const middleChunks = chunk.slice(CHUNK_SIZE / 2, CHUNK_SIZE / 2 + 2);
          const lastChunks = chunk.slice(CHUNK_SIZE - 2, CHUNK_SIZE);
          targetChunks.push(firstChunks, middleChunks, lastChunks);
        }
      });

      fileReader.readAsArrayBuffer(new Blob(targetChunks));
      fileReader.onload = (e) => {
        spark.append(e.target?.result as ArrayBuffer);
        // console.log('hash:' ,spark.end());
        resolve(spark.end());
      };
    });
  };

  // 合并请求
  const mergeRequest = () => {
    fetch("http://localhost:3000/merge", {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        fileName: fileName.current,
        fileHash: fileHash.current,
        size: CHUNK_SIZE,
      }),
    }).then(() => {
      alert("合并成功");
    });
  };

  // 上传切片
  const uploadChunks = async (chunks: Blob[], existsChunks: string[]) => {
    const data = chunks.map((chunk, index) => {
      return {
        fileName: fileName.current,
        chunk,
        chunkHash: `${fileHash.current}-${index}`,
        fileHash: fileHash.current,
      };
    });

    const formData = data
      .filter((item) => !existsChunks.includes(item.chunkHash))
      .map((item) => {
        const _formData = new FormData();
        _formData.append("fileName", item.fileName);
        _formData.append("chunk", item.chunk);
        _formData.append("chunkHash", item.chunkHash);
        _formData.append("fileHash", item.fileHash);
        return _formData;
      });

    const MAX_CONCURRENT = 6; // 最大并发数
    let index = 0;
    const taskPool = [];

    while (index < formData.length) {
      const task = fetch("http://localhost:3000/upload", {
        method: "POST",
        body: formData[index],
      });

      const _targetTask = taskPool.findIndex((t) => t === task);

      taskPool.splice(_targetTask);
      taskPool.push(task);

      if (taskPool.length === MAX_CONCURRENT) {
        await Promise.race(taskPool);
      }

      index++;
    }

    // 最后不足MAX_CONCURRENT的任务
    await Promise.all(taskPool);

    mergeRequest();
  };

  // 验证文件是否已经上传过
  const verify = () => {
    return fetch("http://localhost:3000/verify", {
      method: "POST",
      headers: {
        "content-type": "application/json",
      },
      body: JSON.stringify({
        fileHash: fileHash.current,
        fileName: fileName.current,
      }),
    })
      .then((res) => res.json())
      .then((data) => {
        return data;
      });
  };

  const handleFileChange = async (e: ChangeEvent<HTMLInputElement>) => {
    if (!e.target.files || e.target.files.length === 0) return;

    // 1. 读取文件
    const file = e.target.files[0];
    fileName.current = file.name;

    // 2. 文件分片
    const chunks = createFileChunk(file);

    // 3. hash计算
    const hash = await calculateHash(chunks);
    fileHash.current = hash as string;

    // 验证是否已经上传过
    const data = await verify();
    console.log("data", data);
    if (!data.data.shouldUpload) {
      alert("该文件已经上传过");
      return;
    }

    // 上传分片
    uploadChunks(chunks, data.data.existsChunks);
  };

  return (
    <div>
      <input type="file" onChange={handleFileChange} />
    </div>
  );
}

export default Upload;
服务端
const express = require("express");
const cors = require("cors");
const multiparty = require("multiparty");
const path = require("path");
const fse = require("fs-extra");

const app = express();

app.use(express.json());
app.use(cors());

const UPLOAD_DIR = path.resolve(__dirname, "uploads");

// 分离文件后缀名
const fileExtension = (fileName) => {
  return fileName.slice(fileName.lastIndexOf("."));
};

app.post(`/upload`, (req, res) => {
  const form = new multiparty.Form();

  form.parse(req, async (err, fields, files) => {
    if (err) {
      return res.status(500).send("Error parsing form data");
    }

    // 存放临时目录
    const fileHash = fields["fileHash"][0];
    const chunkHash = fields["chunkHash"][0];

    if (!fse.existsSync(UPLOAD_DIR)) {
      await fse.mkdir(UPLOAD_DIR);
    }

    const chunkPath = path.resolve(UPLOAD_DIR, fileHash);

    if (!fse.existsSync(chunkPath)) {
      await fse.mkdir(chunkPath);
    }

    const oldPath = files.chunk[0].path;
    await fse.move(oldPath, path.resolve(chunkPath, chunkHash), {
      overwrite: true,
    });
    res.status(200).json({
      code: 200,
      message: "Chunk uploaded successfully",
    });
  });
});

app.post("/merge", async (req, res) => {
  const { fileName, fileHash, size } = req.body;

  // 保存切片的目录
  const chunkDir = path.resolve(UPLOAD_DIR, fileHash);

  // 如果切片的目录不存在,直接返回错误信息
  if (!fse.existsSync(chunkDir)) {
    return res.status(401).json({
      code: 401,
      message: "合并失败,请重新上传",
    });
  }

  //  如果合并的文件已经存在,不需要进行合并
  const filePath = path.resolve(UPLOAD_DIR, fileHash + fileExtension(fileName));
  if (fse.existsSync(filePath)) {
    res.status(200).json({
      code: 200,
      message: "合并成功",
    });
    return;
  }

  //  如果文件不存在,进行合并
  const chunkPaths = await fse.readdir(chunkDir);
  // 根据切片的下标进行排序
  chunkPaths.sort((a, b) => {
    return a.split("-")[1] - b.split("-")[1];
  });

  const list = chunkPaths.map((chunkName, index) => {
    return new Promise((resolve) => {
      const chunkPath = path.resolve(chunkDir, chunkName);
      const readStream = fse.createReadStream(chunkPath);
      const writeStream = fse.createWriteStream(filePath, {
        start: index * size,
        end: (index + 1) * size,
      });

      readStream.pipe(writeStream);

      readStream.on("end", async () => {
        await fse.unlink(path.resolve(chunkPath));
        resolve();
      });
    });
  });

  await Promise.all(list);
  await fse.remove(chunkDir);
  res.status(200).send("Merge request received");
});

app.post("/verify", async (req, res) => {
  const { fileHash, fileName } = req.body;
  const filePath = path.resolve(UPLOAD_DIR, fileHash + fileExtension(fileName));

  // 所有上传成功的切片
  const chunkDir = path.resolve(UPLOAD_DIR, fileHash);

  let chunkPaths = [];
  if (fse.existsSync(chunkDir)) {
    chunkPaths = await fse.readdir(chunkDir);
    console.log("🚀 ~ chunkPaths:", chunkPaths);
  }

  if (fse.existsSync(filePath)) {
    // 文件已存在,通知前端无需上传
    return res.status(200).json({
      ok: true,
      data: {
        shouldUpload: false,
      },
    });
  } else {
    // 文件不存在,通知前端需要上传
    return res.status(200).json({
      ok: true,
      data: {
        shouldUpload: true,
        existsChunks: chunkPaths,
      },
    });
  }
});

app.listen(3000, () => {
  console.log("Server is running on http://localhost:3000");
});