ProChat + Remix 对接 Ollama 流式服务今天分享的是本地模型 ollama 对接 @ant-desi

今天分享的是本地模型 ollama 对接 @ant-design/pro-chat 实现自己本地服务。当然也有对一些新技术的探索。

一、简介

基于 Remix + @ant-design/pro-chat + ollama 实现本地流式服务交互。

二、准备和熟悉

如果还不知道 ollama 的可以先安装 [Ollama](https://ollama.com/)。

三、初始化 Remix 项目并安装依赖

npx create-remix <your_project>

cd your_project

pnpm install

pnpm add ollama remix-utils antd antd-style @ant-design/pro-chat

3.1) 为什么选择 Remix ？

Next.js 方案其实已经有了,其次对 Remix 熟练度比较高了。

四、调用 ollama 的方式

ollama restful api
ollamajs

ollamajs 对 restful api 进行了封装，我们使用的时候更加方便，以下是 chat 的一个示例：

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'llama2',
  messages: [{ role: 'user', content: 'Why is the sky blue?' }],
})
console.log(response.message.content)

默认情况下，chat 方法传入 model 指定模型，传入 message 数组即可开始聊天🙃，不再需要自己手动封装 api😁。

五、会遇到什么问题

由于我们是使用 Remix, Remix 默认使用 Vite + SSR 的模式，在与 Antd 系统结合时候会遇到问题。

antd icons 不支持 esm 的问题，处理方式也就是 ClientOnly 和 Vite ssr.noExternal 和 optimizeDeps

六、ProChat 基本使用

ProChat 被设计的使用非常简单，基本使用我们其实只需要关注 request 属性即可。

mport { ProChat } from '@ant-design/pro-chat';
export default () => {
  return (
    <div style={{ background: theme.colorBgLayout }}>
      <ProChat
        request={async (messages) => {
          const mockedData: string = `这是一段模拟的对话数据。本次会话传入了${messages.length}条消息`;
          return new Response(mockedData);
        }}
      />
    </div>
  );
};

request 属性是一个请求对象，返回的是一个 Response 示例，Response 来自 Web 标准库。当然 Response 是支持流式渲染的，但是 ollama 的 stream 流是一样的。

七、Ollama stream

chat(request: ChatRequest & {
        stream: true;
    }): Promise<AsyncGenerator<ChatResponse>>;

我们看到 AsyncGenerator 是一个异步生成，根据生成器的特点， ollama stream 必然是可迭代的。既然是可迭代的，我们就能通过迭代 api 将异步生成器，转换成 Web 标准流。下面我们实现以下：

import ollama from 'ollama'

const chat = async (messages) => {
    const response = await ollama.chat({
      model: "qwen",
      messages,
      stream: true,
    });

    const stream = new ReadableStream({
      start(controller) {
        try {
          (async () => {
            const encoder = new TextEncoder();
            const reader = response[Symbol.asyncIterator](); // 获取迭代器

            // eslint-disable-next-line no-constant-condition
            while (true) {
              const s = await reader.next(); // 调用next方法获取下一个值
              const { value } = s;
              const { done } = value;
              const { content } = value.message;
              if (done) {
                controller.close();
                break;
              }

              controller.enqueue(encoder.encode(content)); // 将内容写入流
            }
          })();
        } catch (error) {
          console.log(error);
        }
      },
    });

    return stream;
  }

从实现中需要的 web 知识还有：

TextEncoder
ReadableStream
Symbol.asyncIterator

来完成文件读取和异步迭代，当然迭代的时候也需要考虑 ollama 的数据结构。

八、字符串形式渲染

const response = await ollama.chat({
  model: "qwen",
  messages,
});

完整的字符串 pro-chat 采用的是 loading 默认方式，意味着所有的字符输出完毕之后，才会显示出来，如果文本比较到，loading 的时间会相对较长。

九、流式渲染

const response = await ollama.chat({
  model: "qwen",
  messages,
  stream: true,
});

因为字符串渲染的确定，流式渲染因为良好的体验就广受大模型欢迎。开启 ollama 的流式渲染也非常简单，在 options 中添加 stream 即可。

十、Server Send Event

ChatGPT 中使用 sse 的方案实现流式渲染,在 ChatGPT 火爆之后技术也被广泛熟知，实现 sse 需要设置 HTTP 头信息：

Content-Type: text/event-stream, 
Cache-Control: no-cache, 
Connection: keep-alive

使用 SSE 需要干的这些事情

改写上面的请求头
服务端提供接口并响应数据
使用 EventSource 配合服务店制作一个监听源，使用 onmessage 事件获取服务器发送的数据

Server Send Event 在浏览器中调试会在面板中多出一个 EventStream 面板：

ID/Type/message/Time 自个字段。

以下是一个简单的示例:

3.1) 客户端

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
</head>
<body>
  <div id="root"></div>
  <script type="module">
    window.onload = function() {
      const source = new EventSource("http://localhost:5656/event")
      source.onmessage = function(e) {
        if(JSON.parse(e.data).num >= 5) {
          console.log("dropped it")
          source.close()
        }
      }
    }
  </script>
</body>
</html>

当响应数据中 num >= 5 时，将客户端关闭，不然会重复的发。

3.2) 服务端

import express from "express";
import cors from "cors";

const app = express();

app.use(cors());

app.get("/event", (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  let counter = 0;
  let id = setInterval(() => {
    counter++;
    if (counter >= 5) {
      clearInterval(id);
      res.end();
      return;
    }
    res.write(`data: ${JSON.stringify({ num: counter })}\n\n`); 
    res.send() 
  }, 1000);

  res.on("close", () => {
    console.log("client dropped me");
    clearInterval(id);
    res.end();
  });
});

app.listen(5656, () => {
  console.log("server on: http://localhost:5656");
});

以一个简单的 express 应用示例，采用 count 倒计时的方式模拟产生数据，和结束状态。我们看到没有经过封装的 sse 是有一定的代码量的。

但是 ollama 似乎没有选择 sse, 而是使用 ndjson。下面就是关于 ndjson 的探索。

十一、ndjson

全称： New-line Delimited JSON

ndjson 对应的 MIME 类型是 application/x-ndjson, 一个 ndjson 格式大概是这样的：

{
    "model": "qwen",
    "created_at": "2024-05-06T17:11:38.1485541Z",
    "message": {
        "role": "assistant",
        "content": "In"
    },
    "done": false
}{
    "model": "qwen",
    "created_at": "2024-05-06T17:11:38.3499615Z",
    "message": {
        "role": "assistant",
        "content": " the"
    },
    "done": false
}

以上意思 ollama 响应中浏览器 NetWorks 选项中一个 chat 接口 preview 的显示的内容，一个一个的对象的形式。下面我们看看 ollama 开源源码，以下是部分代码：

// /aserver/routes.go streamRespons
func streamResponse(c *gin.Context, ch chan any) {
	c.Header("Content-Type", "application/x-ndjson")
	c.Stream(func(w io.Writer) bool {
		val, ok := <-ch
		if !ok {
			return false
		}

		bts, err := json.Marshal(val)
		if err != nil {
			slog.Info(fmt.Sprintf("streamResponse: json.Marshal failed with %s", err))
			return false
		}

		// Delineate chunks with new-line delimiter
		bts = append(bts, '\n')
		if _, err := w.Write(bts); err != nil {
			slog.Info(fmt.Sprintf("streamResponse: w.Write failed with %s", err))
			return false
		}

		return true
	})
}

我们 header 指定了 application/x-ndjson 看到 <-ch 从 ch 中读取数据，然后交给 son.Marshal 处理 bts，然后加入 \n 换行，io 写入 bts。

十二、在 node 中 expess 的实现思路

关键：

设置请求头（在浏览器面板中观察）
将数据分片
响应数据分片
响应结束

12.1) 服务端实现

import express from "express";
import cors from "cors";
import { PassThrough } from 'node:stream'

const app = express();
app.use(cors())

app.get("/ndjson", (_, res) => {
  res.setHeader("Content-Type", "application/x-ndjson");
  const readableStream = new PassThrough()

  let count = 5;

  const id = setInterval(() => {
    readableStream.write(`{num: ${count}}\n`)
    count--
    if(count <= 0) {
      clearInterval(id)
      res.end()
    }
  }, 1000)

  
  readableStream.on('data', (chunk) => {
    res.write(chunk)
  })

  readableStream.on("end", () => {
    console.log("end");
    res.end();
  });

  readableStream.on("error", (err) => {
    console.error("Error reading the stream:", err);
    res.status(500).send("Internal Server Error");
  });
});
app.listen(8866, () => {
  console.log("Server is running on http://localhost:8866");
});

使用 PassThrough 创建一个可读流，并且 1s 写一个字符串，监听流中 data, 并响应给前端。在 count 小于 0 时结束。

12.3) 客户端

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Document</title>
  </head>
  <body>
    <div id="root">sdf</div>
    <script type="module">
      window.onload = function () {
        let result = new Uint32Array([])
        const el = document.body.querySelector('#root')

        fetch("http://localhost:8866/ndjson").then((res) => {
          if (!res.ok) {
            throw new Error("Network response was not ok");
          }
          const reader = res.body.getReader()

          let charsReceived = 0;

          reader.read().then(function processText(v) {
            const { done, value } = v
            if (done) {
              console.log("Stream complete");
              return;
            }

            const decoder = new TextDecoder('utf-8')
            const str = decoder.decode(value)

            el.innerHTML += str
          
            return reader.read().then(processText);
          }).catch((error) => {
            console.log(error)
          });
        });
      };
    </script>
  </body>
</html>

由于服务端返回的是数据流，前端要使用流处理，fetch api 很容易处理流数据。需要注意的点是 reader 和 decoder 两个不常用的 api 处理流数据，采用递归的方式不断的读取流中的数据，直到 done 为 true 时结束。

十三、小结

本文主要讲解了对接 ollama 的模型时需要知识点和实践，使用 Prochat + Remix 对接 Ollama 本地流式服务。涉及到了 sse 与 express 流式探索，以及基于 express stream + ndjson 的实践探索，最后希望本文对读者有所帮助。

ProChat + Remix 对接 Ollama 流式服务

一、 简介