ChatGPT计算token数

589 阅读2分钟

首先给出官方说明:

出处:platform.openai.com/docs/guides…

image.png 官方的例子: cookbook.openai.com/examples/ho… tiktoken的GItHub:github.com/openai/tikt…

示例

(1)官方获取一个消息token数

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
  """Returns the number of tokens used by a list of messages."""
  try:
      encoding = tiktoken.encoding_for_model(model)
  except KeyError:
      encoding = tiktoken.get_encoding("cl100k_base")
  if model == "gpt-3.5-turbo-0613":  # note: future models may deviate from this
      num_tokens = 0
      for message in messages:
          num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
          for key, value in message.items():
              num_tokens += len(encoding.encode(value))
              if key == "name":  # if there's a name, the role is omitted
                  num_tokens += -1  # role is always required and always 1 token
      num_tokens += 2  # every reply is primed with <im_start>assistant
      return num_tokens
  else:
      raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
  See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")

(2)计算给定字符串的token数

编码

encoding.encode("tiktoken is great!")
[83, 1609, 5963, 374, 2294, 0]

计算token数

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens
num_tokens_from_string("tiktoken is great!", "cl100k_base")

结果为 6

解码 解码一个token系列

encoding.decode([83, 1609, 5963, 374, 2294, 0])

解码输出为

'tiktoken is great!'

应用

能提前计算消息的token数很有用, 例如我们要翻译一篇文章, 要将文章拆成段落来翻译,有时段落也会超限。这时我们就可以先计算一下是否超限, 如果超限了就再拆分段落。示例代码如下:

def translate_paragraph(paragraph):
    # 使用全局客户端实例进行翻译
    role_content = "You are an assistant capable of translating Chinese written content into English directly. The content is written in Markdown, and you should maintain the formatting while translating."
    messages = [
        {"role": "system", "content": role_content},
        {"role": "user", "content": paragraph}
    ]

    # 检查消息的token数是否超限
    if num_tokens_from_messages(messages) > 2048:
        lines = paragraph.split('\n')
        half_point = len(lines) // 2
        first_half = '\n'.join(lines[:half_point])
        second_half = '\n'.join(lines[half_point:])

        # 递归调用以翻译每一半
        print(f"超限:分成两半")
        translated_first_half = translate_paragraph(first_half)
        print("翻译完第一部分")
        translated_second_half = translate_paragraph(second_half)
        print("翻译完第二部分")

        if translated_first_half is not None and translated_second_half is not None:
            return translated_first_half + '\n' + translated_second_half
        else:
            return None
    try:
        completion = client.chat.completions.create(
            model=gpt_model,
            messages=messages
        )

        # 提取翻译后的文本
        translated_text = completion.choices[0].message.content
        return translated_text

    except Exception as e:
        print(f"An error occurred: {e}")
        print(f"Failed to translate paragraph: {paragraph}")
        return None