goland底层原理--字符串

字符串的本质

字符串时一种重要的数据结构，通常由一系列字符组成。字符串一般有两种类型，一种在编译时指定长度，不能修改。一种具有动态的长度，可以修改。

但是在Go语言中，字符串不能被修改，只能被访问。

底层结构：

type StringHeader struct {
   Data uintptr
   Len  int
}

Data指向底层的数组
Len代表字符串的长度

字符串本质上是一串字符数组，每个字符在存储时都对应了一个或多个整数，这涉及字符集的编码方式。

Go语言中的所有文件都采用UTF-8的编码方式，同时字符常量使用UTF-8的字符编码集。一般英文字母占1字节，但是中文会占据3字节。

在Go语言中使用符文(rune)类型来表示和区分字符串中的“字符”，rune其实时int32的别称。

字符串底层原理

字符串有特殊标识，有两种声明方式。

var a string = `hello world`
var b string = "hello world"

字符串常量在语法解析的阶段会被标记成StringLit类型的Token进入下一阶段。在语法分析阶段，会采用递归下降的方式读取UTF-8字符，单撇号和双引号时字符串的标识。

分析的具体逻辑：

func (s *Scanner) Scan() (pos token.Pos, tok token.Token, lit string) {

   switch ch := s.ch; {
   
      case '"':
         insertSemi = true
         tok = token.STRING
         lit = s.scanString()
      case ''':
         insertSemi = true
         tok = token.CHAR
         lit = s.scanRune()
      case '`':
         insertSemi = true
         tok = token.STRING
         lit = s.scanRawString()
     
   return
}

如果识别到",则调用scanRawString()

对于单引号，如果出现另外一个也是退出，但是如果出现了\,则对后面的字符转义。因此双引号不能出现换行符，这是通过对每个字符判断r==‘\n’实现的。

func (s *Scanner) scanString() string {
   // '"' opening already consumed
   offs := s.offset - 1

   for {
      ch := s.ch
      if ch == '\n' || ch < 0 {
         s.error(offs, "string literal not terminated")
         break
      }
      s.next()
      if ch == '"' {
         break
      }
      if ch == '\' {
         s.scanEscape('"')
      }
   }

   return string(s.src[offs:s.offset])
}

如果识别到单撇号,则调用scanString()

对于单撇号的处理比较简单，一直向后读取，指导匹配到配对的单撇号，退出

func (s *Scanner) scanRawString() string {
   // '`' opening already consumed
   offs := s.offset - 1

   hasCR := false
   for {
      ch := s.ch
      if ch < 0 {
         s.error(offs, "raw string literal not terminated")
         break
      }
      s.next()
      if ch == '`' {
         break
      }
      if ch == '\r' {
         hasCR = true
      }
   }

   lit := s.src[offs:s.offset]
   if hasCR {
      lit = stripCR(lit, false)
   }

   return string(lit)
}

字符串拼接

在Go中，通过加号进行字符串的拼接。

当加号字符操作两边是字符串时，编译时抽象语法树阶段具体操作的Op会被解析为OADDSTR。对两个字符串常量进行拼接时会在语法分析阶段调用noder.sum函数。

例如对于"a"+"b"+"c"的场景，noder.sum函数会将所有的字符串常量放到字符串数组中，然后调用strings.Join函数完成对字符串常量的拼接。

noder.go

// sum efficiently handles very large summation expressions (such as
// in issue #16394). In particular, it avoids left recursion and
// collapses string literals.
func (p *noder) sum(x syntax.Expr) ir.Node {
   // While we need to handle long sums with asymptotic
   // efficiency, the vast majority of sums are very small: ~95%
   // have only 2 or 3 operands, and ~99% of string literals are
   // never concatenated.

   adds := make([]*syntax.Operation, 0, 2)
   for {
      add, ok := x.(*syntax.Operation)
      if !ok || add.Op != syntax.Add || add.Y == nil {
         break
      }
      adds = append(adds, add)
      x = add.X
   }

   // nstr is the current rightmost string literal in the
   // summation (if any), and chunks holds its accumulated
   // substrings.
   //
   // Consider the expression x + "a" + "b" + "c" + y. When we
   // reach the string literal "a", we assign nstr to point to
   // its corresponding Node and initialize chunks to {"a"}.
   // Visiting the subsequent string literals "b" and "c", we
   // simply append their values to chunks. Finally, when we
   // reach the non-constant operand y, we'll join chunks to form
   // "abc" and reassign the "a" string literal's value.
   //
   // N.B., we need to be careful about named string constants
   // (indicated by Sym != nil) because 1) we can't modify their
   // value, as doing so would affect other uses of the string
   // constant, and 2) they may have types, which we need to
   // handle correctly. For now, we avoid these problems by
   // treating named string constants the same as non-constant
   // operands.
   var nstr ir.Node
   chunks := make([]string, 0, 1)

   n := p.expr(x)
   if ir.IsConst(n, constant.String) && n.Sym() == nil {
      nstr = n
      chunks = append(chunks, ir.StringVal(nstr))
   }

   for i := len(adds) - 1; i >= 0; i-- {
      add := adds[i]

      r := p.expr(add.Y)
      if ir.IsConst(r, constant.String) && r.Sym() == nil {
         if nstr != nil {
            // Collapse r into nstr instead of adding to n.
            chunks = append(chunks, ir.StringVal(r))
            continue
         }

         nstr = r
         chunks = append(chunks, ir.StringVal(nstr))
      } else {
         if len(chunks) > 1 {
            nstr.SetVal(constant.MakeString(strings.Join(chunks, "")))
         }
         nstr = nil
         chunks = chunks[:0]
      }
      n = ir.NewBinaryExpr(p.pos(add), ir.OADD, n, r)
   }
   if len(chunks) > 1 {
      nstr.SetVal(constant.MakeString(strings.Join(chunks, "")))
   }

   return n
}

如果涉及字符串变量的拼接，那么拼接操作最终是在运行时完成的。

运行时字符串的拼接原理：其并不是简单的将一个字符串合并到另外一个字符串，而是找到一个更大的空间，并通过内存复制的形式将字符串复制待其中。

QQ图片20220831200038.jpg 会调用runtime.concatstrings函数，concatstrings会先对传入的切片进行遍历，过滤空字符串并且计算拼接以后的字符串的长度。

// concatstrings implements a Go string concatenation x+y+z+...
// The operands are passed in the slice a.
// If buf != nil, the compiler has determined that the result does not
// escape the calling function, so the string data can be stored in buf
// if small enough.
func concatstrings(buf *tmpBuf, a []string) string {
   idx := 0
   l := 0
   count := 0
   for i, x := range a {
      n := len(x)
      if n == 0 {
         continue
      }
      if l+n < l {
         throw("string concatenation too long")
      }
      l += n
      count++
      idx = i
   }
   if count == 0 {
      return ""
   }

   // If there is just one string and either it is not on the stack
   // or our result does not escape the calling frame (buf != nil),
   // then we can return that string directly.
   if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) {
      return a[idx]
   }
   s, b := rawstringtmp(buf, l)
   for _, x := range a {
      copy(b, x)
      b = b[len(x):]
   }
   return s
}

拼接的过程位于rawstringtmp中，当拼接后的字符串小于32字节时，会有一个临时的缓存供其使用。当拼接后的字符串大于32字节时，堆区会开辟一个足够大的内存空间，并将多个字符串存入其中，期间还会涉及内存的copy。

func rawstringtmp(buf *tmpBuf, l int) (s string, b []byte) {
   if buf != nil && l <= len(buf) {
      b = buf[:l]
      s = slicebytetostringtmp(&b[0], len(b))
   } else {
      s, b = rawstring(l)
   }
   return
}

字符串与字节数组的转换

字节数组转化为字符串在运行时调用了slicebytetostring函数。注意!字节数组与字符串的相互转换并不是简单的指针引用，而是涉及了复制。当字符串大于32字节时，还需要申请堆内存，

当字符串转换为字节数组时，在运行时需要调用 stringtoslicebyte 函数，其和slicebytetostring函数非常类似，需要新的足够大小的内存空间。当字符串小于 32 字节时，可以直接使用缓buf。当字符串大于 32 字节时，rawbyteslice 函数需要向堆区申请足够的内存空间。最后使用。函数完成内存复制。