Transmission Control Protocol
三次握手 四次挥手
如何保证可靠性传输
TCP主要提供了 序列号/确认应答、 重传、 检验和 等方法实现了可靠性传输。
为什么要三次握手
三次握手的意义在于确定双方都能够完成读写操作。
第一次握手,客户端发了个连接请求消息到服务端,服务端收到信息后知道自己与客户端是可以连接成功的,但此时客户端并不知道服务端是否已经接收到了它的请求,所以服务端接收到消息后的应答。客户端得到服务端的反馈后,才确定自己与服务端是可以连接上的,这就是第二次握手。而服务端发送出去的消息,要等客户端响应后,才能确定此次连接为有效连接。
否则假设 客户端发出去的第一个连接请求由于某些原因在网络节点中滞留了导致延迟,直到连接释放的某个时间点才到达服务端,这是一个早已失效的报文,但是此时应了客户端,第二次握手。如果没有第三次握手那么到这里,连接就建立了,但是此服务端仍然认为这是客户端的建立连接请求第一次握手,于是服务端回时客户端并没有任何数据要发送,会让服务端空等,造成资源浪费
timewait
为什么有 timewait 状态
-
允许老的报文在网络中消逝
- 假如我们在192.168.1.1:5000和39.106.170.184:6000建立一个TCP连接,一段时间后我们关闭这个连接,再基于相同插口建立一个新的TCP连接,这个新的连接称为前一个连接的化身。老的报文很有可能由于某些原因迟到了,那么新的TCP连接很有可能会将这个迟到的报文认为是新的连接的报文,而导致数据错乱。为了防止这种情况的发生TCP连接必须让TIME WAIT状态持续2MSL、在此期间将不能基于这个插口建立新的化身,让它有足够的时间使迟到的报文段被丢弃
-
保证TCP全双工连接的正确关闭
- 如果主动关闭方最终的ACK丢失、那么服务器将会重新发送那个FIN,以允许主动关闭方重新发送那个ACK,要是主动关闭方不维护2MSL状态、那么主动关闭将会不得不响应一个RST报文段,而服务器将会把它解释为一个错误、导致TCP连接没有办法完成全双工的关闭、而进入半关闭状态。
为什么维持2MSL
- 一个MSL是确保主动关闭方最后的ACK能够到达对端
- 一个MSL是确保被动关闭方重发的FIN能够被主动关闭方收到。
timewait过多怎么产生的,如何防范
-
TIME WAIT状态过多的危害
- 正常的TCP客户端连接在关闭后,会进入一个TIME——WAIT的状态,持续的时间一般在1-4分钟、对于连接数不高的场景,1-4分钟其实并不长,对系统也不会有什么影响,但如果短时间内(例如1s内)进行大量的短连接,则可能出现这样一种情况:客户端所在的操作系统的socket端口和文件描述符被用尽,系统无法再发起新的连接!
-
解决
- 服务器可以设置SO_REUSEADDR套接字选项来通知内核,如果端口忙,但TCP连接位于TIME_WAIT状态时可以重用端口
- 短连接改成长连接
tcp_tw_recycle (Boolean; default: disabled; Linux 2.4 to Linux
4.11)
Enable fast recycling of TIME_WAIT sockets. Enabling this
option is not recommended as the remote IP may not use
monotonically increasing timestamps (devices behind NAT,
devices with per-connection timestamp offsets). See RFC
1323 (PAWS) and RFC 6191.
tcp_tw_reuse (Boolean; default: disabled; since Linux 2.4.19/2.6)
Allow to reuse TIME_WAIT sockets for new connections when
it is safe from protocol viewpoint. It should not be
changed without advice/request of technical experts.
[粘包]
TCP粘包就是指发送方发送的若干包数据到达接收方时粘成了一包,从接收缓冲区来看,后一包数据的头紧接着前一包数据的尾,出现粘包的原因 可能是来自发送方,也可能是来自接收方。
(1)发送方原因 TCP默认使用Nagle算法(主要作用:减少网络中报文段的数量),而Nagle算法主要做两件事:
- 只有上一个分组得到确认,才会发送下一个分组
- 收集多个小分组,在一个确认到来时一起发送
Nagle算法造成了发送方可能会出现粘包问题
(2)接收方原因 TCP将接收到的数据包保存在接收缓存里、然后应用程序主动从缓存读取收到的分组。这样一来、如果TCP接收数据包到缓存的速度大于应用程序从缓存中读取数据包的速度、多个包就会被缓存、应用程序就有可能读取到多个首尾相接粘到一起的包。
为了解决粘包问题,上层协议一般会用消息头+消息体的格式去重新包装要发的数据。
而消息头里一般含有消息体的长度,通过这个长度可以去截取真正的消息体。
此外 发送端在发送时还会加入各种校验字段(校验和或者对整段完整数据进行 CRC 之后获得的数据)放在标志位后面,在接收端拿到整段数据后校验下确保它就是发送端发来的完整数据。
SYN 洪泛攻击
在 TCP 的三次握手机制的第一步中,客户端会向服务器发送 SYN 报文段。服务器接收到 SYN 报文段后会为该 TCP分配缓存,如果攻击分子大量地往服务器发送 SYN 报文段,服务器的连接资源终将被耗尽,导致内存溢出无法继续服务。
解决策略:当服务器接受到 SYN 报文段时,不直接为该 TCP 分配资源,而只是打开一个半开的套接字。接着会使用 SYN 报文段的源 ip,目的 ip,端口号以及只有服务器自己知道的一个秘密函数生成一个 cookie,并把 cookie 作为序列号响应给客户端。
如果客户端是正常建立连接,将会返回一个确认字段为 cookie +1 的报文段。接下来服务器会根据确认报文的源 ip,目的 ip,端口号以及秘密函数计算出一个结果,如果结果的值+1 等于确认字段的值,则证明是刚刚请求连接的客户端,这时候才为该 TCP 分配资源
tcp 丢包可能出现的时机
-
TCP[三次握手]
-
在服务端,第一次握手之后,会先建立个半连接,然后再发出第二次握手。这时候需要有个地方可以暂存这些半连接。这个地方就叫半连接队列。
-
如果之后第三次握手来了,半连接就会升级为全连接,然后暂存到另外一个叫全连接队列的地方,坐等程序执行
accept()方法将其取走使用。 -
是队列就有长度,有长度就有可能会满,如果它们满了,那新来的包就会被丢弃。
-
可以通过下面的方式查看是否存在这种丢包行为。
-
netstat -s | grep overflowed (全连接队列溢出次数)
-
netstat -s | grep -i "SYNs to LISTEN sockets dropped" (半连接队列溢出次数)
-
-
Transmit Queue Length (txqueuelen)
-
txqueuelen的值决定了在数据包被发送到网络设备驱动程序之前可以在队列中排队的数据包数量。如果队列满了,新的数据包将被丢弃。
-
我们可以通过下面的ifconfig命令查看到,里面涉及到的txqueuelen后面的数字1000,其实就是流控队列的长度。
-
当发送数据过快,流控队列长度txqueuelen又不够大时,就容易出现丢包现象。
-
可以通过
ifconfig eth0命令,查看TX下的dropped字段,当它大于0时,则有可能是发生了流控丢包。 -
当遇到这种情况时,我们可以尝试修改下流控队列的长度。比如像下面这样将eth0网卡的流控队列长度从1000提升为1500。
-
ifconfig eth0 txqueuelen 1500
-
-
网卡RingBuffer过小导致丢包
- ifconfig 命令, 查看
overruns指标,它记录了由于RingBuffer长度不足导致的溢出次数。 - 通过 ethtool -g eth0 命令查看当前网卡的配置
- ifconfig 命令, 查看
-
网卡性能不足
- 网卡作为硬件,传输速度是有上限的。当网络传输速度过大,达到网卡上限时,就会发生丢包。这种情况一般常见于压测场景。
-
接收缓冲区丢包
- 我们一般使用
TCP socket进行网络编程的时候,内核都会分配一个发送缓冲区和一个接收缓冲区。 - 当接受缓冲区满了,它的TCP接收窗口会变为0,也就是所谓的零窗口,并且会通过数据包里的
win=0,告诉发送端,"球球了,顶不住了,别发了"。一般这种情况下,发送端就该停止发消息了,但如果这时候确实还有数据发来,就会发生丢包。
- 我们一般使用
-
两端之间的网络丢包
- ping命令查看丢包
- mtr
- mtr命令可以查看到你的机器和目的机器之间的每个节点的丢包情况。
- mtr -r baidu.com
丢包了,tcp会重传
TCP只保证传输层的消息可靠性,并不保证应用层的消息可靠性。如果我们还想保证应用层的消息可靠性,就需要应用层自己去实现逻辑做保证。
TCP segment header 字段
TCP校验和
TCP校验和是一个端到端的校验和,由发送端计算,然后由接收端验证。 其目的是为了发现TCP首部和数据在发送端到 接收端之间发生的任何改动。 如果接收方检测到校验和有差错,则TCP段会被直接丢弃。 TCP校验和覆盖TCP首部和TCP数据,而IP首部中的校验和只覆盖IP的首部,不覆盖IP数据报中的任何数据。 TCP的校验和是必需的,而UDP的校验和是可选的。 TCP和UDP计算校验和时,都要加上一个12字节的伪首部。
伪首部共有12字节,包含如下信息:源IP地址、目的IP地址、保留字节(置0)、传输层协议号(TCP是6)、TCP报文长度(报头+数据)。
伪首部是为了增加TCP校验和的检错能力:如检查TCP报文是否收错了(目的IP地址)、传输层协议是否选对了(传输层协议号)等。
重传
超时重传
event: timer timeout
retransmit not-yet-acknowledged segment with smallest sequence number
start timer
- 场景1
- Although the segment from A is received at B, the acknowledgment from B to A gets lost. In this case, the timeout event occurs, and Host A retransmits the same segment. Of course, when Host B receives the retransmission, it observes from the sequence number that the segment contains data that has already been received. Thus, TCP in Host B will discard the bytes in the retransmitted segment.
- 场景2
- Suppose now that neither of the acknowledgments arrives at Host A before the timeout. When the timeout event occurs, Host A resends the first segment with sequence number 92 and restarts the timer. As long as the ACK for the second segment arrives before the new timeout, the second segment will not be retransmitted.
- 场景3
- suppose Host A sends the two segments, exactly as in the second example. The acknowledgment of the first segment is lost in the network, but just before the timeout event, Host A receives an acknowledgment with acknowledgment number 120. Host A therefore knows that Host B has received everything up through byte 119; so Host A does not resend either of the two segments
- suppose Host A sends the two segments, exactly as in the second example. The acknowledgment of the first segment is lost in the network, but just before the timeout event, Host A receives an acknowledgment with acknowledgment number 120. Host A therefore knows that Host B has received everything up through byte 119; so Host A does not resend either of the two segments
fast retransmit
In the case that three duplicate ACKs are received, the TCP sender performs a fast retransmit [RFC 5681], retransmitting the missing segment before that segment’s timer expires.
Go-Back-N or Selective Repeat?
A proposed modification to TCP, the so-called selective acknowledgment [RFC 2018], allows a TCP receiver to acknowledge out-of-order segments selectively rather than just cumulatively acknowledging the last correctly received, in-order segment. When combined with selective retransmission—skipping the retransmission of segments that have already been selectively acknowledged by the receiver— TCP looks a lot like our generic SR protocol.
Thus, TCP’s error-recovery mechanism is probably best categorized as a hybrid of GBN and SR protocols.
Congestion control
TCP Congestion Window
• Each TCP sender maintains a congestion window
- Max number of bytes to have in transit (not yet ACK’d)
阶段
ssthresh
- “slow start threshold”
- half the value of cwnd when congestion was last detected
- when cwnd equals ssthresh, 变成 congestion avoidance mode
Slow Start Phase
- the value of cwnd begins at 1 MSS and increases by 1 MSS every time a transmitted segment is first acknowledged
- results in a doubling of the sending rate every RTT, 乘法加大
- increases rate exponentially until the first loss or when cwnd == ssthresh
Congestion Avoidance Phase
达到ssthresh后, After each RTT cwnd = cwnd + 1.
Fast Recovery Phase
if three duplicate ACKs are detected, in which case TCP performs a fast retransmit and enters the fast recovery state
In fast recovery, the value of cwnd is increased by 1 MSS for every duplicate ACK received for the missing segment that caused TCP to enter the fast-recovery state. Eventually, when an ACK arrives for the missing segment, TCP enters the congestion-avoidance state after deflating cwnd.
Congestion Detection
Multiplicative decrement: If congestion occurs, the congestion window size is decreased. The only way a sender can guess that congestion has happened is the need to retransmit a segment. Retransmission is needed to recover a missing packet that is assumed to have been dropped by a router due to congestion. Retransmission can occur in one of two cases: when the RTO(retransmission timeout, 500 msec is typical) timer times out or when three duplicate ACKs are received.
Case 1: Retransmission due to Timeout – In this case, the congestion possibility is high.
- set ssthresh to half the value of cwnd
- start with the slow start phase again
- 到达 ssthresh 后 进入 Congestion Avoidance Phase
Case 2: Retransmission due to 3 duplicate ACK – The congestion possibility is less.
if three duplicate ACKs are detected, in which case TCP performs a fast retransmit and enters the fast recovery state
ssthresh=cwnd/2
cwnd=ssthresh+3•MSS
flow control
TCP provides flow control by having the sender maintain a variable called the receive window
Because TCP is not permitted to overflow the allocated buffer, we must have
LastByteRcvd – LastByteRead <= RcvBuffer
The receive window, denoted rwnd is set to the amount of spare room in the buffer:
rwnd = RcvBuffer – [LastByteRcvd – LastByteRead]
Host B tells Host A how much spare room it has in the connection buffer
by placing its current value of rwnd in the receive window field of every segment it
sends to A. Initially, Host B sets rwnd = RcvBuffer.
Host A in turn keeps track of two variables, LastByteSent and LastByteAcked
LastByteSent – LastByteAcked, is the amount of unacknowledged data that A has sent into the connection. By keeping the amount of unacknowledged data less than the value of rwnd , Host A is assured that it is not overflowing the receive buffer at Host B.
the TCP specification requires Host A to continue to send segments with one data byte when
B’s receive window is zero. These segments will be acknowledged by the receiver.
Eventually the buffer will begin to empty and the acknowledgments will contain a
nonzero rwnd value.
Receiver Window vs. Congestion Window
• Flow control – Keep a fast sender from overwhelming a slow receiver
• Congestion control – Keep a set of senders from overloading the network
• Different concepts, but similar mechanisms
- TCP flow control: receiver window
- TCP congestion control: congestion window
- Sender TCP window = min { congestion window, receiver window }
Specifically, the amount of unacknowledged data at a sender may not exceed the minimum of cwnd and rwnd, that is:
LastByteSent – LastByteAcked <= min{cwnd, rwnd}
transport-layer multiplexing and demultiplexing
extending the host-to-host delivery service provided by the network layer to a
process-to-process delivery service for applications running on the hosts
udp
Aside from the multiplexing/demultiplexing function and some light error checking, it adds nothing to IP.
tcp vs udp
TCP 面向连接的、可靠的、基于字节流
- 占用资源多,效率低
UDP 面向无连接,不可靠的,基于数据报,
- 占用资源较少,效率高
1. 连接
- TCP 是面向连接的传输层协议,传输数据前先要建立连接。
- UDP 是不需要连接,即刻传输数据。
2. 可靠性
- TCP 是可靠交付数据的,数据可以无差错、不丢失、不重复、按需到达。
- UDP 是尽最大努力交付,不保证可靠交付数据。
3. 服务对象
- TCP 是一对一的两点服务,即一条连接只有两个[端点]。
- UDP 支持一对一、一对多、多对多的交互通信
4. 传输方式
- TCP 是[流式传输],没有边界,但保证顺序和可靠。
- UDP 是一个包一个包的发送,是有边界的,但可能会丢包和乱序。
5. 拥塞控制、流量控制
- TCP 有拥塞控制和[流量控制机制],保证数据传输的安全性。
- UDP 则没有,即使网络非常拥堵了,也不会影响 UDP 的发送速率。
7. 分片不同
-
TCP 的数据大小如果大于 MSS 大小,则会在传输层进行分片,目标主机收到后,也同样在传输层组装 TCP [数据包],如果中途丢失了一个分片,只需要传输丢失的这个分片。
-
UDP 的数据大小如果大于 MTU 大小,则会在 IP 层进行分片,目标主机收到后,在 IP 层组装完数据,接着再传给[传输层]。
TCP 和 UDP 应用场景:
由于 TCP 是面向连接,能保证数据的可靠性交付,因此经常用于:
FTP[文件传输];- HTTP / HTTPS;
由于 UDP 不需要连接,再加上UDP本身的处理既简单又高效,因此经常用于:
- 包总量较少的通信,如
DNS、SNMP等; - 视频、音频等[多媒体通信];
- 广播通信;