[paper]
Problem: TCP incast collapse in data centers occurs when (1) workload is highly parallel & barrier-synchronized, (2) cheap switches have small buffers on a high-bandwidth network, and (3) servers are returning small amounts of data per request. Bursty traffic fills switch buffers, causing packet loss & TCP timeouts. Timeouts last hundreds of milliseconds (notable in a DC where round trip time is very very small due to #2 and #3). Protocols that require synchronization block waiting on these timeouts before issuing any new requests. Timeouts + subsequent delay severely reduce application throughput.
Previous solutions: Get bigger buffers ($$). Ethernet flow control ("dangerous across inter-switch trunks because of head-of-line blocking"). Reducing the minimum retransmission time -- not thoroughly explored -- so that is what this paper tries.
Their solution: Reduce the granularity of timeouts by modifying the TCP implementation. Decrease or get rid of minimum RTO, desynchronizing retransmission for scaling. Funny thing is that they need their retransmission granularity to be smaller than the Linux kernel's timer, so they implement a high-resolution timer.
Evaluation: Making the RTO very small (as small as the avg RTT) works well in both their simulated data & real cluster data, although it is less effective as the number of concurrent senders scales up. They find that the cause of this is many flows timing out and retransmitting simultaneously; they fix this by desynchronizing retransmission.
-- Question: won't more memory become cheap soon enough that this won't be a problem anymore? Or will the communication speed keep increasing faster than memory's cheapness?
Blog Archive
-
▼
2009
(32)
-
▼
September
(11)
- Understanding TCP Incast Through Collapse in Datac...
- Safe and Effective Fine-grained TCP Retransmission...
- VL2: A Scalable and Flexible Data Center Network
- PortLand: A Scalable Fault-Tolerant Layer 2 Data C...
- Detailed Diagnosis in Enterprise Networks
- Floodless in SEATTLE
- Congestion Avoidance and Control
- Analysis of the Increase and Decrease Algorithms f...
- Understanding BGP Misconfiguration
- Interdomain Internet Routing
- End-to-End Arguments in System Design
-
▼
September
(11)
Subscribe to:
Post Comments (Atom)
About Me
- Adrienne
- Berkeley EECS PhD student
More buffers can have an implication for network latency too. So you try to have as little buffering in the network as you can get away with. Also, ethernet top of rack switches would like to be very inexpensive -- hence minimal amounts of memory.
ReplyDelete