Making things go fast on a network, part 3

项目
06/13/2006

Now I want to pop the stack up a bit and talk about messages. At their heart, connection oriented protocols are about establishing pipes between senders and receivers - they don't NEED to have semantics beyond that.

But applications are different. They typically fall into some stereotypical use patterns, the most common of which is the client/server usage pattern. A client sends data to the server, the server responds. The role of sender and receiver alternate between the client and the server.

In the client/server pattern, the client sends a "message" to the server, the server responds with its own "message". A "message" may be formed from multiple packets (send()s) but not always.

Every "message" is self describing - there needs to be a mechanism that allows the server to know it's received all the data that the client sent. That mechanism may be a length prepended to each message, it might be a magic "terminating sequence" (CR/LF is a very common magic terminator). Often the semantics of a message are defined by the API being used to send the data - for example, the NetBIOS networking API includes a length of the data being sent, the receiver of the message is guaranteed to receive a block of the same length that was sent, regardless of fragmentation (RFC1001/RFC1002 define how NetBIOS API semantics are implemented over TCP/IP if anyone cares). In other cases, the semantics of the message are defined by the protocol being implemented. For example, POP3, IMAP4, NNTP and SMTP define their messages as CR/LF delimited strings, while LDAP uses ASN.1's semantics for defining messages.

But however a message is defined, there is still a request/response semantic associated with client/server protocols.

From an application level, here's what happens on the wire when you have a client/server interaction at the application level:

Client	Server
Send message request "A"
	Send message with response "A"

But as we discussed earlier, that's not REALLY what happens. Each of those messages is a packet, and that means there has to be an acknowledgment between the two.

Client	Server
Send message request "A"
	Acknowledge receipt of request "A"
	Send message with response "A"
Acknowledge receipt of response "A"

Now it gets REALLY interesting when you string lots of client/server request/response sequences together:

Client	Server
Send message request "A"
	Acknowledge receipt of request "A"
	Send message with response "A"
Acknowledge receipt of response "A"
Send message request "B"
	Acknowledge receipt of request "B"
	Send message with response "B"
Acknowledge receipt of response "B"
Send message request "C"
	Acknowledge receipt of request "C"
	Send message with response "C"
Acknowledge receipt of response "C"
Etc.

Remember that on local area networks, the time to send a given packet is the same, regardless of the payload size. It would be unbelievably cool if there was a way of combining the acknowledgement with the response to an operation - that would double your bandwidth since you'd be sending half the number of packets.

And it turns out that several people independently came to the same solution. In the NetBEUI protocol the feature's called "piggy-back acks", in TCP, it's called the "Nagel Algorithm" (after the person who invented it). When you turn on piggy-back acks (or nagling), the sequence above becomes:

Client	Server
Send message request "A"
	Acknowledge receipt of request "A" and send message with response "A"
Acknowledge receipt of response "A" and send message request "B"
	Acknowledge receipt of request "B" and send message with response "B"
Acknowledge receipt of response "B" and send message request "C"
	Acknowledge receipt of request "C" and send message with response "C"
Acknowledge receipt of response "C"
Etc.

It halves the number of frames. But there's a tricky bit here - it only works if the application's going to send a response to the client - if not, it needs to send the acknowledgement. And that's where things get "interesting". In order to give the server time to send the response to the client, the transport holds off on sending the ack for a short amount of time (somewhere around 200 ms typically). If the server responds to the client within 200 milliseconds, everything's copasetic. If the server doesn't respond, the receiver sends the acknowledgement and nothing happens.

But what happens if you DON'T use the request/response pattern? What happens if your protocol involves multiple messages from the client to the server?

This isn't as silly an idea as you might think - for example, in CIFS, the client and server negotiate a common message size - the client is prohibited from sending a block larger than the server's block size and the server is prohibited from sending a block larger than the client's buffer size. If a CIFS client needs to send tons of data to the server, it would make sense to break the requests up into server block size chunks and shotgun them to the server - the client issues async sends for all the requests to the server and waits on the transport to deliver them as best as it can.

Client	Server
Send message "A.1"
Send message "A.2"
Send message "A.3"
	Respond to A.1..A.3

On the surface, it seems like a great idea, except for the fact that (as I mentioned in my first article) the sender can't start sending message A.2 before it's gotten the acknowledgment for A.1.

Client	Server
Send message request "A.1"
	Acknowledge receipt of request "A.1"
Send message request "A.2"
	Acknowledge receipt of request "A.2"
Send message request "A.3"
	Acknowledge receipt of request "A.3"
	Send response to A.1..A.3
Acknowledge receipt of response to A.1..A.3
Etc.

But if nagling is involved, things get REALLY ugly:

Client	Server
Send message request "A.1"
	Wait 200 ms waiting for the server to respond
	Acknowledge receipt of request "A.1"
Send message request "A.2"
	Wait 200 ms waiting for the server to respond
	Acknowledge receipt of request "A.2"
Send message request "A.3"
	Acknowledge receipt of request "A.3" and send response to A.1..A.3
Acknowledge receipt of response to A.1..A.3
Etc.

All of a sudden, the performance "optimization" of nagling has totally killed your performance.

This leads to Larry's "Making things go fast on the network" rule number 2:

You can't design your application protocol in a vacuum. You need to understand how the layers below your application work before you deploy it.

In this case, the designers of TCP/IP realized that the nagle algorithm could cause more harm than good in certain circumstances, so they built in an opt-out. The opt-out has two forms: First, if you're sending "large" packets (typically more than 16K) nagling is disabled. Second, you can disable nagling on an individual socket basis by calling setsockopt(socket, ..., SO_NODELAY,...). But your better choice is to understand how the underlying network works and design your application protocol to match. Yes, that's a leaky abstraction, but this article is about making things faster - when you're trying to make things fast, you have to understand all the layers.

The CIFS protocol has a couple of protocol elements that use the "send multiple messages" pattern mentioned above, when we were doing NT 3.1, it became clear that the performance associated with nagling/piggyback acks was killing our performance, I wrote about that two years ago here (I thought I'd written this stuff out before).

Comments

Anonymous
June 13, 2006
> I wrote about that two years ago here

If it takes two years to ack yourself, your Nagling needs adjusting ^_^
Anonymous
June 14, 2006
I think you have your Nagle description slightly off. Nagle's algorithm says (loosely) that a sending station should buffer data until it receives an ACK. The idea is that if the app makes lots of small send() calls, TCP could wind up sending tons of small segs, which is bad for network utilization (40 bytes minimum per seg) - if it's telnet, for example, you would have a 41-byte seg for each character typed. (Yes, you may have that anyway if the user is a slow typist, but you get the idea.)

Nagle should disable itself if TCP has a full segment's worth of data to send; it is only invoked if there is pending send data but not enough to fill up a seg.

TCP implementations typically send an ACK for every two received segments, IIRC. ACK piggybacking is built into TCP from the ground up; if TCP happens to have data to send, it'll send it with the ACK; if not, it'll send an empty ACK (after a bit of a delay - the 2 segment thing).

Then again, it's been a while since I've done much of this so my memory may be failing...
Anonymous
June 14, 2006
Dispensa is correct.

The nagel algorithm says that there can be at most one unacknowledged packet that is smaller than the maximum size, around 1500 bytes for Ethernet as you pointed out.

That means that sending A.1, A.2, A.3 will work because each of those messages is larger than the 1500 byte Ethernet frame size. I'm assuming that all the messages are being sent through the same TCP connection.

Nagel's algorithm fails badly when it coincides with delayed ACK's, which assume that any incoming data will soon be followed by outgoing data. Thus the TCP implementation waits up to 200 ms to send the ACK, if there's no outgoing data. If the first packet was small, Nagel on the sender side will wait the 200 ms for the ACK because the only allowable small packet is already on the network.
Anonymous
June 14, 2006
Ryan, I agree - nagling and delayedacks won't hit on > 1500 byte packets. But they will with smaller ones than that.

That's why the rule is as general as it is - I thought about making a "always make your protocols request/response" but that might not apply for every lower level protocol.

But "understand the lower layers, they're going to hurt you" is more general.
Anonymous
June 19, 2006
Your article is prety nice. It's a pity that i didn't see it more later.
Anonymous
June 26, 2006
Oh, Larry, Larry, Larry...
Articles 1 and 2 were great - really necessary reading to a lot of would-be...

通过

Making things go fast on a network, part 3

Comments

其他资源