views:

98

answers:

1

We are facing random spikes in high throughput transaction processing system using sockets for IPC.

Below is the setup used for the run:

  1. The client opens and closes new connection for every transaction, and there are 4 exchanges between the server and the client.
  2. We have disabled the TIME_WAIT, by setting the socket linger (SO_LINGER) option via getsockopt as we thought that the spikes were caused due to the sockets waiting in TIME_WAIT.
  3. There is no processing done for the transaction. Only messages are passed.
  4. OS used Centos 5.4

The average round trip time is around 3 milli seconds, but some times the round trip time ranges from 100 milli seconds to couple of seconds.

Steps used for Execution and Measurement and output

  1. Starting the server

    $ python sockServerLinger.py > /dev/null &

  2. Starting the client to post 1 million transactions to the server. And logs the time for a transaction in the client.log file.

    $ python sockClient.py 1000000 > client.log

  3. Once the execution finishes the following command will show the execution time greater than 100 milliseconds in the format <line_number>:<execution_time>.

    $ grep -n "0.[1-9]" client.log | less

Below is the example code for Server and Client.

Server

# File: sockServerLinger.py
import socket, traceback,time
import struct
host = ''
port = 9999

l_onoff = 1
l_linger = 0
lingeropt = struct.pack('ii', l_onoff, l_linger)

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, lingeropt)
s.bind((host, port))
s.listen(1)

while 1:
    try:
        clientsock, clientaddr = s.accept()
        print "Got connection from", clientsock.getpeername()
        data = clientsock.recv(1024*1024*10)
        #print "asdasd",data
        numsent=clientsock.send(data)
        data1 = clientsock.recv(1024*1024*10)
        numsent=clientsock.send(data)
        ret = 1
        while(ret>0):
            data1 = clientsock.recv(1024*1024*10)
            ret = len(data)
        clientsock.close()
    except KeyboardInterrupt:
        raise
    except:
        print traceback.print_exc()
        continue

Client

# File: sockClient.py

import socket, traceback,sys
import time
i = 0
while 1:
    try:
        st = time.time()
        s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
        while (s.connect_ex(('127.0.0.1',9999)) != 0):
            continue
        numsent=s.send("asd"*1000)
        response = s.recv(6000)
        numsent=s.send("asd"*1000)
        response = s.recv(6000)
        i+=1
        if i == int(sys.argv[1]):
            break
    except KeyboardInterrupt:
        raise
    except:
        print "in exec:::::::::::::",traceback.print_exc()
        continue
    print time.time() -st
+1  A: 

Here is one possibility that comes to mind:

1) Because you are using SOCK_STREAM, you are using the TCP protocol 2) As a reliable protocol, TCP will resend packets that have timed out, to ensure that everything eventually arrives. 3) TCP uses a dynamic timeout value that is calculated based on what the current round-trip time (RTT) is estimated to be 4) When a TCP connection first starts up, it doesn't know what the RTT is, so it uses a very large initial timeout value, sometimes on the order of seconds.

So... if one of the early TCP handshake packets is dropped, your socket could be waiting around a long time before it decides that the packet didn't get there and it resends it. This would happen randomly, relatively rarely, but certainly many times in a million connections.

Try using socket.settimeout() with a relatively short value, and retry it immediately if the connection times out. This way you fake having a shorter initial RTT estimate.

smehmood
Thanks for the possibility. Tried setting the timeout to a value of 0.5 seconds and when the exception is raised, the operation (connect or send or recv) was retried in a while loop. This also behaved the same way (5 seconds was the maximum time). What we noticed was that, if the communication dealy is about 5 seconds then it raised exception 10 times and the 11th time the operation succeeded. Please let me know if you have any other suggestions.
Harun Prasad