views:

9444

answers:

12

We have an application that has a WCF service (*.svc) running on IIS7 and various clients querying the service. The server is running Win 2008 Server. The clients are running either Windows 2008 Server or Windows 2003 server. I am getting the following exception, which I have seen can in fact be related to a large number of potential WCF issues.

System.TimeoutException: The request channel timed out while waiting for a reply after 00:00:59.9320000. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: The HTTP request to 'http://www.domain.com/WebServices/myservice.svc/gzip' has exceeded the allotted timeout of 00:01:00. The time allotted to this operation may have been a portion of a longer timeout. 

I have increased the timeout to 30min and the error still occurred. This tells me that something else is at play, because the quantity of data could never take 30min to upload or download.

The error comes and goes. At the moment, it is more frequent. It does not seem to matter if I have 3 clients running simultaneously or 100, it still occurs once in a while. Most of the time, there are no timeouts but I still get a few per hour. The error comes from any of the methods that are invoked. One of these methods does not have parameters and returns a bit of data. Another takes in lots of data as a parameter but executes asynchronously. The errors always originate from the client and never reference any code on the server in the stack trace. It always ends with:

 at System.Net.HttpWebRequest.GetResponse()
  at System.ServiceModel.Channels.HttpChannelFactory.HttpRequestChannel.HttpChannelRequest.WaitForReply(TimeSpan timeout)

On the server: I've tried (and currently have) the following binding settings:

maxBufferSize="2147483647" maxReceivedMessageSize="2147483647" maxBufferPoolSize="2147483647"

It does not seem to have an impact.

I've tried (and currently have) the following throttling settings:

<serviceThrottling maxConcurrentCalls="1500"   maxConcurrentInstances="1500"    maxConcurrentSessions="1500"/>

It does not seem to have an impact.

I currently have the following settings for the WCF service.

[ServiceBehavior(InstanceContextMode = InstanceContextMode.Single, ConcurrencyMode = ConcurrencyMode.Single)]

I ran with ConcurrencyMode.Multiple for a while, and the error still occurred.

I've tried restarting IIS, restarting my underlying SQL Server, restarting the machine. All of these don't seem to have an impact.

I've tried disabling the Windows firewall. It does not seem to have an impact.

On the client, I have these settings:

maxReceivedMessageSize="2147483647"

<system.net>
    <connectionManagement>
    <add address="*" maxconnection="16"/>
</connectionManagement> 
</system.net>

My client closes its connections:

var client = new MyClient();

try
{
    return client.GetConfigurationOptions();
}
finally
{
    client.Close();
}

I have changed the registry settings to allow more outgoing connections:

MaxConnectionsPerServer=24, MaxConnectionsPer1_0Server=32.

I have now just recently tried SvcTraceViewer.exe. I managed to catch one exception on the client end. I see that its duration is 1 minute. Looking at the server side trace, I can see that the server is not aware of this exception. The maximum duration I can see is 10 seconds.

I have looked at active database connections using exec sp_who on the server. I only have a few (2-3). I have looked at TCP connections from one client using TCPview. It usually is around 2-3 and I have seen up to 5 or 6.

Simply put, I am stumped. I have tried everything I could find, and must be missing something very simple that a WCF expert would be able to see. It is my gut feeling that something is blocking my clients at the low-level (TCP), before the server actually receives the message and/or that something is queuing the messages at the server level and never letting them process.

If you have any performance counters I should look at, please let me know. (please indicate what values are bad, as some of these counters are hard to decypher). Also, how could I log the WCF message size? Finally, are there any tools our there that would allow me to test how many connections I can establish between my client and server (independently from my application)

Thanks for your time!

Extra information added June 20th:

My WCF application does something similar to the following.

while (true)
{
   Step1GetConfigurationSettingsFromServerViaWCF(); // can change between calls
   Step2GetWorkUnitFromServerViaWCF();
   DoWorkLocally(); // takes 5-15minutes. 
   Step3SendBackResultsToServerViaWCF();
}

Using WireShark, I did see that when the error occurs, I have a five TCP retransmissions followed by a TCP reset later on. My guess is the RST is coming from WCF killing the connection. The exception report I get is from Step3 timing out.

I discovered this by looking at the tcp stream "tcp.stream eq 192". I then expanded my filter to "tcp.stream eq 192 and http and http.request.method eq POST" and saw 6 POSTs during this stream. This seemed odd, so I checked with another stream such as tcp.stream eq 100. I had three POSTs, which seems a bit more normal because I am doing three calls. However, I do close my connection after every WCF call, so I would have expected one call per stream (but I don't know much about TCP).

Investigating a bit more, I dumped the http packet load to disk to look at what these six calls where.

1) Step3
2) Step1
3) Step2
4) Step3 - corrupted
5) Step1
6) Step2

My guess is two concurrent clients are using the same connection, that is why I saw duplicates. However, I still have a few more issues that I can't comprehend:

a) Why is the packet corrupted? Random network fluke - maybe? The load is gzipped using this sample code: http://msdn.microsoft.com/en-us/library/ms751458.aspx - Could the code be buggy once in a while when used concurrently? I should test without the gzip library.

b) Why would I see step 1 & step 2 running AFTER the corrupted operation timed out? It seems to me as if these operations should not have occurred. Maybe I am not looking at the right stream because my understanding of TCP is flawed. I have other streams that occur at the same time. I should investigate other streams - a quick glance at streams 190-194 show that the Step3 POST have proper payload data (not corrupted). Pushing me to look at the gzip library again.

A: 

from: http://www.codeproject.com/KB/WCF/WCF_Operation_Timeout_.aspx

To avoid this timeout error, we need to configure the OperationTimeout property for Proxy in the WCF client code. This configuration is something new unlike other configurations such as Send Timeout, Receive Timeout etc., which I discussed early in the article. To set this operation timeout property configuration, we have to cast our proxy to IContextChannel in WCF client application before calling the operation contract methods.

Joel Martinez
I have tried this. Regardless of the timeout I put, it still times out but this makes no sense because the operation is not that long and because all the other clients that do the same queries function during this time.
Jason Kealey
A: 

Please try doing this and let me know your result

var client = new MyClient(); var config = null;

        try
        {
            config  = client.GetConfigurationOptions();
        }

        finally
        {
            client.Close();
        }

return config;

What kind of binding are you using? Is WS-Security in the picturee?

Regards, Diwakar

If that is yes, then I have few more questions for you.
I don't need to test this. See related post. http://stackoverflow.com/questions/345091/will-code-in-a-finally-statement-fire-if-i-return-a-value-in-a-try-block
Jason Kealey
We don't have WS-Security in place. We are using a gzip message encoding taken from the sample listed here: http://msdn.microsoft.com/en-us/library/ms751458.aspx(Microsoft.ServiceModel.Samples.GZipMessageEncodingElement)
Jason Kealey
A: 

Did you try using clientVia to see the message sent, using SOAP toolkit or something like that? This could help to see if the error is coming from the client itself or from somewhere else.

Philippe
Do you know of any tools more recent than the deprecated SOAP toolkit that would make it easier for me to log this information in WCF calls?
Jason Kealey
You could try soapUI http://www.soapui.org/.
Philippe
A: 

I'm not a WCF export but I'm wondering if you aren't running into a DDOS protection on IIS. I know from experience that if you run a bunch of simultaneous connections from a single client to a server at some point the server stops responding to the calls as it suspects a DDOS attack. It will also hold the connections open until they time-out in order to slow the client down in his attacks.

Multiple connection coming from different machines/IP's should not be a problem however.

There's more info in this MSDN post:

http://msdn.microsoft.com/en-us/library/bb463275.aspx

Check out the MaxConcurrentSession sproperty.

I feel that this is what is happening, from everything I have seen, however I have (on the server):<serviceThrottling maxConcurrentCalls="150" maxConcurrentInstances="150" maxConcurrentSessions="150"/><serviceDebug includeExceptionDetailInFaults="true" />Would there be any performance monitor or IIS log that I could monitor to see if this is happening?
Jason Kealey
A: 

Did you check the WCF traces? WCF has a tendency to swallow exceptions and only return the last exception, which is the timeout that you're getting, since the end point didn't return anything meaningful.

Miki Watts
I tried SvcTraceViewer and the only exception it reported was the timeout (on the client). Nothing was reported on the server.
Jason Kealey
Open all the options on the trace, you might not have all the trace options open. Also, check both the event trace and message trace files.
Miki Watts
+1  A: 

If you havn't tried it already - encapsulate your Server-side WCF Operations in try/finally blocks, and add logging to ensure they are actually returning.

If those show that the Operations are completing, then my next step would be to go to a lower level, and look at the actual transport layer.

Wireshark or another similar packet capturing tool can be quite helpful at this point. I'm assuming this is running over HTTP on standard port 80.

Run Wireshark on the client. In the Options when you start the capture, set the capture filter to tcp http and host service.example.com - this will reduce the amount of irrelevant traffic.

If you can, modify your client to notify you the exact start time of the call, and the time when the timeout occurred. Or just monitor it closely.

When you get an error, then you can trawl through the Wireshark logs to find the start of the call. Right click on the first packet that has your client calling out on it (Should be something like GET /service.svc or POST /service.svc) and select Follow TCP Stream.

Wireshark will decode the entire HTTP Conversation, so you can ensure that WCF is actually sending back responses.

Will Hughes
I have logging on the server - there is no error on that end. I am running WireShark right now to see what I can find. Given the high volume of traffic, it is going to be a pain to analyze but I will report back if I can find anything.
Jason Kealey
I ran WireShark over the past six hours and collected some 60k frames. Only one exception was reported by this client today. I did see a TCP connection marked as RST (reset), apparently after sending the error email, which is probably WCF that is terminating the connection. I saved the payload (525k) to disk. I verified that there were 87 other invocations with similar sized payloads.I did see a few TCP retransmissions, but saw some in other calls as well (that did not fail). Starting to wonder about my networking hardware + cables.
Jason Kealey
Even on a local network, the presence of a TCP Retransmits isn't necessarily bad. If it's possible to physically connect two of the end points to a single switch, then that might be worth a shot, but I wouldn't hold out hope that will fix it. If you can - create a very basic client application that just passes some traffic back and forth to your server, and nothing else. This can help eliminate any issue in your application that might be causing timeouts.
Will Hughes
Also, you mention seeing the TCP Reset packet -- had the server delivered any kind of response at that point (or perhaps was it waiting for more data)? Was there an appreciable delay between the RST and the previous packet?
Will Hughes
The server is remote. I am planning on creating a test environment locally to see if that helps. As for the RST, it was sent 34 seconds after the last of five TCP Retransmission. (1 to 8 second intervals between retransmissions). Does that give you any clues?
Jason Kealey
So, if the server is sending a RST packet after 5x consecutive retransmissions with no Acknowledgements - then it seems your client is (for some reason) dropping packets. I'd definitely create a stand-alone client, even if it talks directly to the regular server. There could be any number of issues causing the clients to drop packets (other threads blocking execution, other apps hogging resources, failing hardware, network, firewalls, etc).
Will Hughes
+1  A: 

I'm having a very similar problem. In the past, this has been related to serialization problems. If you are still having this problem, can you verify that you can correctly serialize the objects you are returning. Specifically, if you are using Linq-To-Sql objects that have relationships, there are known serialization problems if you put a back reference on a child object to the parent object and mark that back reference as a DataMember.

You can verify serialization by writing a console app that serializes and deserializes your objects using the DataContractSerializer on the server side and whatever serialization methods your client uses. For example, in our current application, we have both WPF and Compact Framework clients. I wrote a console app to verify that I can serialize using a DataContractSerializer and deserialize using an XmlDesserializer. You might try that.

Also, if you are returning Linq-To-Sql objects that have child collections, you might try to ensure that you have eagerly loaded them on the server side. Sometimes, because of lazy loading, the objects being returned are not populated and may cause the behavior you are seeing where the request is sent to the service method multiple times.

If you have solved this problem, I'd love to hear how because I'm stuck with it too. I have verified that my issue is not serialization so I'm at a loss.

UPDATE: I'm not sure if it will help you any but the Service Trace Viewer Tool just solved my problem after 5 days of very similar experience to yours. By setting up tracing and then looking at the raw XML, I found the exceptions that were causing my serialization problems. It was related to Linq-to-SQL objects that occasionally had more child objects than could be successfully serialized. Adding the following to your web.config file should enable tracing:

<sharedListeners>
    <add name="sharedListener"
         type="System.Diagnostics.XmlWriterTraceListener"
         initializeData="c:\Temp\servicetrace.svclog" />
  </sharedListeners>
  <sources>
    <source name="System.ServiceModel" switchValue="Verbose, ActivityTracing" >
      <listeners>
        <add name="sharedListener" />
      </listeners>
    </source>
    <source name="System.ServiceModel.MessageLogging" switchValue="Verbose">
      <listeners>
        <add name="sharedListener" />
      </listeners>
    </source>
  </sources>

The resulting file can be opened with the Service Trace Viewer Tool or just in IE to examine the results.

Brett Bim
A: 

Are you using reliable sessions?

Steve Campbell
I don't think I am.
Jason Kealey
ok. Reason I asked is that when using reliable sessions, many problems that are not timeouts exhibit as timeouts.
Steve Campbell
A: 

We encountered the same problem and this only happens once we host the site and the service under IIS. I was able to debug and see that the request is not reaching the service. I tried all the things mentioned in the article but it did not help. I then changed the binding to basicHttpBinding from wsHttpBinding and have not faced the problem again. I could not figure out what is causing this issue when we use wsHttpBinding. Any pointers please?

Mithun
+4  A: 

If you are using .Net client then you may not have set

//This says how many outgoing connection you can make to a single endpoint. Default Value is 2
System.Net.ServicePointManager.DefaultConnectionLimit = 200;

here is the original question and answer http://stackoverflow.com/questions/2454362/wcf-service-throttling

Mubashar Ahmad
This looks promising. I've included this to be tested during my next scalability test. It looks exactly like the kind of random setting that would make it crash :) Thanks for the pointer.
Jason Kealey
This solved my problem !! I wanna vote it up 100 times !!!
Mustafa A. Jabbar
@Jason: If you are server programmer you know how much important to maintain the scalability of server in your hands and also one who is currently suffering the concurrency problem even after using above. Please if you can look into the following question http://stackoverflow.com/questions/2637175/wcf-network-cost in short i am suffering with 31ms latency between client and server and need to reduce it.
Mubashar Ahmad
A: 

Jason,
I'm working on this right now. I feel that n3wjack is on the right track. my problem arose when the Server guys deployed our application to a different IIS server. Is there some timeout setting in IIS that can be configured? Did you ever get to the bottom of this? thanks Ray.

Raymo L
I've been running at low volume (about 20% of what I was running when I got these errors) for over a year now and don't encounter this issue anymore. I would need to run more formal tests to see if I can make it occur again. However, I was getting this issue when I additional machines stress testing my server - the volume from a single IP has not changed. Under high stress, maybe IIS becomes more sensitive to each individual connection.
Jason Kealey