views:

829

answers:

5

Hi, we have a biztalk server (a virtual one (1!)...) at our company, and an sql server where the data is being kept. Now we have a lot of data traffic. I'm talking about hundred of thousands. So I'm actually not even sure if one server is pretty safe, but our company is not that easy to convince.

Now recently we have a lot of problems.

Allow me to situate in detail, so I'm not missing anything:

Our server has 5 applications:

  • One with 3 orchestrations, 12 send ports, 16 receive locations.
  • One with 4 orchestrations, 32 send ports, 20 receive locations.
  • One with 4 orchestrations, 24 send ports, 20 receive locations.
  • One with 47 (yes 47) orchestrations, 37 send ports, 6 receive locations.
  • One with common application with a couple of resources.

Our problems have occured since we deployed the applications with the 47 orchestrations. A lot of these orchestrations use assign shapes which use c# code to do the mapping. This is because we use HL7 extensions and this is kind of special, so by using c# code & xpath it was a lot easier to do the mapping because a lot of these schema's look alike. The c# reads in XmlNodes received through xpath, and returns XmlNode which are then assigned again to biztalk messages. I'm not sure if this could be the cause, but I thought I'd mention it.

The send and receive ports have a lot of different types: File, MQSeries, SQL, MLLP, FTP. Each of these types have a different host instances, to balance out the load. Our orchestrations use the BiztalkApplication host.

On this server also a couple of scripts are running, mostly ftp upload scripts & also a zipper script, which zips files every half an hour in a daily zip and deletes the zip files after a month. We use this zipscript on our backup files (we backup a lot, backups are also on our server), we did this because the server had problems with sending files to a location where there were a lot (A LOT) of files, so after the files were reduced to zips it went better.

Now the problems we are having recently are mainly two major problems:

  • Our most important problem is the following. We kept a receive location with a lot of messages on a queue for testing. After we start this receive location which uses the 47 orchestrations, the running service instances start to sky rock. Ok, this is pretty normal. Let's say about 10000, and then we stop the receive location to see how biztalk handles these 10000 instances. Normally they would go down pretty fast, and it does sometimes, but after a while it starts to "throttle", meaning they just stop being processed and the service instances stay at the same number, for example in 30 seconds it goes down from 10000 to 4000 and then it stays at 4000 and it lowers very very very slowly, like 30 in 5minutes or something. So this means, that all the other service instances of the other applications are also stuck in here, and they are also not processed.

We noticed that after restarting our host instances the instance number went down fast again. So we tried to selectively restart different host instances to locate the problem. We noticed that eventually restarting the file send/receive host instance would do the trick. So we thought file sends would be the problem. Concidering that we make a lot of backups. So we replaced the file type backups with mqseries backups. The same problem occured, and funny thing, restarting the file send/receive host still fixes the problem.

No errors can be found in the event viewer either.

  • A second problem we're having is. That sometimes at arround 6 am, all or a part of the host instances are being stopped.

In the event viewer we noticed the following errors (these are more than one):

The receive location "MdnBericht SQL" with URL "SQL://ZNACDBPEG/mdnd0001/" is shutting down. Details:"The error threshold has been exceeded. The receive location is shutting down.".

The Messaging Engine failed to add a receive location "M2m Othello Export Start Bestand" with URL "\m2mservices\Othello_import$\DataFilter Start*.xml" to the adapter "FILE". Reason: "The FILE adapter cannot access the folder \m2mservices\Othello_import$\DataFilter Start. Verify this folder exists. Error: Logon failure: unknown user name or bad password. ".

The FILE adapter cannot access the folder \m2mservices\Othello_import$\DataFilter Start. Verify this folder exists. Error: Logon failure: unknown user name or bad password.

An attempt to connect to "BizTalkMsgBoxDb" SQL Server database on server "ZNACDBBTS" failed. Error: "Login failed for user ''. The user is not associated with a trusted SQL Server connection."

It woould seem that there's a login failure at this time and that because of it other services are also experiencing problems, and eventually they are shut down.

The thing is, our user is admin, and it's impossible that it's password is wrong "sometimes". We have concidering that the problem could be due to an infrastructure problem, but that's not really are department.

I know it's a long post, but we're not sure anymore what to do. Would adding another server and balancing the load solve our problems? Is there a way to meassure our balance and know where to start splitting? What are normal numbers of load etc?

I appreciate any answers because these issues are getting worse and we're also on a deadline.

Thanks a lot for replies!

+2  A: 

Your immediate problem is Biztalk throttling feature. It's supposed to help Biztalk survive temporary overload conditions. One of its many problems is that you can see the throttling kick-in only in the performance monitor and not in the event log.

What you should do:

  1. Separate the new application to a different host than the rest of the applications. Throttling is done in the host level. So the problematic application wont affect the rest of the applications.
  2. Read about how to disable throttling in the link above.
  3. What we have done is implementing an external throttling service. That feed the Biztalk receive location in small digestible packets. Its ugly, but the problem is ugly.

Update to comment: You have enough host instances. So Ignore that advice. You may reorder the applications between the instances. But there are no clear guidelines to do that. So its just shuffling and guessing.
About the safeness of disabling throttling. This feature doesn't make much sense in many scenarios. You have to study it. check which of the throttling parameters you are hitting ( can be seen in the performance monitor. And decide how to change the thresholds.

Igal Serban
Is disabling the throttling not unsafe?I notice that when it's throttling our CPU is at like 10-20%. Which is of course ridiculous, when we restart and it's working fine it's at a 100% so that's normal.I can see that there's like 6 different ways of throttling, should I just disable all of them?? And this is safe? It's there for a reason right?And about splitting the host instances. So I should just split every application to one host instance? We have like 20 host instances now, so if I split a host instance per application we only have like 4 host instances instead of 20
WtFudgE
+1  A: 

How many host instances do you have?

From the line:

The send and receive ports have a lot of different types: File, MQSeries, SQL, MLLP, FTP. Each of these types have a different host instances, to balance out the load. Our orchestrations use the BiztalkApplication host

It sounds like you have a lot - I recently did an audit of a system where BizTalk was self throttling and the issue was in part due to too many host instances. Each host instance places its own load upon the BizTalk messagebox, as well as chewing up a minimum of 200mb memory.

Reading your comment, you have 20 - this is too many and would be a big part of your problems.

A good starting host setup would be:

  • A dedicated tracking host
  • One host that contains all receive handlers for adapters
  • One host that contains all orchestrations
  • One host that contains all send handlers for adapters
  • One host for adapters that need to be clustered (like FTP and MSMQ)

You can then also consider things like introducing "real time" hosts and batched hosts, so you can tune the real time hosts for low latency.

You can also have hosts for specific applications if there are known to be unstable, but in general this should not be done.

David Hall
We have about 20 host instances.Should we then have 1 host instance for each application?Because I can remember we had a problem were we created an extra host instance to solve this problem. I'm not sure again what it was, so maybe seperating the host instances per application could still fix it.Am I correct in this?
WtFudgE
20 host instances is far too many in my experience. I've added more detail to my answer, outlining a sound host setup.
David Hall
A: 

Hi WtFudgeE, I run a BizTalk system that has similar problems and can empathize with what you are seeing. I don't know if it's the same, but I thought I'd share my experience in case.

In the same manner restarting the send/receive seems to fix the problem. In my case I found a direct correlation to memory usage by the host processes. I used performance counters to see when a given host was throttled for memory. By creating extra hosts, and moving orchestrations and ports between them I was able to narrow down which business sets were causing the problem. Basically in my case restarting the hosts was the equivalent to the ultimate "garbage collection" to free up memory. This was of course until enough instances came through to gobble it up again.

I'm afraid I have not solved the issue yet, but a few things I found to alleviate the issue:

  1. Raise the memory to a given process so that throttling does not occur or occurs later
  2. Each host instance, while informative, does have an overhead that is added. Try combining hosts that are not your problem children together to reduce the memory foot print.
  3. Throw hardware at the problem, ram is cheap
  4. I measure the following every few minutes in perfmon so I can diagnose where the problem is:

    BizTalk:MessageAgent(*)\Process memory usage (MB)

    BizTalk:MessageAgent(*)\Process memory usage threshold

    Memory\Available MBytes

A few other things to take a look at. Make sure any custom pipelines use good BizTalk memory practices (i.e. no XML DOM manipulation hiding somewhere, etc). Also theoretically reducing the number of threads for a given host should lower the amount of memory it can seize at one time. I did not seem to have much luck with this one. Maybe the BizTalk throttling overrode it as others have mentioned, I don't know. Also, on a final note, if you dump the perfmon results to a csv, with Excel you can make some pretty memory usage graphs. These might be useful for talking to management about buying more hardware. That's assuming your issue fits this scenario as well.

Andrew Dunaway
A: 

We fixed the problem temporarily due to a combination of all ur answers.

We set the process memory usage throttling parameters of some hosts higher.

We divided the balance of the host instances better after I analyzed all the memory usage of all hosts, thanks to performance counters and also with the use of a tool called MsgBoxViewer.

And now we're trying to get more physical memory & hopefully also an extra server or a 64bit server.

Thanks for all replies!

WtFudgE
A: 

We recently installed a 64-bit server in cluster with our older server. Thanks to this we can balance the memory even better which solved a lot of problems.

Although the 64-bit didn't give us much improvements (except for a bit more memory) since it can't use 64-bits on IBM MQ's, MLLP's, HL7 pipelines etc...

WtFudgE