Well after many attempts and many hours of testing we have finally been able to demonstrate that the solution at this client can support 15000 busy hour call attempts. Here is the evidence courtesy of Empirix Hammer on Call (HOC) reporting (the blip at midnight can be ignored as we went closed for 1 minute):
Great job team (you know who you are!)
The final hurdle we had to get over in the last few weeks was driving the Avaya S8730 Media Server into an overload condition. This can be seen quite clearly during performance testing after 8PM:
For information, processor occupancy is defined as the percentage of time the configuration’s processor is busy performing call processing tasks, maintenance tasks, administration tasks, and operating system tasks. Occupancy is further divided into:
- Static Occupancy (Static Occ) which is the percentage of occupancy used by high priority background processes in support of call processing, maintenance, and administration functions
- Call Processing Occupancy (CP Occ) which is the percentage of occupancy used by call processing-level processes
- System Management Occupancy (SM Occ) which is the amount of time taken by lower priority activities such as administration and maintenance command processing
- dle Occupancy (Idle Occ) which is the amount of time the processor is unused. There are several factors that drive down this number. These factors may reduce the idle occupancy to almost 0 percent during several 3-minute intervals. On a heavily-loaded configuration with frequent demand testing, the idle occupancy may drop to low levels for longer periods (perhaps 1-2 hours). These situations are normal and do not indicate a problem with the configuration.
It is not desirable for any system to function at 100 percent processor occupancy. Rather, the Static and Call Processing Occupancy should total no more than a maximum of 75%. By maintaining this 75% maximum limit, other system functions can be performed and bursts of caller activity can also be accommodated.
The Occupancy report below clearly shows the call processing (CP) occupancy rising to 81% in one 3 minute interval!
In the end the fix was quite simple!
Previously we had been injecting test calls in directly over SIP trunks. However at the end of the day this was producing too many SIP messages for ACM to handle. Therefore, for the final test above we went to (expensive) test injection over the PSTN and all worked OK.
The Occupancy report below shows the call processing (CP) occupancy rising to a maximum of 35% in one 3 minute interval which is perfectly acceptable:
For future reference this is what we learnt during our diagnostic efforts ….
Doubled Calls = Double Call Processing
A single test call shows 4 connections in total per customer call. Hence with Genesys treatments there are an additional 2 connections (as expected). Thus it can be reasonably expected that the call processing load with Genesys treatments will be doubled:
Note: Tandem calls are those calls into Genesys which then come back out e.g. tromboned calls
Look Ahead Routing (LAR)
MST traces showed a lot of denial events 5008/1191. This means the outgoing SIP INVITE (to Genesys) did not get a response within the period set in the Alternate Route Timer on the routing pattern.
We had this timeout set to 2 seconds (rather than the default of 6 seconds) to fix an OAT defect. Therefore, after 2 seconds if there is no ACK back to a SIP INVITE, ACM cancels the call and tries another Trunk Group. Setting the Alternate Route Timer lower causes more LAR retries and higher CPU load than it would if the timeout value was higher.
When multiple System Access Terminal (SAT) administration and maintenance commands are performed per second via the Communication Manager (CM) Operations Support Systems Interface (OSSI), system management processor occupancy can increase very rapidly, thus causing overall CPU occupancy to spike. In some instances this can drive the system into CPU overload.
Great care must be exercised when running CPU intensive SAT administration and maintenance commands. These commands should only be run when the system is processing low call volumes (off hours) and never during busy call traffic periods.
Avaya are a bit coy about stating what the SIP message processing throughput of Communication Manager 5.2.1 SP4 actually is.
The document “Avaya Aura™ Communication Manager System Capacities Table” describes the IP endpoint capacity of this system but not in the context of call attempts and connections.
The document “Avaya Aura™ Communication Manager 5.2.1 SP#5 Release Notes” show that there a quite a few “SIP issues” which are fixed in every release.
The effect of duplication on SIP message processing should be considered e.g. PSN002232u – “H.323 and SIP station capacities and SIP trunk capacities for S8xx0 Servers running Avaya Aura™ Communication Manager 5.2.1” stated that Software Duplication feature is not optimised for use with SIP endpoints. Fortunately, at this client we are using hardware (DAL 2) duplication.
The following comments in the Avaya Aura™ Communication Manager 6.0 SP#1 Release Notes should not go unread!
“However, note that the capacities specified in that document pertain to general business configurations and may not be valid or recommended for Call Center (CC) solutions. Simultaneously achieving the upper bounds for multiple capacities including SIP trunks may not be possible for real-world CC systems. Call rates and other operational aspects of these CC systems may preclude realizing the maximum limits”
“*** IMPORTANT: All Call Center designs should be reviewed by the Sales Factory Design Center. Call Center designs that involve SIP trunking *must* go through the Sales Factory. ***”
We never got chance to re-test this but we suspect that when an overload condition occurs, Genesys SIP server causes further overload by resending REFER messages without backing off “for several seconds” at it should do according to the SIP specification.
Under load conditions Avaya CM sends back status code 503 (Service Unavailable). The behaviour we observe is that the SIP message (REFER in this case) gets resent multiple times causing additional load.
For reference, overload occurs in the Session Initiation Protocol (SIP) when SIP servers have insufficient resources to process all SIP messages they receive. The SIP protocol specified in RFC 3261 provides the 503 (Service Unavailable) response code as a remedy for servers under overload. However, the current definition of 503 (Service Unavailable) has problems and can in fact amplify an overload condition. There is an Essential Correction to RFC 3261 which relates to this. Please see http://tools.ietf.org/html/draft-hilt-sip-correction-503-01
The fix may be in SIP Server 8.0.400.25:
Release Number 8.0.400.25
SIP Server now correctly releases a call when it receives a 503 Service Unavailable message in response to a re-INVITE request that it sent to the call originator. (ER# 248405320)