SIP Server failover using Windows Network Load Balancing (NLB)

As I mentioned in an earlier posting we have deployed a HA pair of SIP T-Servers at this client and are using Windows Network Load balancing between them.

Even though this is a Genesys recommended solution there are a number of known issues and apparently Genesys are developing a stateless proxy to fix the problem.

In the interim we have been using an alarm reaction to enable or disable port 5060 on the relevant host during a switchover.

We have finally managed to get the switchover process reliable by using the following sequence:

  • Switch over to the backup component
  • Stop and then start the component that was previously in PRIMARY mode. This seems to be the secret!
  • Enable the cluster host where the backup component is running and disable the cluster host where the previous primary component was running

Image

I have configured a pair of alarm reactions using Log Event 5150 to automate the process. Note that the cancel timeout is set to 5 seconds to ensure that the alarm reactions fire again on fail-back.

Image

Image

Image

During normal running on the primary, the NLB status is a follows:

Image

On failover the NLB status is a follows:

Image

Share

Router Self Awareness

For this client we are using 3 Routing Server (URS) High Availability pairs with load balancing through Load Distribution Server (LDS).

During volume testing this week we noticed a problem where calls we being delivered twice to the same advisor using both line 1 and line 2 on their stations.

The URS 7.6 deployment guide describes this possibility in a section on “Router Self-Awareness”:

Starting with 7.6, you can configure URSs, such as those in a loading balancing scenario, to share routing information between themselves by setting a mode called Router Self-Awareness. If Router Self-Awareness is activated for the participating URSs, then the number of calls in transition sent to this target will include calls in transition sent by all URSs participating in the same Router Self-Awareness group.

Using Router Self-Awareness, URSs deployed in a load sharing mode can communicate with each other regarding selected targets and target statistics. This addresses potential load balancing issues across multiple URSs. It also addresses certain race (timing) conditions that can occur in agent-based routing.

When Router Self-Awareness mode is configured, URSs can exchange internal data in order to have more real time information on their working environment. In addition to information about any URS decision to send a call to some destination, information in the communication channel between URSs can also be used for:

  1. Agent blocking. If Router Self-Awareness is on, then every URS will block the agent for routing as soon it receive notification that an agent is selected by some other URS. The assumption here is that other URS’s notification can arrive much sooner than the agent will be reported as busy by Stat Server. This can save URS from the necessity to doing a reserving request that will unconditionally fail.
  2. Preventing other URSs from selecting the same targets (Agents, Places) during the early phase of routing before the agent reservation mechanism detects the call.

To configure “Router Self-Awareness” the following steps are required:

  • Add a dedicated HA “Notification” Message Server Pair
  • On the Annex of the Message Server application add a “__ROUTER__” section with the options set as shown below

Image

  • On each URS add a connection to the primary “Notification” Message Server

Image

Share

Static IP Routing

I had a problem yesterday caused by the reboot of an AES server after the installation of a service pack. After the server was rebooted we lost the Avaya T-Server link to it.

The problem was caused by a static IP route which was added a couple of months ago as a quick fix and this did not survive the reboot. This was then forgotten about.

The moral of the story is always fix the root cause of the problem there and then otherwise it will come back and bite you at some time in the future.

Share

Custom Alarm Conditions and Log Event Message IDs

This week I needed to implement some custom alarm conditions in preparation for Operational Acceptance Testing (OAT). Finding non standard Log Event Message IDs turned out to be more difficult than you may think. After trawling through various component reference guides, I stumbled upon the .lms file within the install folder of each Genesys application. This is a plain text file which defines specific alarm / log messages used by the component. Standard / common messages are contained in the common.lms file within the same folder.

You learn something new every day!

Image

Image

Share

GVP 7.6 and the Management Layer

To integrate GVP 7.6 (aka the old Telera product) with the Genesys Management Layer, Genesys uses a Windows service named “CMEInterface” which, in theory, allows the status of an IPCS server to be monitored and controlled through Solution Control Interface (SCI).

I say in theory because testing I have performed indicates that starting and stopping IPCS components through SCI is somewhat hit and miss. This is especially true in an environment where Genesys SIP server is deployed in a HA configuration using Windows Load Balancing to present a Virtual IP Address (VIP) for the SIP Server pair.

Therefore, for this client I have changed the configuration as follows:

For each GVP IPCS installation, I created a Third Party Application for the WatchDog. On the Annex tab I added a “start_stop” section using the standard Windows Service Control command to start or stop the “WatchDog” Windows service:

Image

Image

For each GVP IPCS installation, I created two alarm reaction scripts – one to start the IPCS WatchDog service and one to stop it:

Image

I created an alarm condition to detect Hot Standby primary mode being activated (Log Event 4563) on the backup SIP T-Server:

Image

And added alarm reactions to stop and restart each GVP IPCS Server:

Image

Note that in the example shown above I have also configured an additional Alarm Reaction which starts another Third Party application. This is actually a batch script which uses Windows Load Balancing to disable port 5060 on the server where the primary SIP server is installed and enable port 5060 on the server where the backup SIP Server (which has now been promoted to PRIMARY) is running. By using Windows Load Balancing in this way I can configure a single Virtual IP Address (VIP) in the host mappings in Avaya SIP Enablement Services (SES):

Image

Share

8.0 HA Failover

Had a strange situation this morning (or maybe my memory is not as good as it used to be!).

We have a number of components configured in HA pairs e.g. T-Servers, Stat Servers and Routing Servers. As part of pre-OAT (Operational Acceptance Testing) I have been killing primary components via Task Manager (rather than SCI) and expecting the backup component to automatically be promoted to primary by the Management layer. However, it wasn’t and all I got was a ‘Unplanned Solution Status Change’ alarm:

Image

My initial thought was that I needed to configure an associated alarm reaction to force failover to the backup component:

Image

Before I did that I set ‘Auto-Restart’ on each of the HA components and tried again. It worked! When the primary component was killed via Task Manager the backup component was automatically promoted to primary and the killed component was restarted as expected. Since I do not normally configure solutions with Auto-Restart I also checked that if the component was stopped via SCI it stayed stopped (and it did).

This is not the behaviour I remember in Framework 7.

Share

SQL Server Log File Truncation

Had a little problem this morning running out of space on one of our SQL server boxes in development. The problem was caused by log file growth on an ICON instance.

Image

A standard shrink did not work so I detached the database and then deleted log file.

Image

To re-create the log file I had to copy the existing MDF file to a temporary file, create a new database of the same name and size, stop SQL server, copy back the original MDF file over the newly created file and then restart SQL Server.

I then put the database into emergency mode and ran a CHECKDB.

To get the database back out of single user mode I needed to kill my own SQL Server Management Studio connection to the database using kill <spid>.

All fixed now.

Image

Fixed the root cause of the problem by setting the database recovery model to simple rather than full (which should be OK for development)prevent future problems:

Image

Share

Big Apple

Me and the eldest are off next month to the Metallica gig at Madison Square Garden to celebrate his 16th birthday so it is time to book some flights.

Image

First attempt: Opodo (http://www.opodo.co.uk) – 586 pounds each!

Image

Second attempt: Lastminute (http://www.lastminute.com) same flight and a 7 quid saving. Wow!

Image

OK time to look at indirect flights. Opodo is the winner this time as the same combination was not offered on Lastminute. KLM 1 : Air France and Delta 0.

Yes I know the flight time is longer and there is more hastle but I’ve saved 420 pounds and the lad will be back home in time for school Tuesday!

Image

Interesting to note that the tax is actually higher than the fare. Who gets 139 quid and who pockets the remaining 230? No wonder the airlines are not making the same profits these days.

Share

Financial Markets

For the last couple of years with the onset of 40+ life (aka grumpy old man according to this kids!) I’ve been tracking the financial markets and in particular the UK housing market as a general financial indicator.

There are lots of websites out there but the two I read regularly are http://www.marketoracle.co.uk/ and http://www.housepricecrash.co.uk/.

In relation to UK house prices the following chart keeps appearing:

Image

Despite studying Math at degree level and thinking that I understand Fast Fourier transforms and associated Complex numbers I still have a problem understanding what the chart above shows!

Does it mean (accoring to the Nationwide) that in February 2009 the average UK house price was down 17% but as of Sepember 2009 there has been a recovery and the average price is now the same as the start of 2009 – I don’t think so. Anyway, if that was the case the figure in January would always be 0%.

Surely a simple chart showing the average UK house price on a per month basis would be easier to understand?

Regardless, as somebody looking to purchase property in the near future I still prefer to use this chart from http://www.marketoracle.co.uk/

Image

Share