Prosody application note: swapping cards

Occasionally it is necessary to replace a Prosody card in a system which has already been configured. Ideally it would be possible to replace it while the system is running (called "hot swapping") because this means that the use of other cards in the system can continue uninterrupted, giving higher overall availability. If this is not possible, the card can be replaced after switching off the system (called "cold swapping"). This note considers the procedures required for swapping cards. The procedures are considered in three parts:

For any kind of card swapping, applications must be written so that they do not assume a particular card ordering. This means that they must use the proper API calls to match resources on a card. For example, the card number used in sw_set_output() to connect an incoming call must be obtained from call_port_2_swdrvr(). Similarly, the application must use sm_get_card_switch_ix() on Prosody cards rather than assuming that the numbering of Prosody processors follows that of switch indexes.

It is often very desireable to use Prosody resources for a call on the same card as the trunk handling the call. This is especially useful for hot swapping as it minimises the number of calls which depend on a card. To match calls to Prosody channels, when allocating Prosody channels use sm_channel_alloc_placed() specifying a Prosody processor module on the desired card. You can use sm_get_card_info() on each card to discover from the module_count values which Prosody processor module is on which card (they are numbered from 0 starting on card 0) and match the values from call_port_2_swdrvr() with those from sm_get_card_switch_ix() to find out which trunks and Prosody processor modules share a card.

Basic procedure for hot swapping

Hot swapping is done to maximise the availability of the system, so the procedure for replacing a card must be carefully designed to ensure that it does not cause a further failure. This is especially important as procedures for recovering from failure are notorious as a cause of further failures (numerous examples of this are to be found in the Risks forum moderated by Dr. Peter G. Neumann.

  1. Inform application that card is to be removed
  2. Wait for application to stop using card
  3. Remove old card
  4. Add new card
  5. Test and configure new card
  6. Inform application that new card is available

Inform application that card is to be removed

In general, it is essential to inform the running application that a card is to be removed. This is because each card has several resources and even when there is a fault on a card some resources may still be in use. For example, if a trunk port on a card is faulty, there will typically be three or seven other trunks on the same card and it is necessary to stop using them before the card can be removed. In many applications this means waiting for calls to end normally so that there is no abrupt interruption of service.

Wait for application to stop using card

This may take a long time. The application may need to wait for events outside its control. For example, if it is waiting for calls to clear, it may have to wait an arbitrarily long time. An application may use a timeout to prevent this causing an excessive delay, but note that this appears to external users of the system as reduced reliability as they find long calls are occasionally cleared. See the Ideal procedure for hot swapping below for how to minimise this problem.

Remove old card

Since the old card is no longer in use, it can be removed without disturbing the system. On a CompactPCI system, since we have waited for the application to stop using this card, the blue LED can be illuminated (using an operating system tool) to indicate which card to remove. It is especially important that the operator have a positive indication like this of which card to remove since in a generally reliable system cards will be replaced very rarely, so we cannot rely on the operator being very familiar with the layout of the cards.

Add new card

We can now add a new card to the vacant slot. If necessary, an operating system tool can be used to make the driver attach to the card and prepare it for use.

Test and configure new card

It is very important at this stage to test the card. We need to be certain that a valid card has been installed (i.e. both that it is a card of the correct type and that it is not a faulty card which has been accidentally returned to service instead of being sent for repair). We also need to ensure that it is correctly connected, for example that trunks have been connected to its ports and that they are in the right ports. We must perform any appropriate configuration (such as downloading firmware) and this must be done in a way which ensures that when the whole system is next re-booted the system configuration is correct (so, for example, scripts which download firmware correctly refer to the new card).

Inform application that new card is available

After testing, we know that we have a suitable card ready for use, so the application can use it.

Ideal procedure for hot swapping

The basic procedure for hot swapping described above has a significant problem: it leaves the system running at reduced capacity while a card is being replaced. When a card is handling phone calls and the application must wait for the calls to clear, the system may need to be at reduced capacity for a long time. This can often be avoided by adding the new card before removing the old one. If this is done, the application can be permitted to continue using the old card for much longer. This means that the timeout mentioned in "wait for application to stop using card" above can be very long. For example, it might be as long as a day. Alternatively, an application may not use a timeout but instead allow an operator to tell it when to forcibly release any resources remaining in use. This would allow maximum availability and reliability, with calls only being cleared if they are both very long and continue past a time when the operator urgently needs to remove the card. Obviously the disadvantages of this over the basic procedure are that there must be a spare slot for the new card and the application must be able to handle an extra card during the transition period.

Procedures for cold swapping

Hot swapping requires suitable hardware and support from the operating system, so it is not always possible on a particular system but it is always possible to replace cards with the system switched off (cold swapping). This can be done as if it were a fresh installation. However it may be possible to automate the most common case of replacing a single card with an identical one.

Simple card replacement

It can be useful to have an automated method for replacing a single card at a time with an identical card, this can be achieved if all applications on the system are appropriately written. The method is essentially to run a test when the system boots to check the cards which are present and if the test finds that one card is missing and an extra card has appeared, it modifies the system configuration to subtitute the new card for the missing one.

The test for a card being replaced must be carefully written to avoid assuming that the system configuration is correct, since it is supposed to be checking the configuration. An example is provided as a Perl script (swapcard.pl) in the diag directory. This example is for Solaris.

swapcard.pl example script

This script is a working example of how to check for replacement of a card. It does everything necessary to change the system configuration to use the new card instead of the old one. In some systems it may be necessary to perform additional steps. For example, if an application has a configuration file which refers to cards, it may be necessary to update that file also to reflect the change. You can do this by adding the appropriate code to the swapcard subroutine. There are comments in the script to describe how all of it works.


Document reference: AN 1384