How can I configure fault tolerant NICs with RV?

How can I configure fault tolerant NICs with RV?

book

Article ID: KB0085681

calendar_today

Updated On:

Products Versions
TIBCO Rendezvous -
Not Applicable -

Description

Resolution:
In situations where TIBCO Rendezvous is running on a machine with more than one NIC, the NICs can be configured in a fault tolerant mode to reduce downtime and in some cases entirely prevent data loss.

To do this, configure the NICs to have the same IP address and the same net mask, with one in the up mode (ifconfig -eth0 up) and the other(s) down (ifconfig -eth1 down).  In this configuration TIBCO Rendezvous will send all traffic for that subnet via the up interface.

On failure of the primary interface, bring the failed interface down (ifconfig -eth0 down) and bring one of the redundant interfaces up (ifconfig -eth1 up).  While all the interfaces in FT are down, no traffic will be sent to the IP subnet they serve.

NOTE:  You must make sure that no more than one of the FT NICs is up at any given time.

In preliminary tests, we have found that as soon as one of the redundant NICs is brought up, traffic flow will resume to their subnet.  If the failover was done within the daemons reliability window, then there will not be any data loss.  Failovers taking longer than this may result in data loss, which will be reported with the "_RV.ERROR.SYSTEM.DATALOSS.*.*" advisory message.

Since we do not QA this type of failover, we don't guarantee that it will work on all systems and you are encouraged to test it yourself.  If you do experience problems then you must restart the daemon after the failover.  Assuming no messages are sent in between the time of the failover and the daemon being restarted, there will be unreported data loss (no advisory will be sent) of the messages sent after the interface failure and the restart of the daemon.  If you still experience problems after restarting the daemon, please contact TIBCO Support.

NOTE: In situations where message loss is not acceptable, please use certified messaging.

To prevent accidental incorrect configurations, it is strongly recommended that you automate the failover process by using an appropriate script like the following:

----- begin nicfailover.sh -----
#! /bin/sh


TEST=0  #set to 1 when testing

#killall might not be a command on your system.  Replace with
#a suitable alternative if required.
#comment out the following line if restarting the daemon is
#not required.
KILL_DAEMON="killall rvd"

#comment out the following line if the daemon will be restated
#automatically by the applications.
START_DAEMON="rvd"


echo `date "+%x %X"`": NIC failover started"
if test $# -lt 2; then
  echo `date "+%x %X"`": Error: At least 2 NICS must be specified to perform failover";
  echo `date "+%x %X"`": NIC failover failed";
  exit 1;
fi

which ifconfig >& /dev/null
if test $? -ne 0; then
  echo `date "+%x %X"`": Error: Cannot find ifconfig. ";
  echo `date "+%x %X"`": NIC failover failed";
  exit 1;
fi

for NIC in "$@"; do
  ifconfig "$NIC" | grep "^[[:space:]]*UP " >& /dev/null;
  if test $? -eq 0; then
    if test "$FAILED_NIC"; then
      echo `date "+%x %X"`":   WARNING: Both redundant NICs '$NIC' and '$FAILED_NIC' where up";
    fi
    echo `date "+%x %X"`":   Bringing down '$NIC'...";
    if ! test $TEST; then
      ifconfig "$NIC" down;
      if test $? -ne 0; then
        echo `date "+%x %X"`":   ...failed";
    echo `date "+%x %X"`": NIC failover failed"                                     exit 1;
      else
        echo `date "+%x %X"`":   ...done";
      fi;
    else
      echo `date "+%x %X"`":   ...done";
    fi;
    FAILED_NIC="$NIC";
  fi;
done

UP=0;

for NIC in "$@"; do
  if test "$NIC" != "$FAILED_NIC"; then
    echo `date "+%x %X"`":   Bringing up '$NIC'...";
    if ! test $TEST; then
      ifconfig "$NIC" up;
      if test $? -ne 0; then
        echo `date "+%x %X"`":   ...failed";
      else
        echo `date "+%x %X"`":   ...done";
        UP=1;
        break;
      fi;
    else
      echo `date "+%x %X"`":   ...done";
      UP=1;
      break;
    fi;
  fi;
done



if test $UP; then
  if ! test $TEST; then
    if test "$KILL_DAEMON"; then
      echo `date "+%x %X"`":   Killing daemon...";
      $KILL_DAEMON;
      if test $? -eq 0; then
        echo `date "+%x %X"`":   ...killed";
      else
        echo `date "+%x %X"`":   ...failed.  Was daemon running?";
      fi;
      if test "$START_DAEMON"; then
        echo `date "+%x %X"`":   Starting daemon...";
        $START_DAEMON;
        if test $? -eq 0; then
        echo `date "+%x %X"`":   ...started ";
        else
          echo `date "+%x %X"`":   ...failed.";
        echo `date "+%x %X"`": NIC failover failed";
          exit 1;
        fi;
      fi;
    echo `date "+%x %X"`": NIC failover complete";
  else
    echo `date "+%x %X"`": NIC failover failed";
    exit 1;
  else
    echo `date "+%x %X"`": NIC failover complete";
fi
----- end nicfailover.sh -----

This script takes a list of interfaces as arguments, brings whichever one is up down, brings one of the others up and optionally restarts the daemon.

NOTE:  If you are restarting the daemon, it is recommended that you prevent your applications from restarting the daemon, by specifying the loop back address in their daemon parameter.

While the above script takes care of the actual failover, you still need to know when to call the script.  How you do this will depend on your situation, but one of the easiest options is to use a cron job to do a broadcast ping on the NIC's IP subnet and perform the failover on no response.  The following script is an example of what would be called from the cron job.

----- begin testsubnet.sh -----
#! /bin/sh

# change the address to the subnet you want to test
SUBNET=10.10.10.0

#change the interfaces to those setup for FT on the above subnet.
#this can be determined automatically from ifconfig, based on the
#subnet specified, but is left as an exercise for the reader.
NICS="eth0 eth1"

echo `date "+%x %X"`": Testing subnet '$SUBNET'...";

ping -c1 -b "$SUBNET" >& /dev/null
if test $? -ne 0; then
  echo `date "+%x %X"`": ...failed";
  sh nicfailover.sh $NICS;
else
  echo `date "+%x %X"`": ...OK";
fi
----- end testsubnet.sh -----

By placing the above (appropriately modified) scripts in you /bin directory

* * * * *  /bin/sh /bin/testsubnet.sh >> /var/log/testsubnet.log

and in an administrators crontab (crontab -e), then the subnet will be tested every minute and failover will be performed whenever it's ping to the subnet fails.  For less frequent testing please see 'man crontab'.

When using automatic failover, you should make sure that the script you use, whether your own or the one above, logs the failover to an appropriate location to make sure that any problems can be examined.  It is also recommended that you have your scripts send an email notification to an administrator whenever failover occurs, so they may replace the failed interface or at the very least remove it from the list of redundant NICs.


Issue/Introduction

How can I configure fault tolerant NICs with RV?