Redundant Systems and D2000
High demands are often placed on SCADA and MES systems regarding availability, functionality, and uptime (ideally 24/7/365). Since every device has some kind of a lifespan and a failure rate, it is necessary to take care of these factors. Using redundancy is a way to increase overall system availability even when one or more components fail.
The D2000 application server implements several types of redundancy:
1. Redundancy of application servers
2. Redundancy of networks
3. Redundancy of archiving
4. Redundancy of communication processes
5. Redundancy of other system processes
Let’s try to look at individual types of redundancy and introduce their use and benefits for a customer.
1. Redundancy of application servers
It means running two or more application servers connected in a so-called redundant group. Every application server is run on a differentphysical computer. One of them is active – “HOT”, all the others are passive – “SBS” (standby servers). These servers receive changes in configuration and object values from HOT server so that in every moment they contain actual data and they are ready to take over the role of the HOT server.
Activation of SBS server happens either automatically as a result of HOT server’s failure or during a manual switching of redundancy by an administrator. Clients can be configured for an automatic reconnection to a new HOT server.
Besides emergency situations, application redundancy can be used e.g. for uninterrupted operations during regular updating of Windows, D2000 and other software that is a part of the application, or else during updating of the firmware of servers and so on. By the method of gradual update (server A update and restart – setting as HOT – running e.g. for a week – server B update), it is possible to verify whether the update did not cause any problems. If it did, the system can be made functional immediately by switching to yet not-updated server B (and only then perform uninstalling of updates or else recovery of server A from backup).
Many customers implemented HA (high availability) + DR (disaster recovery) solutions using redundancy 2+1 (two servers on the main location and one on a backup location). Redundancy 1+1, 2+1, or 2+2 was also temporarily used in projects of migrating applications – to new servers in the same network or to a different network segment.
2. Redundancy of networks
Redundancy of networks means to provide a double LAN for communication between clients and D2000 application servers. In this configuration, every client is connected to the D2000 server via two independent TCP connections. In a case of failure of a primary connection, unconfirmed data are sent again via secondary connection, so there will be neither data loss nor the failure of ‘logical’ connection, only few-seconds-long ‘freezing’ of data. It is possible to set one network as preferable on the client’s side, so if a connection is functional via this network, it will be used preferably (an example of use is a client on a location with a 100 Mbit LAN backed up by several Mbit VDSL line).
Network redundancy is available in the basic license of D2000. Similarly, as with server redundancy, it is possible to use it for maintenance of network components without failures (patching or change of switches/routers, replacement of cabling, power supply for network components and so on).
A partial alternative which can be used for LAN is ‘duplication’ of networks on a physical level, while on a logical level, there is still only one network with one IP address – these are technologies known as teaming or bonding. Their advantage is that even an application without implemented network redundancy can use them. The advantage of network redundancy, on the other hand, is a better diagnostics (even on the user level) and the possibility of complete separation of network components (teaming/bonding requires switches in one network segment). Another advantage used in practice is the possibility to deploy it in WAN networks (two independent connections between distant locations using various technologies, for instance, fibre optics and wireless transmission).
3. Redundancy of archiving
Archiving redundancy is used in most SCADA/MES applications since data are perceived as valuable assets. It is implemented by ‘doubling’ or multiplying archives and archive databases (we are talking about archive instances 1, 2 and so on, while an instance can be active or passive). Every archive stores archive data into its own database. However, only an active instance serves for reading for the needs of application and clients. D2000 also supports load balancing. Several instances can be active during load balancing, so the overall reading load will spread among them.
In practice, the asymmetric configuration of archives is often used (especially for MES systems). One of the archives fills, beside archive database (which contains data of all archive objects according to the configured archiving depth), also so-called depositories – databases with unlimited archiving depth used for long-term archiving. In this case, the archive with depositories is usually configured as preferred (it provides users with long-term data for analyses).
It is possible (and sometimes even used by customers) to use redundant archives which store data into various archive databases (MsSql, Sybase SQL Anywhere, Oracle or PostgreSQL) – such configuration is more resistant to an error of the SQL server, which could affect two identical databases.
Similarly to the redundancy of application servers, it is possible to use archiving redundancy for preventing data loss while updating Windows, D2000 or a database. Archives contain a mechanism for ‘patching’ holes that appeared during an outage or a controlled temporary shutdown. Missing data is copied from the second instance.
4. Redundancy of communication processes
Similarly to archiving redundancy, communication processes can be ‘doubled’ or multiplied. When comparing with the functionality of archives, differences are the following:
- An active instance performs communication with devices (reading, writing).
- The functionality of a passive instance depends on a particular protocol. Redundancy is not supported for every protocol. In some protocols (Modbus), the passive instance doesn’t send nor reads data. In others (IEC101), it works in passive mode (it reads and analyses responses to requests of the active instance). In other protocols (IEC104 client/server), functionality is configurable – from complete passivity through reading values to sending values (optionally with the indication that they come from passive instance).
If a particular protocol enables the passive instance to acquire data, it facilitates ‘bumpless’ switching of active and passive instance – without delay related to the initiation of a communication canal. An example might be the OPC protocol that requires defining of OPC groups and consecutive activation of all OPC items which should be read (with thousands of items, this could even take a few minutes, depending on hardware performance and load, and on network latency).
Some communication protocols implement redundant features for particular applications. For example, the Microtel 700 protocol used in telemetry in gas industry supports beside primary communication path (serial or serial wrapped in UDP packets, for example using the Moxa Nport server) one or two backup Ethernet paths for communication using UDP packets which are activated after the loss of connection on the primary path or manually on the request of the operator.
Protocols TG809 and IEC101 used in communication with the Slovak electro-energetic dispatching enable parallel communication with the main dispatching centre in Žilina and backup dispatching centre in Bratislava while every dispatching centre communicates via two independent communication lines. In a case of serial line configuration, communication uses four serial ports corresponding to four communication lines:
If a Moxa NPort server is used for communication, it can be also doubled (while each of them can be on a different network segment and on a different power supply branch). In this case, it is possible to configure up to 8 serial servers for communication via 4 lines:
This configuration can be resistant to an outage of a network segment or one branch of a power supply without communication loss via a primary or secondary line.
5. Redundancy of other system processes
Similarly to communication and archive processes, other D2000 system processes can be made redundant as well (event – see Figure 8, dbmanager), however, the benefit is smaller here. The only exception is the EDA server providing access to so-called EDA database which serves for archiving time series. Similar to the archive, the EDA server also supports load balancing – in this case, it means that several EDA servers write into one EDA database (in the case of archives, every archive has its own archive database). Load balancing enables effective usage of resources of multiple servers (CPU, RAM) and creation of a global data cache available to all clients regardless of which EDA server they are connected to.
The described redundancy types in the D2000 application server are no news but technologies verified by years of production use and deployed in many SCADA and MES systems. One of the examples is a redundancy of application server 2+1, redundancy of archives 2+1, redundancy of networks and selected communications deployed in a control system of dispatching centre of a transit gas pipeline set up in 2003. It is still functional – of course, with several upgrades of D2000 and hardware.
Besides the mentioned types of redundancy, it is necessary to think individually about every physical component or resource when designing a control system. We need to comprehensively analyse the consequence of outage of all components and resources – no matter whether these are network components (or else connectivity to surroundings or partners), power supply (servers usually have more power supplies which can be, for example, connected to independent UPS), air-conditioning or something else. Furthermore, another chapter is the proposition of disaster-recovery solutions that anticipate an outage of the whole site (server room, building) and consider building a backup site…