0 - SAP Solutions On VMware - Business Continuity
0 - SAP Solutions On VMware - Business Continuity
0 - SAP Solutions On VMware - Business Continuity
2011 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. This product is covered by one or more patents listed at http://www.vmware.com/download/patents.html. VMware is a registered trademark or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies.
Contents
1. 2. 3. 4. 5. 6. 7. 8. Introduction ...................................................................................... 5 SAP Distributed Architecture ............................................................ 5 Protection with VMware High Availability .......................................... 6 SAP Central Services and VMware Fault Tolerance......................... 7 Symantec ApplicationHA .................................................................. 9 Clustering in Virtual Machines ........................................................ 11 SAP High Availability Options and Uptime Discussion .................... 14 VMware vCenter Site Recovery Manager ....................................... 16
8.1 vCenter Site Recovery Manager Architecture ............................................................. 16 8.2 Executing Recovery Plans .......................................................................................... 18 8.3 Network Customization................................................................................................ 18 8.4 Storage Array Replication ........................................................................................... 19 8.5 Reprotection and Failback ........................................................................................... 20
9. 10.
1. Introduction
Business continuity describes the processes and procedures an organization puts in place to make sure that essential functions can continue in case of unplanned downtime. Unplanned downtime refers to an outage in system availability due to infrastructure or software failure (server, storage, network, software or OS crash, etc), or site disaster. SAP products and solutions provide mission-critical business processes that need to be highly available even in the event of a site disaster. This document describes various high availability scenarios designed to protect the SAP single-points-offailure. The architectures are based upon VMware high availability features (VMware Fault Tolerance and VMware High Availability), Symantec ApplicationHA (partner solution that integrates with VMware HA and helps to bridge the gap between VMware HA and in-guest clustering), and third-party in-guest clustering software with virtual machines. Factors that influence the final design choice are discussed. Finally, an SAP disaster recovery architecture based on VMware vCenter Site Recovery Manager is described. For background and more detail about these VMware functions and products refer to the documents in Section 10, Resources. Though planning for business continuity of SAP implementations is part of a system-wide strategy, this document does not cover high availability network and storage features. Consult the appropriate VMware and VMware partner guides for information on these topics.
SAP Solutions on VMware Business Continuity depicts a typical scenario with the SAP database and Central Instance running in a single virtual machine with VMware HA applied, a configuration often found in existing installations. The table summarizes the features of this configuration. Figure 1. VMware HA Configuration for SAP
Considerations No monitoring of application. DB unavailable during failover. No enqueue and message services during failover. Time to recover includes time to boot guest-OS and restart the application.
2-tier or 3-tier Application server virtual machines not shown. Protection against server failure. Auto restart of virtual machines. Startup scripts/service required to auto-start SAP/DB instances in guest OS. VMware HA easy to configure (VMware out-of-thebox).
VMware FT for ASCS Key Points Assumes 3-tier application server virtual machines not shown. ASCS protected via VMware FT, DB protected via VMware HA. Protection against server failure. Continuous availability of Central Services. Easy to configure VMware out-of-the-box. DB still protected via VMware HA. New secondary ASCS virtual machine automatically created after failover (assumes more ESX hosts available).
Considerations No monitoring of application. VMware FT currently supports 1 x vCPU VM so potentially not enough for very large SAP systems. DB unavailable during failover. Separate NIC/network recommended for FT logging traffic.
SAP Solutions on VMware Business Continuity The configuration shown above was installed in VMware labs and a small-scale functional test was conducted to verify continuous availability of central services during failover of the ASCS virtual machine protected via VMware FT. The results are described in Figure 3. Figure 3. Lab Results: VMware FT Test with Virtual Machine Running ASCS
Test Setup
2x ESX hosts running vSphere. 1x VM, 2 x vCPU, 8GB RAM running ECC 6.0: MSSQL database; Windows Server 2003 64bit; dialog instance. 1x VM, 1 x vCPU, 1GB RAM running ASCS protected by VMware FT. NOTE: this setup was not intended or tuned for benchmarking.
150 concurrent users < 0.5 sec response time (users generated by SD Benchmark Kit). Successful completion of workload with no user or lock errors. Average CPU utilization of ASCS VM < 5%.
The VMware hardware partner competency centers for SAP can provide further guidelines for determining the sizing of this distributed architecture. For technical details of VMware FT see the document Protecting Mission-Critical Workloads with VMware Fault Tolerance (http://www.vmware.com/files/pdf/resources/ft_virtualization_wp.pdf). When deploying SAP Central Services standalone in a virtual machine note the following: Linux-based guest OS is supported by SAP and there are no caveats. For Windows-based guest OS please see SAP note 1609304, Installing a standalone ASCS instance. To obtain support on Windows for a standalone deployment follow these guidelines: o o o o Use a sapinst that allows installation of standalone Central Services (available from Netweaver 7.3 but also possible with some earlier versions). Take care of RFC destinations that point to the virtual hostname of the Central Services by maintaining RFC group destinations or implementing a standalone gateway. In case of an upgrade, choose the correct upgrade tools (if you need advice, open a SAP message under support component BC-UPG). For clarification open a SAP ticket under support component BC-OP-NT-ESX before proceeding with an installation.
5. Symantec ApplicationHA
Symantec ApplicationHA is an agent-based solution that integrates with VMware vCenter Server to provide application monitoring and management from the vSphere client. With vSphere 4.1, VMware introduced an application programming interface (API) to provide third-party vendors the ability to integrate with VMware HA. Symantec uses this API as the basis for ApplicationHA. Using the Veritas Cluster Server framework, ApplicationHA runs inside the guest operating system to monitor and protect the applications. Symantec ApplicationHA consists of the following components: Guest Component The Guest Component is installed within the virtual machine running the application to be protected and provides start and stop capabilities for the application or resource via agents. ApplicationHA Console The ApplicationHA Console provides the interface between the Guest Component and vCenter Server. The ApplicationHA Console is installed on a dedicated virtual or physical machine and is responsible for relaying application heartbeat information from the Guest Component to VMware HA, as well as providing vCenter Server with application health status. vSphere Client plug-in The vSphere Client plug-in enables administrators to view the status of a monitored application and make basic configuration changes such as starting and stopping an application, placing the monitoring component in maintenance mode, and reconfiguring application monitoring. Agents exist for monitoring the SAP single-points-of-failure. The following table lists the supported configurations. For more details and updates on the supported agents please consult http://go.symantec.com/applicationha/. For the most recent list of supported SAP and DB versions see https://sort.symantec.com/agents. The application agent runs a utility to verify the status of the instance (for example, central instance or database). The agent detects application failure if the monitoring routine reports an improper function of the instance processes. When this application failure occurs, the ApplicationHA agent for SAP tries to restart the instance. If it further fails, a virtual machine reboot is triggered. The following figure shows a screenshot of the Symantec ApplicationHA plug-in in vCenter. This example shows the SAP instance being monitored. The key points of this solution are summarized in the table following the diagram.
SAP Solutions on VMware Business Continuity Figure 5. Symantec Application HA Screenshot Example
Considerations Recovery time depends on the time it takes to restart the service or processes. If VMware HA is invoked downtime is incurred for the amount of time it takes to boot the guest OS and start the application. Not supported for use with FT protected virtual machines. Check with Symantec for the latest list of agents (https://sort.symantec.com/agents).
Builds upon VMware HA to allow for application-level awareness to improve application availability. Does not impede the functionality of VMware features such as DRS, vMotion. Application monitoring and management through a single pane of glass using vCenter plug-in. Application and dependency awareness allows for graceful startup and shutdown. Less complex than clustering to set up.
Requires RDM, cannot vMotion clustered virtual machine. VMware guide available Setup for Failover Clustering and Microsoft Cluster Service.
YES
YES
NO
For iSCSI and FC SAN requires RDM cannot use vMotion to migrate clustered virtual machines. http://www.symantec.com/connect/articles/clusteri ng-configurations-supported-vcs-vsphere For VMFS, need to use multi-writer flag, see VMware KB article 1034165. Enables vMotion. http://www.cc-dresden.de/en/whitepaper
YES
YES
YES
YES
YES
Supported by Red Hat from 5.7 and later For VMFS, need to use multi-writer flag, see VMware KB article 1034165. Enables vMotion. Supported by Oracle from 11.2.0.2 and later as per MyOracleSupport, Document ID #249212.1. For VMFS, need to use multi-writer flag, see VMware KB article 1034165. Enables vMotion.
YES
YES
It is possible with the Linux clustering solutions mentioned above to use VMFS, which requires implementing VMware KB article 1034165 Disabling simultaneous write protection provided by VMFS using the multi-writer flag (kb.vmware.com/kb/1034165). VMFS is a clustered file system that disables (by default) multiple virtual machines from opening and writing to the same virtual disk (VMDK file). This prevents more than one virtual machine from inadvertently accessing the same VMDK file. The multiwriter option allows VMFS-backed disks to be shared by multiple virtual machinesthat is, two different virtual machines acting as two nodes of an in-guest cluster solution.
SAP Solutions on VMware Business Continuity The installation of SAP with cluster software by way of the SAP install shield "sapinst" follows the same process as on a physical system. Each cluster node is a virtual machine and the resulting architecture is similar to that described in the SAP installation guides, as shown in . ("REP ENQ" in the diagram stands for replicated enqueue server). The table following the diagram outlines the features of this configuration. Under normal operation, the SAP Central Services run on one node (virtual machine) and the database runs on the other node of the cluster. If one of the nodes fails, the affected central service or database instance is automatically moved to the other node, preventing downtime. The enqueue replication server contains a replica of the lock table (replication table) and behaves exactly the same way as in physical implementations. In normal operation the replication enqueue server is always active on the virtual machine where the ASCS is not running. If an enqueue server in the cluster with two nodes fails on the first node, the enqueue server fails over to the second node and starts there. It retrieves the data from the replication table on that node and writes it in its lock table. The enqueue replication server on the second node then becomes inactive. The following figure depicts a SAP cluster solution with two virtual machines. In this example VMFS is shown (valid for the Linux based solutions), but RDM can also be used. Figure 5. SAP and Cluster Configuration in Virtual Machines
Cluster S/W in VMs Key Points Assumes 3-tier application server virtual machines not shown. Protected via cluster agents for DB, ASCS, replicated enqueue. Protection against server failure plus monitoring of DB and ASCS. Auto-restart of SAP services. Continuous availability of SAP locks due to replicated enqueue. No virtual machine and guest OS boot required during failover. Planned downtime for guest OS and database patching can be minimized by evacuating cluster resources to the other node. .
Considerations DB and message service unavailable during failover. Time to recover depends on time to restart the application. Cluster skills required, more complex setup. For cluster software that requires RDMs no migration via vMotion possible and ESX host maintenance causes downtime (manual failover of service required).
YES
LOW
YES
YES
NO
MEDIUM
In-guest Clustering
YES
YES
YES
HIGH
When evaluating the best option consider the following: What is a customers Service Level Agreement (SLA) with respect to uptime/downtime, or how much downtime is a business willing to tolerate? o Can the business accept some regular planned downtime for patching the guest OS and database? If yes there may be no need for rolling patch upgrades.
All scenarios provide protection against hardware failures. A big difference is the ability to monitor the health of the database/Central Services. Therefore, what is the impact of application awareness? o In past experiences running SAP applications, how often has only the database or Central Services component failed at the application level that required automatic restart (situations where hardware was not the cause of failure)? In some environments, Operations may not want automatic restart of the application in the event of only an application error. Instead, immediate notification and manual intervention may be preferred to determine root cause of the problem. What type of application errors need to be monitored and can the clustering agent detect such events? For example, corruption to database objects may not be detected by the cluster agent. Work with the cluster vendor to determine application error detection methods. In some cases application error detection can occur when a cluster node is incorrectly patchedthat is, the maintenance of the cluster software itself can lead to unexpected downtime unless strict promote-to-production testing methods are deployed.
SAP Solutions on VMware Business Continuity What is the business cost/uptime trade-off? Clustering is the more expensive solution, and VMware built-in features are the more cost-efficient solution. Therefore, the business willingness to incur additional costs for increased levels of detection is a key consideration. It is common in SAP datacenters to virtualize non-production SAP systems first. Such deployments typically do not require in-guest clustering, and satisfy their SLA requirements with VMware HA and vMotion. Actual uptime data from these installations can be used as an indicator to determine if they can be acceptable for production SAP deployments. The final design choice depends on how much downtime a business can realistically tolerate, and the cost they are willing to invest in the extra resources and skills to install and operate software that provides application monitoring. It is a trade-off.
8.1
An SAP landscape can consist of a considerable number of separate systems to host the multiple SAP products, each with separate production and non-production systems. In the production environment, multiple SAP systems typically interface to a myriad of third-party bolt-on applications. In addition, the multi-tier architecture of SAP Netweaver may result in separate tiers of application and database servers. A fully virtualized SAP environment results in numerous virtual machines with data interfaces/flows between them. Such a volume of virtual machines can be managed with the workflow features of SRM that process the correct sequence and order of recovery of virtual machines after a site failure. Error! Reference source not found. shows the architecture of a deployment of a virtualized SAP landscape with SRM. In this example, production SAP systems are replicated from the protected to a recovery site. Each site hosts a separate storage array. Customer-specific business requirements determine if non-production systems also need to be replicated and protected against site failure. Other non-production SAP systems are hosted at the protected site. The SAP landscape is logically depicted here by three SAP systems for simplicity, each of which is connected via interfaces to demonstrate that business processes can traverse separate systems (the actual landscape would have more SAP systems and third-party bolt-on applications).
SAP Solutions on VMware Business Continuity Figure 2. Example Deployment of SAP Landscape with Site Recovery Manager
Overview of the architecture: ESX/ESXi hosts at the recovery site run some non-production systems to maximize resource usage. These servers do not need to be idle. Another scenario (not shown here) based on two-way storage array replication is feasible with SRM whereby production systems can be split between the two sites and each can be acting as a failover to the other. A SRM server is installed both at the protected and recovery site. Both sites are managed by their own vCenter Server. The SRM Server operates as an extension to the vCenter Server and the SRM user interface installs as a vSphere client plugin. A certified storage array vendor is required that has an adapter that integrates with Site Recovery Manager. See Section 10, Resources, for a list of certified storage products. Storage array replication needs to be correctly installed and configured for Site Recovery Manager to operate. This should follow the same process as the physical environments. Site Recovery Manger automatically detects the replicated LUNs that contain virtual machines. The protected and recovery sites should be connected by a reliable IP network. Storage arrays might have additional network requirements for replication. The SRM servers at both sites communicate with each other during normal operations.
SAP Solutions on VMware Business Continuity On the protected site, production virtual machines are replicated via storage array replication. On the recovery site (storage array B), the replicated LUNs are not visible to the ESX hosts.
8.2
Protection groups are created on the protected site. A protection group is a collection of virtual machines that all use the same set of replicated LUNs and failover together. Recovery plans are created at the recovery site and are created from the protection groups. The recovery plan is essentially an automated runbook that consists of a set of steps that control what happens during a failover. The recovery plan determines the order of production virtual machine startup during a failover and also can suspend non-production virtual machines already running at the recovery site. Enough server resources are required at the recovery site to run the production systems, as well as any nonproduction systems that are also needed to run per business requirements (otherwise, nonproduction systems can be suspended by the recovery plan). Callouts to custom scripts can be included in the recovery plan for customer-specific requirements. The SAP application can be configured to auto start after a guest OS boot within the virtual machine. The execution of the recovery plans enable customers to achieve faster RTO. A recovery plan can be executed in one of two modes: Actual failover Array replication is halted and the replicated LUNs on the recovery site are enabled for read and write capabilities. SRM initiates the power up of the virtual machines in the recovery site according to the startup order in the recovery plan. SRM does not automatically detect a site disasterrecovery has to be manually started via the SRM user interface at the recovery site. Test failover The replicated LUNs on the recovery site still remain unavailable to the ESX hosts. They are copied using storage array snapshot functionality and these copied snapshot LUNs are presented to the ESX hosts. The snapshot is reasonably quick as data is not duplicated (this is part of the storage array feature). The production virtual machines are started according to the recovery plan and can then be user tested. After testing is complete, a manual step is performed via the SRM user interface to continue and this step stops the production virtual machines and removes the storage array snapshot. Meanwhile, any suspended non-production virtual machines are started again on the protected site. During this test cycle the replicated LUNs are still being refreshed per the storage array replication schedule, and production systems continue to function normally on the protected site. The recovery plan test simulates an actual recovery as it performs the same sequence of actions to recover the production SAP systems. It can be run as frequently as required and demonstrates an enormous business benefit of being able to test a disaster recovery plan on demand to satisfy any auditing requirements.
8.3
Network Customization
Typically there are separate networks at the protected and recovery sites. Though each network should be connected via routers, the subnet and IP address will differ between the locations. Therefore, when performing site failover, IT administrators can be faced with the following challenges on the recovery site: Network properties of the production virtual machines need to be customized according to the network specification of the recovery site. Domain Name Server (DNS) records pertaining to these virtual machines need to be updated. After failover to a disparate network, the network properties of virtual machines such as IP addresses, gateway, and DNS domain all need to change to return to a functional state. SRM addresses this at the recovery site via the following features: 2011 VMware, Inc. All rights reserved. Page 19 of 24
SAP Solutions on VMware Business Continuity Customization Specification Manager This allows administrators to create a custom network specification for each production virtual machine that is replicated from the protected site. Network properties (IP address, gateway, and so on) can be assigned to the virtual machine so that when it starts up in a recovery plan it will function correctly on the recovery site network. The hostname of the guest OS in the virtual machine needs to remain the same so as not to impact the SAP application (installed SAP instance files have the hostname of the OS in various configuration and startup files, but IP address is not hard coded in the files). After changing the IP addresses of virtual machines, DNS records of the virtual machines need updating. These features are covered in detail in the document, Automating Network Setting Changes and DNS Updates on Recovery Site Using VMware vCenter Site Recovery Manager (http://communities.vmware.com/docs/DOC-11516).
8.4
The SRM solution requires storage array tools to replicate the LUNs from the protected to the recovery site. Storage array replication needs to be installed and configured in the same manner as in physical environments, and administrators should follow guidelines from their storage vendor. Similarly, SAP database LUN layout on the storage array should follow the same recommendations as for physical environments. The major storage array vendors have SAP practices that have developed best practice guidelines for LUN layouts of SAP databases and how they should be replicated between separate sites in a disaster recovery scenario. The same guidelines should be followed with SRM. For example: Best practice for I/O performance requires production database virtual machines not to be shared with other virtual machines. Where applicable, some storage array vendors may prefer the use of RDMs as they are compatible with their disaster recovery tools. In these cases the virtual machine guest OS drive (root or C:\) would be VMFS format and the database data files would be RDM-based. The RPO objective is managed by the storage array replication schedule. The frequency of replication and subsequent cost with respect to bandwidth requirements over a long distance is managed by the storage vendor specifications and is balanced against the business requirements. Two broad replication methods are available from storage vendors that impact RPO, and in both cases SRM does not manage the consistency of the SAP database during replication (quiescing of the database). This is addressed by the storage vendor technology or by separate procedures: Synchronous replication Guarantees zero data loss, where a write either completes on both sides or not at all. The storage vendor technology typically guarantees consistency of the database that is spread across multiple LUNs. Asynchronous replication Write is considered complete as soon as local storage acknowledges it. The remote storage is not guaranteed to have the current copy of data. A potential scenario to guarantee database consistency in this situation involves putting the database into online backup mode before replicating. On a separate, more frequent schedule, replicate the database log files. Database recovery then involves starting the database and applying logs to roll forward the database. Such a process may be created manually or be part of tools/products from the storage vendor.
8.5
VMware vCenter Site Recovery Manager Version 5.0 is available as of Q3 2011. This provides additional features that can benefit SAP deployments, including reprotection and failback.
8.5.1 Reprotection
After a recovery plan or planned migration has run, there are often cases where the environment must continue to be protected against failure to ensure its resilience or to meet objectives for disaster recovery. Reprotection is an extension to recovery plans for use only with array-based replication. It enables the environment at the recovery site to establish synchronized replication and protection of the environment back to the original protected site. This enables automated failback to a primary site following a migration or failover.
8.5.2 Failback
An automated failback workflow can be run to return the entire environment to the primary site from the secondary site. This happens after reprotection has made sure that data replication and synchronization have been established to the original site. Failback runs the same workflow that was used to migrate the environment to the protected site. Failback results in: All virtual machines that were initially migrated to the recovery site are moved back to the primary site. Environments that require that disaster recovery testing be done with live environments with genuine migrations can be returned to their initial site. Failover can be performed in case of disaster or in case of planned migration.
9. Conclusion
SAP software solutions enable a variety of mission-critical business functions such as sales order entry, manufacturing, and accounting that depend on the availability of IT services. The consequence of a failure to meet the business demands can be costly and require an investment in infrastructure that is designed for high availability to protect against failures within the datacenter as well as against events that may cause a site disaster. Such failures result in unplanned downtime. Architectural scenarios were described showing how the SAP SPOFs can be protected against hardware failure with VMware HA and FT, Symantec ApplicationHA, or with clustering software in virtual machines. Though Symantec ApplicationHA and clustering require a more complex setup, they provide application awareness by checking the health of the SAP SPOFs. Note that all these high availability specifications require redundancy designed into other parts of the infrastructure (for example, network, storage, and power). Designing a highly available SAP system on VMware vSphere requires a trade-off between the level of downtime that can be tolerated (which, in turn, has a business cost), and the complexity of the setup which has a cost with respect to skills and IT resources. So, organizations need to determine their realistic requirements for availability. The architectural deployment of an SAP landscape with VMware vCenter Site Recovery Manager was described which provides an automated disaster recovery and testing solution for SAP landscapes. SRM enables on demand and frequent testing of disaster recovery plans with no impact to the production systems. This can help to satisfy internal audits and business compliance requirements. SRM can help to achieve the disaster recovery RPO and RTO priorities of organizations running SAP applications. RTO is addressed by recovery plans that automate the sequence of virtual machine recovery at the remote site, including network reconfiguration. RPO is managed by the storage array that controls the frequency of replication to the remote site and manages the consistency of data. A successful SRM deployment requires a solid partner approach between the customer, Vmware, and the storage array vendor.
10. Resources
SAP Notes: 1609304- Installing a standalone ASCS instance (Windows) 1374671 - High Availability in Virtual Environment on Windows 1552925 - Linux: High Availability Cluster Solutions
Protection of business-critical applications in SUSE Linux Enterprise environments virtualized with VMware vSphere 4 and SAP NetWeaver as an Example http://www.cc-dresden.de/en/whitepaper/
http://www.vmware.com/files/pdf/VMwareHA_twp.pdf
Protecting Mission-Critical Workloads with VMware Fault Tolerance:
http://www.vmware.com/files/pdf/resources/ft_virtualization_wp.pdf
Setup for Failover Clustering and Microsoft Cluster Service ESXi 5.0 vCenter Server 5.0 http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-50mscs-guide.pdf
Application Note: Clustering configurations supported for VCS with vSphere http://www.symantec.com/connect/sites/default/files/Clustering_Conf_for_VCS_with_vSphere_0.pdf Whats New in VMware vCenter Site Recovery Manager 5.0 http://www.vmware.com/files/pdf/techpaper/Whats-New-VMware-vCenter-Site-Recovery-Manager50-Technical-Whitepaper.pdf
http://www.vmware.com/pdf/srm_storage_partners.pdf
Automating Network Setting Changes and DNS Updates on Recovery Site Using VMware vCenter Site Recovery Manager http://communities.vmware.com/docs/DOC-11516