Error DataBase-One Place all Solutions Forums Blog Glossary    Contact Us
Search  
   
Browse by Category
Error DataBase-One Place all Solutions .: Citrix .: How to Design and Implement Effective Disaster Recovery Strategies with Citrix Technology

How to Design and Implement Effective Disaster Recovery Strategies with Citrix Technology

Introduction

Over the last five years, we have witnessed some truly catastrophic events.  The September 11th hijacking and Hurricane Katrina both provided terrible devastation to both people and places.  Beyond that, they also had a terrible impact on organizations and their ability to survive in times of crisis.  Whole data centers were wiped out. People were ordered to evacuate, and could not remain behind to manage what was left. Massive amounts of damage were done to the infrastructures of both New York and New Orleans. Suddenly IT managers were faced with critical staffing decisions and the realization that they had no way to get their organizations up and running again. Disaster Recovery has become a major force in the lexicon of I.T. lingo.

This article isn’t going to walk you through DR for your entire company. Those types of exercises take years to put in place and a lot more space than I have to write. Instead we will discuss general DR best practices and then focus on a small chunk of your infrastructure, namely your Citrix environment. As we go through the environment though, look at all the things you peripherally interact with just to keep your environment working. Good DR strategies don’t try to answer every what-if, they try to provide a roadmap for what to do next.

Define Your Criticality Levels

The first step is to define your levels of criticality. Worst case scenario dictates that your data center just burned down/blew up/flooded. Whatever the case, it is inaccessible. What kind of organization are you? Are you a financial institution, where every minute down costs money? Do you have a production line dependent on your applications, or shippers that need constant access to inventory numbers? All of these are critical components. Your organization needs to understand the criticality level of each component. Is email critical to your success, or can it wait 24 hours? Which is more important to your ongoing operations, the external web servers or your payroll system? These are decisions that only you can make.

Many organizations will come up with a class for the Business Criticality. For instance, at my current company we define 5 levels of BC with 5 being the lowest priority and 1 being the highest. Each level of BC has a different response level attached, and any application or infrastructure component brought into the environment is assigned a BC as part of the process. For instance, a BC of 5 means that we as an organization can live for up to a month without this piece. A BC of 1 means we cannot live more than 4 hours without it. Once you have a structure for your BC levels, it becomes easier to think about the individual components and the requirements for each of them.

Evaluate Your Environment

Now that you have a starting point, begin your environment evaluation and assign the appropriate BC levels. Some will be obvious. Power is probably a BC1. You can’t do a whole lot without it. Network very well might be a BC1 as well. After all, if your servers can’t communicate that might be a showstopper. How about your email environment? Is it as critical a component? Could you live 4 hours, 8 hours, 24 hours without email? Each component is going to have to be evaluated on its own merits and as a piece of the infrastructure puzzle. For instance, if an application server has a BC1 but utilizes an Oracle database, then you can’t have the database as a BC3. It can be a very complicated web, especially with large and complex environments that have multiple 2 or 3-tier applications to restore.

One trick is to divide your infrastructure into larger segments. Each segment is all of the applications/servers/whatever that need to interact with each other. You assign the BC to the entire chunk, and then define the recovery plan so that they are brought up in the order of criticality within that segment. In this case, all BC1s are not created equal! Your Accounting segment for instance could have 6 different systems, each with an overall BC1. But within the segment and when you are designing your recovery procedures, you know that the Oracle database has to be the first thing up. Then your Peoplesoft server, or SAP, or however you want to define that priority. This gives you a good method to make sure that each portion gets restored in a properly defined timeframe, but also gives you the ability to understand the relationships between each piece and the order that they need to be prioritized in even within their own BC.

Obviously this is worst case scenario where you have literally lost everything. In fact, most DR situations will be one or two applications that need to be brought online. Still, if you have plans put together for the worst case then everything else tends to fall in line behind it. So now you have defined your BC levels, and have evaluated your current environment. If disaster strikes, where exactly will you restore?

Picking a DR Site

Choosing your DR site can be an expensive proposition. For smaller organizations, there might not be a second site at all. Many will choose to try and restore at the main facility, or figure that if it is gone there is not much they can do anyway. Some organizations that are large enough to use multiple data centers will use them as failure sites in case of DR. A third choice is to pay one of the larger storage organizations such as Iron Mountain or Sungard to give you space in their facility to recover your environment in case of a DR. Often these companies are in secure sites in other major metropolitan areas, and will charge you based on your requirements for space and storage. This can be an attractive option if you don’t have the resources for your own facility, but often there are time and space constraints at these types of places. You may face challenges in getting the environment configured to your specifications, and if you choose to bring in your own hardware for leased floor space it means you will get little support from the facility managers.

For those sites large enough to maintain additional data centers already, planning for DR means considering what the added load will mean to your data center. Capacity at each facility will have to have enough overhead for the anticipated spike in a DR situation. Obviously your infrastructure will have to support switching the network over to the new data center, clients will need to be rerouted, etc; Thankfully with a good DNS infrastructure this task is a lot easier than it used to be. The last consideration is your restore mechanism.

How Do You Restore?

Obviously if a real disaster strikes, you may not have access to your production servers. That means your ability to restore will be based entirely on your backups, software library, and installation documentation. Regular, tested backups of key components makes your restore at least feasible. For larger organizations these backups are usually kept in an off-site storage facility in case of a disaster. Whatever backup mechanism you use, you must have a method of restoring those backups at your DR facility. I have participated in several DR tests where the backup tapes used varied between two sites and neither was able to restore the other’s tapes in the backup testing. Seems ridiculous, but it happens all the time. 

If your company follows the ITIL guidelines you might already have a software library implemented for all installed software. This can be a valuable tool in a DR situation, although a common roadblock is finding someone who remembers how the application was installed in the first place. Keeping a hard copy of installation media, instructions, etc offsite is an overlooked but extremely important DR component.

Conclusion

This is simply a brief look at how to tackle DR in your organization. In the next part of this article, I will present a specific scenario for a Citrix environment and discuss how all of these DR steps apply. DR is a complicated, expensive, and often overlooked component of a stable infrastructure. In this day and age you really can’t be too careful with your DR strategy.

If you missed the other articles in this series please read:

Designing and Implementing Effective Disaster Recovery Strategies with Citrix Technology Part 2: The Strategy

 

Introduction

In my last article, I discussed some basic steps to evaluate your environment and consider DR implications and solutions for it. Obviously this can be an incredibly difficult task, and no one article can encompass all of it. Hopefully however it provided you with some ideas about how to tackle your environment from a 1000 mile view. In this article, we will look at a traditional Citrix environment and how to apply DR techniques to the critical components that your Citrix environment might contain.

Scenario

Let’s look at this from the point of view of a traditional Citrix environment using Presentation Server 4. We will take a farm with 5 PS4 application servers and an NFuse/Secure Gateway machine. The data store is hosted in a SQL2000 environment along with the Resource Manager database. Installation Manager is installed and used to deploy applications when possible, but there are some manual installations you just have to do. The applications installed are SAP, Office, Email, and a home grown database application. The applications are installed as follows:

  • ServerA: Office, Email
  • ServerB: Office, Email
  • ServerC: SAP
  • ServerD: SAP
  • ServerE: DB App

There are two datacenters available to use. Servers A, C, and E are hosted in DC1 along with the SQL server, license server, and NFuse/Secure Gateway box. Servers B and D are in DC2.

Do I want DR or Fault Tolerance?

Looking at the scenario above, the first question you have to ask is whether the solution you are looking for is Disaster Recovery or Fault Tolerance. There can be a considerable difference between the two goals. Fault Tolerance is the constant uptime of an environment for user access. In the scenario above, we are providing a Fault Tolerant solution for Office, Email, and SAP. Although there is no fail-over for user sessions, the application is load balanced and thus hosted on multiple servers. If one of the application servers goes down there is still availability to the application. This only takes into account the application server pieces however. Fault Tolerant solutions can and should be a part of your DR strategy for key applications, but they are not a true DR on their own.

Suppose in the above scenario that all the servers are hosted in the same datacenter. Something catastrophic happens, and your datacenter is completely down. That fault tolerant solution isn’t doing you a whole lot of good right now is it! Take it back to the scenario… you are a smart admin and have the advantage of multiple data centers, so you have divided your servers between the locations. You now have fault tolerance and disaster recovery capability for those application servers. What about the rest of the environment? If your users are dependent on the secure gateway/NFuse box for access, you are going to be faced with challenges to get them reconnected.

Breaking Down the Components

Application Servers – We have already covered this a little bit above. The application servers have been split between the DCs to provide full fault tolerance and DR. If one datacenter should go down, your application servers are still available in the other for user connections with the caveats discussed below. If you want to go a step further in providing disaster recovery, you can also maintain a set of machines in a DR location that are ONLY utilized in the case of a true DR situation. We currently do this for our own mission critical applications. Several Citrix servers sit in a hosted DR site. A second set of applications, identical to the originals, is published from that server in a folder labeled DR on the user’s Program Neighborhood. These boxes are only accessed if a true DR is declared, and the users have instructions about when to switch to their DR folder application set.

The servers are an insurance policy. If they are never used in their lifecycle then it is probably a good thing. The problem with this configuration is the reliance on the data store replication. See the data store section for some issues with a true split DR environment and how you can look to deal with them.

Licensing Server – MPS3 and PS4 moved licensing out of the data store and to an actual license server process. It needs to be hosted on a machine running IIS, and your application servers will periodically query the box for a license status. Every time an application server is rebooted it connects to the license server and then caches the license count locally. This allows your license server to be down up to 30 days before the application servers will no longer allow connections, which is a welcome change from the old model. License files are downloaded from MyCitrix.com and imported into the License Server console. Unfortunately, license files are generated based on host name and will not work on a different license servers without reclaiming them in Mycitrix and then reallocating them to the new server.

Citrix does suggest some methods for creating a DR environment for your license server. The first is to use MS Clustering on your web servers to provide High Availability. You would simply tie the license host name to the cluster service name, and you now have a High Availability environment. This will only help however if you experience a total server failure, since a network failure alone is not enough to trigger the Active to Passive failover. Having a clustered environment means that you will have to have a dedicated License Server. In a non-clustered environment you could host it on one of your existing boxes like the SQL server. And finally, it doesn’t really give you DR capability since the cluster has to be located in the same datacenter.

A second solution is to create backup license servers. This can simply be a clone of the production server that is kept offline until needed. Alternatively, you can build another identical box with a different name and then rename it and adjust your DNS if it is required. Honestly, given the 30 day grace period around the license server, it is one of the least critical components of the DR. The best alternative is to document your login to Mycitrix and have build instructions for recreating the license server and changing the farm settings so that it points to the new machine. It’s cheap and fairly easy. 

Data store - One of the most complex issues around creating a true DR environment is how you handle your data store. These days, most administrators are choosing to host the data store on an external machine like a SQL 2000 server. Prior to MPS3, the loss of your data store was truly a critical event. You had 96 hours to recover the data store or no connections could be made. Since licensing has been separated from the data store and given its own grace period of 30 days, the criticality of the data store has decreased significantly. If your data store is down, configuration changes cannot be made to the farm environment. Your users will not have any impact however to their application access, and you have in essence a static environment until you can bring it back up.

In larger environments however there might be a critical need for data store accessibility. As an example, without the ability to change the configuration you can’t point your servers to a new license server. For situations like this there are several options available. If a High Availability solution is sought, then clustering the OS and SQL environments is a very valid tactic. It’s also an expensive one! For multi-site fault tolerance and DR, SQL replication can be established. This is what we chose to do with our hosted DR environment. The live SQL data store is replicated to our hosted DR environment where the DR Citrix servers sit. Those DR machines must point to the live data store (replication is a one way street, and your data would be out of sync otherwise) and be switched to the failover SQL server in case of a DR. The Replica server has to be promoted to be the live server to take it out of read-only mode.

Conclusion

For smaller environments that use a local data store, it is important that you regularly create backup copies of your data store using the dsmaint backup command. This will create a flat file that can be backed up and restored using whatever backup methodology you have in place. In the final part of this article series, we will address restoring the data store files and DR strategies for the rest of our Citrix environment. We will also look at alternative strategies for providing application access during downtime periods. So stay tuned!

 

 

 

 

 

 

 

 

 

 

 

 

Introduction

In my last article we looked at recovery strategies for specific components of the Citrix environment.  In devising environments for business critical applications it is important to consider both high availability and disaster recovery options in implementing your environments. Once you have identified your concerns and created appropriate recovery strategies for the core components of your server farm, it is critical to make sure that you have also covered the additional requirements for a Citrix farm to function. In this final part of the series we will discuss backup and recovery strategies for your data store.

Data Store Restoration

Honestly, you could write a five part series just on how to backup or restore your data store. As I mentioned in the last article, the criticality of your data store has been significantly decreased since Citrix separated the licensing component into its own service. That’s not to say the data store isn’t important though! After all, if your data store is truly lost you will be forced to completely recreate your farm environment. And trust me that is not a fun way to spend your weekend! So what can you do to minimize your risks?

Well, let’s start by addressing the scenario as it was presented in the last article. The data store is hosted in a SQL2000 environment along with the Resource Manager database. Obviously this is a more complex situation than a simple Access data store. I can’t tell you the importance of your data store to you. It is honestly an individual choice that each organization will have to make in addressing their goals. Let’s look at two common requirements that managers like to give to their poor overworked admins:

The data store must ALWAYS be available

Frankly, you can’t really guarantee this one. You can come darn close though! For starters, you will likely want to cluster your SQL environment. On Windows 2003 Enterprise, installing SQL2000 with SP3 on a clustered environment is actually a fairly painless process. The limitations of Microsoft clustering remain in place however. Because the cluster will be Active/Passive, one server will sit unused unless the primary is brought offline. Additionally failover will only occur under conditions that are recognized as a failure by the cluster service. And because of the requirements for shared Quorum and Data space, it becomes very difficult to locate clustered servers any real distance apart for a true DR situation.

The clustered environment gives you a fault tolerance at a local level. For true Disaster Protection however, you need to have some plan for offsite recovery. This could mean hot hardware at a standby site and SQL replication for the data store databases. Log shipping is also an option with SQL but with some important caveats. If constant uptime is a concern, log shipping can cause a significant recovery time depending on the number of transactions in your database. When you are using log shipping as a means of backup and recovery, the database has to process those logs in recovery mode before it can become operational. If you haven’t refreshed your database in a year and have just been shipping those logs down to your DR site, then be prepared for a LOT of pain when you run that recovery mode.

So you’ve got your cluster, you’ve got your replication or log shipping… you’re safe, right? Well, probably. But what if (and I know I’m going crazy here) your DR server fails? Never happen, right? That’s what I thought. But a funny thing happens when you leave DR servers sitting for a year without really watching them. They tend to fail, and you often don’t realize it until it’s time for you to need them. So what can you do? Well, you cluster your DR environment of course! So you now have your live cluster, your DR cluster, and probably SQL replication between the live and DR database servers. I admit it is a crazy situation. You’ve got 2 live servers, 2 standby servers, and lots of redundancy. All for your Citrix data store!

I can’t say I honestly recommend this situation. In my case, the database server happened to be hosting several other critical databases and my little data store got swept up in the process. I will admit though that I have absolutely no fear about any kind of data store failure now though! It is important to understand however that this environment STILL doesn’t fulfill the requirements of management that the data store be always available. In a replicated environment you are still have to force the replicated SQL server to take over as the live environment. You will also have to modify the DSN on each server to point to the new SQL server unless you utilize some form of DNS aliasing. Setting up SQL replication is an extremely difficult task unless you are an accomplished SQL admin. Citrix has provided VERY thorough documentation on their website for setting up that replication, but it still requires a good understanding of SQL Server to execute. I do not recommend setting up replication without help from your DBAs if you are not a SQL person yourself.

The Data Store isn’t as Important as User Connectivity!

This is the other common attitude managers take, and it is often a valid one. You can run for quite some time without it after all. Sure, your options are severely limited in relation to management of your Citrix environment. But as long as your users can carry on all is right with the world! Well, maybe not. User connectivity can and should be the highest priority for any Citrix farm, but that shouldn’t override your ability to manage your own farm. This attitude can be a difficult one to counter however. Generally this is the type of environment where you’re just going to run flat backups of the database and use them to restore once you have the time. This could be to the same hardware, to a DR site, etc; Flat backups of the database can be scheduled and managed through the SQL Server Administrator. You want to make sure you back up all the relevant databases like Master as well as the data store itself. Otherwise you will not be able to successfully restore the database.

Once you have restored the database form the flat files you still are faced with the same requirements as above. You may need to make those same DSN changes or DNS aliasing to make the connection work. You will also likely need to cycle the IMA service on all of your boxes to make sure they can connect to the new data store copy. Although the servers can be run for quite some time without a data store, it really isn’t a good way to leave your environment. Monitoring becomes a logistical nightmare. Performing common tasks like user session manipulation is close to impossible. And if something further goes wrong it might take a very long time to diagnose. While it is understandable that management is concerned with the user base, you have to be concerned with your own ability to work.

Conclusion

Data store redundancy and Disaster Recovery can be a touchy subject. In many cases it’s one of those “we’ll get to it when we can” situations. But sometimes, in high pressure environments, it really does make sense to provide a clustered or replicated environment for your data store. Only you as the administrator can accurately answer the importance of a working data store to your organization. This concludes my series on Disaster Recovery and your Citrix environment. I realize I have only brushed the surface on some issues, but there really is no way to cover all of the facets of a true DR in 3 short articles. Hopefully this has given you some things to consider as you plan your own DR needs, and a few tips to help you implement it successfully.


How helpful was this article to you?

Related Articles

article How Do anyone Really Need Citrix for Effective Application Hosting?
Microsoft has long been making inroads into...

(No rating)  4-20-2008    Views: 202   
article Citrix Presentation Server 4.5 - Technology Preview
Presentation Server 4.5 is due out sometime...

(No rating)  4-22-2008    Views: 176   
article Network Design consideration in implementing Exchange 2003
Network Design Considerations:  

(No rating)  9-29-2008    Views: 190   

User Comments

Add Comment
No comments have been posted.