From Chaos To Control
From Chaos To Control
From Chaos To Control
Introduction
Introduction
By Sean Daily, Series Editor Welcome to From Chaos to Control: The CIOs Executive Guide to Managing and Securing the Enterprise! The book you are about to read represents an entirely new modality of book publishing and a major first in the publishing industry. The founding concept behind Realtimepublishers.com is the idea of providing readers with high-quality books about todays most critical IT topicsat no cost to the reader. Although this may sound like a somewhat impossible feat to achieve, it is made possible through the vision and generosity of corporate sponsors such as NetIQ, who agree to bear the books production expenses and host the book on its Web site for the benefit of its Web site visitors. It should be pointed out that the free nature of these books does not in any way diminish their quality. Without reservation, I can tell you that this book is the equivalent of any similar printed book you might find at your local bookstore (with the notable exception that it wont cost you $30 to $80). In addition to the free nature of the books, this publishing model provides other significant benefits. For example, the electronic nature of this eBook makes events such as chapter updates and additions, or the release of a new edition of the book possible to achieve in a far shorter timeframe than is possible with printed books. Because we publish our titles in realtimethat is, as chapters are written or revised by the authoryou benefit from receiving the information immediately rather than having to wait months or years to receive a complete product. Finally, Id like to note that although it is true that the sponsors Web site is the exclusive online location of the book, this book is by no means a paid advertisement. Realtimepublishers is an independent publishing company and maintains, by written agreement with the sponsor, 100% editorial control over the content of our titles. However, by hosting this information, NetIQ has set itself apart from its competitors by providing real value to its customers and transforming its site into a true technical resource librarynot just a place to learn about its company and products. It is my opinion that this system of content delivery is not only of immeasurable value to readers, but represents the future of book publishing. As series editor, it is my raison dtre to locate and work only with the industrys leading authors and editors, and publish books that help IT personnel, IT managers, and users to do their everyday jobs. To that end, I encourage and welcome your feedback on this or any other book in the Realtimepublishers.com series. If you would like to submit a comment, question, or suggestion, please do so by sending an email to [email protected], leaving feedback on our Web site at www.realtimepublishers.com, or calling us at (707) 539-5280. Thanks for reading, and enjoy! Sean Daily Series Editor
i
Chapter 1 Introduction...................................................................................................................................... i Chapter 1: Managing the Enterprise ................................................................................................1 Business Drivers for Manageability.................................................................................................1 Maturing Enterprise IT Management...............................................................................................2 Reacting to Problems ...........................................................................................................4 Planning ...................................................................................................................4 Forecasting...............................................................................................................5 Monitoring ...............................................................................................................5 Managing to Service Levels.................................................................................................6 Optimizing Utilization .........................................................................................................7 Creating IT Agility.............................................................................................................10 Areas of Management Concern .....................................................................................................13 Network Management........................................................................................................14 Server Management ...........................................................................................................14 Storage Management .........................................................................................................15 Application Management...................................................................................................15 Manageability Impacts on ROI and TCO ......................................................................................16 Adopting a Manageability Roadmap .............................................................................................17 Assembling the Manageability Team ................................................................................18 Identifying Manageability Concerns..................................................................................18 Defining a Manageability Maturation Path........................................................................19 Creating a Manageability Policy........................................................................................19 Evaluating Manageability Results .....................................................................................20 Summary ........................................................................................................................................21
ii
Chapter 1
Copyright Statement
2003 Realtimepublishers.com, Inc. All rights reserved. This site contains materials that have been created, developed, or commissioned by, and published with the permission of, Realtimepublishers.com, Inc. (the Materials) and this site and any such Materials are protected by international copyright and trademark laws. THE MATERIALS ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice and do not represent a commitment on the part of Realtimepublishers.com, Inc or its web site sponsors. In no event shall Realtimepublishers.com, Inc. or its web site sponsors be held liable for technical or editorial errors or omissions contained in the Materials, including without limitation, for any direct, indirect, incidental, special, exemplary or consequential damages whatsoever resulting from the use of any information contained in the Materials. The Materials (including but not limited to the text, images, audio, and/or video) may not be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any way, in whole or in part, except that one copy may be downloaded for your personal, noncommercial use on a single computer. In connection with such use, you may not modify or obscure any copyright or other proprietary notice. The Materials may contain trademarks, services marks and logos that are the property of third parties. You are not permitted to use these trademarks, services marks or logos without prior written consent of such third parties. If you have any questions about these terms, or if you would like information about licensing materials from Realtimepublishers.com, please contact us via e-mail at [email protected].
iii
Chapter 1
In short, a lack of manageability costs money. Youll spend more money on IT staff who are forced to work in reactive mode, more money on IT systems in an attempt to meet new business requirements, and more money trying to manage the entire mess. Youll spend money on lost opportunities because your IT resources werent prepared to handle them. Youll lose money as systems fail to protect valuable corporate assets. Youll lose money as your users spend more time trying to deal with inefficient IT systems rather than getting their jobs done. Unmanaged and unmanageable IT systems are a money pit and the primary business driver for getting some manageability on the scene immediately.
Chapter 1
Top 10 Ways a Lack of Manageability Costs Money A lack of manageability in any area of a business costs money, particularly for IT systems. The following list highlights the top 10 ways a lack of manageability costs money: Unmanaged systems are subject to unexpected failure, depriving users of much-needed services and reducing general productivity. Unmanaged systems are less likely to provide acceptable levels of service, often resulting in hidden productivity losses as users wait for systems to catch up. Unmanaged systems are more difficult to redeploy and repurpose when business needs suddenly change. Youre unlikely to know systems unused capacities or their dependencies, meaning new business opportunities often require new systems and new expenses. Unmanaged systems are rarely optimized, meaning they either carry unused capacityeffectively money youve spent that will never be usedor they are under capacity. Unmanaged systems make it impossible to move beyond reactive IT managementproblems must manifest themselves before being corrected; thus, the business always suffers. Unmanaged systems make it impossible to accurately identify common IT problems; if you cant identify problems, you cant find efficient ways to solve or eliminate them. Unmanaged systems might have hidden flaws or design problems that result in recurring problems. Unmanaged systemsparticularly applicationsmight have scalability capabilities that youre not taking advantage of because you cant tell when the application needs to be scaled. Unmanaged systemsespecially network storage systemsare easily abused by users, wasting corporate resources, reducing system availability, and driving up costs. Unmanaged systems are like unmanaged employees: theyll do their job, but theyll have no direction, no mission, and will take a lot longer (and cost a lot more) to get the job done.
Enterprise manageability, then, is a key to maturing your IT management, reducing costs, and improving the return on your expensive IT investments.
Chapter 1
Apply these levels to your sales organizationor any traditional area of business management and see how mature your company management really is: AgileThe highest level of management, agility allows you to focus on handling new business opportunities. Youve fine-tuned your sales organization so much that you know exactly where your best assets are and how much extra capacity they have. When new business opportunities arise, you can take advantage almost effortlessly, relying on your optimizations and managed levels to keep things on track. OptimizeThis level of management has you fine-tuning your resources. In other words, your sales organizations goals and quotas keep most management problems from happening, or at least let you address them proactively instead of reactively. You spend time putting the perfect salesperson on each account, helping each salesperson manage a greater number of accounts, and so forth. ManagedIn this level of management, youve defined clear goals and are meeting them. You dont need to react to a sudden drop in sales because youre instead managing to specific short- and long-term sales goals and quotas. You can see unmet quotas coming from a long way off and do something to correct the problem before it actually occurs. ReactiveThis is the lowest level of management and is simply a matter of reacting to problems. Reactive management can never be eliminated, as unexpected problems will never completely go away. However, reactive management should comprise a small portion of your overall management effort. For example, a well-tuned sales organization might sometimes need emergency action when a large client is unexpectedly dissatisfied, but that certainly shouldnt be the normal state of affairs.
3
Chapter 1 Most companies already operate many of their divisionssuch as sales and marketingat the Optimize or Agile levels. Such being the case is almost natural, as areas such as sales and marketing are well understood, and time-tested solutions and methodologies exist for managing efforts in those areas. In other words, sales organizations have a high level of manageability. IT systems, however, are rarely so straightforward. Most IT systems provide very little support for achieving higher levels of management maturity, making it impossible to ever move beyond reactive management. Fortunately, theres growing recognition that IT management must mature, and theres a growing market of products and solutions to make IT systems more manageable. Reacting to Problems The servers crashed! Mails down! The sales entry system is running slowly again! Phrases such as these are a common refrain in many IT environments. Without any systematic means of management for IT systems, theyll remain common, as your IT staff sits back and waits for problems to occur. For example, many folks manage their personal vehicles in a reactive fashion. Rather than taking the vehicle in for a monthly checkup, they simply drive and drive, waiting for something to fail. When something does fail, its often catastrophic, creating significant downtime waiting for a tow truck and costly repair bills. Generally, IT management works just the same: Wait until something breaks and then fix it. The problem, of course, is that something has to break in order for things to get done. So how can you drag yourself out of the reactive level of management? Planning, forecasting, and monitoring. Planning What are your most common IT problems? You need to conduct an inventory of them. Oftentimes, Help desk- or problem-tracking software can assist you in doing so, especially for a large IT team. Large teams might not be able to communicate effectively without tools, meaning that multiple team members could all be constantly fighting the same fire, with each individual thinking theyre the only one doing so. By implementing some kind of problem-tracking system, you can get a better picture of your common problems and take steps to reduce occurrences of them. For example, without realizing it, your team might be dealing with authentication problems on a daily basis. The cause could turn out to be authentication servers, such as Windows domain controllers, that are simply over capacity. An immediate solution is to increase the number of domain controllers to handle the authentication load of your business. This action illustrates a higher-level form of reaction: Rather than simply solving the authentication failure each time it occurs, youre taking steps to solve the problem in the short term.
Chapter 1
Forecasting Forecasting is key to preventing problems in the future. If your domain controllers are overburdened, you can add more to solve the immediate problem. Forecasting allows you to go one step further and predict when the new domain controllers will become overburdened. You can then plan to add capacity before the domain controllers reach that point, thus preventing the problem from occurring again. Monitoring Monitoring is the final step to getting yourself out of the reactive mode of management. Continuous monitoring lets your staff detect oncoming problemssuch as domain controllers reaching their maximum utilization pointbefore the problem occurs. Monitoring and forecasting go hand in hand, ensuring that your forecasts arent upset by sudden changes in the business environment or unforeseen bottlenecks in the IT infrastructure.
Performance Monitoring vs. Health Monitoring One significant problem with typical IT management is that it is overly focused on performance. Suppose your companys mail servers are running at 70 percent processor capacity, with disk work queues under one or two. Is that good or bad? Who knows? Most IT systemsoperating systems (OSs), applications, you name itare great at providing raw performance data. Unfortunately, that data is just about as useful as any other raw data, which is to say it isnt very useful at all. In order to mature your IT management, youre going to need to turn your focus from performance data to system health information. IT health takes raw performance data and compares it with predefined thresholds. These thresholds, which are defined both by industry best practices and your own service level agreements (SLAs), translate raw performance data into general health indications, such as healthy, problem, or critical. Thirdparty management systems often use an automated process to collect performance data, compare that data with predefined levels of acceptable health, and provide a simple indication of your systems health. For example, a Microsoft Exchange management system might collect a dozen or more performance points from your Exchange servers and let you know which servers are operating within, below, or above your predefined levels. You could immediately spot servers that have extra capacity, are nearing their level of peak efficiency, or are exceeding their planned capacity. Because health management can categorize systems as still working, but nearing capacity, they provide an excellent means for moving beyond reactive management to more mature levels of management. Health management lets you see problems coming from a longer way off, allowing your IT staff to take corrective action before the problems actually manifest as failed or unacceptably slow systems.
In addition, monitoring needs to involve more than just basic IT performance data; it should also monitor user perceptions. For example, if users expect their logons to be processed within 5 seconds each morning, you need to monitor actual logon times to meet that expectation. Simply monitoring IT-centric metrics such as processor utilization or memory utilization might not reveal the entire picture, and might not alert you to problems perceived by end users. With sufficient practice and data, youll be able to more easily match IT-centric metrics with usercentric metrics, establishing a correlation between hard performance values and users perceptions of performance. When youve done that, youve finally moved out of reactive management, and you can begin to establish service levels that guarantee specific userperceivable levels of performance.
Chapter 1
Youll always need to react. Dont make the mistake of thinking that reactive management is solvable. As I said earlier, simply reacting to problems isnt a desirable way to manage an IT infrastructure, but it isnt unavoidable either. Unforeseen problems will always come up. However, a more mature IT management strategy will first react to these problems, then plan for them in the future. Perhaps youll begin monitoring the health of additional systems or additional aspects of existing systems. You might alter your capacity forecasts to ensure that the problem doesnt occur again. The key is not to completely eliminate the reactive portion of management, but rather to reduce it and learn from it.
Managing to Service Levels Although proactively reacting is a step up from simply fighting IT fires, it doesnt provide your users with any guarantees of performance, and it doesnt provide your organization with any IT stability. Once you have monitoring systems in place that can inform you of problems before they occur, you can start addressing those problems in advance. At that point, you can move up to the next level of maturity in IT managementmanaging to service levels. Service levels are simply agreements between you and your users, giving you a goal for performance and stability and giving your users an expectation for the IT services they consume. Service level agreements (SLAs) should always be stated in terms that users can easily find for themselves. For example, the following basic SLA might work for your email infrastructure: Local network logon time, from the time the email client is launched: 5 seconds or less (might degrade acceptably to less than 45 seconds as much as twice per month due to rebalancing of mailboxes across available servers) Time to retrieve a message while connected to the local network: 2 seconds or less Availability of email during business hours: 99.999% (approx 1 hour of downtime during business hours per month) Availability of email during non-business hours: 95% (approx 50 hours of downtime during non-business hours per month) Minimum notice for schedule maintenance: 5 business days Recoverability: No more than 6 hours data loss in the event of total failure Time for Internet-bound messages to be delivered: No more than 1 hour, typically 45 seconds or less Time for new mailboxes to be added (once requested): 1 business day or less Response time to user notification of mail system failure: 1 hour or less
How can you meet these promises? Again, monitoring, planning, and forecasting. Use your monitoring infrastructure to calculate a correlation between IT-centric metrics such as processor utilization or network utilization, and user-centric metrics such as response time, downtime, and so forth. Create a monitoring system that recognizes server health, and can notify you if a servers operating parameters start to edge toward the unacceptable health zone. Automated notifications, and in many instances automated corrective actions, will enable your IT staff to respond to a worsening situation before it impacts your SLAs.
Chapter 1 Because so few enterprise systems provide adequate built-in monitoring and reaction capabilities, youll need to implement a third-party manageability system. This system should provide centralized management of IT resources, automated notification of problem health conditions, and ideally some form of automated corrective response to common problems. Typically, these systems are priced per end user or per monitored server or service, making them affordable even for smaller environments, and more easily scaled for larger implementations.
When will it pay off? Properly implemented, third-party monitoring systems can create a return on their investment almost immediately. By allowing your staff to react to oncoming problems before they occur, and by offering automated responses to many potential problems, these systems reduce management overhead, increase uptime, and reduce impact to productivity. Proper implementation, of course, is absolutely required, which is why many firms that sell monitoring solutions also offer implementation consulting to make sure that youre getting the most benefit from their products.
Once youre completely out of the realm of reactive management as a way of life, you can start using your proactively managed IT infrastructure more efficiently. In other words, once your IT staff quits fighting fires on a daily basis, they can start concentrating on optimizing your systems for maximum business benefit.
Moving Toward Managed: A Dot Coms Web Servers When I worked with a dot-com startup, most IT management was in a purely reactive state. We knew we had a problem when users or employees tried to use the Web site and discovered that it was too slow, or when customers complained that they couldnt reach specific pages on the Web site. Hearing about problems from customers is the worst way to discover them, of course, and we knew we needed to make our IT management a bit more mature. Unfortunately, our Internet Information Server (IIS)-based Web servers offered very little help, providing nothing more than some basic raw performance data and no effective means to catch failing Web pages or servers. We implemented NetIQs WebTrends to provide better manageability for those Web servers. We determined that customers complained when server response time exceeded about 5 seconds per page load, and we were able to draw a correlation between the number of users hitting each Web server and the resulting response times. We discovered that during peak hours the Web servers were handling too may users to provide an acceptable response timeeven though IT-centric values such as processor utilization and memory utilization were within acceptable values. To solve the immediate problem, we reconfigured the servers to have more capacity and prepared to add more servers to handle peak traffic. We established service levels that specified maximum page load times, and used WebTrends to help monitor user capacity and response times. As the sites popularity grew, we were able to easily predict when the existing server resources would be insufficient to maintain our service levels, and added servers well before we reached that point. Wed matured beyond simply reacting to problems to managing to a specific service level, thanks in part to third-party tools that made our Web servers more manageable then they were out of the box.
Optimizing Utilization IT systems are expensive, and businesses benefit most when theyre squeezing every drop of productivity from their systems. In a reactive management mode, of course, youre too busy simply keeping things running to worry about whether or not theyre running at peak efficiency. Once youve implemented third-party monitoring systems and moved to a more mature level of management, youll have the luxury to start optimizing your systems to their maximum potential.
Chapter 1 Your monitoring systems should already help you determine some critical pieces of information: The maximum values that various IT-centric measurements can be at. For example, you should know the maximum processor utilization, number of mailboxes, and other metrics that your mail servers should have to carry. The maximum desired operational level for your servers. This information represents the maximum performance values for various IT-centric measurements that still allow your servers to meet your SLAs. These values are typically belowoften far belowthe theoretical maximum performance level of your equipment, and represent an operational level that is acceptable to both your IT staff and your end users. The current level of performance for your servers.
Figure 1.2 illustrates these three points mapped for several common measurements on a mail server: mailbox capacity, user response time, network throughput, storage utilization, and downtime. In each instance, the red bar indicates the maximum possible value that the mail system can have. The red line indicates the maximum values that the servers can endure while still meeting your SLAs. The colored bars and the green line indicate the current performance level. Everything in between the green and red lines indicates unused capacity that can be put to use.
Figure 1.2: Monitoring system utilization is the first step toward optimization.
Chapter 1 Once youve reached this point at which youre concerned about optimizing utilization, youre no longer worried about the mail system crashing or about meeting your SLAs. Now youre interested in putting as much load on the server as possible while continuing to meet your service level goals. In other words, youre no longer simply assuring serviceyoure optimizing it.
Optimizing can take a bit longer to pay off than simply managing to an SLA. Ideally, youll find that your systems, properly managed, have enough excess capacity that you can reduce the number of systems you have to manage, which is an obvious savings. In the long term, you might discover enough excess capacity to postpone the purchase of additional systems, allowing IT to support business growth with little or no additional overhead.
Optimization isnt a one-time process. Because your IT staff should be spending less and less time fighting fires or simply maintaining SLAs, they should be spending more and more time optimizing systems for maximum performance. Regular reviews of server health and optimization targets along with carefully managed optimization efforts should continue to put you closer to the goal of a completely optimized IT environment.
Top 10 IT Optimization Points So what, specifically, will your staff be optimizing? The following list provides the top 10 areas of focus: Database server utilizationTweaking database designs, indexes, server-side code, and other technologies to maximize the load the database server can handle. File server throughputEntails reorganizing files, physically relocating servers to be logically closer to users, and so forth. Mail server balancingLocating high-volume mailboxes and balancing them evenly across available servers. Internet accessProxy servers and other border devices can be used to reduce WAN bandwidth and improve user response times for Web access. Authentication serversWindows domain controllers and UNIX realm servers can be optimized to provide authentication for the maximum number of users with acceptable response times. Network servicesName resolution, in particular, can be placed on the network and optimized for fastest response times. Combining servicesReducing servers by combining services (such as authentication and name resolution) onto multi-function servers can meet SLAs, reduce hardware management, and better meet utilization goals. Network throughputRedesigning LANs and WANs to reduce network throughput, distribute highvolume users and servers across the network, and better balance access to network services. Management effortAutomating common tasks so that another important IT resource, your IT staff, is utilized more efficiently. Security auditingAutomation is a key to improving security, reducing management effort, and optimizing your security infrastructure.
What do you do when optimization is firmly under control? Stop worrying about getting the most from your servers, and start worrying about how to quickly reposition IT resources to help your company capitalize on business opportunities.
Chapter 1
Moving Toward Optimized: A Telecommunications Firms File Servers One of my first manageability jobs was for a major East Coast telecommunications firm. The company had already moved into a basic managed model of IT management but was experiencing significantly increased costs. Their basic management methodology was to predict when existing resources would become insufficient to meet their service levels, then add more resources when the time came. The problem was that the number of resources, particularly file servers, was fast exceeding their staffs ability to manage and monitor, particularly with regard to file security. They needed to find a way to use their management and file server resources more efficiently. First, we acquired Precises StorageCentral SRM, a product designed to optimize storage utilization. Using its reporting features we discovered that more than 35 percent of the file servers space was effectively being wasted by users personal files, such as MP3s, graphics, and so forth. We also realized that the companys files were scattered across servers with practically no discernable organization, making it difficult to efficiently manage file permissions for the companys various departments. Optimization was in order. First, we used StorageCentral SRM to block unwanted files, such as music and graphics, from the file servers. Then we rearranged the file resources into department-specific file servers, placing files with similar security configurations on the same servers. We used NetIQs File Security Administrator to logically group files that needed to remain spread across different servers and to apply security permissions to groups of files more efficiently that Windows native user interface allowed. Finally, we implemented regular monthly reporting from both StorageCentral SRM and File Security Administrator to ensure that the file servers remained fine-tuned. Hours of daily file server management were replaced by monthly optimization meetings and periodic fine-tuning to the file server organization. The optimizations helped ensured that the network met its SLAs while significantly reducing the amount of management time required. In the end, approximately 10 percent of the file servers were found to be unnecessary and were decommissioned and redeployed, reducing the overall management burden even further.
Creating IT Agility Agility is often a more nebulous term than the other levels of IT management maturity. In its simplest form, agility simply means that youve achieved a level of IT management in which service levels and optimization are under control, and youre ready to quickly reposition IT resources to meet rapidly changing business environments. For example, consider Figure 1.3, which shows a diagram for how a business will react to a new business unit acquisition. The unused capacity of existing resources is documented, allowing management to quickly determine how new users and services will be accommodated. The acquisitions existing resources can be easily migrated into the parent company, and the need to purchase new systems is reduced.
10
Chapter 1
Figure 1.3: An agile IT infrastructure can easily accommodate changing business requirements, such as the acquisition of another company.
How can you achieve this level of maturity? Again, third-party products are almost always required. Specifically, youll need to leverage your underlying optimization and management expertise as well as create new capabilities: Your work with optimization tells you how much excess capacity your systems have and how much additional load they can handle. Having this information allows you to react to new business opportunities quickly, leveraging known excess capacity to soak up increased demand for IT services. Youll often need to leverage third-party migration tools. These can help quickly assimilate acquired assets into your infrastructure, whether its migrating users from one domain into another, migrating files into your existing file servers, migrating data across database systems, or migrating user mailboxes from dissimilar messaging systems. Migration tools can also be useful in quickly redistributing existing company resources to create sufficient concentrated excess capacity to create new business units or handle acquisitions. For example, if you have four Web servers with 20 percent excess capacity, they could assimilate a fifth server running at 60 percent capacity, freeing up that fifth server for a new business venture.
11
Chapter 1 Your ongoing optimization efforts can be relied upon to quickly reorganize your infrastructure to best meet increased demand. For example, in the short term in might be sufficient to simply migrate acquired assets to any system with excess capacity; in the long term, youll want to reorganize and optimize those assets so that your IT systems are handling the maximum possible load.
Creating an agile IT environment pays off in ways that are difficult to measure. Often, the best way to realize a return on the necessary infrastructure investment is to consider the cost of opportunities that would be lost in a less agile environment.
Maturing your IT management to the agility level can be difficult. It requires a lot of foresight into the companys potential business future, and can require a significant investment in tools and resources to make rapid IT changessuch as migrations or server consolidationpossible and reliable. Difficult though it might be to achieve, an agile IT environment is the only one that serves the ultimate purpose of IT: to support every facet of your business in every way possible.
Dont think that moving from the optimizing to agility levels of management maturity is easy. Quite the contrary, in fact; youll have to achieve an extremely high level of discipline, configuration management, and awareness of management correlation to get there. In fact, the skills and processes necessary to mature to an agile level of management is the topic of entire books.
Moving Toward Agility: A Banks Exchange Servers One of my larger projects was helping an international banking firm achieve a more mature level of management over their Microsoft Exchange servers. The bank was already aggressively managing to service levels for performance and utilization, and regularly reviewed their servers for optimization. In fact, their optimization effort used home-grown reporting tools to review each mailbox on each Exchange server, and every quarter they moved mailboxes between servers to balance the number of high-volume mailboxes each server had to handle. However, the bank had recently embarked on a series of major business unit acquisitions, while at the same time creating new business units to offer insurance, investment services, and other new lines of business. IT management was hard-pressed to redeploy resources quickly enough to meet the everchanging business environment. Recognizing that these changes would continue to come fast and heavy and that they were inherently unpredictable, I advised that the bank begin to manage its Exchange infrastructure with deliberate excess capacity. Rather than managing servers to achieve maximum efficiency, additional servers were added and mailboxes rebalanced so that each server had approximately 15 to 20 percent extra capacity. When new business units were added, they could utilize this capacity immediately, giving IT management time to reanalyze the situation, rebalance the servers, and bring new servers online if necessary. Their next concern was that the rapidly changing business environment would require so many new resources that their ability to manage their service levels would be compromised. We decided that the solution would be to implement additional automation for common tasks and problems. NetIQs Exchange Administrator was used to simplify daily administrative tasks, making those tasks suitable for lower levels of IT administration. For example, new mailbox creation was delegated to the banks desktop configuration division, freeing up more skilled resources to focus on higher-end administration. Microsoft Operations Manager (MOM) was brought in to provide better server health monitoring and to implement common corrective actions for common problems. For example, a new overflow Exchange server was set up. MOM was programmed to watch for Exchange servers nearing their managed capacity level and to move mailboxes to the overflow server while sending an alert to an administrator. This setup provided a partially automated solution for balancing the Exchange server load, letting administrators focus less time on server optimization and more time on infrastructure planning to support the business unit growth. 12
Chapter 1
Top 10 Signs of Maturing IT Management How can you tell that your IT management is maturing? The following list provides the top 10 signs that it is occurring: Firefighting stops being a daily occurrence. Youre able to more easily take advantage of new business opportunities by redeploying existing systems or taking advantage of unused capacity. Youre able to provide your users with SLAs and meet those service levels. You have evidence that your systems are operating at peak efficiency, not over or significantly under capacity. Your staff is able to detect growing IT problems before they result in a loss of service, and take corrective action. IT staffers are spending more time designing and implementing new systems, not repairing failures on existing systems. You might be able to reduce the number of lower-level IT staffers, as problems are handled proactively and require fewer hands to fight fires. You have evidence that IT resources arent being wasted or abused. You stop hearing about problem children in IT status reports. Youre able to more easily document critical business factors such as total cost of ownership (TCO) and ROI for IT systems.
13
Chapter 1 Network Management Network management involves the management of your underlying network infrastructure, including raw network capacity, network devices (such as switches, routers, and firewalls), and so forth. Signs of management maturity levels are: ReactiveNetwork speeds are uneven throughout the enterprise, outages occur without warning, the maximum capacity of the network and/or network devices is not known, and the networks actual structure might not be accurately documented. ManagedNetwork throughput is guaranteed by SLAs; outages are rare; the networks basic capacity is documented, and the major portions of the network architecture are documented and understood; and bottlenecks have been identified and are monitored through an automated solution. OptimizedNetwork speeds are evenly balanced throughout the enterprise; per-segment and per-device capacities are known, and are managed to a specified level of utilization; and network restructuring is performed regularly to maintain optimization levels. AgileNetwork segments and devices have known excess capacity and can be expanded to meet sudden increases in demand; sudden increases in demand are met easily, and the infrastructure is re-optimized and extended to ensure planned excess capacity.
Server Management Server management includes the basic OS and hardware of your servers as well as core network services such as authentication, name resolution, file services, print services, and so forth. Signs of maturity levels include: ReactiveServer outages occur without warning; server capacities are unknown and often exceeded; performance management is based primarily on anecdotal evidence and user complaints; and performance of core network services is uneven throughout the enterprise. ManagedServer outages occur only as part of planned maintenance; server capacities are known, bottlenecks are documented, and performance is automatically monitored for compliance with SLAs. OptimizedCore network services perform consistently throughout the enterprise, server capacities are known and are managed to a specific level of utilization, and server consolidation and reorganization occurs regularly to maintain optimization levels. AgileServers have known excess capacities that can absorb sudden increases in demand, services are available to quickly migrate or consolidate servers in response to changing business conditions, and the server infrastructure is easily extended to ensure planned excess capacity.
14
Chapter 1
Storage Management Storage management refers to one of the most important and oft-overlooked areas of IT managementraw data storage. With todays increasingly inexpensive storage options, many organizations simply throw more storage space at their problem, failing to recognize that every megabyte of storage introduces management overhead, disaster recovery concerns, and environment complexity issues. Signs of maturity levels are: ReactiveStorage capacity is extended on-demand; data is not distributed across storage systems in a logical fashion; users might have difficulty locating data, especially new users; systems such as Microsoft Distributed File System (Dfs) might be in place to help alleviate complexity; and data security is inconsistent throughout the enterprise. ManagedStorage capacity is monitored and controlled, a storage policy exists to describe classes of data and levels of data availability, data is secured in a consistent fashion throughout the enterprise, and storage policy is enforced through technological means such as disk quotas. OptimizedData is logically organized and consistently secured across the enterprise, a planned level of excess capacity is present and maintained, regular data reorganization takes place to maintain planned levels of utilization, and Storage Area Networks (SANs) are likely in use to optimize storage management. AgileStorage systems (including disaster recovery systems) have a known excess capacity and are able to meet sudden increases in demand, systems exist to quickly migrate or reorganize storage, and storage systems are easily expanded to meet planned levels of excess capacity.
Application Management Application management is a broad topic and is difficult to discuss in specific terms. It includes such obvious things as messaging systems management, database systems management, and so forth. It also includes the capacities and performance of line-of-business applications, both packages solutions such as ERP systems and customized in-house applications. Signs of management maturity include: ReactiveApplication outages occur without warning; outages are difficult to recover from completely; application performance is inconsistent across the enterprise, and the precise capacities of individual applications are unknown; no specific plan exists for expanding capacities; and performance is generally managed based on anecdotal evidence and user complaints. ManagedApplication outages occur only when planned, disaster recovery is planned and tested, applications perform according to SLAs, and plans exist to expand capacity whenever applications reach too near their limit. OptimizedApplication performance is consistent across the enterprise, applications have a known capacity and are managed to a specific level of utilization, and application management is reduced though automation techniques.
15
Chapter 1 AgileApplications have a known excess capacity and can absorb increases in demand, systems exist to provide migration and reorganization capabilities to meet sudden changes in business requirements, and applications are expanded to maintain a planned level of excess capacity.
The effect of improved manageability on ROI is obvious: If implementing more manageable systems reduces TCO, then ROI occurs more quickly with every reduction in TCO. Its important to document these cost reductions as you seek to mature your IT management because they provide justification for what might otherwise seem like additional pointless IT overhead.
16
Chapter 1
Deliberately maturing your IT management level requires a great deal of cooperation both within your IT organization and with your companys other executives. Top managers need to be on board with SLAs and with the need to invest in tools that can provide the manageability necessary to mature the environment. A written manageability policy makes manageability goals clear to IT staffers, who can move to implement tools and procedures to support the policy. Finally, management technologies play a vital role, exposing information and providing levels of automation that are required to move beyond simple reactive management.
17
Chapter 1 Assembling the Manageability Team Who are the key players in your maturity roadmap? Generally, top managers have to agree that a more mature IT management environment is both desirable and worth a modest investment. They also have to decide what levels of performance and stability are required; higher levels, of course, will require a higher level of investment in management tools and supporting technologies. Key players include ExecutivesCEO, CTO/CIO, CFO, COO, and so forth Departmental managers, vice presidents, and directors, including sales, marketing, operations, research, and so forth
Another key and often overlooked need is for a technically savvy businessperson to play the role of translator. Two-way translation is required: Executives will need a plain-language explanation of how IT management concerns will affect the business as a whole, and technologists implementing your new manageability policies will need someone to translate the business requirements into a technology plan. On the IT side, make sure your organizations top technologists are involved: Senior network administration End-user support (Help desk) management Lead software developers
In addition, be sure to involve representatives from any major contractors or IT outsourcing partners. These partners need to understand the impact of your manageability initiatives on their projects, and need to be provided with new requirements so that their efforts will support your manageability policies. Identifying Manageability Concerns Decide where youll focus your manageability policies first. Its often unwise to try to tackle the entire IT environment at once; instead, pick a single key area of pain and mature your management in that area first. Use your experience to then mature other areas of IT management individually, until the entire environment can be considered agile. Typical areas of concern include: Messaging and collaboration systems Relational database management systems Storage systems Core network services (name resolution, authentication, and so on)
Security is another major area for concern, but can be much more complicated to bring to a more mature level of management.
Chapter 2 focuses entirely on maturing your level of IT security management.
18
Chapter 1 Defining a Manageability Maturation Path Create a plan that details exactly how youll move to each successive level of management maturity. For example: In the third quarter, we will implement a problem-tracking system for the messaging infrastructure. We will use this system to identify key problems, then provide immediate fixes for those problems. By the fourth quarter, we plan to have all major problems documented and under control. We will then implement an automated monitoring and management solution for the messaging infrastructure, and document the correlation between user perceptions and performance measurements. In the first quarter of next year, we will adopt SLAs that define acceptable levels of perceived performance, while documenting the corresponding server health conditions. We will proactively manage the systems to these service levels. At this point, you have achieved managed level. After maintaining our service levels for at least 6 months, we will focus on server consolidation and optimization. Our goal is to document the excess capacity of all messaging servers and potentially reduce the number of production systems. At this point, you have achieved optimized level. Once achieving an optimized environment, we will manage to our optimization goals for 6 months, becoming proficient at reorganizing the environment on a monthly basis to meet optimization goals in face of changing business operations. After maintaining the optimized environment for 6 months, we will optimize to a specific level of excess capacity, creating room for new business opportunities. We will implement tools to assist in migrations and reorganizations, and will create standardized procedures for using these tools. You have now achieved agile level.
Creating a Manageability Policy You maturation plan should communicate to your companys management how you plan to proceed to a more agile IT environment. However, your technologists will need more specific direction, which they can get from a well-crafted manageability policy. For example, your policy might include points similar to the following: Implement a problem-tracking system. All user problem reports will go through the Help desk and be logged into the system. All administrators will describe problem resolutions in the tracking system. Implement reporting on the tracking system to identify common problems and their interim resolutions. Senior administrators will begin devising ways to eliminate these common problems from the environment, either through technology, user education, revised procedures, or other means. Interview users to determine acceptable levels of user-perceived performance. Implement monitoring tools that will help correlate acceptable performance levels with specific ITcentric metrics. Document both the user-perceived performance levels and observed ITcentric metrics.
19
Chapter 1 Implement an automated monitoring system that alerts us when systems performance moves into ranges that do not provide user-acceptable response times. Implement procedures to address these problems proactively before system response times become completely unacceptable or result in a service outage. Create SLAs that guarantee user-acceptable levels of performance. Manage to these levels and document all deviations. Where possible, implement automated corrective actions to ensure that server performance remains acceptable to users.
Notice that these policy elements specify technology goals, rather than tools or enabling technologies. Your technologists are the professionals; trust them to implement the right tools to meet the goals youve specified. Evaluating Manageability Results How can you evaluate the success of your maturity efforts? TCO is one way, but that can be incredibly complex to calculate, even in fairly small and simplistic environments. Other factors might provide a more useful indication. For example, consider hours of downtime, which is easy to document and work with. Figure 1.6 shows how improved manageability can impact downtime in a positive fashion.
20
Chapter 1 In the scenario that Figure 1.6 illustrates, in a reactive environment, downtime exceeded 10 hours per month. Simply implement automated management and monitoring systems reduced this time by more than half by allowing IT staff to step in and address growing problem conditions before an actual outage occurred. Optimizing the environment reduced downtime slightly by reducing the number of production systems in the environment. Having fewer systems to deal with means fewer points of failure. Finally, creating an agile environment allowed downtime to remain flat despite a 200 percent growth in demand, which doubtless necessitated additional resources. The agile environment can react quickly to business changes without jeopardizing IT stability.
Top 10 Corporate Manageability Policies So what might your manageability policy include? Keep in mind that your policy should simply provide direction for your IT staffers, detailing what type of manageability they must provide. The following list highlights the top 10 policy directives: Ensure that the health of all critical systems is monitored constantly. Where possible, provide automated corrective actions for common problems. Ensure that security auditing is automated and includes alerts for problem conditions. Provide monthly capacity reports for all critical systems. Where possible, consolidate underutilized systems to improve efficiency. Extend systems capability when existing systems reach 90 percent utilization or more. Maintain a 10 to 15 percent level of general underutilization to provide extra capacity to react to new business opportunities. Reduce administrative time spent on handling unexpected problems and conditions by 50 percent or more. Define and meet SLAs for general system availability for all critical systems. Define and meet SLAs for system response times for all critical systems and applications.
Summary
Managing IT doesnt have to mean hourly firefights, long nights in the data center, and perpetually upset users. Although its easy to remain in a reactive management mode when it comes to IT, experienced managers will recognize the value of a more mature level of IT managementone that corresponds to the mature levels of management present elsewhere in most companies. More mature management means lowered IT overhead costs, improved returns on IT investments, improved productivity, and an ability to better meet the needs of the company in todays rapidly changing business environments. Although few IT products provide sufficient manageability out of the box, a robust market of third-party products exists to provide monitoring, automated management, automated corrective action, reliable asset migration and reorganization, and reporting. These products can add significant manageability to your IT infrastructure, making an optimized or agile environment easier to attain.
21