Alois Reitbauer works as Technology Strategist for dynaTrace. As a major
contributor to dynaTrace Labs technology he inuences the companies future technological direction. Besides his engineering work he supports Fortune 500 companies in implementing successful performance management. In his former life Alois has been working for a number of high tech companies as an architect and developer of innovative enterprise software. At Segue software (now Borland a Microfocus Company) he was part of the engineering teams of the companys active and passive monitoring products. He is a regular speaker at conferences like TheServerSide Java Symposium Europe, QCon, Jazoon, DeVoxx or JAX. He was the author of the Performance Series of the German Java magazine as well as author of other online and print publications and contributor to several books. Alois Reitbauer Andreas Grabner Andreas Grabner has 10 years experience as an architect and developer in the Java and .NET space. In his current role, Andi works as a Technology Strategist. In his role he inuences the dynaTrace product strategy and works closely with customers in implementing performance management solutions across the entire application lifecycle. Copyright dynaTrace software, A Division of Compuware. All rights reserved. Trademarks remain the property of their respective owners. The dynaTrace Application Performance Almanac is brought to you by our authors at blog.dynatrace.com Page 3 Michael Kopp has 10 years of experience as an architect and developer in Java/JEE and C++. He currently works as a Technology Strategist and product evangelist in the dynaTrace Center of Excellence. In this role he is specializing in the architecture and performance of large scale production deployments. As part of the R&D team he inuences the dynaTrace product strategy and works closely with key customers in implementing performance management solution for the entire lifecycle. Before joining dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In this role one special focus has always been performance and scalability of their enterprise offerings. Michael Kopp Klaus Enzenhofer Klaus Enzenhofer has several years of experience and expertise in the eld of web performance optimization and user experience management. He works as Technical Strategist in the Center of Excellence Team at dynaTrace software. In this role he inuences the development of the dynaTrace application performance management solution and the web performance optimization tool dynaTrace AJAX Edition. He mainly gathered his experience in web and performance by developing and running large-scale web portals at Tiscover GmbH. The dynaTrace Application Performance Almanac is brought to you by our authors at blog.dynatrace.com Copyright dynaTrace software, A Division of Compuware. All rights reserved. Trademarks remain the property of their respective owners. Page 4 Welcome dynaTrace Application Performance Almanac Issue 2012 A Year of APM Knowledge We are proud to present the second issue of the Application Performance Management Almanac, a collection of technical articles drawn from our most read and discussed blog articles of last year. While keeping our focus on classical performance topic like memory management, some new topics piqued our interest. Specically Cloud, Virtualization and Big Data performance is getting increasingly important and becoming the focus of software companies as these technologies gain traction. Web Performance has continued to be a hot topic. This topic also broadened from Web Diagnostics to production monitoring using User Experience Management, looking beyond at ones own application into Third Party component performance which is becoming a primary contributor to application performance. Besides providing deep technical insight we also tried to answer controversial Why questions like in our comparison of different end user monitoring approaches, or our discussion about the use cases for NoSQL technologies. We are also glad that some of our top customers have generously agreed to allow us to feature their applications as real-world examples. Furthermore, a number of excellent guest authors have contributed articles, to whom we extend our thanks. We decided to present the articles in chronological order to reect the development of performance management over the year from our perspective. As the articles however do not depend on each other they can be read as individual pieces depending on your current interests. We want to thank our readers for their loyal readership and hope you enjoy this Almanac. Klaus Enzenhofer Andreas Grabner Michael Kopp Alois Reitbauer The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 5 5 Steps to Set Up ShowSlow as Web Performance Repository for dynaTrace Data [8] 5 Things to Learn from JC Penney and Other Strong Black Friday and Cyber Monday Performers [332] eCommerce Business Impact of Third Party Address Validation Services [310] How Case-Sensitivity for ID and ClassName can Kill Your Page Load Time [209] How Proper Redirects and Caching Saved Us 3.5 Seconds in Page Load Time [238] Is Synthetic Monitoring Really Going to Die? [269] Microsoft Not Following Best Practices Slows Down Firefox on Outlook Web Access [160] Real Life Ajax Troubleshooting Guide [31] Slow Page Load Time in Firefox Caused by Old Versions of YUI, jQuery, and Other Frameworks [40] Step by Step Guide: Comparing Page Load Time of US Open across Browsers [202] Testing and Optimizing Single Page Web 2.0/AJAX Applications Why Best Practices Alone Dont Work Any More [45] The Impact of Garbage Collection on Java Performance [59] Third Party Content Management Applied: Four Steps to Gain Control of Your Page Load Performance! [376] Cassandra Write Performance a Quick Look Inside [231] Clouds on Cloud Nine: the Challenge of Managing Hybrid-Cloud Environment [388] Goal-oriented Auto Scaling in the Cloud [193] NoSQL or RDBMS? Are We Asking the Right Questions? [261] Pagination with Cassandra, And What We Can Learn from It [352] Performance of a Distributed Key Value Store, or Why Simple is Complex [342] Web Cloud Page 6 Application Performance Monitoring in Production A Step by Step Guide Measuring a Distributed System [98] Application Performance Monitoring in Production A Step-by-Step Guide Part 1 [77] Automatic Error Detection in Production Contact Your Users Before They Contact You [213] Business Transaction Management Explained [275] Field Report Application Performance Management in WebSphere Environments [134] How to Manage the Performance of 1000+ JVMs [366] Top 8 Performance Problems on Top 50 Retail Sites before Black Friday [320] Troubleshooting Response Time Problems Why You Cannot Trust Your System Metrics [65] Why Performance Management is Easier in Public than Onpremise Clouds [166] Why Response Times are Often Measured Incorrectly [176] Why SLAs on Request Errors Do Not Work and What You Should Do Instead [257] Why You Really Do Performance Management in Production [220] You Only Control 1/3 of Your Page Load Performance! [295] How Server-side Performance Affects Mobile User Experience [198] Why You Have Less Than a Second to Deliver Exceptional Performance [304] DevOps Mobile Page 7 Automated Cross Browser Web 2.0 Performance Optimizations: Best Practices from GSI Commerce [183] dynaTrace in Continuous Integration - The Big Picture [20] How to do Security Testing with Business Transactions Guest Blog by Lucy Monahan from Novell [122] Tips for Creating Stable Functional Web Tests to Compare across Test Runs and Browsers [112] To Load Test or Not to Load Test: That is Not the Question [245] White Box Testing Best Practices for Performance Regression and Scalability Analysis [86] Behind the Scenes of Serialization in Java [25] How Garbage Collection Differs in the Three Big JVMs [145] How to Explain Growing Worker Threads under Load [15] Major GCs Separating Myth from Reality [37] The Cost of an Exception [72] The Reason I Dont Monitor Connection Pool Usage [315] The Top Java Memory Problems Part 1 [92] The Top Java Memory Problems Part 2 [357] Why Object Caches Need to be Memory-sensitive Guest Blog by Christopher Andr [153] Automation Tuning Page 8 5 Steps to Set Up ShowSlow as Web Performance Repository for dynaTrace Data by Andreas Grabner Alois Reitbauer has explained in detail how dynaTrace continuously monitors several thousand URLs and uploads the performance data to the public ShowSlow.com instance. More and more of our dynaTrace AJAX Edition Community Members are taking advantage of this integration in their internal testing environments. They either use Selenium, Watir or other functional testing tools to continuously test their web applications. They use dynaTrace AJAX Edition to capture performance metrics such as Time to First Impression, Time to Fully Loaded, Number of Network Requests or Size of the Site. ShowSlow is then used to receive those performance beacons, store it in a repository and provide a nice Web UI to analyze the captured data over time. The following illustration shows a graph from the public ShowSlow instance that contains performance results for a tested website over a period of several months: 1 Week 1 Sun Mon Tue Wed Thu Fri Sat 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 9 As we received several questions regarding installation and setup of this integration I thought it is time to write a quick Step-by-Step Guide on how to use a private ShowSlow Instance and dynaTrace AJAX Edition in your test environment. I just went through the whole installation process on my local Windows 7 installation and want to describe the steps Ive taken to get it running. Step 1: Download Software Sergey creator of ShowSlow provides a good starting point for our installation: http://www.showslow.org/Installation_and_conguration I started by downloading the latest ShowSlow version, Apache 2.2, PHP5 Binaries for Windows and MySql. If you dont have dynaTrace AJAX Edition yet also go ahead and get it from our AJAX Edition Web Site. Step 2: Installing Components I have to admit I am not a pro when it comes to Apache, PHP or MySql but even I managed to get it all installed and congured in minutes. Here are the detailed steps: Analyze Performance Metrics from dynaTrace over time using ShowSlow as Repository Page 10 Initial Conguration of Apache 1. During the setup process I congured to run Apache on Port 8080 in order to not conict with my local IIS 2. Update Apaches httpd.conf to let the DocumentRoot point to my extracted ShowSlow directory 3. Enable all modules as explained in Sergeys Installation and Conguration Description. That is mod_deate, mod_rewrite, mod_expires Installing the Database 1. Use mysql command line utility and follow the instructions in Sergeys description. Before running tables.sql I had to manually switch to the ShowSlow database executing the use showslow statement 2. Rename cong.samples.php in your ShowSlow installation directory to cong.php and change the database credentials according to your installation Conguring PHP 1. In my extracted php directory (c:\php) I renamed php-development. ini to php.ini 1. Remove the comment for the two MySql Extensions php_mysql. dll and php_mysqli.dll 2. Set the extension_dir to c:/php/ext 3. If you want to use the WebPageTest integration you also need to remove the comment for the extension php_curl.dll 2. In order for PHP to work in Apache I had to add the following lines to httpd.conf -> following these recommendations Page 11 1. LoadModule php5_module c:/php/php5apache2_2.dll -> at the end of the loadmodule section 2. AddType application/x-httpd-php .php -> at the end of IfModule 3. AddType application/x-httpd-php .phtml -> at the end of IfModule 4. PHPIniDir c:/php -> at the very end of the cong le 5. Change DirectoryIndex from index.html to index.php to default to this le Step 3: Launching ShowSlow Now you can either run Apache as a Windows Service or simply start httpd. exe from the command line. When you open the browser and browse to http://localhost:8080you should see the following ShowSlow running on your local machine Page 12 Step 4: Congure dynaTrace AJAX Edition dynaTrace AJAX Edition is congured to send performance data to the public ShowSlow instance. This can be changed by modifying dtajax.ini (located in your installation directory) and adding the following parameters to it: -Dcom.dynatrace.diagnostics.ajax.beacon.uploadurl=http:// localhost:8080/beacon/dynatrace -Dcom.dynatrace.diagnostics.ajax.beacon.portalurl=http:// localhost:8080/ These two parameters allow you to manually upload performance data to your local ShowSlow instance through the Context Menu in the Performance Report:
dynaTrace AJAX Edition will prompt you before the upload actually happens in order to avoid an accidental upload. After the upload to the uploadUrl you will also be prompted to open the actual ShowSlow site. If you click Yes a browser will be opened navigating to the URL congured in portalUrl in our case this is our local ShowSlow instance. Now we will see our uploaded result:
Manually upload a result to ShowSlow Page 13 Step 5: Automation The goal of this integration is not to manually upload the results after every test run but to automate this process. There is an additional parameter that you can congure in dtajax.ini: -Dcom.dynatrace.diagnostics.ajax.beacon.autoupload=true After restarting dynaTrace AJAX Edition the performance beacon will be sent to the congured ShowSlow instance once a dynaTrace Session is completed. What does that mean? When you manually test a web page using dynaTrace AJAX Edition or if you use a functional testing tool such as Selenium in combination with dynaTrace AJAX Edition, a dynaTrace Session is automatically recorded. When you or the test tool closes the browser the dynaTrace session gets completed and moves to the stored session folder. At this point dynaTrace AJAX Edition automatically sends the performance beacon to the congured ShowSlow instance. Uploaded data visible in ShowSlow under Last Measurements Page 14 If you want to know more about how to integrate tools such as Selenium with dynaTrace then read these blogs: How to Use Selenium with dynaTrace, 5 Steps to Use Watir with dynaTrace Want more data and better automation support? The integration with ShowSlow is a great way to automate your performance analysis in a continuous manner. The performance beacon that gets send to ShowSlow can of course also be used by any other tool. The beacon is a JSON formatted object that gets sent to the congured endpoint via HTTP POST. Feel free to write your own endpoint listener if you wish to do so. dynaTrace also offers a solution that extends what is offered in dynaTrace AJAX and ShowSlow. If you need more metrics and better automation support check out Web Performance Automation. If you want to analyze more than just whats going on in the browser check out Server-Side Tracing, End-to-End Visibility. If you want to become more proactive in identifying and reacting on performance regressions check out Proactive Performance Management. Page 15 How to Explain Growing Worker Threads under Load by Andreas Grabner I recently engaged with a client who ran an increasing load test against their load-balanced application. I got involved because they encountered a phenomenon they couldnt explain - here is an excerpt of the email: We have a jBoss application with mySQL that runs stable in a load testing environment with lets say 20 threads. At one point this suddenly changes and jBoss uses up to 100 threads for handling requests (or whatever the congured max number of threads in jBoss might be) until now we have not been able to pinpoint what causes this issue I requested their collected performance metrics to take a quick look at it. Here were my steps to analyze their problem. Step 1: Verify what they said Of course I trusted what I read in the email - but it is always good to conrm and verify. They subscribed several jBoss measures such as the Thread Pool Current Threads Busy. Charting them veried what they said:
2 Week 2 Sun Mon Tue Wed Thu Fri Sat 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 16 Looking at the busy worker threads shows us the load behavior on the load balanced web servers The initial high number of worker threads is probably caused as the application had to warm-up - meaning that individual requests were slow so more worker threads were needed to handle the initial load. Throughout the test the 2nd application server had constantly more threads than the rst. At the end of the test we see that both servers spike in load with server 2 maxing its workers threads. This conrms what they told me in the email. Lets see why that is. Step 2: Correlate with other metrics The next thing I do is to look at actual web request throughput, CPU Utilization and response times of requests. I chart these additional metrics on my dashboard. These metrics are provided by the application server or by the performance management solution we used in this case. Lets look at the dashboard including my red marks that I want to focus on:
Correlating thread count with CPU, thr-oughput and response time
The top graph shows the busy worker threads again with a red mark around the problematic timeframe. The second graph shows the number of successful web requests - which is basically the throughput. It seems Page 17 that the throughput increased in the beginning which is explained by the increasing workload. Already before we see the worker threads go up we can see that throughput stops and stays at. On server 2 (brown line) we even see a drop in throughput until the end of the test. So even though we have more worker threads we actually have fewer requests handled successfully. The CPU Graph now actually explains the root cause of the problem. I chose to split the chart to have a separate graph for each application server. Both servers max out their CPU at the point that I marked in red. The time correlates with the time when we see the throughput becoming at. It is also the time when we see more worker threads being used. The problem though is that new worker threads wont help to handle the still increasing load because the servers are just out of CPU. The bottom graph shows the response time of those requests that are handled successfully again, split by the two application servers. The red area shows the time spent on CPU, the blue area shows the total execution time. We can again observe that the contribution of the CPU plateaus once we have reached the CPU limit. From there on we only see increased Execution Time. This is time where the application needs to wait on resources, such as I/O, Network or even CPU cycles. Interesting here is that the 2nd application server has a much higher execution time than application server 1. I explain this because the 2nd server gets all these additional requests from the load balancer that results in more worker threads which all compete for the same scarce resources. Summarizing the problems: It seems we have a CPU problem in combination with a load balancer conguration problem. As we run out of CPU the throughput stalls, execution times get higher but the load balancer still forwards incoming requests to the already overloaded application servers. It also seems that the load balancer is unevenly distributing the load causing even more problems on server 2. Is it really a CPU Problem? Page 18 Based on the analysis it seems that CPU is our main problem here. The question is whether there is something we can do about it (by xing a problem in the application), or whether we just reached the limits of the hardware used (xable by adding more machines). The customer uses dynaTrace with a detailed level of instrumentation which allows me to analyze which methods consume most of the CPU. The following illustration shows what we call the Methods Dashlet. It allows me to look at those method executions that have been traced while executing the load test. Sorting it by CPU Time shows me that there are two methods called getXML which spend most of the time in CPU:
There are two different implementations of getXML that consume most of the CPU The rst getXML method consumes by far more CPU than all other methods combined. We need to look at the code as to what is going on in there. Looking at the parameters I assume it is reading content from a le and then returning it as a String. File access explains the difference in Execution Time and CPU Time. The CPU Time is then probably spent in processing the le and generating an XML representation of the content. Inefcient usage of an XML Parser or String concatenations would be a good explanation. It seems like we have a good chance to optimize this method, save CPU and then be able to handle more load on that machine. What about these spikes on server 2? In the response time chart we could see that Server 2 had a totally different Page 19 response time behavior than server 1 with very high Execution Times. Lets take a closer look at some of these long running requests. The following illustration shows parts of a PurePath. A PurePath is the transactional trace of one request that got executed during the load test. In this case it was a request executed against server 2 taking a total of 68s to execute. It seems it calls an external web service that returns invalid content after 60 seconds. I assume the web service simply ran into a timeout and after 60 seconds returned an error html page instead of SOAP Content: External Web Service Call runs into a timeout and returns an error page instead of SOAP Content Looking at other PurePaths from server 2 it seems that many transactions ran into the same problem causing the long execution times. The PurePath contains additional context information such as method arguments and parameters passed to the web service. This information can be used when talking with the external service provider to narrow down the problematic calls. Conclusion The most important thing to remember is to monitor enough metrics: CPU, I/O, Memory, Execution Times, Throughput, Application Server Specics, and so on. These metrics allow you to gure out whether you have problems with CPU or other resources. Once that is gured out you need more in- depth data to identify the actual root cause. We have also learned that it is not always our own application code that causes problems. External services or frameworks that we use are also places to look. Page 20 dynaTrace in Continuous Integration - The Big Picture by Andreas Grabner Agile development practices have widely been adopted in R&D organizations. A core component is Continuous Integration where code changes are continuously integrated and tested to achieve the goal of having potentially shippable code at the end of every Sprint/Iteration. In order to verify code changes, Agile team members write Unit or Functional Tests that get executed against every build and every milestone. The results of these tests tell the team whether the functionality of all features is still there and that the recent code changes have not introduced a regression. 3 Week 3 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 2 9 16 23 3 10 17 24 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 21 Verify for Performance, Scalability and Architecture Now we have all these JUnit, NUnit, TestNG, Selenium, WebDriver, Silk or QTP tests that verify the functionality of the code. By adding dynaTrace to the Continuous Integration Process these existing tests automatically verify performance, scalability and architectural rules. Besides knowing that (for example) the Product Search feature returns the correct result we want to know: How much CPU and Network bandwidth it takes to execute the search query? How many database statements are executed to retrieve the search result? Will product images will be cached on the browser? How does JavaScript impact the Page Load Time in the browser? Whether the last code change affected any of these Performance, Scalability or Architectural Rules? dynaTrace analyzes all Unit and Browser Tests and validates execution characteristics such as number of database statements, transferred bytes, cache settings ... against previous builds and test runs. In case there is a change (regression) the developers are notied about what has actually changed. Page 22 dynaTrace automatically detects abnormal behavior on all subscribed measures, e.g.: execution time, number of database statements, number of JavaScript les, ... Automatically validate rules such as number of remoting calls or number of bytes transferred Page 23 Compare the difference between the problematic and the Last Known Good run Besides providing condence about functionality these additional checks ensure that the current code base performs, scales and adheres to architectural rules. Step-by-Step Guide to enable dynaTrace in CI In order to integrate dynaTrace we need to modify the CI Process. The following is a high-level, step-by-step guide that explains all steps in a typical CI environment. When a new build is triggered the Build Server executes an Ant, NAnt, Maven, or any other type of automation script that will execute the following tasks: 1. Check-out current Code Base 2. Generate new Buildnumber and then Compile Code 3. (dynaTrace) Start Session Recording (through REST) 4. (dynaTrace) Set Test Meta Data Information (through REST) 5. Execute JUnit, Selenium, ... Tests - (dynaTrace Agent gets loaded into Test Execution Process) 6. (dynaTrace) Stop Session Recording (through REST) 7. Generate Test Reports including dynaTrace Results (through REST) dynaTrace provides an Automation Library that provides both a Java and .NET Implementation to call the dynaTrace Server REST Services. It also Page 24 includes Ant, NAnt and Maven Tasks that make it easy to add the necessary calls to dynaTrace. The Demo Application includes a fully congured sample including Ant, JUnit and Selenium. Once these steps are done dynaTrace will automatically identify Unit and Browser Tests learn the expected behavior of each individual test raise an incident when tests start behaving unexpected Captured results of tests are stored in individual dynaTrace Sessions which makes it easy to compare and share.More details on Test Automation in the Online Documentation. Conclusion Take your Continuous Integration Process to the next level by adding performance, scalability and architectural rule validations without the need to write any additional tests. This allows you to nd more problems earlier in the development lifecycle which will reduce the time spent in load testing and minimize the risk of production problems. Page 25 Behind the Scenes of Serialization in Java by Alois Reitbauer When building distributed applications one of the central performance- critical components is serialization. Most modern frameworks make it very easy to send data over the wire. In many cases you dont see at all what is going on behind the scenes. Choosing the right serialization strategy however is central for achieving good performance and scalability. Serialization problems affect CPU, memory, network load and response times. Java provides us with a large variety of serialization technologies. The actual amount of data which is sent over the wire can vary substantially. We will use a very simple sample where we send a rstname, lastname and birthdate over the wire. Then well see how big the actual payload gets. Binary Data As a reference we start by sending only the payload. This is the most efcient way of sending data, as there is no overhead involved. The downside is that due to the missing metadata the message can only be de-serialized if we know the exact serialization method. This approach also has a very high testing and maintenance effort and we have to handle all implementation complexity ourselves. The gure below shows what our payload looks like in binary format. 4 Week 4 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 26
Binary Representation of Address Entity Java Serialization Now we switch to standard serialization in Java. As you can see below we are now transferring much more metadata. This data is required by the Java Runtime to rebuild the transferred object at the receiver side. Besides structural information the metadata also contains versioning information which allows communication across different versions of the same object. In reality, this feature often turns out to be more troublesome than it initially looks. The metadata overhead in our example is rather high. This is caused by large amount of data in the GregorianCalendar Object we are using. The conclusion that Java serialization comes with a very high overhead per se, however, is not valid. Most of this metadata will be cached for subsequent invocations. Person Entity with Java Serialization Java also provides the ability to override serialization behavior using the Externalizable interface. This enables us to implement a more efcient serialization strategy. In our example we could only serialize the birthdate as a long rather than a full object. The downside again is the increased effort regarding testing and maintainability Java serialization is used by default in RMI communication when not using IIOP as a protocol. Application server providers also offer their own serialization stacks which are more efcient than default serialization. If Page 27 interoperability is not important, provider-specic implementations are the better choice. Alternatives The Java ecosystem also provides interesting alternatives to Java serialization. A widely known one is Hessian which can easily be used with Spring. Hessian allows an easy straightforward implementation of services. Underneath it uses a binary protocol. The gure below shows our data serialized with Hessian. As you can see the transferred data is very slim. Hessian therefore provides an interesting alternative to RMI.
Hessian Binary Representation of Person Object
JSON A newcomer in serialization formats is JSON (JavaScript Object Notation). Originally used as a text-based format for representing JavaScript objects, its increasingly been adopted in other languages as well. One reason is the rise of Ajax applications, but also the availability of frameworks for most programming languages. As JSON is a purely text-based representation it comes with a higher overhead than the serialization approaches shown previously. The advantage is that it is more lightweight than XML and it has good support for describing metadata. The Listing below shows our person object represented in JSON. {rstName:Franz, lastName:Musterman, birthDate:1979-08-13 } Page 28 XML XML is the standard format for exchanging data in heterogeneous systems. One nice feature of XML is out of the box support for data validation, which is especially important in integration scenarios. The amount of metadata, however, can become really high depending on the mapping used. All data is transferred in text format by default. However the usage of CDATA tags enables us to send binary data. The listing below shows our person object in XML. As you can see the metadata overhead is quite high. <?xml version=1.0 encoding=UTF-8 standalone=yes?> <person> <birthDate>1979-08-13T00:00:00-07:00</birthDate> <rstName>Franz</rstName> <lastName>Musterman</lastName> </person> Fast InfoSet Fast InfoSet is becoming a very interesting alternative to XML. It is more or less a lightweight version of XML, which reduces unnecessary overhead and redundancies in data. This leads to smaller data set and better serialization and deserialization performance. When working with JAX-WS 2.0 you can enable Fast InfoSet serialization by using the @FastInfoset annotation. Web Service stacks then automatically detect whether it can be used for cross service communication using HTTP Accept headers. When looking at data serialized using Fast InfoSet the main difference you will notice is that there are no end tags. After their rst occurrence they are only referenced by an index. There are a number of other indexes for content, namespaces etc. Page 29 Data is prexed with its length. This allows for faster and more efcient parsing. Additionally binary data can avoid being serialized in base64 encoding as in an XML. In tests with standard documents the transfer size could be shrunk down to only 20 percent of the original size and the serialization speed could be doubled. The listing below shows our person object now serialized with Fast InfoSet. For Illustration purposes I skipped the processing instructions and I used a textual representation instead of binary values. Values in curly braces refer to indexed values. Values in brackets refer to the use of an index. {0}<person> {1}<birthDate>{0}1979-08-13T00:00:00-07:00 {2}<rstName>{1}Franz {3}<lastName>{2}Musterman The real advantage can be seen when we look what the next address object would look like. As the listing below shows we can work mostly with index data only. [0]<> [1]<>{0} [2]<>{3}Hans [3}<>{4}Musterhaus Object Graphs Object graphs can be quite tricky to serialize. This form of serialization is not supported by all protocols. As we need to work with reference to entities, the language used by the serialization approach must provide a proper language construct. While this is no problem in serialization formats which are used for (binary) RPC-style interactions, it is often not supported out of the box by text-based protocols. XML itself, for example, supports Page 30 serializing object graphs using references. The WS-I however forbids the usage of the required language construct. If a serialization strategy does not support this feature it can lead to performance and functional problems, as entities get serialized individually for each occurrence of a reference. If we are, for example, serializing addresses which refer to country information, this information will be serialized for each and every address object leading to large serialization sizes. Conclusion Today there are numerous variants to serialize data in Java. While binary serialization remains the most efcient approach, modern text-based formats like JSON or Fast Infoset provide valid alternatives especially when interoperability is a primary concern. Modern frameworks often allow using multiple serialization strategies at the same time. So the approach can even be selected dynamically at runtime. Page 31 Real Life Ajax Troubleshooting Guide by Andreas Grabner One of our clients occasionally runs into the following problem with their web app: They host their B2B web application in their East Coast data center with their clients accessing the app from all around the United States. Occasionally they have clients complain about bad page load times or that certain features just dont work on their browsers. When the problem cant be reproduced in-house and all of the usual suspects (problem with internet connection, faulty proxy, user error, and so on) are ruled out they actually have to y out an engineer to the client to analyze the problem on-site. Thats a lot of time and money spent to troubleshoot a problem. Capturing data from the End User In one recent engagement we had to work with one of their clients on the West Coast complaining that they could no longer login to the application. After entering username and password and clicking the login button, the progress indicator shown while validating the credentials actually never goes away. The login worked ne when trying it in-house. The login also worked for other user in the same geographical region using the same browser version. They run dynaTrace on their application servers which allowed us to analyze the requests that came from that specic user. No problems could be detected on the server side. So we ruled out all potential 5 Week 5 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 32 problems that we could identify from within the data center. Instead of ying somebody to the West Coast we decided to use a different approach. We asked the user on the West Coast to install the dynaTrace Browser Agent. The Browser Agent captures similar data to dynaTrace AJAX Edition. The advantage of the agent is that it automatically ties into the backend. Requests by the browser that execute logic on the application server can be traced end-to-end, from the browser all the way to the database. dynaTrace Timeline showing browser (JavaScript, rendering, network) and server-side activity (method executions and database statements) The Timeline view as shown above gives us a good understanding of what is going on in the browser when the user interacts with a page. Drilling into the details lets us see where time is spent, which methods are executed and where we might have a problem/exception: Page 33 End-to-end PurePath that shows what really happens when clicking on a button on a web page Why the Progress Indicator Didnt Stop In order to gure out why the progress indicator didnt stop spinning and therefore blocked the UI for this particular user we compared the data of the user that experienced the problem with the data from a user that had no problems. From a high level we compared the Timeline views. Page 34 Identifying the general difference by comparing the two Timeline Views Both Timelines show the mouse click which ultimately results in sending two XHR Requests. In the successful case we can see a long running JavaScript block that processes the returns XHR Response. In the failing case this processing block is very short (only a few milliseconds). We could also see that in the failing case the progress indicator was not stopped as we can still observe the rendering activity that updates the rotating progress indicator. Page 35 In the next step we drilled into the response handler of the second XHR Request as thats where we saw the difference. It turned out that the XHR Response was an XML Document and the JavaScript handler used an XML DOM parser to parse the response and then iterate through nodes that match a certain XPath Query: JavaScript loads the XML response and iterates through the DOM nodes using an XPath expression The progress indicator itself was hidden after this loop. In the successful case we saw the hideProgressIndicator() method being called, in the failing one it wasnt. That brought us to the conclusion that something in the load function above caused the JavaScript to fail. Wrong XML Encoding Caused Problem dynaTrace not only captures JavaScript execution but also captures network trafc. We looked at the two XML Responses that came back in Page 36 the successful and failing cases. Both XML Documents were about 350k in size with very similar content. Loading the two documents in an XML Editor highlighted the problem. In the problematic XML Document certain special characters such as German umlauts were not encoded correctly. This caused the dom.loadXML function to fail and exit the method without stopping the progress indicator. Incorrect encoding of umlauts and other special characters caused the problem in the XML Parser As there was no proper error handling in place this problem never made it to the surface in form of an error message. Conclusion To troubleshoot problems it is important to have as much information at hand as possible. Deep dive diagnostics as we saw it in this use case is ideal as it makes it easy to spot the problem and therefore allows us to x problems faster. Want to know more about dynaTrace and how we support web performance optimization from development to production? Then check out the following articles: Best Practices on Web Performance Optimization dynaTrace in Continuous Integration The Big Picture How to integrate dynaTrace with your Selenium Tests Page 37 Major GCs Separating Myth from Reality by Michael Kopp In a recent post we showed how the Java Garbage Collection MXBean Counters have changed for the Concurrent Mark-and-Sweep Collector. It now reports all GC runs instead of just major collections. That prompted me to think about what a major GC actually is or what it should be. It is actually quite hard to nd any denition of major and minor GCs. This well- known Java Memory Management Whitepaper only mentions in passing that a full collection is sometimes referred to as major collection. Stop-the-world One of the more popular denitions is that a major GC is a stop-the-world event. While that is true, the reverse is not. It is often forgotten that every single GC, even a minor one, is a stop-the-world event. Young Generation collections are only fast if there is a high mortality rate among young objects. Thats because they copy the few surviving objects and the number of objects to check is relatively small compared to the old generation. In addition they are done in parallel nowadays. But even the Concurrent GC has to stop the JVM during the initial mark and the remark. 6 Week 6 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 38 That brings us immediately to the second popular denition. The Old Generation GC Very often GC runs in the old generation are considered major GCs. When you read the tuning guides or other references, GC in the tenured or old generation is often equated with a major GC. While every major GC cleans up the old generation, not all runs can be considered major. The CMS (Concurrent Mark and Sweep) was designed to run concurrently to the application. It executes more often than the other GCs and only stops the application for very short periods of time. Until JDK6 Update 23 its runs were not reported via its MXBean. Now they are, but the impact on the application has not changed and for all intents and purposes I would not consider them major runs. In addition not all JVMs have a generational Heap, IBM and JRockit both feature a continuous Heap on default. We would still see GC runs that we would either consider minor or major. The best denition that we could come up with is that a major GC stops the world for a considerable amount of time and thus has major impact on response time. With that in mind there is exactly one scenario that ts all the time: a Full GC. Full GC According to the aforementioned whitepaper a Full GC will be triggered Page 39 whenever the heap lls up. In such a case the young generation is collected rst followed by the old generation. If the old generation is too full to accept the content of the young generation, the young generation GC is omitted and the old generation GC is used to collect the full heap, either in parallel or serial. Either way the whole heap is collected with a stop-the- world event. The same is true for a continuous heap strategy, as apart from the concurrent strategy every GC run is a Full GC! In case of the concurrent GC the old generation should never ll up. Hence it should never trigger a major GC, which is of course the desired goal. Unfortunately the concurrent strategy will fail if too many objects are constantly moved into the old generation, the old generation is too full or if there are too many allocations altogether. In that case it will fall back on one of the other strategies and in case of the Sun JVM will use the Serial Old Collector. This in turn will of course lead to a collection of the complete heap. This was exactly what was reported via the MXBean prior to Update 23. Now we have a good and useful denition of a major GC. Unfortunately since JDK6 Update 23 we cannot monitor for a major GC in case of the concurrent strategy anymore. It should also be clear by now that monitoring for major GCs might not be the best way to identify memory problems as it ignores the impact minor GCs have. In one of my next posts I will show how we can collection has on the application in a better way. Page 40 Slow Page Load Time in Firefox Caused by Old Versions of YUI, jQuery, and Other Frameworks by Andreas Grabner We blogged a lot about performance problems in Internet Explorer caused by the missing native implementation of getElementsByClassName (in 101 on jQuery Selector Performance, 101 on Prototype CSS Selectors, Top 10 Client Side Performance Problems and several others). Firefox, on the other hand, has always implemented this and other native lookup methods. This results in much faster page load times on many pages that rely on lookups by class name in their onLoad JavaScript handlers. But this is only true if the web page also takes advantage of these native implementations. 7 Week 7 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 41 Yahoo News with 1 Second CSS Class Name Lookups per Page Looking at a site like Yahoo News shows that this is not necessarily the case. The following screenshot shows the JavaScript/Ajax Hotspot analysis from dynaTrace AJAX Edition for http://news.yahoo.com on Firefox 3.6: 6 element lookups by classname result in about 1 second of pure JavaScript execution time The screenshot shows the calls to getElementsByClassName doing lookups for classes such as yn-menu, dynamic-_ad or lter-controls. Even though Firefox supports a native implementation of CSS class name lookups it seems that YUI 2.7 (which is used on this page) does not take advantage of this. When we drill into the PurePath for one of these calls we can see what these calls actually do and why it takes them about 150ms to return a result: Page 42 The YUI 2.7 implementation of getElementsByClassName iterates through 1451 DOM elements checking the class name of every element How to Speed Up These Lookups? There are two solutions to this specic problem: a) upgrade to a new version of YUI or b) specify a tag name additionally to the class name Upgrade to a Newer Library Version Framework developers have invested a lot in improving performance over the last couple of years. The guys from Yahoo did a great job in updating their libraries to take advantage of browser specic implementations. Other frameworks such as jQuery did the same thing. Check out the blogs from Page 43 the YUI Team or jQuery. If you use a different framework I am sure you will nd good blogs or discussion groups on how to optimize performance on these frameworks. Coming back to YUI: I just downloaded the latest version and compared the implementation of getElementsByClassName from 2.7 (dom.js) to 3.3 (compat.js). There has been a lot of change between the version that is currently used at Yahoo News and the latest version available. Changing framework versions is not always as easy as just using the latest download, as a change like this involves a lot of testing but the performance improvements are signicant and everybody should consider upgrading to newer versions whenever it makes sense. Specify Tag Name Additionally to Class Name The following text is taken from the getElementsByClassName documentation from YUI 2.7: For optimized performance, include a tag and/ or root node when possible Why does this help? Instead of iterating through all DOM elements YUI can query elements by tag name rst (using the native implementation of getElementsByTagName) and then only iterate through this subset. This works if the elements you query are of the same type. On all the websites Ive analyzed the majority actually query elements of the same type. Also, if you are just looking for elements under a certain root node specify the root node, e.g. a root DIV element of your dynamic menus. Implementing this best practice should be fairly easy. It doesnt require an upgrade to a newer framework version but will signicantly improve JavaScript execution time. Conclusion: Stay Up-to-date With Your Frameworks To sum this blog post up: follow the progress of the frameworks you are using. Upgrade whenever possible and whenever it makes sense. Also stay up-to-date with blogs and discussion forums about these frameworks. Page 44 Follow the framework teams on Twitter or subscribe to their news feeds. If you are interested in dynaTrace AJAX Edition check out the latest Beta announcement and download it for free on the dynaTrace website. Page 45 Testing and Optimizing Single Page Web 2.0/ AJAX Applications Why Best Practices Alone Dont Work Any More by Andreas Grabner Testing and Optimizing what I call traditional page-based web applications is not too hard to do. Take CNN as an example. You have the home page www.cnn.com. From there you can click through the news sections such as U.S, World, Politics, and many more each click loading a new page with a unique URL. Testing this site is rather easy. Use Selenium, WebDriver or even a HTTP Based Testing tool to model your test cases. Navigate to all these URLs and if you are serious about performance and load time use tools such as YSlow, PageSpeed or dynaTrace AJAX Edition (these tools obviously only work when testing is done through a real browser). These tools analyze page performance based on common Best Practices for every individually tested URL. This is great for traditional web sites but doesnt work anymore for modern Web 2.0 applications that provide most of their functionality on a single page. An example here would be Google and their apps like Search, Docs or GMail. Only a very small time is actually spent in the initial page load. The rest is spent in JavaScript, XHR Calls and DOM Manipulations triggered by user actions on the same URL. The following illustration gives us an overview of what part of the overall User Interaction Time actually gets 8 Week 8 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 46 analyzed with current Best Practice Approaches and which parts are left out: The initial Page Load Time on Web 2.0 applications only contributes a small percentage to the overall perceived performance by the end user Let me explain the differences between these traditional vs. modern web applications and let me give you some ideas on how to solve these new challenges. Page 47 Optimizing Individual Pages Has Become Straightforward Anybody Can Do It Lets look at CNN again. For each URL it makes sense to verify the number of downloaded resources, the time until the onLoad Event, the number of JavaScript les and the execution time of the onLoad Event handlers. All these are things that contribute to the Page Load Time of a single URL. The following screenshot shows the timeline of the CNN Start Page: Key Performance Indicators that can be optimized for individual pages by optimizing Network Roundtrips, JavaScript and Rendering Page 48 Now we can follow recommendations like reducing the number of images, JavaScript and CSS les. Optimizing Network Resources and with that speeding up Page Load Time for single pages Once you are through the recommendations your pages will most likely load faster. Job well done at least for web sites that have a lot of static pages. But what about pages that leverage JavaScript, DOM Manipulations and make use of XHR? You will only speed up the initial page load time which might only be one percent of the time the user spends on your page. Single Page Web 2.0 Applications Thats the Next Challenge! I am sure you use Google on a regular basis, whether it is Google Search, Docs or Gmail. Work with these Web Apps and pay attention to the URL. It hardly ever changes even though you interact with the application by clicking on different links. Not all of these actions actually cause a new page with a different URL to be loaded. Lets look at a simple example: I open Page 49 my local Google Search (in my case it is Google Austria) and see the search eld in the middle of the page. Once I start typing a keyword two things happen: 1. the search eld moves to the top and Google shows me Instant Results based on the currently entered fraction of the keyword 2. a drop-down box with keyword suggestions pops up But the URL stays the same I am still on www.google.at. Check out the following screenshot: All the time while I execute actions on thesite I stay on the same page Page 50 When we now look at the dynaTrace timeline we see all these actions and the browser activities corresponding to these actions: Executing a Google Search includes several actions that are all executed on the same URL The question that arises is, well, is the page performance good? Do Google follow their own Best Practices? Analyzing Page Load Time is Not Enough If we want to analyze this scenario with PageSpeed in Firefox we run into the problem that PageSpeed and YSlow are focused on optimizing Page Load Time. These Tools just analyze loading of a URL. In our Google Scenario it is just the loading of the Google Home Page, which (no surprise) gets a really good score: Page 51 PageSpeed and YSlow only analyze activities when loading the initial page Why this doesnt work? In this scenario we miss all the activities that happen on that page after the initial page was loaded. Analyzing All Activities on a Page Delivers Misleading Results On the other hand we have tools like dynaTrace AJAX Edition and Google SpeedTracer that not only look at the page load time but at all activities while the user interacts with the page. This is a step forward but can produce a misleading result. Lets look at the Google example once again: Whenever I strike a key, an XHR request is sent to the Google Search servers returning a JavaScript le that is used to retrieve suggestions and instant results. The longer my entered keyword, the more JavaScript les are downloaded. This is as designed for the action Search something on Google but it violates some Web Performance Best Practice rules. Thats why analyzing all activities on a single URL will lead to the wrong conclusions. Check out Page 52 the following screenshot it tells me that we have too many JavaScript les on that page: 12 JavaScript les from the same domain can be problematic when analyzing a single page loading but not when analyzing a single-page Web 2.0 application How to Test and Optimize Single Page Web 2.0 Applications? Testing Web 2.0 applications is not as challenging as it used to be a couple of years back. Modern web testing tools both commercial and open source provide good support for Web 2.0 applications by actually driving a real browser instance executing different actions such as loading a URL, clicking on a link or typing keys. I see a lot of people using Selenium or WebDriver these days. These open source tools work well for most scenarios and also work across various browsers. But depending on the complexity of the web site it is possible that you will nd limitations and need to consider commercial tools that in general do a better job in simulating a real end user, e.g. really simulating mouse moves and keystrokes, and not just doing this through JavaScript injection or low-level browser APIs. For my Google Search example I will use WebDriver. It works well across Firefox and Internet Explorer. It gives me access to all DOM elements on a page which is essential for me to verify if certain elements are on the page, whether they are visible or not (e.g.: verifying if a the suggestion drop down box becomes visible after entering a key) and what values certain controls have (e.g.: what are the suggested values in a Google Search). Page 53 Object Pattern for Action based Testing The following is the test case that I implemented using WebDriver. It is really straight forward. I open the Google home page. I then need to make sure we are logged in because Instant Search only works when you are logged in. I then go back to the main Google Search page and start entering a keyword. Instead of taking the search result of my entered keyword I pick a randomly suggested keyword: Unit Test Case that tests the Google Search Scenario using Page Object Pattern Page 54 The script is really straight forward and fairly easy to read. As you can see I implemented classes called GoogleHomePage, GoogleSuggestion or GoogleResultPage. These classes implement the actions that I want to execute in my test case. Lets look at the suggestions for methods implemented on the GoogleHomePage class returning a GoogleSuggestion object: In order to test the suggestion box we simulate a user typing in keys, then wait for the result and return a new object that handles the actual suggestions The code of this method is again not that complicated. What you notice though is that I added calls to a dynaTrace helper class. dynaTrace AJAX Edition allows me to set Markers that will show up in the Timeline View (for details on this see the blog post Advanced Timing and Argument Capturing). The addTimerName method is a method included in the premium version of dynaTrace AJAX Edition which we will discuss in a little bit. Page 55 Timestamp-based Performance Analysis of Web 2.0 Actions When I execute my test and instruct WebDriver to launch the browser with the dynaTrace-specic environment variables, dynaTrace AJAX Edition will automatically capture all browser activities executed by my WebDriver script. Read more on these environment variables on our forum post Automation with dynaTrace AJAX Edition. Lets have a look at the recorded dynaTrace AJAX session that we get from executing this script and at the Timeline that shows all the actions that were executed to get the suggestions, as well as clicking on a random link: The Markers in the Timeline also have a timestamp that allows us to measure performance of individual actions we executed The Timeline shows every marker that we inserted through the script. In addition to the two that are called Start/Stop Suggestion, I also placed a marker before clicking a random suggestion and placed another when the nal search result was rendered to the page. This timestamp-based approach is a step forward in tracking performance of individual actions. From here we can manually drill into the timeframe between two markers and analyze the network roundtrips, JavaScript executions and rendering activity. The problem that we still have though is that we cant really apply Best Practices such as #of Roundtrips as this number would be very action specic. The goal here must be to see how many requests we have per action and then track this over time. And thats what we want but we want it automated! Page 56 Action/Timer-based Analysis That Allows Automation Remember the calls to dynaTrace.addTimerName? This allows me to tag browser activities with a special name a timer name. In the premium extension of dynaTrace, activities are analyzed by these timer names allowing me to not only track execution time of an action; it allows me to track all sorts of metrics such as number of downloaded resources, execution time of JavaScript, number of XHR requests, and so on. The following screenshot shows the analysis of a single test run focusing on one of the actions that I named according to the action in my test script: Page 57 Key Performance Indicators by Timer Name (=Action in the Test Script) This allows us to see how many network requests, JavaScript executions, XHR Calls, etc. we have per action. Based on these numbers we can come up with our own Best Practice values for each action and verify that we meet these numbers for every build we test avoiding regressions. The following screenshot shows which measures dynaTrace allows us to track over time: Multiple Key Performance Indicators can be subscribed and tracked over time Instead of looking at these metrics manually dynaTrace supports us with automatically detecting regressions on individual metrics per Timer Name (=Action). If I run this test multiple times dynaTrace will learn the expected values for a set of metrics. If metrics fall outside the expected value range I get an automated notication. The following screenshot shows how over time an expected value range will be calculated for us. If values fall out of this range we get a notication: Automatically identify regressions on the number of network resources downloaded for a certain user action Page 58 If you are interested in more check out my posts on dynaTrace in CI The Big Picture and How to Integrate dynaTrace with Selenium Conclusion: Best Practices Only Work on Page Load Time -Not on Web 2.0 Action-based Applications It is very important to speed up Page Load Time dont get me wrong. It is the initial perceived performance by a user who interacts with your site. But it is not all we need to focus on. Most of the time in modern web applications is spent in JavaScript, DOM manipulations, XHR calls and rendering that happen after the initial page load. Automatic verication against Best Practices wont work here anymore because we have to analyze individual user actions that do totally different things. The way this will work is to analyze the individual user actions, track performance metrics and automate regression detection based on these measured values. Page 59 The Impact of Garbage Collection on Java Performance by Michael Kopp In my last post I explained what a major Garbage Collection is. While a major Collection certainly has a negative impact on performance it is not the only thing that we need to watch out for. And in the case of the CMS we might not always be able to distinguish between a major and minor GC. So before we start tuning the garbage collector we rst need to know what we want to tune for. From a high level there are two main tuning goals. Execution Time vs. Throughput This is the rst thing we need to clarify if we want to minimize the time the application needs to respond to a request or if we want to maximize the throughput. As with every other optimization these are competing goals and we can only fully satisfy one of them. If we want to minimize response time we care about the impact a GC has on the response time rst and on resource usage second. If we optimize for throughput we dont care about the impact on a single transaction. That gives us two main things to monitor and tune for: runtime suspension and Garbage Collection CPU usage. Regardless of which we tune for, we should always make sure that a GC run is as short as possible.But what determines the duration of GC run? 9 Week 9 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 60 What makes a GC slow? Although it is called Garbage Collection the amount of collected garbage has only indirect impact on the speed of a run. What actually determines this is the number of living objects. To understand this lets take a quick look at how Garbage Collection works. Every GC will traverse all living objects beginning at the GC roots and mark them as alive. Depending on the strategy it will then copy these objects to a new area (Copy GC), move them (compacting GC) or put the free areas into a free list. This means that the more objects stay alive the longer the GC takes. The same is true for the copy phase and the compacting phase. The more objects stay alive, the longer it takes. The fastest possible run is when all objects are garbage collected! With this in mind lets have a look at the impact of garbage collections. Impact on Response Time Whenever a GC is triggered all application threads are stopped. In my last post I explained that this is true for all GCs to some degree, even for so called minor GCs. As a rule every GC except the CMS (and possibly the G1) will suspend the JVM for the complete duration of a run. Page 61 The easiest way to measure impact on the response time is to use your favorite tool to monitor for major and minor collections via JMX and correlate the duration with the response time of your application. The problem with this is that we only look at aggregates, so the impact on a single transaction is unknown. In this picture it does seem like there is no impact from the garbage collections. A better way of doing this is to use the JVM-TI interface to get notied about stop-the-world events. This way the response time correlation is 100% correct, whereas otherwise it would depend on the JMX polling frequency. In addition, measuring the impact that the CMS has on response time is harder as its runs do not stop the JVM for the whole time and since Update 23 the JMX Bean does not report the real major GC anymore. In this case we need to use either verbose:gc or a solution like dynaTrace that can accurately measure runtime suspensions via a native agent technology. Page 62 Here we see a constant but small impact on average, but the impact on specic PurePaths is sometimes in the 10 percent range. Optimizing for minimal response time impact has two sides. First we need to get the sizing of the young generation just right. Optimal would be that no object survives its rst garbage collection, because then the GC would be fastest and the suspension the shortest possible. As this optimum cannot be achieved we need to make sure that no object gets promoted to old space and that an object dies as young as possible. We can monitor that by looking at the survivor spaces. This chart shows the survivor space utilization. It always stays well above 50% which means that a lot of objects survive each GC. If we were to look at the old generation it would most likely be growing, which is obviously not what we want. Getting the sizing right, also means using the smallest young generation possible. If it is too big, more objects will be alive and need to be checked, thus a GC will take longer. If after the initial warm-up phase no more objects get promoted to old space, we will not need to do any special tuning of the old generation. If only a few objects get promoted over time and we can take a momentary hit on response time once in a while we should choose a parallel collector in the old space, as it is very efcient and avoids some problems that the CMS has. If we cannot take the hit in response time, we need to choose the CMS. Page 63 The Concurrent Mark and Sweep Collector will attempt to have as little response time impact as possible by working mostly concurrently with the application. There are only two scenarios where it will fail. Either we allocate too many objects too fast, in which case it cannot keep up and will trigger an old-style major GC; or no object can be allocated due to fragmentation. In such a case a compaction or a full GC (serial old) must be triggered. Compaction cannot occur concurrently to the application running and will suspend the application threads. If we have to use a continuous heap and need to tune for response time we will always choose a concurrent strategy. CPU Every GC needs CPU. In the young generation this is directly related to the number of times and duration of the collections. In old space and a continuous heap things are different. While CMS is a good idea to achieve low pause time, it will consume more CPU, due to its higher complexity. If we want to optimize throughput without having any SLA on a single transaction we will always prefer a parallel GC to the concurrent one. There are two thinkable optimization strategies. Either enough memory so that no objects get promoted to old space and old generation collections never occur, or have the least amount of objects possible living all the time. It is important to note that the rst option does not imply that increasing memory is a solution for GC related problems in general. If the old space keeps growing or uctuates a lot than increasing the heap does not help, it will actually make things worse. While GC runs will occur less often, they will be that much longer as more objects might need checking and moving. As GCs becomes more expensive with the number of objects living, we need to minimize that factor. Allocation Speed The last and least known impact of a GC strategy is the allocation speed. While a young generation allocation will always be fast, this is not true in Page 64 the old generation or in a continuous heap. In these two cases continued allocation and garbage collection leads to memory fragmentation To solve this problem the GC will do a compaction to defragment the area. But not all GCs compact all the time or incrementally. The reason is simple; compaction would again be a stop-the-world event, which GC strategies try to avoid. The concurrent Mark and Sweep of the Sun JVM does not compact at all. Because of that these GCs must maintain a so called free list to keep track of free memory areas. This in turn has an impact on allocation speed. Instead of just allocating an object at the end of the used memory, Java has to go through this list and nd a big enough free area for the newly allocated object. This impact is the hardest to diagnose, as it cannot be measured directly. One indicator is a slowdown of the application without any other apparent reasons, only to be fast again after the next major GC. The only way to avoid this problem is to use a compaction GC, which will lead to more expensive GCs. The only other thing we can do is to avoid unnecessary allocations while keeping the amount of memory usage low. Conclusion Allocate as much as you like, but forget as soon as you can - before the next GC run if possible. Ddont overdo it either, there is a reason why using StringBuilder is more efcient than simple String concatenation. And nally, keep your overall memory footprint and especially your old generation as small as possible. The more objects you keep the less the GC will perform. Page 65 Troubleshooting Response Time Problems Why You Cannot Trust Your System Metrics by Michael Kopp Production monitoring is about ensuring the stability and health of your system, including the application. A lot of times we encounter production systems that concentrate on system monitoring, under the assumption that a stable system leads to stable and healthy applications. So lets see what system monitoring can tell us about our application. Lets take a very simple two tier Web Application: A simple two tier web application This is a simple multi-tier eCommerce solution. Users are concerned about bad performance when they do a search. Lets see what we can nd out about it if performance is not satisfactory. We start by looking at a couple of simple metrics. 10 Week 10 Sun Mon Tue Wed Thu Fri Sat 3 10 17 24 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 66 CPU Utilization The best known operating system metric is CPU utilization, but it is also the most misunderstood. This metric tells us how much time the CPU spent executing code in the last interval and how much more it could execute theoretically. Like all other utilization measures it tells us something about capacity, but not about health, stability or performance. Simply put: 99% CPU utilization can either be optimal or indicate impeding disaster depending on the application. The CPU charts show no shortage on either tier Lets look at our setup. We see that the CPU utilization is well below 100%, so we do have capacity left. But does that mean the machine or the application can be considered healthy? Lets look at another measure that is better suited for the job, the Load Average (System\Processor QueueLength on Windows). The Load Average tells us how many threads or processes are currently executed or waiting to get CPU time. Unix Top Output:load average: 1.31, 1.13, 1.10 Linux systems display three sliding load averages for the last one, ve and 15 minutes. The output above shows that in the last minute there were on average 1.3 processes that needed a CPU core at the same time. If the Load Average is higher than the number of cores in the system we should either see near 100% CPU utilization, or the system has to wait for other resources and cannot max out the CPU. Examples would be Swapping or other I/O related tasks. So the Load Average tells us if we should trust the CPU usage on the one hand and if the machine is overloaded on the other. It does not tell us how well the application itself is performing, but whether the shortage of CPU might impact it negatively. If we do notice a Page 67 problem we can identify the application that is causing the issue, but not why it is happening. In our case we see that neither the load average nor the CPU usage shines any light on our performance issue. If it were to show high CPU utilization or a high load average we could assume that the shortage in CPU is a problem, but we could not be certain. Memory Usage Memory use is monitored because lack of memory will lead to system instability. An important fact to note is that Unix and Linux operating systems will mostly show close to 100% memory utilization over time. They ll the memory up with buffers and caches which get discarded, as opposed to swapped out, if that memory is needed otherwise. In order to get the real memory usage we need subtract these. In Linux we can do so by using the free command. Memory Usage on the two systems, neither is suffering memory problems If we do not have enough memory we can try to identify which application consumes the most by looking at the resident memory usage of a process. Once identied we will have to use other means to identify why the process uses up the memory and whether this is OK. When we look at memory thinking about Java/.NET performance we have to make sure that the application itself is never swapped out. This is especially important because Java accesses all its memory in a random-access fashion and if a portion were to be swapped out it would have severe performance penalties. We can monitor this via swapping measures on the process Page 68 itself. So what we can learn here is whether the shortage of memory has a negative impact on application performance. As this is not the case, we are tempted to ignore memory as the issue. We could look at other measures like network or disk, but in all cases the same thing would be true, the shortage of a resource might have impact, but we cannot say for sure. And if we dont nd a shortage it does not necessarily mean that everything is ne. Databases An especially good example of this problem is the database. Very often the database is considered the source of all performance problems, at least by application people. From a DBA and operations point of view the database is often running ne though. Their reasoning is simple enough: the database is not running out of any resources, there are no especially long-running or CPU consuming statements or processes running and most statements execute quite fast. So the database cannot be the problem. Lets look at this from an application point of view Looking at the Application As users are reporting performance problems the rst thing that we do is to look at the response time and its distribution within our system. The overall distribution in our system does not show any particular bottleneck
Page 69 At rst glance we dont see anything particularly interesting when looking at the whole system. As users are complaining about specic requests lets go ahead and look at these in particular: The response time distribution of the specic request shows a bottleneck in the backend and a lot of database calls for each and every search request We see that the majority of the response time lies in the backend and the database layer. That the database contributes a major portion to the response time does not mean however that the DBA was wrong. We see that every single search executes 416 statements on average! That means that every statement is executing in under one millisecond and this is fast enough from the database point of view. The problem really lies within the application and its usage of the database. Lets look at the backend next. The heap usage and GC activity chart shows a lot of GC runs, but does it have a negative impact? Looking at the JVM we immediately see that it does execute a lot of garbage collection (the red spikes), as you would probably see in every monitoring tool. Although this gives us a strong suspicion, we do not know how this is affecting our users. So lets look at that impact: Page 70 These are the runtime suspensions that directly impact the search. It is considerable but still amounts to only 10% of the response time A single transaction is hit by garbage collection several times and if we do the math we nd out that garbage collection contributes 10% to the response time. While that is considerable it would make sense to spend a lot of time on tuning it just now. Even if we reduce it by half it will only save us 5% of the response time. So while monitoring garbage collection is important, we should always analyze the impact before we jump to conclusions. So lets take a deeper look at where that particular transaction is spending time on the backend. To do this we need to have application centric monitoring in place which we can then use to isolate the root cause. The detailed response time distribution of the search within the backend shows two main problems: too many EJB calls and a very slow doPost method
Page 71 With the right measuring points within our application we immediately see the root causes of the response time problem. At rst we see that the WebService call done by the search takes up a large portion of the response time. It is also the largest CPU hotspot within that call. So while the host is not suffering CPU problems, we are in fact consuming a lot of it in that particular transaction. Secondly we see that an awful lot of EJB calls are done which in turn leads to the many database calls that we have already noticed. That means we have identied a small memory-related issue; although there are no memory problems noticeable if we were to look only at system monitoring. We also found that we have a CPU hotspot, but the machine itself does not have a CPU problem. And nally we found that the biggest issue is squarely within the application; too many database and EJB calls, which we cannot see on a system monitoring level at all. Conclusion System metrics do a very good job at describing the environment - after all, that is what they are meant for. If the environment itself has resource shortages we can almost assume that this has a negative impact on the applications, but we cannot be sure. If there is no obvious shortage this does not, however, imply that the application is running smoothly. A healthy and stable environment does not guarantee a healthy, stable and performing application. Similar to the system, the application needs to be monitored in detail and with application-specic metrics in order to ensure its health and stability. There is no universal rule as to what these metrics are, but they should enable us to describe the health, stability and performance of the application itself. Page 72 11 Week 11 The Cost of an Exception by Alois Reitbauer Recently there was an extensive discussion at dynaTrace about the cost of exceptions. When working with customers we very often nd a lot of exceptions they are not aware of. After removing these exceptions, the code runs signicantly faster than before. This creates the assumption that using exceptions in your code comes with a signicant performance overhead. The implication would be that you had better avoid using exceptions. As exceptions are an important construct for handling error situations, avoiding exceptions completely does not seem to be good solution. All in all this was reason enough to have a closer look at the costs of throwing exceptions. Sun Mon Tue Wed Thu Fri Sat 3 10 17 24 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 73 The Experiment I based my experiment on a simple piece of code that randomly throws an exception. This is not a really scientically profound measurement and we also dont know what the HotSpot compiler does with the code as it runs. Nevertheless it should provide us with some basic insights. public class ExceptionTest {
public long maxLevel = 20;
public static void main (String ... args){
ExceptionTest test = new ExceptionTest();
long start = System.currentTimeMillis(); int count = 10000; for (int i= 0; i < count; i++){ try { test.doTest(2, 0); }catch (Exception ex){ // ex.getStackTrace(); } } long diff = System.currentTimeMillis() - start; System.out.println(String.format(Average time for invocation: %1$.5f,((double) diff)/count)); } Page 74
public void doTest (int i, int level){ if (level < maxLevel){ try { doTest (i, ++level); } catch (Exception ex){ // ex.getStackTrace(); throw new RuntimeException (UUUPS, ex); } } else { if (i > 1) { throw new RuntimeException(Ups.substring(0, 3)); } } } } The Result The result was very interesting. The cost of throwing and catching an exception seems to be rather low. In my sample it was about 0.002ms per exception. This can more or less be neglected unless you really throw too many exceptions were talking about 100, 000 or more. While these results show that exception handling itself is not affecting code performance, it leaves open the question: what is responsible for the huge Page 75 performance impact of exceptions? So obviously I was missing something something important. After thinking about it again, I realized that I was missing an important part of exception handling. I missed out the part on what you do when exceptions occur. In most cases you hopefully do not just catch the exception and thats it. Normally you try to compensate for the problem and keep the application functioning for your end users. So the point I was missing was the compensation code that is executed for handling an exception. Depending on what this code is doing the performance penalty can become quite signicant. In some cases this might mean retrying to connect to a server, in other cases it might mean using a default fallback solution that provides a far worse performing solution. While this seemed to be a good explanation for the behavior we saw in many scenarios, I decided I was not done yet with the analysis. I had the feeling that there was something else that I was missing here. Stack Traces Still curious about this problem I looked into how the situation changes when I collect stack traces. This is what very often happens: you log an exception and its stack trace to try to gure out what the problem is. I therefore modied my code to get the stack trace of an exception as well. This changed the situation dramatically. Getting the stack traces of exceptions had a 10x higher impact on the performance than just catching and throwing them. So while stack traces help to understand where and possibly also why a problem has occurred, they come with a performance penalty. The impact here is often very great, as we are not talking about a single stack trace. In most cases exceptions are thrown and caught at multiple levels. Let us look at a simple example of a Web Service client connecting to a server. First there is an exception at the Java library level for the failed connection. Then there is a framework exception for the failed client and Page 76 then there might be an application-level exception that some business logic invocation failed. This now totals to three stack traces being collected. In most cases you should see them in your log les or application output. Writing these potentially long stack traces also comes with some performance impact. At least you normally see them and can react to them if you look at your log les regularly you do look at your log les regularly, dont you? In some cases I have seen even worse behavior due to incorrect logging code. Instead of checking whether a certain log level is enabled by calling log.isxxEnabled () rst, developers just call logging methods. When this happens, logging code is always executed including getting stack traces of exceptions. As the log level however is set too low they never show up anywhere - you might not even be aware of them. Checking for log levels rst should be a general rule as it also avoids unnecessary object creation. Conclusion Not using exceptions because of their potential performance impact is a bad idea. Exceptions help to provide a uniform way to cope with runtime problems and they help to write clean code. You however need to trace the number of exceptions that are thrown in your code. Although they might be caught they can still have a signicant performance impact. In dynaTrace we, by default, track thrown exceptions and in many cases people are surprised by what is going on in their code and what the performance impact is in resolving them. While exception usage is good you should avoid capturing too many stack traces. In many cases they are not even necessary to understand the problem especially if they cover a problem you already expect. The exception message therefore might prove to be enough information. I get enough out of a connection refused message to not need the full stack trace into the internal of the java.net call stack. Page 77 12 Week 12 Application Performance Monitoring in Production A Step-by-Step Guide Part 1 by Michael Kopp Setting up application performance monitoring is a big task, but like everything else it can be broken down into simple steps. You have to know what you want to achieve and subsequently where to start. So lets start at the beginning and take a top-down approach. Know What You Want The rst thing to do is to be clear about what we want when monitoring the application. Lets face it: we do not want to ensure CPU utilization to be below 90 percent or a network latency of under one millisecond. We are also not really interested in garbage collection activity or whether the database connection pool is of our application and business services. To ensure that, we need to leverage all of the above-mentioned metrics.What does the health and stability of the application mean though? A healthy and stable application performs its function without errors and delivers accurate results within a predened satisfactory time frame. In technical terms this means low response time and/or high throughput and low to non-existant error rate. If we monitor and ensure this then the health and stability of the application is likewise guaranteed. Sun Mon Tue Wed Thu Fri Sat 3 10 17 24 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 78 Dene Your KPIs First we need to dene what satisfactory performance means. In case of an end-user facing application things like rst impression and page load time are good KPIs. The good thing is that satisfactory is relatively simple as the user will tolerate up to a 3-4 second wait but will get frustrated after that. Other interactions, like a credit card payment or a search have very different thresholds though and you need to dene them. In addition to response time, you also need to dene how many concurrent users you want, or need, to be able to serve without impacting the overall response time. These two KPIs, response time and concurrent users, will get you very far if you apply them on a granular enough level.If we are talking about a transaction oriented application your main KPI will be throughput. The desired throughput will depend on the transaction type. Most likely you will have a time window within which you have to process a certain known number of transactions, which dictates what satisfactory performance means to you. Resource and hardware usage can be considered secondary KPIs. As long as the primary KPI is not met, we will not look too closely at the secondary ones. On the other hand, as soon as the primary KPI is met optimizations must always be towards improving these secondary KPIs. If we take a strict top-down approach and measure end-to-end we will not need more detailed KPIs for response time or throughput. We of course need to measure in more detail than that in order to ensure performance. Page 79 Know What, Where and How to Measure In addition to specifying a KPI for e.g. the response time of the search feature we also need to dene where we measure it. The different places where we can measure response time This picture shows several different places where we can measure the response time of our application. In order to have objective and comparable measurements we need to dene where we measure it. This needs to be communicated to all involved parties. This way you ensure that everybody talks about the same thing. In general the closer you come to the end user the closer it gets to the real world and also the harder it is to measure. We also need to dene how we measure. If we measure the average we will need to dene how it is calculated. Averages themselves are alright if you talk about throughput, but very inaccurate for response time. The average tells you nearly nothing about the actual user experience, because it ignores volatility. Even if you are only interested in throughput volatility is interesting. It is harder to plan capacity for a highly volatile application than for one that is stable. Personally I prefer percentiles over averages, as they give us a good picture of response time distribution and thus volatility. Page 80 50th, 75th, 90th and 95th percentile of end user response time for page load In the above picture we see that the page load time of our sample has a very high volatility. While 50 percent of all page requests are loaded in 3 seconds, the slowest 10 percent take between 5 and 20 seconds! That not only bodes ill for our end user experience and performance goals, but also for our capacity management (wed need to over provision a lot to compensate). High volatility in itself indicates instability and is not desirable. It can also mean that we measure the response time with not enough granularity. It might not be enough to measure the response time of e.g. the payment transactions in general. For instance, credit card and d ebit card payment transactions might have very different characteristics so we should measure them separately. Without doing that type of measuring, response time becomes meaningless because we will not see performance problems and monitoring a trend will be impossible. Page 81 This brings us to the next point: what do we measure? Most monitoring solutions allow the monitoring either on an URL level, servlet level (JMX/ App Servers) or network level. In many cases the URL level is good enough as we can use pattern matching on specic URI parameters. Create measures by matching the URI of our Application and Transaction type For Ajax, WebService Transactions or SOA applications in general this will not be enough. WebService frameworks often provide a single URI entry point per application or service and distinguish between different business transactions in the SOAP message. Transaction-oriented applications have different transaction types which will have very different performance characteristics, yet the entry point to the application will be the same nearly every time (e.g. JMS). The transaction type will only be available within the request and within the executed code. In our credit/debit card example we would most likely see this only as part of the SOAP message. So what we need to do is to identify the transaction within our application. We can do this by modifying the code and providing the measurements ourselves (e.g. via JMX). If we do not want to modify our code we could also use aspects to inject it or use one of the many monitoring solutions that supports this kind of categorization via business transactions. Page 82 We want to measure response time of requests that call a method with a given parameter In our case we would measure the response time of every transaction and label it as a debit card payment transaction when the shown method is executed and the argument of the rst parameter is DebitCard. This way we can measure the different types of transactions even if they cannot be distinguished via the URI. Think About Errors Apart from performance we also need to take errors into account. Very often we see applications where most transactions respond within 1.5 seconds and sometimes a lot faster, e.g. 0.2 seconds. More often than not these very fast transactions represent errors. The result is that the more errors you have the better your average response time will get, which is of course misleading. Page 83 Show error rate, warning rate and response time of two business transactions We need to count errors on the business transaction level as well. If you dont want to have your response time skewed by those errors, you should exclude erroneous transaction from your response time measurement. The error rate of your transactions would be another KPI on which you can put a static threshold. An increased error rate is often the rst sign of an impending performance problem, so you should watch it carefully. I will cover how to monitor errors in more detail in one of my next posts. What Are Problems? It sounds like a silly question but I decided to ask it anyway, because in order to detect problems, we rst need to understand them. ITIL denes a problem as a recurring incident or an incident with high impact. In our case this means that a single transaction that exceeds our response time goal is not considered a problem. If you are monitoring a big system you will not have the time or the means to analyze every single violation anyway. But it is a problem if the response time goal is exceeded by 20% of your end user requests. This is one key reason why I prefer percentiles over averages. I know I have a problem if the 80th percentile exceeds the response time goal. The same can be said for errors and exceptions. A single exception or error might be interesting to the developer. We should therefore save the information so that it can be xed in a later release. But as far as Operations is concerned, it will be ignored if it only happens once or twice. On the other hand if the same error happens again and again we need to treat it as a problem as it clearly violates our goal of ensuring a healthy application. Alerting in a production environment must be set up around this idea. If Page 84 we were to produce an alert for every single incident we would have a so- called alarm storm and would either go mad or ignore them entirely. On the other hand if we wait until the average is higher than the response time goal customers will be calling our support line, before we are aware of the problem. Know your system and application The goal of monitoring is to ensure proper performance. Knowing there is a problem is not enough, we need to isolate the root cause quickly. We can only do that if we know our application and which other resources or services it uses. It is best to have a system diagram or ow chart that describes your application. You most likely will want to have at least two or three different detail levels of this. 1. System Topology This should include all your applications, service, resources and the communication patterns on a high level. It gives us an idea of what exists and which other applications might inuence ours. 2. Application Topology This should concentrate on the topology of the application itself. It is a subset of the system topology and would only include communication ows as seen from that applications point of view. It should end when it calls third party applications. 3. Transaction Response Flow Here we would see the individual business transaction type. This is the level that we use for response time measurement. Maintaining this can be tricky, but many monitoring tools provide this automatically these days. Once we know which other applications and services our transaction is using we can break down the response time into its contributors. We do this by measuring the request on the calling side, inside our application and on the receiving end. Page 85 Show response time distribution throughout the system of a single transaction type This way we get a denite picture of where response time is spent. In addition we will also see if we lose time on the network in the form of latency. Next Steps At this point we can monitor the health, stability and performance of our application and we can isolate the tier responsible in case we have a problem. If we do this for all of our applications we will also get a good picture of how the applications impact on each other. The next steps are to monitor each application tier in more detail, including resources used and system metrics. In the coming weeks I will explain how to monitor each of these tiers with the specic goal of allowing production-level root- cause analysis. At every level we will focus on monitoring the tier from an application and transaction point of view as this is the only way we can accurately measure performance impact on the end user. Finally I will also cover system monitoring. Our goal is however not to monitor and describe the system itself, but measure how it affects the application. In terms of application performance monitoring, system monitoring is an integral part and not a separate discipline. Page 86 White Box Testing Best Practices for Performance Regression and Scalability Analysis by Andreas Grabner Every change in your code can potentially introduce a performance regression and can impact the applications ability to scale. Regression analysis addresses this problem by tracking performance aspects of different components throughout the development process and under different load patterns. Black vs. White Box Analysis There are different avors of regression analysis. We can either look at the overall response time of a transaction/request or at the execution times of all involved components. The terms black box and white box testing are commonly used: black box testing looks at the application as one entity; white box testing on the other hand analyzes the performance of all individual components. At rst glance this might not seem a big deal as both approaches allow us to identify changes in the application that affect end user response times. But lets have a look at where white box testing really rocks! 13 Week 13 Sun Mon Tue Wed Thu Fri Sat 3 10 17 24 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 87 Simple Scalability Analysis When you want to test how your application scales I recommend using a simple increasing-workload load test. In this test you start with minimum load and ramp it up over time. Using a black box approach we analyze the response time of our application. The following illustration shows how our application behaves from a response time perspective: Black box testing response time analysis. We can probably predict how response time of the application will change with increasing load - or can we? It looks like our application is doing OK. With increasing load we see response time going up slightly which can be expected. But does this mean that trend will continue if we put more load on the system? Black box testing forces us to assume that the application will continue to scale in the same way which we cant really be sure of. With white box testing we can see into the individual components and can gure out how performance changes in the individual layers. Page 88 White box testing analyzes all involved components. We can see which components scale better than others and how well the application really scales Getting white box insight into application components while executing a load test allows us to learn more about the scalability of our application. In this case, it seems like our business logic layer is by far the worst scaling component in our application when we increase load. Knowing this allows us to a) focus on this component when improving performance and b) make a decision on whether and how to distribute our application components. Analyzing Regressions on Component Level During the development process you also want to verify if code changes have any negative impact on performance. An improved algorithm in one component can improve overall performance but can also have a negative impact on other components. The improvement effort is neutralized by other components that cant deal with the changed behavior. Lets have a look at an example. We have an application that contains presentation, business and database layers. An analysis of the rst Page 89 implementation shows that the business layer makes many roundtrips to the database to retrieve objects that hardly ever change. The engineers decide that these types of objects would be perfect for caching. The additional cache layer frees up database resources that can be used for other applications. In the next iteration the business owner decides to change certain aspects of the business logic. This change doesnt seem to have any negative impact on the response time as reported from the black box tests. When looking at all involved components, however, we see that the changed business logic is bypassing the cache because it requires other objects that have not been congured for caching. This puts the pressure back on the database and is good example for a component level regression. Identify performance regressions on component level Where to Go from Here? It doesnt really matter which tools you are using: whether you use open source or commercial load testing tools, whether you use commercial performance management software like dynaTrace or your own home grown performance logging mechanisms. It is important that you run tests continuously and that you track your performance throughout the development lifecycle. You are more efcient if you have tools that allow you to automate many of these tasks such as automatically executing tests, automatically collecting performance relevant data on code level and also Page 90 automatically highlighting regressions or scalability problems after tests are executed. To give you an idea here are some screenshots of how automated and continuous white box testing can help you in your development process. Track Performance of Tests Across Builds and Automatically Alert on Regressions Automatically Identify Performance Regressions on your daily performance and integration tests Analyze an Increasing Load test and Identify the Problematic Transactions and Components Analyze scalability characteristics of components during an increasing load test Page 91 Analyze Regressions on Code Level Between Builds/ Milestones Compare performance regressions on component and method level Additional Reading Material If you want to know more about testing check out the other blog posts such as 101 on Load Testing, Load Testing with SilkPerformer or Load Testing with Visual Studio Page 92 The Top Java Memory Problems Part 1 by Michael Kopp Memory and garbage collection problems are still the most prominent issues in any Java application. One of the reasons is that the very nature of garbage collection is often misunderstood. This prompted me to write a summary of some of the most frequent and also most obscure memory related issues that I have encountered in my time. I will cover the causes of memory leaks, high memory usage, class loader problems and GC conguration and how to detect them. We will begin this series with the best known one memory leaks. Memory Leaks A memory leak is the most-discussed Java memory issue there is. Most of the time people only talk about growing memory leaks - that is, the continuous rise of leaked objects. They are in comparison, the easiest to track down as you can look at trending or histogram dumps. 14 Week 14 Sun Mon Tue Wed Thu Fri Sat 3 10 17 24 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 93 Memory trending dump that shows the number of objects of the same type increasing In the picture we see the dynaTrace trending dump facility. You can achieve similar results manual by using jmap -histo <pid> multiple times and compare the results. Single object leaks are less thought-about. As long as we have enough memory they seldom pose a serious problem. From time to time though there are single object leaks that occupy a considerable amount of memory and become a problem. The good news is that single big leaks are easily discovered by todays heap analyzer tools, as they concentrate on that. Single object which is responsible for a large portion of the memory being leaked Page 94 There is also the particularly nasty, but rarely-seen, case of a lot of small, unrelated memory leaks. Its theoretically possible, but in reality it would need a lot of seriously bad programmers working on the same project. So lets look at the most common causes for memory leaks. ThreadLocal Variables ThreadLocals are used to bind a variable or a state to a thread. Each thread has its own instance of the variable. They are very useful but also very dangerous. They are often used to track a state, like the current transaction id, but sometimes they hold a little more information. A ThreadLocal variable is referenced by its thread and as such its lifecycle is bound to it. In most application servers threads are reused via thread pools and thus are never garbage-collected. If the application code is not carefully clearing the thread local variable you get a nasty memory leak. These kinds of memory leaks can easily be discovered with a heap dump. Just take a look at the ThreadLocalMap in the heap dump and follow the references. The heap dump shows that over 4000 objects which amount to about 10MB are held by ThreadLocals Page 95 You can then also look at the name of the thread to gure out which part of your application is responsible for the leak. Mutable Static Fields and Collections The most common reason for a memory leak is the wrong usage of statics. A static variable is held by its class and subsequently by its classloader. While a class can be garbage-collected it will seldom happen during an applications lifetime. Very often statics are used to hold cache information or share state across threads. If this is not done diligently it is very easy to get a memory leak. Static mutable collections especially should be avoided at all costs for just that reason. A good architectural rule is not to use mutable static objects at all; most of the time there is a better alternative. Circular and Complex Bi-directional References This is my favorite memory leak. It is best explained by example: org.w3c.dom.Document doc = readXmlDocument(); org.w3c.dom.Node child = doc.getDocumentElement(). getFirstChild(); doc.removeNode(child); doc = null; At the end of the code snippet we would think that the DOM document will be garbage-collected. That is however not the case. A DOM node object always belongs to a document. Even when removed from the document, the Node Object still has a reference to its owning document. As long as we keep the child object the document and all other nodes it contains will not be garbage-collected. Ive see this and other similar issues quite often. JNI Memory Leaks This is a particularly nasty form of memory leak. It is not easily found unless you have the right tool, and it is also not known to a lot of people. JNI is used to call native code from Java. This native code can handle, call and also create Java objects. Every Java object created in a native method Page 96 begins its life as a so called local reference. That means that the object is referenced until the native method returns. We could say the native method references the Java object. So you dont have a problem unless the native method runs forever. In some cases you want to keep the created object even after the native call has ended. To achieve this you can either ensure that it is referenced by some other Java object or you can change the local reference into a global reference. A global reference is a GC root and will never be garbage-collected until explicitly deleted by the native code. The only way to discover such a memory leak is to use a heap dump tool that explicitly shows global native references. If you have to use JNI you should rather make sure that you reference these objects normally and forgo global references altogether. You can nd this sort of leak when your heap dump analysis tool explicitly marks the GC Root as a native reference, otherwise you will have a hard time. Wrong Implementation of Equals/Hashcode It might not be obvious on the rst glance, but if your equals/hashcode methods violate the equals contract it will lead to memory leaks when used as a key in a map. A hashmap uses the hashcode to look up an object and verify that it found it by using the equals method. If two objects are equal they must have the same hashcode, but not the other way around. If you do not explicitly implement hashcode yourself this is not the case. The default hashcode is based on object identity. Thus using an object without a valid hashcode implementation as a key in a map, you will be able to add things but you will not nd them anymore. Even worse: if you re-add it, it will not overwrite the old item but actually add a new one - and just like that you have a memory leak. You will nd it easily enough as it is growing, but the root cause will be hard to determine unless you remember this article. The easiest way to avoid this is to use unit testcases and one of the available frameworks that tests the equals contract of your classes (e.g. http://code. google.com/p/equalsverier/). Page 97 Classloader Leaks When thinking about memory leaks we think mostly about normal Java objects. Especially in application servers and OSGi containers there is another form of memory leak, the class loader leak. Classes are referenced by their classloader and normally they will not get garbage-collected until the classloader itself is collected. That however only happens when the application gets unloaded by the application server or OSGi container. There are two forms of classloader leaks that I can describe off the top of my head. In the rst an object whose class belongs to the class loader is still referenced by a cache, a thread local or some other means. In that case the whole classloader and so, the whole application - cannot be garbage-collected. This is something that happens quite a lot in OSGi containers nowadays and used to happen in JEE application servers frequently as well. As it only happens when the application gets unloaded or redeployed it does not happen very often. The second form is nastier and was introduced by bytecode manipulation frameworks like BCEL and ASM. These frameworks allow the dynamic creation of new classes. If you follow that thought you will realize that now classes, just like objects, can be forgotten by the developer. The responsible code might create new classes for the same purpose multiple times. As the class is referenced in the current class loader you get a memory leak that will lead to an out of memory error in the permanent generation. The real bad news is that most heap analyzer tools do not point out this problem either, we have to analyze it manually, the hard way. This form or memory leak became famous due to an issue in an older version of hibernate and its usage of CGLIB. Summary As we see there are many different causes for memory leaks and not all of them are easy to detect. In my next post I will look at further Java memory problems, so stay tuned. Page 98 Application Performance Monitoring in Production A Step by Step Guide Measuring a Distributed System by Michael Kopp Last time I explained logical and organizational prerequisites to successful production-level application performance monitoring. I originally wanted to look at the concrete metrics we need on every tier, but was asked how you can correlate data in a distributed environment, so this will be the rst thing that we look into. So lets take a look at the technical prerequisites of successful production monitoring. Collecting Data from a Distributed Environment The rst problem that we have is the distributed nature of most applications (an example is shown in the transaction ow diagram below). In order to isolate response time problems or errors we need to know which tier and component is responsible. The rst step is to record response times on every entry and exit from a tier. 15 Week 15 Sun Mon Tue Wed Thu Fri Sat 3 10 17 24 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 99 A simple transaction ow showing tier response time The problem with this is twofold. Firstly, the externalJira tier will host multiple different services which will have different characteristics. This is why we need to measure the response time on that service level and not just on the tier level. We need to do this on both sides of the fence, otherwise we will run into an averaging problem. The second problem is that externalJira is called from several other tiers, not just one.
In a complex System average tier response times are not helpful. Tier response times need to be looked at within a transaction context as shown by this transaction ow Page 100 When we look at the second transaction ow diagram (above) we see that externalJira is called from three different tiers. These tiers sometimes call the same services on externalJira, but with vastly different parameters which leads to different response times of externalJira. We have a double averaging problem: different tiers calling different services on externalJira skewing the average different tiers calling the same service on externalJira with different parameters skewing the average Lets look at this in a little more detail with following example In this table we see which tier entry point calls which services on other tiers. The Payment 1 service calls services 1-3 and measures the response time on its side. The Payment 2 service calls the same three services but with very different response times. When we look at the times measured on services 1-3 respectively we will see a completely different timing. We measured the response times of services 1-3 irrespective of their calling context and ended up with an average! Service 1 does not contribute 500ms to the response times of either Payment 1 or 2, but the overall average is 500ms. This average becomes more and more useless the more tiers we add. One Page 101 of our biggest customers hits 30 JVMs in every single transaction. In such complex environments quick root cause isolation is nearly impossible if you only measure on a tier by tier basis. In order to correlate the response times in a complex system we need to retain the transaction context of the original caller. One way to solve this is to trace transactions, either by using a monitoring tool that can do that or by modifying code and building it into the application. Correlating data in a distributed environment HTTP uses a concept called referrer that enables a webpage to know from which other pages it was called. We can use something similar and leverage this to do our response time monitoring. Lets assume for the moment that the calls done in our imaginary application are all WebService HTTP calls. We can then either use the referrer tag or some custom URL query parameter to track the calling services. Once that is achieved we can track response time based on that custom property. Many monitoring tools allow you to segregate response time based on referrer or query parameters. Another possibility, as always, is to report this yourself via your own JMX Bean. If we do that we will get a response that is context aware. Page 102 We now see that Service 2 only calls Service 3 when it is called directly from Payment 1, which means its contribution is far less than the original table suggested. We also still see a difference in the request and the response time of the services. This is due to the network communication involved. By measuring the response time aware of context we can now also see the time that we spend in the communication layer more clearly, which enables us to isolate network bottlenecks and their impact. The average response time table did not allow us to do that. We can push this to any level we want. E.g. we can divide the Payment 1 WebService call into its three variants supported by our shop: Visa, MasterCard, and AMEX. If we push this as a tag/referrer down the chain we get an even more detailed picture of where we spend time. The problem with this approach is that it is not agnostic to your application or the remoting technology. It requires you to change your code base, and monitoring the different response times becomes more complicated with every tier you add. Of course you also need to maintain this alongside the actual application features, which increases cost and risk. This is where professional APM tools come in. Among other things they do transaction tracing and tagging transparently without code changes. They can also split measure response time in a way that is context aware; they can differentiate between an AMEX and a Visa credit card payment via Business Transactions. And nally they allow you to focus on the entry response time, in our case Payment 1 and Payment 2. In case you have a problem, you can drill down to the next level from there. So there is no need to keep an eye on all the deeper level response times. Page 103 dynaTrace automatically traces calls across tiers (synchronous and asynchronous), captures contextual information per transactions and highlights which tiers contribute how much to the response time Beyond Response Time By analyzing the response time distribution across services and tiers we can quickly isolate the offending tier/service in case we face a performance problem. I stated before that monitoring must not only allow us to detect a problem but also isolate the root cause. To do this we need to measure everything that can impact the response time either directly or indirectly. In generally I like to distinguish between usage, utilization and impact measures. Page 104 Usage and Utilization Measurement A usage measure describes how much a particular application or transaction uses a particular resource. A usage metric can usually be counted and is not time based. An exception is the maybe best known usage measure of CPU time. But CPU time is not really time based, it is based on CPU cycles; and there are a limited number of CPU cycles that can be executed in a specic time. We can directly measure how much CPU time is consumed by our request, by looking at the threads consumed CPU time. In addition we can measure the CPU usage on the process and system level. Most of the time we are measuring a limited resource and as such we also have a utilization measure, e.g. the CPU utilization of a system. Other examples include the number of database calls of a transaction or the connection pool usage. What is important is that the usage is a characteristic of the transaction and does not increase if performance goes down. If the specic resource is fully utilized, we have to wait for it, but then will use the same amount we always do! While response time and overall CPU usage uctuate the average CPU usage per transaction is stable To illustrate that we can again think about the CPU time. An AMEX credit card payment transaction will always use roughly the same amount of CPU time (unless there is a severe error in your code). If the CPU is utilized the Page 105 response time will go up, because it has to wait for CPU, but the amount of CPU time consumed will stay the same. This is what is illustrated in the chart. The same should be true if you measure the amount of database statements executed per transaction, how many web services were called or how many connections were used. If a usage measure has a high volatility then you either are not measuring on a granular enough business transaction level (e.g. AMEX and Visa Payment may very well have different usage measures) or it is an indicator for an architectural problem within the application. This in itself is, of course, useful information for the R&D department. The attentive reader will note that caches might also lead to volatile response times, but then they should only do so during the warm- up phase. If we still have high volatility after that due to the cache, then the cache conguration is not optimal. The bottom line is that a usage measure is ideally suited to measure which sort of transactions utilize your system and resources the most. If one of your resources reaches 100% utilization you can use this to easily identify which transactions or applications are the main contributors. With this information you can plan capacity properly or change the deployment to better distribute the load. Usage measures on a transaction level are also the starting point for every performance optimization activity and are therefore most important for the R&D department. Unfortunately the very fact that makes a usage measure ideal for performance tuning makes it unsuitable for troubleshooting scalability or performance problems in production. If the connection pool is exhausted all the time, we can assume that it has a negative impact on performance, but we do not know which transactions are impacted. Turning this around means that if you have a performance problem with a particular transaction type you cannot automatically assume that the performance would be better if the connection pool were not exhausted! The response time will increase, but all your transaction usage measures will stay the same, so how do you isolate the root cause? Page 106 Impact Measurement In contrast to usage measures, impact measures are always time based. We measure the time that we have to wait for a specic resource or a specic request. An example is the getConnection method in the case of database connection pools. If the connection pool is exhausted the getConnection call will wait until a connection is free. That means if we have a performance problem due to an exhaustion of the database connection pool, we can measure that impact by measuring the getConnection method. The important point is that we can measure this inside the transaction and therefore know that it negatively impacts the AMEX, but not the Visa transactions. Other examples are the execution time of a specic database statement. If the database slows down in a specic area we will see this impact on the AMEX transaction by measuring how long it has to wait for its database statements. The database statement impacts the response time by contributing 20% When we take that thought further we will see that every time a transaction calls a different service, we can measure the impact that the external service has by measuring its response time at the calling point. This closes the cycle to our tier response times and why we need to measure them in a context- aware fashion. If we would only measure the overall average response time Page 107 of that external service we would never know the impact it has on our transaction. This brings us to a problem that we have with impact measurement on the system level and in general. Lack of Transaction Context In an application server we can measure the utilization of the connection pool, which tells us if there is a resource shortage. We can measure average wait time and/or the average number of threads waiting on a connection from the pool, which (similar to the Load Average) tells us that the resource shortage does indeed have an impact on our application. But both the usage and the impact measure lack transaction context. We can correlate the measurements on time basis if we know which transactions use which resources, but we will have to live with a level of uncertainty. This forces us to do guesswork in case we have a performance problem. Instead of zeroing in on the root cause quickly and directly, this is the main reason that troubleshooting performance problems takes a long time and lots of experts. The only way to avoid that guesswork is to measure the impact directly in the transaction, by either modifying the code or use a motoring tool that leverages byte-code injection and provides transaction context. Of course there are some things that just cannot be measured directly. The CPU again is a good example. We cannot directly measure waiting for a CPU to be assigned to us, at least not easily. So how do we tackle this? We measure the usage, utilization and indirect impact. In case of the CPU the indirect impact is measured via the load average. If the load average indicates that our process is indeed waiting for CPU we need to rule out any other cause for the performance problem. We do this by measuring the usage and impact of all other resources and services used by the application. To quote Sherlock Holmes: If all directly measured root causes can be ruled out than the only logical conclusion is that the performance problem is caused by whatever resource cannot be directly measured. In other words if nothing else can explain the increased response time, you can be Page 108 sufciently certain that the CPU exhaustion is the root cause. What About Log Files? As a last item I want to look at how to store and report the monitoring data. I was asked before whether log les are an option. The idea was to change the code and measure the application from inside (as hinted at several times by me) and write this to a log le. The answer is a denitive NO; log les are not a good option. The rst problem is the distributed nature of most applications. At the beginning I explained how to measure distributed transactions. It becomes clear that while you can write all this information to a log le periodically, it will be highly unusable, because you would have to retrieve all the log les, correlate the distributed transaction manually and on top of that correlate it with system and application server metrics taken during that time. While doable, it is a nightmare if we are talking about more than two tiers. Page 109 Trying to manually correlate log les from all the involved servers and databases is nearly impossible in bigger systems. You need a tool that automates this. The second problem is lack of context. If you only write averages to the log le you will quickly run into the averaging problem. One can of course rene this no end, but it will take a long time to reach a satisfactory level of granularity and you will have to maintain this in addition to application functionality, which is what should really matter to you. On the other hand if you write the measured data for every transaction you will never be able to correlate the data without tremendous effort and will also have to face a third problem. Both logging all of the measures and aggregating the measures before logging them will lead to overhead which will have a negative impact on performance. On the other hand if you only turn on this performance logging in case you already have a problem, we are not talking about monitoring anymore. You will not be able to isolate the cause of a problem that has already occurred until it happens again. The same is true if you do that automatically and e.g. automatically start capturing data once you realize something is wrong. It sounds intuitive, but it really means that you already missed the original root cause of why it is slow. On the other hand log messages often provide valuable error or warning information that is needed to pinpoint problems quickly. The solution is to capture log messages the same way that we measure response times and execution counts: within the transaction itself. This way we get the valuable log information within the transaction context and do not have to correlate dozens of log les manually. Page 110 The dynaTrace PurePath includes log messages and exceptions in the context of a single transaction In addition a viable monitoring solution must store, aggregate and automatically correlate all retrieved measurements outside the monitored application. It must store it permanently, or at least for some time. This way you can analyze the data after the problem happened and do not have to actively wait until it happens again. Conclusion By now it should be clear why we need to measure everything that we can in the context of the calling transaction. By doing this we can create an accurate picture of what is going on. It enables us to rule out possible root causes for a problem and zero in on the real cause quickly. It also enables Page 111 us to identify resource and capacity issues on an application and service level instead of just the server level. This is equally important for capacity planning and cost accounting. As a next step we will look at the exact metrics we need to measure in each tier, how to interpret and correlate them to our transaction response time. Page 112 Tips for Creating Stable Functional Web Tests to Compare across Test Runs and Browsers by Andreas Grabner In the last week I created stable functional tests for a new eCommerce application. We picked several use cases, e.g.: clicking through the different links, logging in, searching for products and actually buying a product. We needed functional tests that run on both Internet Explorer and Firefox. With these tests we want to make sure to automatically nd any functional problems but also performance and architectural problems (e.g.: too many JavaScript les on the site, too many exceptions on the server or too many database statements executed for a certain test scenario). We also want to nd problems that happen on certain browsers which is why we ran the tests on the two major browsers. Test Framework: Selenium WebDriver As test framework I decided to pick Selenium WebDriver and downloaded the latest version. I really thought it is easier to write tests that work in a similar way on both browsers. I have several lessons learned 1. When you write a script always test it immediately on both browsers 2. Use a page object approach when developing your scripts. With that you keep the actual implementation separated from the test cases (you 16 Week 16 Sun Mon Tue Wed Thu Fri Sat 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 113 will see my test scripts later in this blog it will make more sense when you see it) 3. Be aware of different behaviors of IE and FF 4. Make sure your test code can deal with unexpected timings or error situations What a Test Script Should Look Like (Slick and Easy to Understand) Here is a screenshot of one of my test cases. Selenium test using a page object pattern that facilitates writing easy to understand test cases Common Functionality in PageObjectBase I put lots of helper methods in a base class that I called PageObjectBase. As WebDriver wont wait for certain objects or for the page to be loaded (or at least, I havent found that functionality) I created my own waitFor methods to wait until certain objects are on the page. This allows me to verify Page 114 whether my app made it to the next stage or not. Here is another screenshot of one of my helper methods. You see that I had to work around a certain limitation I came across in IE seems like By.linkText doesnt work same is true for most of the other lookup methods in By. What worked well for me is By.xpath with the only limitation that certain methods such as contains() dont work on Firefox. As you can see lots of things to consider unfortunately not everything works the same way on every browser. Helper methods in my PageObjectBase class Easy to Switch Browsers My test classes create the WebDriver runner. Here I also created a base class that depending on a system property that I can set from my Ant script instantiates the correct WebDriver implementation (IE or FF). This base class also checks whether dynaTrace will be used to collect performance data. If thats the case it creates a dynaTrace object that I can use to pass test and test step names to dynaTrace. This makes it easier to analyze performance data later on more on this later in this article. Page 115 Base object that makes sure we have the correct WebDriver and that dynaTrace is properly set up Analyzing Tests across Test Runs As recently blogged, dynaTrace offers Premium Extensions to our free dynaTrace AJAX Edition. These extensions allow us not only to collect performance data automatically from Internet Explorer or Firefox they automatically analyze certain key metrics per test case. Metrics can be the number of resources downloaded, the time spent in JavaScript, the number of redirects or the number of database queries executed on the application server. Identify Client Side Regressions across Builds I have access to different builds. Against every build I run my Selenium tests and then verify the Selenium results (succeeded, failed, errors) and the numbers I get from dynaTrace (#roundtrips, time in JS, #database Page 116 statements, #exceptions, and so on). With one particular build I still got all successful Selenium Test executions but got a notication from dynaTrace that some values were outside of the expected value range. The following screenshot shows some of these metrics that triggered an alert: JavaScript errors, number of resources and number of server-side exceptions show a big spike starting with a certain build A double click on one of the metrics of the build that has this changed behavior opens a comparison view of this particular test case. It compares it with the previous test run where the numbers were OK: Page 117 The Timeline makes it easy to spot the difference visually. Seems we have many more network downloads and JavaScript executions A side-by-side comparison of the network requests is also automatically opened showing me the differences in downloaded network resources. It seems that a developer added a new version of jQuery including a long list of jQuery plugins. Page 118 When these libraries are really required we need to at least consider consolidating the jQuery library and using a minied version of these plugins Now we know why we have so many more resources on the page. Best practices recommend that we merge all CSS and JavaScript les into a single le and deploy a minied version of it instead of deploying all these les individually. The JavaScript errors that were thrown were caused by incompatibility between the multiple versions of jQuery. So even though the Selenium test was still successful we have several problems with this build that we can now start to address. Identify Server-Side Regressions across Builds Even though more and more logic gets executed in the browser we still need to look at the application executed on the application server. The Page 119 following screenshot shows another test case that shows a dramatic growth in database statements (from 1 to more than 9000). Looks like another regression. Database executions exploded from 1 to more than 9000 for this particular test case The drill down to compare the results of the problematic build with the previous works in the same way. Double click the measure and we get to a comparison dashboard. This time we are interested in the database statements. Seems it is one statement that got called several thousand times. This database statement was called several thousand times more often than in the previous test runs When we want to know who executed these statements and why they werent executed in the build before, we can open the PurePath Page 120 comparison dashlet. The PurePath represents the actual transactional trace that dynaTrace captured for every request of every test run. As we want to focus on this particular database statement we can drill from here to this comparison view and see where its been called. Comparing the same transaction in both builds. A code change caused the call of allJourneys as compared to getJourneyById Analyzing Tests across Browsers In the same way as comparing results across test runs or builds it is possible to compare tests against different browsers. It is interesting to see how applications are different in different browsers. But it is also interesting to identify regressions on individual browsers and compare these results with the browser that doesnt show the regressions. The following screenshot shows the comparison of browser metrics taken from the same test executed against Internet Explorer and Firefox. Seems that for IE we have 4 more resources that get downloaded: Page 121 Not only compare metrics across test runs but also compare metrics across browsers From here we can go on in the same way as I showed above. Drill into the detailed comparison, e.g.: Timeline, Network, JavaScript or Server-Side execution and analyze the different behavior. Want more? Whether you use Selenium, WebDriver, QTP, Silk, dynaTrace, YSlow, PageSpeed or ShowSlow I imagine you are interested in testing and you want to automate things. Check out my recent blogs such as those on Testing Web 2.0 Applications, Why You cant Compare Execution times Across Browsers or dynaTrace AJAX Premium. Page 122 How to do Security Testing with Business Transactions Guest Blog by Lucy Monahan from Novell by Andreas Grabner Lucy Monahan is a Principal Performance QA Engineer at Novell, and helps to manage their distributed Agile process. One of the most important features of an application is to provide adequate security and protect secrets held within. Business Transactions used with Continuous Integration, Unit-, Feature- and Negative Testing specialized for security can detect known security vulnerabilities before your product goes to market. Catch Secrets Written to Logging Plaintext secrets written to a log le are a well-known vulnerability. Once an intruder gains access to a hard disk they can easily comb through log les to further exploit the system. Here a Business Transaction is used to search logging output to look for secrets. Application data slips into logging in a variety of ways: lack of communication regarding which data is a secret, lingering early code, perhaps before a comprehensive logging strategy has been implemented, old debug 17 Week 17 Sun Mon Tue Wed Thu Fri Sat 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 123 statements or perhaps inadvertent inclusion via localization processing. Its a good idea to grep log les after your test run but that will not cover output to the server console, which may contain different content. For example, users may start the application using nohup which may write terminal logging to nohup.out. And starting an application with redirection of stdout and stderr will persist console output to a le, such as: startserver.sh > /opt/logs/mylogle.log 2>&1 Use Business Transactions to search logging content during your testing and trap secrets before they are revealed! And be sure to enable a high logging level since trace level may contain secrets that info level does not. This Business Transaction uses two Measures for two of the org.apache. log4j.spi.Log4J LoggingEvent class constructors. Other methods that can be trapped include those from proprietary logging classes or classes associated with the auditing channel or retrieval of localized messages. These Measures simply search the constructors message argument for the word testuser for purposes of demonstration: Create the two argument measures for the two different LoggingEvent constructors Page 124 The Business Transaction Filter denes both measures with an OR operator to activate the Business Transaction: The Business Transaction will lter transactions that contain a log message for testuser We want to know about any instance of a secret being logged thus the threshold Upper Severe is set to 1.0 occurrence. Each of the Business Transactions outlined here uses this threshold. Page 125 The threshold on the Business Transaction can later be used for automatic alerting When the secret is written to logging the Business Transaction is activated: The Business Transaction dashlet shows us how many log messages have been written that include our secret text The displayed columns in the Business Transaction can be customized to show counts of each Measure that matched the lter. TIP: Running this Business Transaction against Install and Uninstall programs is highly recommended since a lot of secret data is requested and used during these processes. From the Business Transaction dashlet we can drill-down to the actual transactional trace (PurePath) and see where this secret was actually logged. The PurePath contains additional contextual information such as HTTP parameters, method arguments, exceptions, log messages, database statements, remoting calls, and so on. Page 126 The actual transaction that contains our captured secret. The dynaTrace PurePath contains lots of contextual information valuable for developers. Catch Exception Messages Containing Secrets If your application prints exceptions and their messages then you need to catch any secrets embedded within them. This Business Transaction uses a measure to obtain the return value for java.lang.Throwable. getLocalizedMessage(). Depending on your applications architecture a measure for java.lang.Throwable.getMessage() may also be required. With the Business Transaction dened, perform a negative testing test run that intentionally throws exceptions. The measure in this example simply searches for the word authentication in thegetLocalizedMessage() return value for demonstration purposes: Page 127 Argument measure that counts the occurrences of authentication in the return value of getLocalizedMessage Page 128 Business Transaction that lters based on the captured secret in getLocalizedMessage Business Transaction dashlet showing us instances where authentication was part of the localized message Page 129 Catch Non-SSL Protocols Any non-SSL trafc over the network has the potential for exposing secrets. Most applications requiring high security will disallow all non-SSL trafc. Your application needs to be tested to ensure that all requests are being made over SSL. Non-SSL connections can inadvertently be opened and used when different components within an application are handling their own connections or oversight in installation or conguration allowed the connections. Unless you include an explicit check for non-SSL connections then they may be opened and used stealthily. One way to ensure that only SSL connections are being used is to trap any non-SSL connections being opened. This Business Transaction uses a measure to obtain the string value of the rst argument to the constructor for javax.naming.ldap.InitialLdapContext. The rst argument is a hashtable and when SSL is enabled one of the entries contains the text protocol=ssl. Its worth noting that the presence of the protocol=ssl value is particular to the JVM implementation being used. Indeed, when SSL is not enabled the protocol value is omitted, hence the use of the notcontains operator for this measure. Use a method sensor to capture the value being used in your JVM implementation to conrm the text token for your application environment. Page 130 Measure that evaluates whether SSL is not passed as argument Page 131 Business Transaction that lters those transactions that do not pass SSL as protocol argument Business Transaction dashlet shows the transactions that do not use SSL as parameter What other protocols are being used in your application? You can write a similar Business Transaction to ensure that only SSL is being used for that protocol. Catch Secrets Written to Print Statements The above examples catch secrets written to Exceptions and logging. But a common debug technique is to use a print statement, such as System.out. println() in Java, rather than a logger function. A Business Transaction that catches print statements is recommended for the test suite. Having said that, in Java java.lang.System.out.println() is not accessible in the usual way because out is a eld in the System class. An alternative Page 132 approach may be to use a Business Transaction with a measure based on Classname value to trap all calls to java.lang.System. All related PurePaths for this Business Transaction would then merit review to assess security risks. This approach may or may not be feasible depending how often your application calls java.lang.System. Feasibility testing of this approach would thus be necessary. The test application used here does not call java.lang.System so an example is not included. In any case, a companion test that performs a full text search of the source code tree for calls to java.lang.System.out.println() is highly recommended. Summary For each Business Transaction the process is: Dene which entities within your application represent secrets (e.g. passwords, account numbers) Dene which classes and methods will be used for the Measure denition Create a new Business Transaction using the Measure denition Ensure that the input data used in the test contains many secrets Include these types of Business Transactions in continuous integration, unit, feature and negative testing that contain a diversity of secrets. These Business Transactions are not intended for load testing, however, since load testing may simply contain many instances of the same secret rather than a diversity of secrets and the result will ood your result set. Security-related issues released into the eld are costly. The customers sensitive data is at risk and the expense of security eld patches and the damage done to the corporations reputation is high. Understanding your applications architecture will help identify possible vulnerabilities and enable you to grow a suite of Business Transactions designed specically to trap secrets and guarantee your applications security. Page 133 Follow up Links for dynaTrace Users When you are a dynaTrace User you can check out the following links on our Community Portal that explain more on how dynaTrace does Business Transaction Management, how to improve your Continuous Integration process and how to integrate dynaTrace into your testing environment. Page 134 18 Week 18 Field Report Application Performance Management in WebSphere Environments by Andreas Grabner Just in time for our webinar with The Bon-Ton Stores, where we talking about the challenges in operating complex WebSphere environments, we had another set of prospects running their applications on WebSphere. Francis Cordon, a colleague of mine, shares some of the sample data resulting from these engagements. In this article I want to highlight important areas when managing performance in WebSphere environments. This includes WebSphere health monitoring, end to end performance analysis, performance and business impact analysis as well as WebSphere memory analysis and management. Sun Mon Tue Wed Thu Fri Sat 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 135 WebSphere Health Monitoring WebSphere application servers provide many different metrics that we need to consider when monitoring server health. This includes system metrics such as memory and CPU utilization. It also includes transaction response times and connection pool usage. The following screenshot shows a dashboard that gives a good overview of the general health of a WebSphere server: Monitoring WebSphere server health including memory, CPU, response times, connection pools and thread information Page 136 From a very high-level perspective we can look at overall response times but also at response times of individual services. The following illustration shows a dashboard that visualizes response times and whether we have any SLA Violations on any of our monitored service tiers: Easy to spot whether we have any SLA breaches on any of our tiers Page 137 The following dashboard provides an extended in-depth view. Not only does it show response times or memory usage it also shows which layers of the application contribute to the overall performance and provides an additional overview of problematic SQL statements or exceptions: A more in-depth WebSphere health monitor dashboard including layer performance breakdown, database and exception activity End-To-end Performance View The transaction ow dashlet visualizes how transactions ow through the WebSphere environment. We can look at all transactions, certain business transactions (e.g. product searches, check-outs, logins, and many others) or can pick individual ones from specic users. From this high-level ow we can drill down to explore more technical details to understand where time is spent or where errors happen. The following screenshot shows how to drill into the details of those transactions that cross through a specic WebSphere server node. For every transaction we get to see the full execution trace (PurePath) that contains contextual information such as executed SQL statements, exceptions, log messages, executed methods including arguments, etc. Page 138 Drill into the transactions owing through WebSphere. Each individual transaction contains contextual information and provides the option to lookup offending source code If we want to focus on database activity we simply drill down into the database details. Database activity is captured from within the application server including SQL statements, bind variables and execution times. The following 3 illustrations show different ways to analyze database activity executed by our WebSphere transactions. Page 139 Analyze all queries including bind values executed by our WebSphere application. Identify slow ones or those that are executed very often We can pick an individual database statement to see which transaction made the call and how it impacts the performance of this transaction: Page 140 Identify the impact of a database query on its transaction. In this case a stored procedure is not returning the expected result and throws an exception It is not enough to look at the actual transaction and its database statements. We also monitor performance metrics exposed by the database in this case its an Oracle database instance. dynaTrace users can download the Oracle Monitor Plugin from our Community Portal. You can also read the article on How to Monitor Oracle Database Performance. Page 141 Analyze the activity on the database by monitoring Oracles performance metrics and correlate it to our transactional data Business Impact Analysis As many different end users access applications running on WebSphere it is important to identify problems that impact all users but also problems that just impact individual users. The following illustration shows how Business Transactions allow us to analyze individual users, and from there dig deeper in the root cause of their individual performance problems. Page 142 Analyze the performance impact for individual users using Business Transactions Analyzing Memory Usage and Memory Leaks Memory management and analysis can be hard if you dont know what to look out for. Read our blogs on Top Memory Leaks in Java, Memory Leak Detection in Production or the Impact of Garbage Collection on Performance to make yourself familiar with the topic. Page 143 The following screenshots show how to analyze memory usage in WebSphere and how to track potential Memory Leaks by following object reference paths of identied memory hotspots: We start by analyzing our memory usage Page 144 Identify hotspots and the identify the root cause of memory leaks Final Words Thanks again to Francis for sharing his experience with us. Existing dynaTrace customers please check out the content we have on our Community Portal. For more information, download the recorded version of our webinar with The Bon-Ton Stores. Page 145 19 Week 19 How Garbage Collection Differs in the Three Big JVMs by Michael Kopp Most articles about garbage collection (GC) ignore the fact that the Sun Hotspot JVM is not the only game in town. In fact whenever you have to work with either IBM WebSphere or Oracle WebLogic you will run on a different runtime. While the concept of garbage collection is the same, the implementation is not and neither are the default settings or how to tune it. This often leads to unexpected problems when running the rst load tests or in the worst case when going live. So lets look at the different JVMs, what makes them unique and how to ensure that garbage collection is running smoothly. The Garbage Collection Ergonomics of the Sun Hotspot JVM Everybody believes they know how garbage collection works in the Sun Hotspot JVM, but lets take a closer look for the purpose of reference. Sun Mon Tue Wed Thu Fri Sat 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 146 The memory model of the Sun Hotspot JVM The Generational Heap The Hotspot JVM always uses a generational heap. Objects are rst allocated in the young generation, specically in the Eden area. Whenever the Eden space is full a young generation garbage collection is triggered. This will copy the few remaining live objects into the empty Survivor space. In addition objects that have been copied to Survivor in the previous garbage collection will be checked and the live ones will be copied as well. The result is that objects only exist in one Survivor, while Eden and the other Survivor are empty. This form of garbage collection is called copy collection. It is fast as long as nearly all objects have died. In addition allocation is always fast because no fragmentation occurs. Objects that survive a couple of garbage collections are considered old and are promoted into the Tenured/old space. Tenured Generation GCs The mark and sweep algorithms used in the Tenured space are different because they do not copy objects. As we have seen in one of my previous posts, garbage collection takes longer the more objects are alive. Consequently GC runs in Tenured are nearly always expensive which is why we want to avoid them. In order to avoid GCs we need to ensure that objects are only copied from young to old when they are permanent and in addition ensure that the Tenured does not run full. Therefore generation sizing is the single most important optimization for the GC in the Hotspot JVM. If we cannot prevent objects from being copied to Tenured space once in a while we can use the concurrent mark and sweep algorithm which collects objects concurrent to the application. Page 147 Comparison of the different garbage collector strategies While that shortens the suspensions it does not prevent them and they will occur more frequently. The Tenured space also suffers from another problem: fragmentation. Fragmentation leads to slower allocation, longer sweep phases and eventually out of memory errors when the holes get too small for big objects. Java heap before and after compacting This is remedied by a compacting phase. The serial and parallel compacting GC perform compaction for every GC run in the Tenured space. It is important to note that, while the parallel GC performs compaction every time, it does not compact the whole Tenured heap but just the area that is worth the effort. By worth the effort I mean when the heap has reached a certain level of fragmentation. In contrast, the concurrent mark and sweep does not compact at all. Once objects cannot be allocated anymore a serial Page 148 major GC is triggered. When choosing the concurrent mark and sweep strategy we have to be aware of that side effect. The second big tuning option is therefore the choice of the right GC strategy. It has big implications for the impact the GC has on the application performance. The last and least known tuning option is around fragmentation and compacting. The Hotspot JVM does not provide a lot of options to tune it, so the only way is to tune the code directly and reduce the number of allocations. There is another space in the Hotspot JVM that we all came to love over the years, the Permanent Generation. It holds classes and string constants that are part of those classes. While garbage collection is executed in the permanent generation, it only happens during a major GC. You might want to read up what a Major GC actually is, as it does not mean an Old Generation GC. Because a major GC does not happen often and mostly nothing happens in the permanent generation, many people think that the Hotspot JVM does not do garbage collection there at all. Over the years all of us run into many different forms of the OutOfMemory situations in PermGen and you will be happy to hear that Oracle intends to do away with it in future versions of Hotspot. Oracle JRockit Now that we had a look at Hotspot, let us look at the difference in the Oracle JRockit. JRockit is used by Oracle WebLogic Server and Oracle has announced that it will merge it with the Hotspot JVM in the future. Heap Strategy The biggest difference is the heap strategy itself. While Oracle JRockit does have a generational heap it also supports a so-called continuous heap. In addition the generational heap looks different as well. Page 149 Heap of the Oracle JRockit JVM The young space is called Nursery and it only has two areas. When objects are rst allocated they are placed in a so-called Keep Area. Objects in the Keep Area are not considered during garbage collection while all other objects still alive are immediately promoted to tenured. That has major implications for the sizing of the Nursery. While you can congure how often objects are copied between the two survivors in the Hotspot JVM, JRockit promotes objects in the second young generation GC. In addition to this difference JRockit also supports a completely continuous Heap that does not distinguish between young and old objects. In certain situations, like throughput orientated batch jobs, this results in better overall performance. The problem is that this is the default setting on a server JVM and often not the right choice. A typical Web Application is not throughput but response time orientated and you will need to explicitly choose the low pause time garbage collection mode or a generational garbage collection strategy. Mostly Concurrent Mark and Sweep If you choose concurrent mark and sweep strategy you should be aware about a couple of differences here as well. The mostly concurrent mark phase is divided into four parts: Initial marking, where the root set of live objects is identied. This is done while the Java threads are paused. Concurrent marking, where the references from the root set are followed in order to nd and mark the rest of the live objects in the heap. This is Page 150 done while the Java threads are running. Precleaning, where changes in the heap during the concurrent mark phase are identied and any additional live objects are found and marked. This is done while the Java threads are running. Final marking, where changes during the precleaning phase are identied and any additional live objects are found and marked. This is done while the Java threads are paused. The sweeping is also done concurrent to your application, but in contrast to Hotspot in two separate steps. It is rst sweeping the rst half of the heap. During this phase threads are allowed to allocate objects in the second half. After a short synchronization pause the second half is swept. This is followed by another short nal synchronization pause. The JRockit algorithm therefore stops more often than the Sun Hotspot JVM, but the remark phase should be shorter. Unlike the Hotspot JVM you can tune the CMS by dening the percentage of free memory that triggers a GC run. Compacting The JRockit does compacting for all Tenured Generation GCs, including the Concurrent Mark and Sweep. It does so in an incremental mode for portions of the heap. You can tune this with various options like percentage of heap that should be compacted each time or how many objects are compacted at max. In addition you can turn off compacting completely or force a full one for every GC. This means that compacting is a lot more tunable in the JRockit than in the Hotspot JVM and the optimum depends very much on the application itself and needs to be carefully tested. Thread Local Allocation Hotspot does use thread local allocation (TLA), but it is hard to nd anything in the documentation about it or how to tune it. The JRockit uses this on default. This allows threads to allocate objects without any need for synchronization, which is benecial for allocation speed. The size of a Page 151 TLA can be congured and a large TLA can be benecial for applications where multiple threads allocate a lot of objects. On the other hand too large a TLA can lead to more fragmentation. As a TLA is used exclusively by one thread, the size is naturally limited by the number of threads. Thus both decreasing and increasing the default can be good or bad depending on your applications architecture. Large and small objects The JRockit differentiates between large and small objects during allocation. The limit for when an object is considered large depends on the JVM version, the heap size, the garbage collection strategy and the platform used. It is usually somewhere between 2 and 128 KB. Large objects are allocated outside thread local area in case of a generational heap directly in the old generation. This makes a lot of sense when you start thinking about it. The young generation uses a copy collection. At some point copying an object becomes more expensive than traversing it in every garbage collection. No permanent Generation And nally it needs to be noted that the JRockit does not have a permanent generation. All classes and string constants are allocated within the normal heap area. While that makes life easier on the conguration front it means that classes can be garbage collected immediately if not used anymore. In one of my future posts I will illustrate how this can lead to some hard-to- nd performance problems. The IBM JVM The IBM JVM shares a lot of characteristics with JRockit: The default heap is a continuous one. Especially in WebSphere installation this is often the initial cause for bad performance. It differentiates between large and small objects with the same implications and uses thread local allocation on default. It also does not have a permanent generation, but while the IBM JVM also supports a generational heap model it looks more like Suns than JRockit. Page 152 The IBM JVM generational heap Allocate and Survivor act like Eden and Survivor of the Sun JVM. New objects are allocated in one area and copied to the other on garbage collection. In contrast to JRockit the two areas are switched upon GC. This means that an object is copied multiple times between the two areas before it gets promoted to Tenured. Like JRockit the IBM JVM has more options to tune the compaction phase. You can turn it off or force it to happen for every GC. In contrast to JRockit the default triggers it due to a series of triggers but will then lead to a full compaction. This can be changed to an incremental one via a conguration ag. Conclusion We see that while the three JVMs are essentially trying to achieve the same goal, they do so via different strategies. This leads to different behavior that needs tuning. With Java 7 Oracle will nally declare the G1 (Garbage First) production ready and the G1 is a different beast altogether, so stay tuned. If youre interested in hearing me discuss more about WebSphere in a production environment, then check out our webinar with The Bon-Ton Stores. Ill be joined by Dan Gerard, VP of Technical & Web Services at Bon- Ton, to discuss the challenges theyve overcome in operating a complex WebSphere production eCommerce site to deliver great web application performance and user experience. Watch it now to hear me go into more detail about WebSphere and production eCommerce environments. Page 153 Why Object Caches Need to be Memory- sensitive Guest Blog by Christopher Andr by Michael Kopp Christopher Andr is an Enablement Service Consultant at dynaTrace and helps our Customers to maximize their value they get out of dynaTrace. The other day, I went to a customer who was experiencing a problem that happens quite frequently: he had a cache that was constantly growing, leading to OutOfMemory Exceptions. Other problems in the application seemed linked to it. Analyzing and nding the root cause of this memory- related problem triggered me to write this blog on why they ran into OutOfMemory Exceptions despite having a properly congured cache. He was trying to cache the results of database selects, so he wouldnt have to execute them multiple times. This is generally a good idea, but most Java developers dont really know how to do this right and forget about the growing size of their caches. How can we have memory problems in Java? Quite often, I hear Java developers saying I cant have memory problems; the JVM is taking care of everything for me. While the JVMs memory handling is great this does not mean we dont have to think about it at all. Even if we do not make any obvious mistakes we sometimes have to help 20 Week 20 Sun Mon Tue Wed Thu Fri Sat 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 154 the JVM manage memory efciently. The behavior of the garbage collection (GC) has been explained in several blogs. It will reclaim memory of all objects that cant be reached by so called GC Roots (GC Roots are objects that are assumed to be always reachable). Problems often happen when an object creates many references to different objects and the developer forgets to release them. Java object references and GC roots Page 155 Cache systems A cache is, to simplify to the extreme, a Map. You want to remember a particular object and associate it with an identier. Because we dont have an endless supply of memory, there are specic algorithms that are dedicated to evicting certain no longer needed entries from this cache. Lets have a quick look at some of them: Least Recently Used (LRU) In this algorithm the cache maintains an access timestamp for every entry. Whenever it is triggered to remove entries, due to additional size limitations, it will evict those with the oldest access timestamp rst. Timed LRU This is a special form of LRU that evicts items based on a specic not-used timeframe instead of a size limitation. It is one of the most frequently used algorithms for database caches and was used in my customers case. Least Frequently Used (LFU) This algorithm tracks the number of times an entry has been accessed. When the cache tries to evict some of its entries, it removes those that are accessed least often. Despite the usage of timed LRU algorithm, my customer faced the problem that the number of objects that were referenced grew too big. The memory used by these objects could not be reclaimed by the JVM because they were still hard-referenced by the cache. In his case the root cause was not an inappropriately sized cache or a bad eviction policy. The problem was that the cached objects were too big and occupied too much memory. The Cache had not yet evicted these objects yet based on the timed LRU algorithm and therefore the GC could not claim these objects to free up memory. Page 156 Solving this with Memory-sensitive Standard Java Mechanisms The problem caused by hard references was addressed by the Java Standard library early on (version 1.2) with so-called reference objects. Well only focus on one of them: soft references. SoftReference is a class that was created explicitly for the purpose of being used with memory-sensitive caches. A soft reference will, according to the ofcial javadoc, be cleared at the discretion of the garbage collector in response to memory demand. In other words, the reference is kept as long as there is no need for more memory and can potentially be deleted if the JVM needs more memory. While the specication states that this can happen any time, the implementations I know only do this to prevent an OutOfMemory. When the garbage collector cannot free enough memory, it will dereference all SoftReferences and then run another garbage collection before throwing an OutOfMemoryException. If the SoftReference is the only thing that keeps an object tree alive, the whole tree can be collected. In my customers case, it would mean that the cache would be ushed before an OutOfMemoryException would be triggered, preventing it from happening and making it a perfect fail-safe option for his cache. Side Effects While Soft References have their uses, nothing is without side effects. Garbage Collection and Memory Demand First of all: a memory-sensitive cache often has the effect that people size it too big. The assumption is that when the cache can be cleared on memory demand, we should use all available memory for the cache, because it will not pose a problem. The cache will be growing until it lls a large portion of the memory. As we have learned before this leads to slower and slower garbage collections because of the many objects to check. At some point the GC will ush the cache and everything will be peachy again, right? Not really, the cache will grow again. This is often mistaken for a memory leak. Page 157 Because many heap analyzers treat soft references in a special way it is not easily found. Additionally, SoftReference objects occupy memory themselves and the mere number of these soft reference objects can also create OutOfMemory exceptions. These objects cannot be cleared by the garbage collector! For example, if you create a SoftReference object for every key and every value in your Map, these SoftReference objects are not going to be collected when the object they point to is collected. That means that youll get the same problem as previously mentioned except that, instead of it being caused by many objects of type Key and Value, itll be triggered by SoftReference objects. The small schema below explains how a soft reference works in Java and how an OutOfMemoryException can happen: Soft references before and after ush Page 158 This is why you generally cannot use a memory-sensitive cache without using the mentioned cache algorithms: the combination of a good cache algorithm with SoftReference is going to create a very robust cache system that should limit the amount of OutOfMemory occurrences. Cache System Must Handle Flushed Soft References Your cache system must be able to deal with the situation when it gets ushed. It might sound obvious but sometimes we assume the values or even the keys of the hashmap are always going to be stored in memory, causing some NullPointerExceptions when they are garbage collected as the cache gets ushed. Either it needs to reload the ushed data upon the next access, which some caches can do, or your cache needs to clean out ushed entries periodically and on access (this is what most systems do). There is No Standard Map Implementation Using SoftReferences. To preempt any hardcore coders posting comments about WeakHashMap, let me explain that a WeakHashMap uses WeakReferences and not SoftReferences. As a weak reference can be ushed on every GC, even minor ones, it is not suitable for a cache implementation. In my customers case, the cache would have been ushed too often, not adding any real value to its usage. This being said, do not despair! The Apache Commons Collection library provides you with a map implementation that allows you to use the kind of reference you want to use for keys and values independently. This implementation is the ReferenceMap class and will allow you to create your own cache based on a map without having to develop it from scratch. Conclusion Soft references provide a good means of making cache systems more stable. They should be thought of an enhancement and not a replacement to an Page 159 eviction policy. Indeed many existing cache systems leverage them, but I encounter many home grown cache solutions at our clients, so it is good to know about this. And nally it should be said that while soft references make a cache system better they are not without side effects and we should take a hard look at them before trusting that everything is ne. Page 160 Microsoft Not Following Best Practices Slows Down Firefox on Outlook Web Access by Andreas Grabner From time to time I access my work emails through Outlook Web Access (OWA) which works really great on all browsers I run on my laptop (IE, FF, and Chrome). Guessing that Microsoft probably optimized OWA for its own browser I thought that I will denitely nd JavaScript code that doesnt execute that well on Firefox as compared to Internet Explorer. From an end users perspective there seems to be no noticeable performance difference but using dynaTrace AJAX Edition (also check out the video tutorials) I found a very interesting JavaScript method that shows a big performance difference when iterating over DOM elements. Allow Multiple DOM Elements with the Same ID? That is Not a Good Practice! I recorded the same sequence of actions on both Internet Explorer 8 and Firefox 3.6. This includes logging on, selecting an email folder, clicking through multiple emails, selecting the Unread Mail Search Folder and then logging out. The following screenshot shows the Timeline Tab (YouTube Tutorial) of the Performance Report including all my activities while I was logged on to Outlook Web Access. 21 Week 21 Sun Mon Tue Wed Thu Fri Sat 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 161 Outlook Web Access Timeline showing all browser activities when clicking through the folders Instead of comparing the individual mouse and keyboard event handlers I opened the hotspot view (YouTube tutorial) to identify the slowest JavaScript methods. If you use dynaTrace just drill into the HotSpot view from the selected URL in your performance report (YouTube tutorial). There was one method with a slow execution time. This is the pure method execution time excluding times of child method calls. The method in question is called getDescendantById. It returns the DOM element identied by its id that is a descendant of current DOM Element. If you look at the following screenshot we see the same method call returning no value in both browsers meaning that the element that we are looking for (divReplaceFolderId) is not on that page. Its interesting to see that the method executes in 0.47ms on Internet Explorer but takes 12.39ms on Firefox. A closer look at the method implementation makes me wonder what the developer tries to accomplish with this method: Page 162 Special implementation for non-IE Browsers to get elements by ID If I understand the intention of this method correctly it should return THE UNIQUE element identied by its id. It seems though that IE allows having multiple elements with the same ID on a single page. Thats why the method implements the workaround by using getElementsByTagName and then accessing the returned element array by ID. In case there are more elements with the same ID the method returns the rst element. In case no element was found and we are not running on IE the implementation iterates through ALL DOM elements and returns the rst element that matches the ID. Looks like an odd implementation for me with the result that on NON-IE browsers we have to iterate through this loop that will probably never return any element anyway. Here is some pseudo-code on how I would implement this happy to get your input on this: Page 163 var elem = document.getElementById(d); if(elem != null) { // check if this element is a descendant of our current element var checkNode = elem.parentNode; while(checkNode != null && checkNode != this._get_DomNode()) checkNode = checkNode.parentNode; if(checkNode == null) elem = null; // not a descendant } return elem; This code works on both IE and FF even if there would be duplicated elements with the same ID which we should denitely avoid. Firefox Faster in Loading Media Player? I continued exploring the sessions I recorded on both IE and FF. Interestingly enough I found a method that initializes a Media Player JavaScript library. Check out the following image. It shows the difference in execution time for both IE and FF. This time Firefox is much faster at least at rst peek: Page 164 It appears like initializing Media Player library is much faster in Firefox The time difference here is very signicant 358ms compared to 0.08ms. When we look at the actual execution trace of both browsers we however see that IE is executing the if(b.controls) control branch whereas Firefox does not. This tells me that I havent installed the Media Player Plugin on Firefox: Page 165 Actual JavaScript trace comparison between Internet Explorer and Firefox Lesson learned here is that we always have to look at the actual PurePath (YouTube tutorial) as we can only compare performance when both browsers executed the same thing. Conclusion Before writing this blog I hoped to nd JavaScript performance problem patterns in Firefox similar to IE related problem patterns I blogged about in the past, such as Top 10 Client-Side Performance Problems. Instead of nding real JavaScript performance patterns it seems I keep nding code that has only been optimized for IE but was not really tested, updated or optimized for Firefox. Similar to my blog Slow Page Load Time in Firefox caused by older versions of YUI, jQuery I consider the ndings in this blog going into the same directions. Microsoft implements its JavaScript code to deal with an IE specic situation (allowing multiple elements with the same ID) but hasnt really implemented the workaround to work efciently enough in other browsers. By the way comparing method JavaScript code executions as I did here with the free dynaTrace AJAX Edition is easier and can be automated using the dynaTrace Premium AJAX Extensions. If you have any similar ndings or actual Firefox JavaScript performance problem patterns let me know would be happy to blog about it. Page 166 Why Performance Management is Easier in Public than Onpremise Clouds by Michael Kopp Performance is one of the major concerns in the cloud. But the question should not really be whether or not the cloud performs, but whether the application in question can and does perform in the cloud. The main problem here is that application performance is either not managed at all or managed incorrectly and therefore this question often remains unanswered. Now granted, performance management in cloud environments is harder than in physical ones, but it can be argued that it is easier in public clouds than in on-premise clouds or even a large virtualized environment. How do I come to that conclusion? Before answering that lets look at the unique challenges that virtualization in general and clouds in particular pose to the realm of application performance management (APM). Time is relative The problem with timekeeping is well known in the VMware community. There is a very good VMware whitepaper that explains this in quite some detail. It doesnt tell the whole story, however, because obviously there are other virtualization solutions like Xen, KVM, Hyper-V and more. All of them solve this problem differently. On top of that the various guest operating systems behave very differently as well. In fact I might write a whole article 22 Week 22 Sun Mon Tue Wed Thu Fri Sat 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 167 just about that, but the net result is that time measurement inside a guest is not accurate, unless you know what you are doing. It might lag behind real time and speed up to catch up in the next moment. If your monitoring tool is aware of that and supports native timing calls it can work around that and give you real response times. Unfortunately that leads to yet another problem. Your VM is not running all the time: like a process it will get de- scheduled from time to time; however, unlike a process it will not be aware of that. While real time is important for response time, it will screw with your performance analysis on a deeper level. The effects of timekeeping on response and execution time If you measure real time, then Method B looks more expensive than it actually is. This might lead you down the wrong track when you are looking for a performance problem. When you measure apparent time then you dont have this problem, but your response times do not reect the real user experience. There are generally two ways of handling that. Your monitoring solution can capture these de-schedule times and account for this all the way against your execution times. The more granular your measurement the more overhead this will produce. The more pragmatic approach is to simply account for this once per transaction and thus capture the impact that the de-schedules have on your response time. Yet another approach is to periodically read the CPU steal time (either from vSphere or via mpstat on Xen) and correlate this with your transaction data. This will give you a Page 168 better grasp on things. Even then it will add a level of uncertainty in your performance diagnostics, but at least you know the real response time and how fast your transactions really are. Bottom line, those two are no longer the same thing. The Impact of Shared Environments The sharing of resources is what makes virtualization and cloud environments compelling from a cost perspective. Most normal data centers have an average CPU utilization far below 20%. The reason is twofold: on the one hand they isolate the different applications by running them on different hardware; on the other hand they have to provision for peak load. By using virtualization you can put multiple isolated applications on the same hardware. Resource utilization is higher, but even then it does not go beyond 30-40 percent most of the time, as you still need to take peak load into account. But the peak loads for the different applications might occur at different times! The rst order of business here is to nd the optimal balance. The rst thing to realize is that your VM is treated like a process by the virtualization infrastructure. It gets a share of resources how much can be congured. If it reaches the congured limit it has to wait. The same is true if the physical resources are exhausted. To drive utilization higher, virtualization and cloud environments overcommit. That means they allow 10 2GHz VMs on a 16GHz physical machine. Most of the time this is perfectly ne as not all VMs will demand 100 percent CPU at the same time. If there is not enough CPU to go around, some will be de-scheduled and will be given a greater share the next time around. Most importantly this is not only true for CPU but also memory, disk and network IO. What does this mean for performance management? It means that increasing load on one application, or a bug in the same, can impact another negatively without you being aware of this. Without having a virtualization- aware monitoring solution that also monitors other applications you will not see this. All you see is that the application performance goes down! Page 169 When the load increases on one application it affects the other With proper tools this is relatively easy to catch for CPU-related problems, but a lot harder for I/O-related issues. So you need to monitor both applications, their VMs and the underlying virtualization infrastructure and correlate the information. That adds a lot of complexity. The virtualization vendors try to solve this by looking purely at VM and host level system metrics. What they forget is that high utilization of a resource does not mean the application is slow! And it is the application we care about. OS Metrics are Worse than Useless Now for the good stuff. Forget your guest operating system utilization metrics, they are not showing you what is really going on. There are several reasons why that is so. One is the timekeeping problem. Even if you and your monitoring tool use the right timer and measure time correctly, your operating system might not. In fact most systems will not read out the timer device all the time, but rely on the CPU frequency and counters to estimate Page 170 time as it is faster than reading the timer device. As utilization metrics are always based on a total number of possible requests or instructions per time slice, they get screwed up by that. This is true for every metric, not just CPU. The second problem is that the guest does not really know the upper limit for a resource, as the virtualization environment might overcommit. That means you may never be able to get 100% or you can get it at one time but not another. A good example is the Amazon EC2 Cloud. Although I cannot be sure, I suspect that the guest CPU metrics are actually correct. They correctly report the CPU utilization of the underlying hardware, only you will never get 100% of the underlying hardware. So without knowing how much of a share you get, they are useless. What does this mean? You can rely on absolute numbers like the number of I/O requests, the number of SQL Statements and the amount of data sent over the wire for a specic application or transaction. But you do not know whether an over-utilization of the physical hardware presents a bottleneck. There are two ways to solve this problem. The rst involves correlating resource and throughput metrics of your application with the reported utilization and throughput measures on the virtualization layer. In case of VMware that means correlating detailed application and transaction level metrics with metrics provided by vSphere. On EC2 you can do the same with metrics provided by CloudWatch. EC2 cloud monitoring dashboard showing 3 instances Page 171 This is the approach recommended by some virtualization vendors. It is possible, but because of the complexity requires a lot of expertise. You do however know which VM consumes how much of your resources. With a little calculation magic you can break this down to application and transaction level; at least on average. You need this for resource optimization and to decide which VMs should be moved to different physical hardware. This does not do you a lot of good in case of acute performance problems or troubleshooting as you dont know the actual impact of the resource shortage (or if it has an impact at all). You might move a VM, and not actually speed things up. The real crux is that just because something is heavily used does not mean that it is the source of your performance problem! And of course this approach only works if you are in charge of the hardware, meaning it does not work with public clouds! The second option is one that is, among others, proposed by Bernd Harzog, a well-known expert in the virtualization space. It is also the one that I would recommend. Response Time, Response Time, Latency and More Response Time On the Virtualization Practice blog Bernd explains in detail why resource utilization does not help you on either performance management or capacity planning. Instead he points out that what really matters is response time or throughput of your application. If your physical hardware or virtualization infrastructure runs into utilization problems the easiest way to spot this is when it slows down. In effect that means that I/O requests done by your application are slowing down and you can measure that. Whats more important is that you can turn this around! If your application performs ne then whatever the virtualization or cloud infrastructure reports, there is no performance problem. To be more accurate, you only need to analyze the virtualization layer if your application performance monitoring shows that a high portion of your response time is down to CPU shortage, memory shortage or I/O latency. If that is not the case then nothing is gained by optimizing the virtualization layer from a performance perspective. Page 172 Network impact on the transaction is minimal, even though network utilization is high Diagnosing the Virtualization Layer Of course in case of virtualization and private clouds you still need to diagnose an infrastructure response time problem, once identied. You measure the infrastructure response time inside your application. If you have identied a bottleneck, meaning it slows down or is a big portion of your response time, you need to relate that infrastructure response time back to your virtualized infrastructure: which resource slows down? From there you can use the metrics provided by VMware (or whatever your virtualization vendor) to diagnose the root cause of the bottleneck. The key is that you identify the problem based on actual impact and then use the infrastructure metrics to diagnose the cause of that. Layers Add Complexity What this of course means is that you now have to manage performance on even more levels than before. It also means that you have to somehow manage which VMs run on the same physical host. We have already seen that the nature of the shared environment means that applications can impact each other. So a big part of managing performance in a virtualized Page 173 environment is to detect that impact and tune your environment in a way that both minimizes that impact and maximizes your resource usage and utilization. These are diametrically opposed goals! Now, What about Cloud? A cloud by nature is more dynamic than a simple virtualized environment. A cloud will enable you to provision new environments on the y and also dispose of them again. This will lead to spikes on your utilization, leading to performance impact on existing application. So in the cloud the minimum impact vs. maximize resource usage goal becomes even harder to achieve. Cloud vendors usually provide you with management software to manage the placement of your VMs. They will move them around based on complex algorithms to try and achieve the impossible goal of high performance and high utilization. The success is limited, because most of these management solutions ignore the application and only look at the virtualization layer to make these decisions. Its a vicious cycle and the price you pay for better utilizing your datacenter and faster provisioning of new environments. Maybe a bigger issue is capacity management. The shared nature of the environment prevents you from making straightforward predictions about capacity usage on a hardware level. You get a long way by relating the requests done by your application on a transactional level with the capacity usage on the virtualization layer, but that is cumbersome and does not lead to accurate results. Then of course a cloud is dynamic and your application is distributed, so without having a solution that measures all your transactions and auto detects changes in the cloud environment you can easily make this a full time job. Another problem is that the only way to notice a real capacity problem is to determine if the infrastructure response time goes down and negatively impacts your application. Remember utilization does not equal performance and you want high utilization anyway! But once you notice capacity problems, it is too late to order new hardware. That means is that you not only need to provision for peak loads, effectively over-provisioning again, you also need to take all those temporary and Page 174 newly-provisioned environments into account. A match made in planning hell. Performance Management in a Public Cloud First let me clarify the term public cloud here. While a public cloud has many characteristics, the most important ones for this article are that you dont own the hardware, have limited control over it and can provision new instances on the y. If you think about this carefully you will notice immediately that you have fewer problems. You only care about the performance of your application and not at all about the utilization of the hardware its not your hardware after all. Meaning there are no competing goals! Depending on your application you will add a new instance if response time goes down on a specic tier or if you need more throughput than you currently achieve. You provision on the y, meaning your capacity management is done on the y as well. Another problem solved. You still run in a shared environment and this will impact you. But your options are limited as you cannot monitor or x this directly. What you can do is measure the latency of the infrastructure. If you notice a slowdown you can talk to your vendor, though most of the time you will not care and just terminate the old and start a new instance if infrastructure response time goes down. Chances are the new instances are started on a less utilized server and thats that. I wont say that this is easy. I also do not say that this is better, but I do say that performance management is easier than in private clouds. Conclusion Private and public cloud strategies are based on similar underlying technologies. Just because they are based on similar technologies, however, doesnt mean that they are similar in any way in terms of actual usage. In the private cloud, the goal is becoming more efcient by dynamically and automatically allocating resources in order to drive up utilization while also lowering management costs of those many instances. The problem Page 175 with this is that driving up utilization and having high performance are competing goals. The higher the utilization the more the applications will impact one another. Reaching a balance is highly complex, and is made more complex due to the dynamic nature of the private cloud. In the public cloud, these competing goals are split between the cloud provider, who cares about utilization, and the application owner, who cares about performance. In the public cloud the application owner has limited options: he can measure application performance; he can measure the impact of infrastructure degradation on the performance of his business transactions; but he cannot resolve the actual degradation. All he can do is terminate slow instances and/or add new ones in the hope that they will perform at a higher level. In this way, performance in the public cloud is in fact easier to manage. But whether it be public or private you must actively manage performance in a cloud production environment. In the private cloud you need to maintain a balance between high utilization and application performance, which requires you to know what is going under the hood. And without application performance management in the public cloud, application owners are at the mercy of cloud providers, whose goals are not necessarily aligned with yours. Page 176 Why Response Times are Often Measured Incorrectly by Alois Reitbauer Response times are in many if not in most cases the basis for performance analysis. When they are within expected boundaries everything is OK. When they get to high we start optimizing our applications. So response times play a central role in performance monitoring and analysis. In virtualized and cloud environments they are the most accurate performance metric you can get. Very often, however, people measure and interpret response times the wrong way. This is more than reason enough to discuss the topic of response time measurements and how to interpret them. Therefore I will discuss typical measurement approaches, the related misunderstandings and how to improve measurement approaches. Averaging Information Away When measuring response times, we cannot look at each and every single measurement. Even in very small production systems the number of transactions is unmanageable. Therefore measurements are aggregated for a certain timeframe. Depending on the monitoring conguration this might be seconds, minutes or even hours. While this aggregation helps us to easily understand response times in large 23 Week 23 Sun Mon Tue Wed Thu Fri Sat 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 177 volume systems, it also means that we are losing information. The most common approach to measurement aggregation is using averages. This means the collected measurements are averaged and we are working with the average instead of the real values. The problem with averages is that they in many cases do not reect what is happening in the real world. There are two main reasons why working with averages leads to wrong or misleading results. In the case of measurements that are highly volatile in their value, the average is not representative for actually measured response times. If our measurements range from 1 to 4 seconds the average might be around 2 seconds which certainly does not represent what many of our users perceive. So averages only provide little insight into real world performance. Instead of working with averages you should use percentiles. If you talk to people who have been working in the performance space for some time, they will tell you that the only reliable metrics to work with are percentiles. In contrast to averages, percentiles dene how many users perceived response times slower than a certain threshold. If the 50th percentile for example is 2.5 seconds this means that the response times for 50 percent of your users were less or equal to 2.5 seconds. As you can see this approach is by far closer to reality than using averages Percentiles and average of a measurement series The only potential downside with percentiles is that they require more data to be stored than averages do. While average calculation only requires the sum and count of all measurements, percentiles require a whole range of Page 178 measurement values as their calculation is more complex. This is also the reason why not all performance management tools support them. Putting Everything into One Box Another important question when aggregating data is which data you use as the basis of your aggregations. If you mix together data for different transaction types like the start page, a search and a credit card validation the results will only be of little value as the base data are as different as apple and oranges. So in addition to ensuring that you are working with percentiles it is necessary to also split transaction types properly so that the data that are the basis for your calculations t together. The concept of splitting transactions by their business function is often referred to as business transaction management (BTM). While the eld of BTM is wide, the basic idea is to distinguish transactions in an application by logical parameters like what they do or where they come from. An example would be a put into cart transaction or the requests of a certain user. Only a combination of both approaches ensures that the response times you measure are a solid basis for performance analysis. Far from the Real World Another point to consider with response times is where they are measured. Most people measure response times at the server-side and implicitly assume that they represent what real users see. While server-side response times are down to 500 milliseconds and everyone thinks everything is ne, users might experience response times of several seconds. The reason is that server-side response times dont take a lot of factors inuencing end-user response times into account. First of all server-side measurements neglect network transfer time to the end users. This easily adds half a second or more to your response times.
Page 179 Server vs. client response time At the same time server-side response times often only measure the initial document sent to the user. All the images, JavaScript and CSS les that are required to render a paper properly are not included in this calculation at all. Experts like Souders even say that only 10 percent of the overall response time is inuenced by the server side. Even if we consider this an extreme scenario it is obvious that basing performance management solely on server-side metrics does not provide a solid basis for understanding end- user performance. The situation gets even worse with JavaScript-heavy Web 2.0 applications where a great portion of the application logic is executed within the browser. In this case server-side metrics cannot be taken as representative for end-user performance at all. Not Measuring What You Want to Know A common approach to solve this problem is to use synthetic transaction monitoring. This approach often claims to be close to the end-user. Commercial providers offer a huge number of locations around the world from where you can test the performance of pre-dened transactions. While this provides better insight into what the perceived performance of end-users is, it is not the full truth. The most important thing to understand is how these measurements are Page 180 collected. There are two approaches to collect this data: via emulators or real browsers. From my very personal perspective any approach that does not use real browsers should be avoided as real browsers are also what your users use. They are the only way to get accurate measurements. The issue with using synthetic transactions for performance measurement is that it is not about real users. Your synthetic transactions might run pretty fast, but that guy with a slow internet connection who just wants to book a $5,000 holiday (OK, a rare case) still sees 10 second response times. Is it the fault of your application? No. Do you care? Yes, because this is your business. Additionally synthetic transaction monitoring cannot monitor all of your transactions. You cannot really book a holiday every couple of minutes, so you at the end only get a portion of your transactions covered by your monitoring. This does not mean that there is no value in using synthetic transactions. They are great to be informed about availability or network problems that might affect your users, but they do not represent what your users actually see. As a consequence, they do not serve as a solid basis for performance improvements Measuring at the End-User Level The only way to get real user performance metrics is to measure from within the users browser. There are two approaches to do this. You can use a tool like the free dynaTrace AJAX Edition which uses a browser plug-in to collect performance data or inject JavaScript code to get performance metrics. The W3C now also has a number of standardization activities for browser performance APIs. The Navigation Timing Specication is already supported by recent browser releases, as is the Resource Timing Specication. Open- source implementations like Boomerang provide a convenient way to access performance data within the browser. Products like dynaTrace User Experience Management (UEM) go further by providing a highly scalable backend and full integration into your server-side systems. The main idea is to inject custom JavaScript code which captures timing Page 181 information like the beginning of a request, DOM ready and fully loaded. While these events are sufcient for classic web applications they are not enough for Web 2.0 applications which execute a lot of client-side code. In this case the JavaScript code has to be instrumented as well. Is it Enough to Measure on the Client-side? The question now is whether it is enough to measure performance from the end-user perspective. If we know how our web application performs for each user we have enough information to see whether an application is slow or fast. If we then combine this data with information like geo location, browser and connection speed we know for which users a problem exists. So from a pure monitoring perspective this is enough. In case of problems, however, we want to go beyond monitoring. Monitoring only tells us that we have a problem but does not help in nding the cause of the problem. Especially when we measure end-user performance our information is less rich compared to development-centric approaches. We could still use a development-focused tool like dynaTrace AJAX Edition for production troubleshooting. This however requires installing custom software on an end users machine. While this might be an option for SaaS environments this is not the case in a typical eCommerce scenario. The only way to gain this level of insight for diagnostics purposes is to collect information from the browser as well as the server side to have a holistic view on application performance. As discussed using averaged metrics is not enough in this case. Using aggregated data does not provide the insight we need. So instead of aggregated information we require the possibility to identify and relate the requests of a users browser to server- side requests. Client/server drill-down of pages and actions Page 182 The gure below shows an architecture based (and abstracted) from dynaTrace UEM which provides this functionality. It shows the combination of browser and server-side data capturing on a transactional basis and a centralized performance repository for analysis. Architecture for end-to-end user experience monitoring Conclusion There are many places where and ways how to measure response times. Depending on what we want to achieve each one of them provides more or less accurate data. For the analysis of server-side problems measuring at the server-side is enough. We however have to be aware that this does not reect the response times of our end users. It is a purely technical metric for optimizing the way we create content and service requests. The prerequisite to meaningful measurements is that we separate different transaction types properly. Measurements from anything but the end-users perspective can only be used to optimize your technical infrastructure and only indirectly the performance of end users. Only performance measurements in the browser enable you to understand and optimize user-perceived performance. Page 183 Automated Cross Browser Web 2.0 Performance Optimizations: Best Practices from GSI Commerce by Andreas Grabner A while back I hosted a webinar with Ron Woody, Director of Performance at GSI Commerce (now part of eBay). Ron and his team are users of dynaTrace both AJAX and Test Center Edition. During the webinar we discussed the advantages and challenges that Web 2.0 offers with a big focus on eCommerce. This blog is a summary of what we have discussed including Rons tips, tricks and best practices. The screenshots are taken from the original webinar slide deck. If you want to watch the full webinar you can go ahead and access it online. Web 2.0 An Opportunity for eCommerce Web 2.0 is a great chance to make the web more interactive. Especially for eCommerce sites it brings many benets. In order to leverage the benets we have to understand how to manage the complexity that comes with this new technology. 24 Week 24 Sun Mon Tue Wed Thu Fri Sat 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 184 The Benets of Web 2.0 JavaScript, CSS, XHR, and many others thats what makes interactive web sites possible, and thats what many of us consider Web 2.0 to be. When navigating through an online shop users can use dynamic menus or search suggestions to more easily nd what they are looking for. Web 2.0 also eliminates the need for full page reloads for certain user interactions, e.g.: display additional production information when hovering the mouse over the product image. This allows the user to become more productive by reducing the time it takes to nd and purchase a product. Web 2.0 allows us to build more interactive web sites that support the user in nding the right information faster The Challenges The challenge is that you are not alone in what you have to offer to your users. Your competition leverages Web 2.0 to attract more users and is there for those users that are not happy with the experience on your own Page 185 site. If your pages are slow or dont work as expected online shoppers will go to your competitor. You may only lose them for this one shopping experience but you may lose them forever if the competitor satises their needs. Worse than that frustrated users share their experience with their friends, impacting your reputation. Performance, reliability and compatibility keep your users happy. Otherwise you lose money and damage your reputation Page 186 The Complexity of Web 2.0 Performance optimization was easier before we had powerful browsers supporting JavaScript, CSS, DOM, AJAX, and so on. When we take a look at a Web 2.0 application we have to deal with an application that not only lives on the application server whose generated content gets rendered by the browser. We have an application that spawns both server and client (browser). Application and navigation logic got moved into the browser to provide better end-user experience. Web 2.0 applications leverage JavaScript frameworks that make building these applications easier. But just because an application can be built faster doesnt mean it operates faster and without problems. The challenge is that we have to understand all the moving parts in a Web 2.0 application as outlined in the following illustration: Web 2.0 Applications run on both server and client (browser) using a set of new components (JS, DOM, CSS, AJAX, etc. Performance in Web 2.0 With more application logic sitting in the client (browser) it becomes more important to measure performance for the actual end-user. We need to split Page 187 page load times into time spent on the server vs. time spent on the client (browser). The more JavaScript libraries we use, the more fancy UI effects we add to the page and the more dynamic content we load from the server the higher the chance that we end up with a performance problem in the client. The following illustration shows a performance analysis of a typical eCommerce user scenario. It splits the end-users perceived performance into time spent in browser and time spent to download (server time): Up to 6 seconds spent in the browser for shipping, payment and conrm Users get frustrated when waiting too long for a page. Optimizing on the server side is one aspect and will speed up page load time. The more logic that gets moved to the browser the more you need to focus on optimizing the browser side as well. The challenge here is that you cannot guarantee the environment as you can on the server-side. You have users browsing with the latest version of Chrome or Firefox, but you will also have users that browser with an older version of Internet Explorer. Why does that make a difference? Because JavaScript engines in older browsers are slower and impact the time spent in the browser. Older browsers also have a limited set of core features such as looking up elements by class name. JavaScript frameworks such as jQuery work around this problem by implementing the missing features in JavaScript which is much slower than native implementations. It is therefore important to test your applications on a broad range of browsers and optimize your pages if necessary. Page 188 How GSI Commerce Tames the Web 2.0 Beast GSI Commerce (now eBay) powers sites such as NFL, Toys R Us, ACE, Adidas, and many other leading eCommerce companies. In order to make sure these sites attract new and keep existing online users it is important to test and optimize these applications before new versions get released. Business Impact of Performance Ron discussed the fact that Performance indeed has a direct impact on business. Weve already heard about the impact when Google and Bing announced the results on their performance studies. GSI conrms these results where poor performance has a direct impact on sales. Here is why: Our clients competitors are only a click away Poor performance increases site abandonment risk Slow performance may impact brand Client and Server-Side Testing GSI does traditional server-side load testing using HP LoadRunner in combination with dynaTrace Test Center Edition. They execute tests against real-life servers hosting their applications. On the other hand they also execute client-side tests using HP Quick Test Pro with dynaTrace AJAX Edition to test and analyze performance in the browser. They also leverage YSlow and WebPageTest.org. GSI Browser Lab Online users of web sites powered by GSI use all different types of browsers. Therefore GSI built their own browser lab including all common versions of Internet Explorer and Firefox. Since dynaTrace AJAX also supports Firefox they run dynaTrace AJAX Edition on all of their test machines as it gives Page 189 them full visibility into rendering, JavaScript, DOM and AJAX. They use HP Quick Test Pro for their test automation: GSI Browser Lab is powered by dynaTrace AJAX Edition and includes multiple versions of Internet Explorer and Firefox How GSI uses dynaTrace While HP Quick Test Pro drives the browser to test the individual use cases, dynaTrace AJAX Edition captures performance relevant information. GSI uses a public API to extract this data and pushes it into a web based reporting solution that helps them to easily analyze performance across browsers and builds. In case there are problems on the browser side the recorded dynaTrace AJAX Edition sessions contain all necessary information to diagnose and x JavaScript, Rendering, DOM or AJAX problems. This allows developers to see what really happened in the browser when the error happened without having them trying to reproduce the problem. In case it turns out that certain requests took too long on the application server GSI can drill into the server side PurePaths as they also run dynaTrace on their application servers. Page 190 Using dynaTrace in testing bridges the gap between testers and developers. Capturing this rich information and sharing it with a mouse click makes collaboration between these departments much easier. Developers also start using dynaTrace prior to shipping their code to testing. Ron actually specied acceptance criteria. Every new feature must at least have a certain dynaTrace AJAX Edition page rank before it can go into testing. Besides running dynaTrace on the desktops of developers and testers, GSI is moving towards automating dynaTrace in Continuous Integration (CI) to identify more problems in an automated way during development. Analyze performance and validate architectural rules by letting dynaTrace analyze unit and functional tests in CI Saving Time and Money With dynaTraces in-depth visibility and the ability to automate many tasks both on the Client and Server-Side it was possible to Reduce test time from 20 hours to 2 hours Find more problems faster Shorten project time The fact that developers actually see what happened improves collaboration with the testers. It eliminates the constant back and forth. The deep visibility allows identication of problems that were difcult/impossible to Page 191 see before. Especially rendering, JavaScript and DOM analysis have been really helpful for optimizing page load time. Tips and Tricks Here is a list of tips and tricks that Ron shared with the audience: Clear browser cache when testing load time > Different load behavior depending on network time Multiple page testing > Simulates consumer behavior with regard to caching Test in different browsers > IE 6, 7 & 8 have different behavior > Compare Firefox & IE understand cross-browser behavior > Use weighted averages based on browser trafc Test from different Locations > www.webpagetest.org to test sites from around the world (including dynaTrace Session Tracking) > Commercial solutions for network latency, cloud testing, etc. Best Practices Ron concluded with the following best practices: Performance matters Dene performance targets Test client-side performance > Cross-browser testing Page 192 > Add server-side testing > Tie everything together Automate! Get Test and Development on the same page Get proactive Benchmark your site against competition Conclusion If you are interested in listening to the full webinar that also includes an extended Q&A Session at the end go ahead and listen to the recorded version. Page 193 Goal-oriented Auto Scaling in the Cloud by Michael Kopp The ability to scale your environment on demand is one of the key advantages of a public cloud like Amazon EC2. Amazon provides a lot of functionality like the Auto Scaling groups to make this easy. The one downside to my mind is that basing auto scaling on system metrics is a little nave, and from my experience only works well in a limited number of scenarios. I wouldnt want to scale indenitely either, so I need to choose an arbitrary upper boundary to my Auto Scaling Group. Both the upper boundary and the system metrics are unrelated to my actual goal, which is always application related e.g. throughput or response time. Some time back Amazon added the ability to add custom metrics to CloudWatch. This opens up interesting possibilities. One of them is to do goal-oriented auto scaling. Scale for Desired Throughput A key use case that I see for a public cloud is batch processing. This is throughput and not response time oriented. I can easily upload the measured throughput to CloudWatch and trigger auto scaling events on lower and upper boundaries. But of course I dont want to base scaling events on throughput alone: if my application isnt doing anything I wouldnt want to add instances. On the other hand, dening the desired 25 Week 25 Sun Mon Tue Wed Thu Fri Sat 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 194 throughput statically might not make sense either as it depends on the current job. My actual goal is to nish the batch in a specic timeframe. So lets size our EC2 environment based on that! I wrote a simple Java program that takes the current throughput, remaining time plus remaining number of transactions and calculates the throughput needed to nish in time. It then calculates the difference between actual and needed throughput as a percentage and pushes this out to CloudWatch. public void setCurrentSpeed(double transactionsPerMinute, long remainingTransactions, long remainingTimeInMinutes, String JobName) { double targetTPM; double currentSpeed; if (remainingTimeInMinutes > 0 && remainingTransactions > 0) {// time left and something to be done targetTPM = remainingTransactions / remainingTimeInMinutes; currentSpeed = transactionsPerMinute / targetTPM; } else if (remainingTransactions > 0) // no time left but transactions left? throw new SLAViolation(remainingTransactions); else // all done currentSpeed = 2; // tell our algorithm that we are too fast, //if we dont have anything left to do PutMetricDataRequest putMetricDataRequest = new PutMetricDataRequest(); Page 195 MetricDatum o = new MetricDatum(); o.setDimensions(Collections.singleton(new Dimension(). withName(JobName). withValue(JobName))); o.setMetricName(CurrentSpeed); o.setUnit(percent); o.setValue(currentSpeed); putMetricDataRequest.setMetricData(Collections.singleton(o)); putMetricDataRequest.setNamespace(dynaTrace/APM); cloudWatch.putMetricData(putMetricDataRequest); } After that I started my batch job with a single instance and started measuring the throughput. When putting the CurrentSpeed into a chart it looked something like this: The speed would start at 200% and go down according to the target time after the start Page 196 It started at 200%, which my Java code reports if the remaining transactions are zero. Once I start the load the calculated speed goes down to indicate the real relative speed. It quickly dropped below 100%, indicating that it was not fast enough to meet the time window. The longer the run took, the less time it had to nish. This would mean that the required throughput to be done in time would grow; in other words, the relative speed was decreasing. So I went ahead and produced three auto scaling actions and the respective alarms. The rst doubled the number of instances if current speed was below 50%. The second added 10% more instances as long the current speed was below 105% (a little safety margin). Both actions had a proper threshold and cool down periods attached to prevent an unlimited sprawl. The result was that the number of instances grew quickly until the throughput was a little more than required. I then added a third policy. This one would remove one instance as long as the relative current speed was above 120%. The adjustments result in higher throughput which adjust the relative speed As the number of instances increased so did my applications throughput until it achieved the required speed. As it was faster than required, the batch would eventually be done ahead of time. That means that every minute that it kept being faster than needed, the required throughput kept Page 197 shrinking, which is why you see the relative speed increasing in the chart although no more instances were added. Upon breaching the 120% barrier the last auto scaling policy removed an instance and the relative speed dropped. This led to a more optimal number of instances required to nish the job. Conclusion Elastic scaling is very powerful and especially useful if we couple it with goal oriented policies. The example provided does of course need some ne tuning, but it shows why it makes sense to use application-specic metrics instead of indirectly related system metrics to meet an SLA target. Page 198 How Server-side Performance Affects Mobile User Experience by Alois Reitbauer Testing mobile web sites on the actual device is still a challenge. While tools like dynaTrace AJAX Edition make it very easy to get detailed performance data from desktop browsers, we do not have the same luxury for mobile. I was wondering whether desktop tooling can be used for analyzing and optimizing mobile sites. My idea was to start testing mobile web sites on desktop browsers. Many websites return mobile content even when requested by a desktop browser. For all sites one has control over it is also possible to override browser checks. The basic rationale behind this approach is that if something is already slow in a desktop browser it will not be fast in a mobile browser. Typical problem patterns can also be more easily analyzed in a desktop environment than on a mobile device. I chose the United website inspired by Maximiliano Firtmans talk at Velocity. I loaded the regular and the mobile sites with Firefox and collected all performance data with dynaTrace. The rst interesting fact was that the mobile site was much slower than the regular site. 26 Week 26 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 199 mobile.united.com is slower than united.com This is quite surprising as the mobile site has way less visual content, as you can see below. So why is the site that slow? Desktop and mobile website of United When we look at the timeline we see that the mobile site is only using one domain while the regular site is using an additional content domain. So serving everything from one domain has a serious impact on performance. Timeline comparison of United sites I checked the latest result from BrowserScope to see how many connections Page 200 mobile browsers can handle. They are using up to 35 connections, which is quite a lot. The United mobile site does not leverage this fact for mobile. Connections per domain and total for mobile browsers Looking at the content reveals two optimization points. First, a lot of the content is images which could be sprited. This would then only block one connection and also speed up download times. The second point is that the CSS which is used is huge. A 70k CSS le for a 12k HTML page is quite impressive. Very large CSS le on mobile.united.com While these improvements will make the page faster they are not the biggest concern. Looking at the requests we can see that there are several network requests which take longer than 5 seconds. One of them is the CSS le which is required to layout the page. This means that the user Page 201 does not see a nicely layouted page within less than 5 seconds (not taking network transfer time into consideration). So in this case the server used for the mobile website is the real problem. Request with very high server times Conclusion This example shows that basic analysis of mobile web site performance can also be done on the desktop. Especially performance issues caused by slow server-side response times or non-optimized resource delivery can be found easily. The United example also shows how important effective server-side performance optimization is in a mobile environment. When we have to deal with higher latency and smaller bandwidth we have to optimize server-side delivery to get more legroom for dealing with slower networks. Content delivery chain of web applications Looking at the content delivery chain which start at the end user and goes all the way back to the server-side it becomes clear that any time we lose on the server cannot be compensated by upstream optimization. Page 202 Step by Step Guide: Comparing Page Load Time of US Open across Browsers by Andreas Grabner The US Open is one of the major world sport events these days. Those tennis enthusiasts that cant make it to the Centre Court in Flushing Meadows are either watching the games on television or following the scores on the ofcial US Open Web Site. The question is: how long does it take to get the current standings? And is my computer running Firefox (FF) faster than my friends Internet Explorer (IE)? Comparing US Open 2011 Page Load Time I made this test easy. I recorded page load activity in both Firefox 6 and Internet Explorer 8 when loading http://www.usopen.org. I am using dynaTrace AJAX Edition Premium Version to analyze and compare the activity side-by-side. The following screenshot shows the High-Level Key Performance Indicators (KPIs) for both browsers. The interesting observations for me are: 27 more roundtrips in IE (column request count) resulting in 700k more downloaded data 27 Week 27 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 203 Slower JavaScript (JS) execution time in IE Slower rendering in Firefox High-level comparison between Internet Explorer and Firefox Comparing Page Activity in Timeline The next step is to compare page load in the Browser Timeline. In the following screenshot you see the page activity for Internet Explorer (top) and Firefox (bottom). I highlighted what are to me 3 signicant differences: 1. Loading behavior for Google, Twitter and Facebook JavaScript les is much faster in Firefox 2. In Internet Explorer we have 6 XHR calls whereas in Firefox we only see 5 3. Long running onLoad event handler in Internet Explorer Page 204 Easy to spot differences in the Timeline Comparison between Internet Explorer and Firefox What is the Difference in Network Roundtrips? Next step is to compare the network requests. We have already learned through the high-level KPIs that there is a signicant difference in network roundtrips e.g. Internet Explorer has 27 more resources requests than Firefox. The following screenshot shows me the browser network dashlet in Comparison Mode. It compares the network requests from IE and FF and uses color coding to highlight the differences. Grey means that this request was done by IE but not by FF. Red means that IE had more requests to a specic resource than FF. If we look at this table we can observe the following interesting facts: Page 205 Internet Explorer tries to download ashcanvas.swf twice where this component is not loaded in Firefox (top row) It requests certain JS, CSS and one JPG twice making it one request more than on Firefox (rows in red) It requests certain les that are not requested by Firefox (rows in gray) A network comparison shows that Internet Explorer is requesting additional resources and some resources more than once What is the Extra AJAX Request? The same browser network Comparison View allows us to focus on the AJAX requests. Looking at the data side-by-side shows us that the Flash object the one that was requested twice is requested once using AJAX/ XHR. As this Flash component is only requested in IE we see the extra XHR Request. Page 206 Comparison of AJAX requests between Firefox and Internet Explorer. Its easy to spot the extra request that downloads the Flash component With a simple drill down we also see where this AJAX/XHR request comes from. The following screenshot displays the browser PurePath with the full JavaScript trace including the XHR request for the Flash component. The AJAX request for the Flash component is triggered when ashcanvas.js is loaded Page 207 Why is the onLoad Event Handler Taking Longer in IE? The Performance Report shows us the JavaScript hotspots in Internet Explorer. From here (for instance) we see that the $ method (jQuery Lookup method) takes a signicant time. The report also shows us where this method gets called. When we drill from there to the actual browser PurePaths we see which calls to the $ method were actually taking a long time: JavaScript hotspot analysis brings us to problematic $ method calls that are slower on Internet Explorer Want to Analyze Your Own Web Site? This was just a quick example on how to analyze and compare page load performance across browsers. For more recommendations on how to actually optimize page load time I recommend checking out our other blogs on Ajax/JavaScript. Page 208 To analyze individual medium complex Web 2.0 applications download the free dynaTrace AJAX Edition to analyze individual pages. For advanced scenarios such as the following take a look at dynaTrace AJAX Edition Premium Version: JavaScript heavy Web 2.0 applications: Premium Version is unlimited in the JavaScript activities it can process the free AJAX Edition only works for medium-complex Web 2.0 applications Compare performance and load behavior across browsers: Premium Version automatically compares different sessions the free AJAX Edition only allows manual comparison Identify regressions across different versions of your web site: Premium Version automatically identies regressions across tested versions Automate performance analysis: Premium Version automatically identies performance problems that can be reported through REST or HTML reports I also encourage everybody to participate in the discussions on our Community Forum. Page 209 How Case-Sensitivity for ID and ClassName can Kill Your Page Load Time by Andreas Grabner Many times have we posted the recommendation to speed up your DOM element lookups by using unique IDs or at least a tag name. Instead of using $(.wishlist) you should use $(div.wishlist) which will speed up lookups in older browsers; if you want to lookup a single element then give it a unique ID and change your call to $(#mywishlist). This will speed up lookup in older browsers from 100-200ms to about 5-10ms (times vary depending on number of DOM elements on your page). More on this in our blogs 101 on jQuery Selector Performance or 101 on Prototype CSS Selectors. 28 Week 28 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 210 Case Sensitive ID handling Results in Interesting Performance Impact With the recommendation from above I was surprised to see the following $(#addtowishlist) call with a huge execution time difference in Internet Explorer (IE) 7, 8 and Firefox (FF) 6: Same $(#addtowishlist) call with huge performance differences across browsers doesnt only reveal performance problems So why is This Call Taking That Long? It turns out that the ID attribute of the element in question (addtowishlist) is actually dened as addToWishList. As Class and Id are case-sensitive (read this article on the Mozilla Developer Network) the call $(#addtowishlist) should in fact return no element. This leads us to an actual functional problem on this page. The element exists but is not identied because the developer used a different name in the $ method as dened in HTML. The performance difference is explained by a uniqueness of Internet Explorer 6 and the way jQuery implements its $ method. Page 211 jQuery 1.4.2 is the version used on the page we analyzed. The following screenshot shows what happens in Internet Explorer 7: jQuery iterates through all DOM elements in case the element returned by getElementsById doesnt match the query string The screenshot shows the dynaTrace browser PurePath for IE 7. In fact, getElementById returns the DIV tag even though it shouldnt be based on HTML standard specication. jQuery adds an additional check on the returned element. Because the DOM elements ID addToWishList is not case-equals with addtowishlist it calls its internal nd method as fallback. The nd method iterates through ALL DOM Elements (1944 in this case) and does a string comparison on the ID element. In the end, jQuery doesnt return any element because none match the lower case ID. This additional check through 1944 elements takes more than 50ms in Internet Explorer 7. Why the time difference in IE 7 and FF 6? IE 8 and FF 6 execute so much faster because getElementById doesnt return an object and jQuery therefore also doesnt perform the additional check. Page 212 Lessons Learned: We Have a Functional and a Performance Problem There are two conclusions to this analysis: We have a functional problem because IDs in HTML in JavaScript/CSS are used with mixed case and therefore certain event handlers are not registered correctly. We have a performance problem because IE 7 incorrectly returns an element leading to a very expensive jQuery check. So watch out and check how your write your IDs and ClassNames. Use tools to verify your lookups return the expected objects and make sure you always use a lookup mechanism that performs well across browsers. Page 213 Automatic Error Detection in Production Contact Your Users Before They Contact You by Andreas Grabner In my role I am responsible for our Community and our Community Portal. In order for our Community Portal to be accepted by our users I need to ensure that our users nd the content they are interested in. In a recent upgrade we added lots of new multi-media content that will make it easier for our community members to get educated on Best Practices, First Steps, and much more. Error in Production: 3rd Party Plugin Prevents Users from Accessing Content Here is what happened today when I gured out that some of our us- ers actually had a problem accessing some of the new content. I was able to directly contact these individual users before they reported the issue. We identied the root cause of the problem and are currently working on a permanent x preventing these problems for other us- ers. Let me walk you through my steps. Step 1: Verify and Ensure Functional Health One dashboard I look at to check whether there are any errors on our Community Portal is the Functional Health dashboard. dynaTrace comes 29 Week 29 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 214 with several out-of-the-box error detection rules. These are rules that e.g. check if there are any HTTP 500s, exceptions being thrown between application tiers (e.g.: from our authentication web service back to our frontend system), severe log messages or exceptions when accessing the database. The following screenshot shows the Functional Health dashboard. As we monitor more than just our Community Portal with dynaTrace I just lter to this application. I see that we had 14 failed transactions in the last hour. It seems we also had several unhandled exceptions and several HTTP 400s between transaction tiers: dynaTrace automates error detection by analyzing every transaction against error rules. In my case I had 14 failed transactions in the last hour on our Community Portal Page 215 My rst step tells me that we have users that experience a problem. Step 2: Analyze Errors A click on the error on the bottom right brings me to the error details, allowing me to analyze what these errors are. The following screenshot shows the Error dashboard with an overview of all detected errors based on the congured error rules. A click on one Error rule shows me the actual errors on the bottom. It seems we have a problem with some of our new PowerPoint slides we made available on our Community Portal: The 14 errors are related to the PowerPoint slide integration we recently added to our Community Portal as well as some internal Conuence Problems Now I know what these errors are. The next step is to identify the impacted users. Page 216 Step 3: Identify Impacted Users A drill into our Business Transactions tells me which users were impacted by this problem. It turns out that we had 5 internal users (those with the short usernames) and 2 actual customers having problems. Knowing which users are impacted by this problem allows me to proactively contact them before they contact me Page 217 What is also interesting for me is to understand what these users were doing on our Community Portal. dynaTrace gives me the information about every visit including all page actions with detailed performance and context information. The following shows the activities of one of the users that experienced the problem. I can see how they got to the problematic page and whether they continued browsing for other material or whether they stopped because of this frustrating experience: Analyzing the visit shows me where the error happened. Fortunately the user continued browsing to other material I now know exactly which users were impacted by the errors. I also know that even though they had a frustrating experience these users are still continuing browsing other content. Just to be safe I contacted them letting them know we are working on the problem and also sent them the content they couldnt retrieve through the portal. Page 218 Step 4: Identify Root Cause and Fix Problem My last step is to identify the actual root cause of these errors because I want these errors to be xed as soon as possible to prevent more users from being impacted. A drill into our PurePaths shows me that error is caused by a NullPointerException thrown by the Conuence plugin we use to display PowerPoint presentations embedded in a page. Having a PurePath for every single request (failed or not) available makes it easy to identify problems. In this case we have a NullPointerException being thrown all the way to the web server leading to an HTTP 500 dynaTrace also captures the actual exception including the stack trace giving me just the information I was looking for. Page 219 The Exception Details window reveals more information about the actual problem Conclusion Automatic error detection helped me to proactively work on problems and also contact my users before they reported the problem. In this particular case we identied a problem with the viewle Conuence plugin. In case you use it make sure you do not have path-based animations in your slides. It seems like this is the root cause of this NullPointer Exception. For our dynaTrace users: If you are interested in more details on how to use dynaTrace, best practices or self-guided Walkthroughs then check out our updated dynaLearn Section on our Community Portal. For those that want more information on how to become more proactive in your application performance management check out Whats New in dynaTrace 4. Page 220 Why You Really Do Performance Management in Production by Michael Kopp Often performance management is still confused with performance troubleshooting. Others think that performance management in production is simply about system and Java Virtual Machine (JVM) level monitoring and that they are already doing application performance management (APM). The rst perception assumes that APM is about speeding up some arbitrary method performance and the second assumes that performance management is just about discovering that something is slow. Neither of these two are what we at dynaTrace would consider prime drivers for APM in production. So what does it mean to have APM in production and why do you do it? The reason our customers need APM in their production systems is to understand the impact that end-to-end performance has on their end users and therefore their business. They use this information to optimize and x their application in a way that has direct and measurable return on investment (ROI). This might sound easy but in environments that include literally thousands of JVMs and millions of transactions per hour, nothing is easy unless you have the right approach! True APM in production answers these questions and solves problems 30 Week 30 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 221 such as the following: How does performance affect the end users buying behavior or the revenue of my tenants? How is the performance of my search for a specic category? Which of my 100 JVMs, 30 C++ Business components and 3 databases is participating in my booking transaction and which of them is responsible for my problem? Enable Operations, Business and R&D to look at the same production performance data from their respective vantage points Enable R&D to analyze production-level data without requiring access to the production system Gain End-to-end Visibility The rst thing that you realize when looking at any serious web application pick any of the big e-commerce sites is that much of the end user response time gets spent outside their data center. Doing performance management on the server side only, leaves you blind to all problems caused due to JavaScript, content delivery networks (CDNs), third party services or, in case of mobile users, simply bandwidth.
Page 222 Web delivery chain As you are not even aware of these, you cannot x them. Without knowing the effect that performance has on your users you do not know how performance affects your business. Without knowing that, how do you decide if your performance is OK? This dashboard shows that there is a relationship between performance and conversion rate Page 223 The primary metric on the end user level is the conversion rate. What end- to-end APM tells you is how application performance or non-performance impacts that rate. In other words, you can put a dollar gure on response time and error rate! Thus the rst reason why you do APM in production is to understand the impact that performance and errors have on our users behavior. Once you know the impact that some slow request has on your business you want to zero-in on the root cause, which can be anywhere in the web delivery chain. If your issue is on the browser side, the optimal thing to have is the exact click path of the users affected. A visits click path plus the PageAction PurePath of the rst click You can use this to gure out if the issue is in a specic server side request, related to third party requests or in the JavaScript code. Once you have the click path, plus some additional context information, a developer can easily use something like dynaTrace AJAX Edition to analyze it. Page 224 If the issue is on the server side we need to isolate the root cause there. Many environments today encompass several hundred JVMs, Common Language Runtimes (CLRs) and other components. They are big, distributed and heterogeneous. To isolate a root cause here you need to be able to extend the click path into the server itself. From the click path to server side But before we look at that, we should look at the other main driver of performance management the business itself. Create Focus Its the Business that Matters One problem with older forms of performance management is the disconnects from the business. It simply has no meaning for the business whether average CPU on 100 servers is at 70% (or whatever else). It does not mean anything to say that JBoss xyz has a response time of 1 second on webpage abc. Is that good or bad? Why should I invest money to improve that? On top of this we dont have one server but thousands with thousands of different webpages and services all calling each other, so where should Page 225 we start? How do we even know if we should do something? The last question is actually crucial and is the second main reason why we do APM. We combine end user monitoring with business transaction management (BTM). We want to know the impact that performance has on our business and as such we want to know if the business performance of our services are inuenced by performance problems of our applications. While end user monitoring enables you to put a general dollar gure on your end user performance, business transactions go one step further. Lets assume that the user can buy different products based on categories. If I have a performance issue I would want to know how it affects my best selling categories and would prioritize based on that. The different product categories trigger different services on the server side. This is important for performance management in itself as I would otherwise look at too much data and could not focus on what matters. The payment transaction has a different path depending on the context Business transaction management does not just label a specic web request with a name booking, but really enables you to do performance management on a higher level. It is about knowing if and why revenue of one tenant is affected by the response time of the booking transaction. Page 226 In this way business transactions create a twofold focus. It enables the business and management to set the right focus. That focus is always based on company success, revenue and ROI. At the same time business transactions enable the developer to exclude 90% of the noise from his investigation and immediately zero in on the real root cause. This is due to the additional context that business transactions bring. If only bookings via credit cards are affected, then diagnostics should focus on only these and not all booking transactions. This brings me to the actual diagnosing of performance issues in production. The Perfect Storm of Complexity At dynaTrace we regularly see environments with several hundred or even over thousand web servers, JVMs, CLRs and other components running as part of a single application environment. These environments are not homogeneous. They include native business components, integrations with, for example, Siebel or SAP and of course the mainframe. These systems are here to stay and their impact on the complexity of todays environments cannot be underestimated. Mastering this complexity is another reason for APM. Todays systems serve huge user bases and in some cases need to process millions of transactions per hour. Ironically most APM solutions and approaches will simply break down in such an environment, but the value that the right APM approach brings here is vital. The way to master such an environment is to look at it from an application and transaction point of view. Page 227 Monitoring of service level agreements Service Level Agreement (SLA) violations and errors need to be detected automatically and the data to investigate needs to be captured, otherwise we will never have the ability to x it. The rst step is to isolate the offending tier and nd out if the problem is due to host, database, JVM, the mainframe a third party service or the application itself. Isolating the credit card tier as the root cause Instead of seeing hundreds of servers and millions of data points you can immediately isolate the one or two components that are responsible for your issue. Issues happening here cannot be reproduced in a test setup. This has nothing to do with lack of technical ability; we simply do not have the time to gure out which circumstances led to a problem. So we need Page 228 to ensure that we have all the data we need for later analysis available all the time. This is another reason why we do APM. It gives us the ability to diagnose and understand real world issues. Once we have identied the offending tier, we know whom to talk to and that brings me to my last point: collaboration. Breaking the Language Barrier Operations is looking at SLA violations and uptime of services, the business is looking at revenue statistics of products sold and R&D is thinking in terms of response time, CPU cycles and garbage collection. It is a fact that these three teams talk completely different languages. APM is about presenting the same data in those different languages and thus breaking the barrier. Another thing is that as a developer you never get access to the production environment, so you have a hard time analyzing the issues. Reproducing issues in a test setup is often not possible either. Even if you do have access, most issues cannot be analyzed in real time. In order to effectively share the performance data with R&D we rst need to capture and persist it. It is important to capture all transactions and not just a subset. Some think that you only need to capture slow transactions, but there are several problems with this. Either you need to dene what is slow, or if you have baselining you will only get what is slower than before. The rst is a lot of work and the second assumes that performance is ne right now. That is not good enough. In addition such an approach ignores the fact that concurrency exists. Concurrent running transactions impact each other in numerous ways and whoever diagnoses an issue at hand will need that additional context.
Page 229 A typical Operations to Development conversation without APM Once you have the data you need to share it with R&D, which most of the time means physically copying a scrubbed version of that data to the R&D team. While the scrubbed data must exclude things like credit card numbers, it must not lose its integrity. The developer needs to be able to look at exactly the same picture as operations. This enables better communication with Operations while at the same time enabling deep dive diagnostics. Now once a x has been supplied Operations needs to ensure that there are no negative side effects and will also want to verify that it has the desired positive effect. Modern APM solves this by automatically understanding the dynamic dependencies between applications and automatically monitoring new code for performance degradations. Thus APM in production improves communication, speeds up deployment cycles and at the same time adds another layer of quality assurance. This is the nal, but by far not least important reason we do APM. Conclusion The reason we do APM in production is not to x a CPU hot spot, speed up a specic algorithm or improve garbage collection. Neither the business Page 230 nor operations care about that. We do APM to understand the impact that the applications performance has on our customers and thus our business. This enables us to effectively invest precious development time where it has the most impact thus furthering the success of the company. APM truly serves the business of a company and its customers, by bringing focus to the performance management discipline. My recommendation: If you do APM in production, and you should, do it for the right reasons. Page 231 Cassandra Write Performance a Quick Look Inside by Michael Kopp I was looking at Cassandra, one of the major NoSQL solutions, and I was immediately impressed with its write speed even on my notebook. But I also noticed that it was very volatile in its response time, so I took a deeper look at it. First Cassandra Write Test I did the rst write tests on my local machine, but I had a goal in mind. I wanted to see how fast I could insert 150K data points each consisting of 3 values. In Cassandra terms this meant I added 150K of rows in a single column family, adding three columns each time. Dont be confused with the term column here, it really means a key/value pair. At rst I tried to load the 150K in one single mutator call. It worked just ne, but I had huge garbage collection (GC) suspensions. So I switched to sending 10K buckets. That got nice enough performance. Here is the resulting response time chart: 31 Week 31 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 232 Cassandra client/server performance and volatility The upper chart shows client and server response time respectively. This indicates that we leave a considerable time either on the wire or in the client. The lower chart compares average and maximum response time on the Cassandra server, clearly showing a high volatility. So I let dynaTrace do its magic and looked at the transaction ow to check the difference between client and server response time. Getting a Look at the Insides of Cassandra batch_mutate transactions from client to Cassandra server Page 233 This is what I got 5 minutes after I rst deployed the dynaTrace agent. It shows that we do indeed leave a large portion of the time on the wire, either due to the network or waiting for Cassandra. But the majority is still on the server. A quick check of the response time hotspots reveals even more: This shows that most of the time spent on the server is CPU and I/O The hotspots show that most of the time on the Cassandra server is spent in CPU and I/O, as it should be, but a considerable portion is also attributed to GC suspension. Please note that this is not time spent in garbage collection, but the time that my transactions were actively suspended by the garbage collector (read about the difference here)! What is also interesting is that a not so insignicant portion is spent inside Thrift, the communication protocol of Cassandra, which conrmed the communication as part of the issue. Another thing that is interesting is that the majority of the transactions are in the 75ms range (as can be seen in the upper right corner), but a lot of transactions are slower and some go all the way up to 1.5 seconds. Page 234 Hotspots of the slowest 5% of the batch_mutate calls I looked at the slowest 5% and could see that GC suspension plays a much bigger role here and that the time we spend waiting on I/O is also greatly increased. So the next thing I checked was garbage collection, always one of my favorites. The charts show that nearly all GC suspensions are due to minor collections What we see here is a phenomenon that I have blogged about before. The GC suspensions are mostly due to the so called minor collection. Major Page 235 collections do happen, but are only responsible for two of the suspensions. If I had only monitored major GCs I would not have seen the impact on my performance. What it means is that Cassandra is allocating a lot of objects and my memory setup couldnt keep up with it - not very surprising with 150K of data every 10 seconds. Finally I took a look at the single transactions themselves: Single batch_mutate business transactions, each inserting 10K rows What we see here is that the PurePath follows the batch_mutate call from the client to the server. This allows us to see that it spends a lot of time between the two layers (the two synchronization nodes indicate start and end of the network call). More importantly we see that we only spend about 30ms CPU in the client side batch_mutate function, and according to the elapsed time this all happened during sending. That means that either my network was clogged or the Cassandra server couldnt accept my request quick enough. We also see that the majority of the time on the server is spent waiting on the write. That did not surprise me as my disk is not the fastest. A quick check on the network interface showed me that my test (10x150K rows) accumulated to 300MB of data, being quick in math this told me that a single batch_mutate call sent roughly 2MB of data over the wire, so we can safely assume that the latency is due to network. It also means that Page 236 we need to monitor network and Cassandras usage of it closely. Checking the Memory I didnt nd a comprehensive GC tuning guide for Cassandra and didnt want to invest a lot of time, so I took a quick peek to get an idea about the main drivers for the obvious high object churn: The memory trend shows the main drivers What I saw was pretty conclusive. The Mutation creates a column family and a column object for each single column value that I insert. More importantly the column family holds a ConcurrentSkipListMap which keeps track of the modied columns. That produced nearly as many allocations as any other primitive, something I have rarely seen. So I immediately found the reasons for the high object churn. Page 237 Conclusion NoSQL or Big Data solutions are very, very different from your usual RDBMS, but they are still bound by the usual constraints: CPU, I/O and most importantly how it is used! Although Cassandra is lighting fast and mostly I/O bound, its still Java and you have the usual problems e.g. GC needs to be watched. Cassandra provides a lot of monitoring metrics that I didnt explain here, but seeing the ow end-to-end really helps to understand whether the time is spent on the client, network or server and makes the runtime dynamics of Cassandra much clearer. Understanding is really the key for effective usage of NoSQL solutions as we shall see in my next blogs. New problem patterns emerge and they cannot be solved by simply adding an index here or there. It really requires you to understand the usage pattern from the application point of view. The good news is that these new solutions allow us a really deep look into their inner workings, at least if you have the right tools at hand. Page 238 How Proper Redirects and Caching Saved Us 3.5 Seconds in Page Load Time by Andreas Grabner We like to blog about real life scenarios to demonstrate practical examples on how to manage application performance. In this blog I will tell you how we internally improved page load time for some of our community users by 3.5 seconds by simply following our own web performance guidelines that we promote through our blog, community articles and our performance report in dynaTrace AJAX Edition and dynaTrace AJAX Premium Version. 32 Week 32 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 239 Step 1: Identifying That We Have a Problem We had users complaining about long page load times when entering http:// community.dynatrace.com. Testing it locally showed acceptable page load time not perfect, but not too bad either. In order to verify the complaints we looked at the actual end user response time captured by dynaTrace User Experience Management (UEM). Focusing on our Community home page we saw that we in fact have a fair amount of users experiencing very high page load times. The following screenshot shows the response time histogram of our Community home page. Response times ranging from 1 to 22 seconds for our Community home page Page 240 I was also interested in seeing whether there are any particular browsers experiencing this problem. The next screenshot therefore shows the response time grouped by browser: Mobile browsers have a signicant performance problem when loading our home page. My rst thought is latency problems Now we know that we really have a performance problem for a set of users. I assume that the really long load times are somehow related to network latency as the main user group uses Mobile Safari. Lets continue and analyze page load behavior. Step 2: Analyzing Page Load Behavior Analyzing page load behavior of http://community.dynatrace.com and looking at the network roundtrips shows one of our problems immediately. The following screenshot highlights that we have a chain of 6 redirects from entering http://community.dynatrace.com until the user ends up on the actual home page http://community.dynatrace.com/community/display/ Page 241 PUB/Community+Home. Depending on network latency this can take several seconds and would explain why mobile users experience very long page load times. The following screenshot shows that even on my machine being very close to our web servers it takes 1.2 seconds until the browser can actually download the initial HTML document: 6 Redirects take 1.2 seconds to resolve. On a mobile network this can be much higher depending on latency We have often seen this particular problem (lots of redirects) with sites we have analyzed in the past. Now we actually ran into the same problem on our own Community portal. There are too many unnecessary redirects leaving the browser blank and the user waiting for a very long time. The rst problem therefore is to eliminate these redirects and automatically redirect the user to the home page URL. A second problem that became very obvious when looking at the dynaTrace browser performance report is browser caching. We have a lot of static content on our Community pages. Caching ratio should therefore be as high as possible. The Content tab on our report however shows us that 1.3MB is currently not cached. This is content that needs to be downloaded every time the user browses to our Community page even though this content hardly ever changes: Page 242 dynaTrace tells me how well my web page utilizes client-side caching. 1.3MB is currently not cached on a page that mainly contains static content The report not only shows us the percentage or size of content that is cached vs. not cached. It also shows the actual resources. Seems that our system is congured to force the browser to not cache certain images at all by setting an expiration date in the past: Lot of static images that have an expiration header set to January 1st 1970 forcing the browser not to cache these resources Step 3: Fix and Verify Improvements Fixing the redirects and caching was easy. Instead of 6 redirects we only have 1 (as we have to switch from http to https we cant just do a server-side URL rewrite). Optimized caching will have a positive impact for returning users as they have to download fewer resources and with that also save on roundtrips. Page 243 The following image shows the comparison of the page load time before and after the improvements: Improved page load time by 3.5 seconds due to better caching and optimized redirect chains The improvements of 3.5 seconds are signicant but we still have many other opportunities to make our Community portal faster. One of the next areas we want to focus is server-side caching as we see a lot of time spent on our application servers for our content pages that are created on-the-y by evaluating Wiki-style markup code. More on that in a later blog post. Page 244 Conclusion and More You have to analyze performance from the end-user perspective. Just analyzing page load time in your local network is not enough as it doesnt give you the experience your actual end users perceive. Following the common web performance best practices improves performance and should not only be done when your users start complaining but should be something you constantly check during development. There is more for you to read: Check out our Best Practices on WPO and our other blogs on web performance For more information on automating web performance optimization check out dynaTrace AJAX Premium Version For information on User Experience Management check out dynaTrace UEM Page 245 33 Week 33 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 To Load Test or Not to Load Test: That is Not the Question by Andreas Grabner There is no doubt that performance is important for your business. If you dont agree you should check out what we and others think about the performances impact on business or remember headlines like these: Target.com was down after promoting new labels: article on MSN Twitter was down and people were complaining about it on Facebook: Hufngton Post article People stranded in airports because United had a software issue: NY Times article The question therefore is not whether performance is important or not. The question is how to verify and ensure your application performance is good enough Use Your End-Users as Test Dummies? In times of tight project schedules and very frequent releases some companies tend to release new software versions without going through a proper test cycle. Only a few companies can actually afford this because they have their users loyalty regardless functional or performance regressions The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 246 (again only a few companies have that luxury). If the rest of us were to release projects without proper load testing we would end up as another headline on the news. Releasing a new version without proper load testing is therefore not the correct answer. Dont Let Them Tell You that Load Testing is Hard When asking people why they are not performing any load tests you usually hear things along the following lines: We dont know how to test realistic user load as we dont know the use cases nor the expected load in production We dont have the tools, expertise or hardware resources to run large scale load tests It is too much effort to create and especially maintain testing scripts Commercial tools are expensive and sit too long on the shelf between test cycles We dont get actionable results for our developers If you are the business owner or member of a performance team you should not accept answers like this. Let me share my opinion in order for you to counter some of these arguments in your quest of achieving better application performance. Answer to: What Are Realistic User Load and Use Cases? Indeed it is not easy to know what realistic user load and use cases are if you are about to launch a new website or service. In this case you need to make sure to do enough research on how your new service will be used once launched. Factor in how much money you spend in promotions and what conversion rate you expect. This will allow you to estimate peak loads. Page 247 Learn from Your Real Users Its going to be easier when you launch an update to an existing site. I am sure you use something like Google Analytics, Omniture, or dynaTrace UEM to monitor your end users. If so, you have a good understanding of current transaction volume. Factor in the new features and how many new users you want to attract. Also factor in any promotions you are about to do. Talk with your Marketing folks they are going to spend a lot of money and you dont want your system to go down and all the money be wasted. Also analyze your Web server logs as they can give you even more valuable information regarding request volume. Combining all this data allows you to answer the following questions: What are my main landing pages I need to test? Whats the peak load and what is the current and expected page load time? What are the typical click paths through the application? Do we have common click scenarios that we can model into a user type? Where are my users located on the world map, and what browsers do they use? What are the main browser/location combinations we need to test? The following screenshots give you some examples of how we can extract data from services such as Google Analytics or dynaTrace UEM to better understand how to create realistic tests: What are the top landing pages, the load behavior and page load performance? Testing these pages is essential as it impacts whether a user stays or leaves the web site Page 248 Browser and bandwidth information allow us to do more realistic tests as these factors impact page load time signicantly Analyzing click sequences of real users allows us to model load test scripts that reect real user behavior Page 249 CDN, Proxies, Latency: There is More than Meets the Eye What we also learn from our real users is that not every request makes it to our application environment. Between the end user and the application we have different components that participate and impact load times: connection speed, browser characteristics, latency, content delivery network (CDN) and geolocation. A user in the United States on broadband will experience a different page load time than a user on a mobile device in Europe, though both are accessing an application hosted in the US. To execute tests that take this into consideration you would actually need to execute your load from different locations in the world using different connection speed and different devices. Some cloud-based testing services offer this type of testing by executing load from different data centers or even real browsers located around the globe. One example is Gomez Last Mile Testing. Answer to: We Dont Have the Tools or the Expertise This is a fair point. As load testing is usually not done on a day-to-day basis as it is hard to justify the costs for commercial tools, for hardware resources to simulate the load or for people that need constant training on tools they hardly use. All these challenges are addressed by a new type of load testing: load testing done from the cloud offered as a service. The benets of cloud- based load testing are: Cost control: you only pay for the actual load tests not for the time the software sits on the shelf Script generation and maintenance is included in the service and is done by people that do this all the time You do not need any hardware resources to generate the load as it is generated by the service provider Page 250 Answer to: Its Too Much Effort to Create and Maintain Scripts Another very valid claim but typically caused by two facts: A. Free vs. commercial tools: too often free load testing tools are used that offer easy record/replay but do not offer a good scripting language that makes it easy to customize or maintain scripts. Commercial tools put a lot of effort into solving exactly this problem. They are more expensive but make it easier, saving time. B. Tools vs. service: load testing services from the cloud usually include script generation and script maintenance done by professionals. This removes the burden from your R&D organization. Answer to: Commercial Tools are Too Expensive A valid argument if you dont use your load testing tool enough as then the cost per virtual user hour goes up. An alternative as you can probably guess by now are cloud-based load testing services that only charge for the actual virtual users and time executed. Here we often talk about the cost of a Virtual User Hour. If you know how often you need to run load tests, how much load you need to execute over which period of time it will be very easy to calculate the actual cost. Answer to: No Actionable Data after Load Test Just running a load test and presenting the standard load testing report to your developers will probably do no good. Its good to know under how much load your application breaks but a developer needs more information than We cant handle more than 100 virtual users. With only this information the developers need to go back to their code, add log output for later diagnostics into the code and ask the testers to run the test again, as they need more actionable data. This usually leads to multiple testing cycles, jeopardizes project schedules and also leads to frustrated developers and testers. Page 251 Too many test iterations consume valuable resources and impact your project schedules To solve this problem load testing should always be combined with an application performance management (APM) solution that provides rich, actionably in-depth data for developers to identify and x problems without needing to go through extra cycles and in order to stay within your project schedules. Capturing enough in-depth data eliminates extra test cycles, saves time and money Page 252 The following screenshots show some examples on what data can be captured to make it very easy for developers to go right to xing the problems.The rst one shows a load testing dashboard including load characteristics, memory consumption, database activity and performance breakdown into application layers/components: The dashboard tell us right away whether we have hotspots in memory, database, exceptions or in one of our application layers In distributed applications it is important to understand which tiers are contributing to response time and where potential performance and functional hotspots are: Analyzing transaction ow makes it easy to pinpoint problematic hosts or servicesMethods executed contributed to errors and bad response time. To speed up response time hotspot analysis we can rst look at the top contributors Page 253 In-depth transactional information makes it easy to identify code-level problems before analyzing individual transactions that have a problem. As every single transaction is captured it is possible to analyze transaction executions including HTTP parameters, session attributes, method argument, exceptions, log messages or SQL statements making it easy to pinpoint problems. Are We on the Same Page that Load Testing is Important? By now you should have enough arguments to push load testing in your development organization to ensure that there wont be any business impact on new releases. Ive talked about cloud-based load testing services multiple times as it comes with all the benets I explained. I also know that it is not the answer for every environment as it requires your application to be accessible from the web. Opening or tunneling ports through rewalls or running load tests on the actual production environment during off- hours are options you have to enable your application for cloud-based load testing. Page 254 One Answer to these Questions: Compuware Gomez 360 Web Load Testing and dynaTrace New combined Gomez and dynaTrace web load testing solution provides an answer to all the questions above and even more. Without going into too much detail I want to list some of the benets: Realistic load generation using Gomez First Mile to Last Mile web testing In-depth root-cause analysis with dynaTrace Test Center Edition Load testing as a service that reduces in-house resource requirements Keep your costs under control with per Virtual User Hour billing Works throughout the application lifecycle from production, to test, to development Page 255 Running a Gomez load test allows you to execute load from both backbone testing nodes as well as real user browsers located around the world. Especially the last mile is an interesting option as this is the closest you can get to your real end users. The following screenshot shows the response time overview during a load test from the different regions in the world allowing you to see how performance of your application is perceived in the locations of your real end users: The world map gives a great overview how pages perform from the different test nodes Page 256 From here it is an easy drill down in dynaTrace to analyze how increasing load affects performance as well as functional health of the tested application: There is a seamless drill option from the Gomez load testing results to the dynaTrace dashboards Find out more about Gomez 360 Load Testing for web, mobile and cloud applications. Page 257 34 Week 34 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 Why SLAs on Request Errors Do Not Work and What You Should Do Instead by Klaus Enzenhofer We often see request error rates as an indicator for service level agreement (SLA) compliance. Reality however shows that this draws a wrong picture. Lets start with an example. We had a meeting with a customer and were talking about their SLA and what it is based on. Like in many other cases the request error rate was used and the actual SLA they agreed on was 0.5%. From the operations team we got the input that at the moment they have a request error rate of 0.1%. So they are far below the agreed value. The assumption from current rate is that every 1000 th customer has a problem while using the website. Which really sounds good but is this assumption true or do more customers have problems? Most people assume that a page load equals a single request, however if you start thinking about it you quickly realizes that this is of course not the case. A typical page consists of multiple resource requests. So from now on we focus on all resource requests. Lets take a look at a typical eCommerce example. A customer searches for a certain product and wants to buy it in our store. Typically he will have to go through multiple pages. Each click will lead to a page load which executes The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 258 multiple resource requests or executes one or more Ajax requests. In our example the visitor has to go through at least seven steps/pages starting at the product detail page, and ending up with on the conrmation page. Browser performance report from dynaTrace AJAX Edition Premium Version showing the resource requests per page of the buying process The report shows the total request count per page. The shortest possible click-path for a successful buy leads to 317 resource requests. To achieve a good user experience we need to deliver the resources fast and without any errors. However if we do the math for the reported error rate: Customers with Errors = 317 requests * 0.1% = 31.7% That means that on average every third user will have at least one failing request and it doesnt even violate our SLA! The problem is that our error rate is independent of the number of requests per visiting customer. Therefore the SLA does not reect any real world impact. Instead of a request failure rate we need to think about failed visits. The rate of failed visits has a direct impact on the conversion rate and thus the business; as such it is a much better key performance indicator (KPI). If you ask again your Operations team for this, most will not be able to give you the exact number. This is not a surprise as it is not easy to correlate independent web requests together to a visit. Another thing that needs to be taken into account is the importance of a single resource request for the user experience. A user will be frustrated if the photo of the product he wants to buy is missing or - even worse - if the page does not load at all. He might not care if the background image Page 259 does not load and might even be happy if the ads do not pop up. This means we can dene which missing resources are just errors and which constitute failed visits. Depending on the URI pattern we can distinguish between different resources and we can dene a different severity for each rule. In our case we dened separate rules for CSS les, images used by CSS, product images, JavaScript resources and so forth. Error rules for different resources within dynaTrace This allows us to count errors and severe failures separately on a per page action or visit basis. In our case a page action is either a page load (including all resource requests) or a user interaction (including all resource and Ajax requests). A failed page action is like saying the content displayed in the browser is incomplete or even unusable and the user will not have a good experience. Therefore instead of looking at failed requests it is much better to look at failed page actions. The red portion of the bars represents failed page actions Page 260 When talking about user experience we are however not only interested in single pages but in whole visits. We can tag visits that have errors as non- satised and visits that abandoned the page after an error as frustrated. Visits by user experience Such a failed visit rate draws a more accurate picture of reality, the impact on the business and in the end whether we need to investigate further or not. Conclusion SLAs on request failure rate is not enough. One might even say it is worthless if you really want to nd out how good or bad the user experience is for your customers. It is more important to know the failure rate per visit and you should think about dening a SLA on this value. In addition we need to dene which failed requests constitute a failed visit and are of high priority. This allows us to x those problems with real impact and improve the user experience quickly. Page 261 35 Week 35 Sun Mon Tue Wed Thu Fri Sat 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 NoSQL or RDBMS? Are We Asking the Right Questions? by Michael Kopp Most articles on the topic of NoSQL are around the theme of relational database management systems (RDBMS) vs. NoSQL. Database administrators (DBAs) are defending RDBMS by stating that NoSQL solutions are all dumb immature data stores without any standards. Many NoSQL proponents react with the argument that RDMBS does not scale and that today everybody needs to deal with huge amounts of data. I think NoSQL is sold short here. Yes, Big Data plays a large role; but it is not the primary driver in all NoSQL solutions. There are no standards, because there really is no NoSQL solution, but different types of solutions that cater for different use cases. In fact nearly all of them state that theirs is not a replacement for a traditional RDBMS! When we compare RDBMS against them we need to do so on a use case basis. There are very good reasons for choosing an RDBMS as long as the amount of data is not prohibitive. There are however equally good reasons not to do so and choose one of the following solution types: Distributed key-value stores Distributed column family stores (Distributed) document databases The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 262 Graph databases It has to be said however that there are very simple and specic reasons as to why traditional RDBMS solutions cannot scale beyond a handful of database nodes, and even that is painful. However before we look at why NoSQL solutions tend not to have that problem, we will take a look why and when you should choose an RDBMS and when you shouldnt. When and Why You (Should) Choose an RDBMS While data durability is an important aspect of an RDBMS it is not a differentiator compared to other solutions. So I will concentrate rst and foremost on unique features of an RDBMS that also have impact on application design and performance. Table based Relations between distinct table entities and rows (the R in RDBMS) Referential integrity ACID transactions Arbitrary queries and joins If you really need all or most of these features than an RDBMS is certainly right for you, although the level of data you have might force you in another direction. But do you really need them? Lets look closer.The table based nature of RDBMS is not a real feature, it is just the way it stores data. While I can think of use cases that specically benet from this, most of them are simple in nature (think of Excel spreadsheets). That nature however requires a relational concept between rows and tables in order to make up complex entities. Page 263 Data model showing two different kinds of relationships There are genuine relations between otherwise stand-alone entities (like one person being married to another) and relationships that really dene hierarchical context or ownership of some sort (a room is always part of a house). The rst one is a real feature; the second is a result of the storage nature. It can be argued that a document (e.g. an XML document) stores such a relation more naturally because the house document contains the room instead of having the room as a separate document. Referential integrity is really one of the cornerstones of an RDBMS: it ensures logical consistency of my domain model. Not only does it ensure consistency within a certain logical entity (which might span multiple rows/ tables) but more importantly cross entity consistency. If you access the same data via different applications and need to enforce integrity at the central location this is the way to go. We could check this in the application as well, but the database often acts as the nal authority of consistency. The nal aspect of consistency comes in the form of ACID transactions. It ensures that either all my changes are consistently seen by others in their entirety, or that none of my changes are committed at all. Consistency really is the hallmark of an RDBMS. However, we often set commit points for other reasons than consistency. How often did I use a bulk update for the simple reason of increased performance? In many cases I did not care about the visibility of those changes, but just wanted to have them done fast. In other cases we would deliberately commit more often in order to decrease locking and increase concurrency. The question is do I care whether Peter shows up as married while Alice is still seen as unmarried? The government for sure does, Facebook on the other hand does not! Page 264 SELECT count(e.isbn) AS number of books, p.name AS publisher FROM editions AS e INNER JOIN publishers AS p ON (e.publisher_id = p.id) GROUP BY p.name; The nal dening feature of an RDBMS is its ability to execute arbitrary queries: SQL selects. Very often NoSQL is understood as not being able to execute queries. While this is not true, it is true that RDBMS solutions do offer a far superior query language. Especially the ability to group and join data from unrelated entities into a new view on the data is something that makes an RDBMS a powerful tool. If your business is dened by the underlying structured data and you need the ability to ask different questions all the time then this is a key reason to use an RDBMS. However if you know how to access the data in advance, or you need to change your application in case you want to access it differently, then a lot of that advantage is overkill. Why an RDBMS Might Not be Right for You These features come at the price of complexity in terms of data model, storage, data retrieval, and administration; and as we will see shortly, a built-in limit for horizontal scalability. If you do not need any or most of the features you should not use an RDMBS. If you just want to store your application entities in a persistent and consistent way then an RDBMS is overkill. A key-value store might be perfect for you. Note that the value can be a complex entity in itself! If you have hierarchical application objects and need some query capability into them then any of the NoSQL solutions might be a t. With an RDBMS you can use object-relational mapping (ORM) to achieve the same, but at the cost of adding complexity to hide complexity. Page 265 If you ever tried to store large trees or networks you will know that an RDBMS is not the best solution here. Depending on your other needs a graph database might suit you. You are running in the cloud and need to run a distributed database for durability and availability. This is what Dynamo and big table based data stores were built for. RDBMS on the other hand do not well here. You might already use a data warehouse for your analytics. This is not too dissimilar from a column family database. If your data grows too large to be processed on a single machine, you might look into Hadoop or any other solution that supports distributed map/reduce. There are many scenarios where fully ACID driven relational table based database is simply not the best option or simplest option to go with. Now that we have got that out of the way, lets look at the big one: amount of data and scalability. Why an RDBMS Does Not Scale and Many NoSQL Solutions Do The real problem with RDBMS is the horizontal distribution of load and data. The fact is that RDBMS solutions cannot easily achieve automatic data sharding. Data sharding would require distinct data entities that can be distributed and processed independently. An ACID-based relational database cannot do that due to its table based nature. This is where NoSQL solutions differ greatly. They do not distribute a logical entity across multiple tables; its always stored in one place. A logical entity can be anything from a simple value, to a complex object or even a full JSON document. They do not enforce referential integrity between these logical entities. They only enforce consistency inside a single entity, and sometimes not even that. Page 266 NoSQL differs to RDBMS in the way entities get distributed and that no consistency is enforced across those entities This is what allows them to automatically distribute data across a large number of database nodes and also to write them independently. If I were to write 20 entities to a database cluster with 3 nodes, chances are I can evenly spread the writes across all of them. The database does not need to synchronize between the nodes for that to happen and there is no need for a two phase commit, with the visible effect that Client 1 might see changes on Node 1 before Client 2 has written all 20 entities. A distributed RDBMS solution on the other hand needs to enforce ACID consistency across all three nodes. That means that Client 1 will either not see any changes until all three nodes acknowledged a two phase commit or will be blocked until that has happened. In addition to that synchronization, the RDBMS also needs to read data from other nodes in order to ensure referential integrity, all of which happens during the transaction and blocks Client 2. NoSQL solutions do no such thing for the most part. The fact that such a solution can scale horizontally also means that it can leverage its distributed nature for high availability. This is very important in the cloud, where every single node might fail at any moment. Page 267 Another key factor is these solutions do not allow joins and groups across entities, as that would not be possible in a scalable way if your data ranges in the millions and is distributed across 10 nodes or more. I think this is something that a lot of us have trouble with. We have to start thinking about how to access data and store it accordingly and not the other way around. So it is true that NoSQL solutions lack some of the features that dene an RDBMS solution. They do so for the reason of scalability; that does however not mean that they are dump data stores. Document, column family and graph databases are far from unstructured and simple. What about Application Performance? The fact that all these solutions scale in principle does however not mean that they do so in practice or that your application will perform better because of it! Indeed the overall performance depends to a very large degree on choosing the right implementation for your use case. Key-value stores are very simple, but you can still use them wrong. Column family stores are very interesting and also very different from a table-based design. Due to this it is easy to have a bad data model design and this will kill your performance. Besides the obvious factors of disk I/O, network and caching (which you must of course take into consideration), both application performance and scalability depend heavily on the data itself; more specically on the distribution across the database cluster. This is something that you need to monitor in live systems and take into consideration during the design phase as well. I will talk more about this and specic implementations in the coming months. There is one other factor that will play a key role in the choice between NoSQL and more traditional databases. Companies are used to RDBMS and they have experts and DBAs for them. NoSQL is new and not well understood yet. The administration is different. Performance tuning and analysis is different, as are the problem patterns that we see. More Page 268 importantly performance and setup are more than ever governed by the applications that use them and not by index tuning. Application performance management as a discipline is well equipped to deal with this. In fact by looking at the end-to-end application performance it can handle the different NoSQL solutions just like any other database. Actually, as we have seen in my last article we can often do better! Page 269 36 Week 36 Sun Mon Tue Wed Thu Fri Sat 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 Is Synthetic Monitoring Really Going to Die? by Alois Reitbauer More and more people are talking about the end of synthetic monitoring. It is associated with high costs and missing insight into real user performance. This is supported by the currently evolving standards of the W3C Performance Working Group which will help to get more accurate data from end users directly in the browser with deeper insight. Will user experience management (UEM) using JavaScript agents eventually replace synthetic monitoring or will there be a coexistence of both approaches in the end? I think it is a good idea to compare these two approaches in a number of categories which I see as important from a performance management perspective. Having worked intensively with both approaches I will present my personal experience. Some judgments might be subjective but this is what comments are for, after all! Real User Perspective One of the most if not the most important requirement of real user monitoring is to experience performance exactly as real users do. This means how close the monitoring results are to what real application users see. The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 270 Synthetic monitoring collects measures using pre-dened scripts executed from a number of locations. How close this is to what users see depends on the actual measurement approach. Only solutions that use real browsers and dont just emulate provide reliable results. Some approaches only monitor from high speed backbones like Amazon EC2 and only emulate different connection speeds making measurements only an approximation of real user performance. Solutions like Gomez Last Mile in contrast measure from real user machines spread out across the world resulting in more precise results Agent-based approaches like dynaTrace UEM measure directly in the users browser taking actual connection speed and browser behavior into account. Therefore they provide the most accurate metrics on actual user performance. Transactional Coverage Transactional coverage denes how many types of business transactions or application functionality are covered. The goal of monitoring is to cover 100 percent of all transactions. The minimum requirement is to cover at least all business-critical transactions. For synthetic monitoring this directly relates to on the number of transactions which are modeled by scripts: the more scripts, the greater the coverage. This comes at the cost of additional development and maintenance effort. Agent-based approaches measure using JavaScript code which gets injected into every page automatically. This results in 100 percent transactional coverage. The only content that is not covered by this approach is streaming content as agent-based monitoring relies on JavaScript being executed. SLA Monitoring SLA monitoring is a central to ensure service quality at the technical and business level. For SLA management to be effective, not only internal but also third party services like ads have to be monitored. Page 271 While agent-based approaches provide rich information on end-user performance, they are not well suited for SLA management. Agent-based measurements depend on the users network speed, local machine, etc. This means a very volatile environment. SLA management however requires a well-dened and stable environment. Another issue with agent-based approaches is that third parties like content delivery networks (CDNs) or external content providers are very hard to monitor. Synthetic monitoring using pre-dened scripts and provides a stable and predictable environment. The use of real browsers and the resulting deeper diagnostics capabilities enable more ne grained diagnostics and monitoring especially for third party content. Synthetic monitoring can also check SLAs for services which are currently not used by actual users. Availability Monitoring Availability monitoring is an aspect of SLA monitoring. We look at it separately as availability monitoring comes with some specic technical prerequisites which are very different between agent-based and synthetic monitoring approaches. For availability monitoring only synthetic script-based approaches can be used. They do not rely on JavaScript code being injected into the page but measures using on points of presence instead. This enables them to measure a site even though it may be down, which is essential for availability monitoring. Agent-based will not collect any monitoring data if a site is actually down. The only exception is an agent-based solution which also runs inside the web server or proxy like, dynaTrace UEM. Availability problems resulting from application server problems can then be detected based on HTTP response codes. Understanding User-specic Problems In some cases especially in a SaaS environment the actual application Page 272 functionality heavily depends on user-specic data. In case of functional or performance problems, information on a specic user request is required to diagnose a problem. Synthetic monitoring is limited to the transactions covered by scripts. In most cases they are based on test users rather than real user accounts (you would not want a monitoring system to operate a real banking account). For an eCommerce site where a lot of functionality does not depend on an actual user, synthetic monitoring provides reasonable insight here. For many SaaS applications this however is not the case. Agent-based approaches are able to monitor every single user click, resulting in a better ability to diagnose user specic problems. They also collect metrics for actual user requests instead of synthetic duplicates. This makes them the preferred solution for web sites where functionality heavily depends on the actual user. Third Party Diagnostics Monitoring of third party content poses a special challenge. As the resources are not served from our own infrastructure we only have limited monitoring capabilities. Synthetic monitoring using real browsers provides the best insight here. All the diagnostics capabilities available within browsers can be used to monitor third party content. In fact the possibilities for third party and your own content are the same. Besides the actual content, networking or DNS problems can also be diagnosed. Agent-based approaches have to rely on the capabilities accessible via JavaScript in the browser. While new W3C standards of the Web Performance Working Group will make this easier in the future it is hard to do in older browsers. It requires a lot of tricks to get the information about whether third party content loads and performs well. Page 273 Proactive Problem Detection Proactive problem detection aims to nd problems before users do. This not only gives you the ability to react faster but also helps to minimize business impact. Synthetic monitoring tests functionality continuously in production. This ensures that problems are detected and reported immediately irrespective if someone is using the site or not. Agent-based approaches only collect data when a user actually accesses your site. If for example you are experiencing a problem with a CDN from a certain location in the middle of the night when nobody uses your site you will not see the problem before the rst users accesses your site in the morning. Maintenance Effort Cost of ownership is always an important aspect of software operation. So the effort needed to adjust monitoring to changes in the application must be taken into consideration as well. As synthetic monitoring is script-based it is likely that changes to the application will require changes to scripts. Depending on the scripting language and the script design, the effort will vary. In any case there is continuous manual effort required to keep scripts up-to-date. Agent-based monitoring on the other hand does not require any changes when the application changes. Automatic instrumentation of event handlers and so on ensures zero effort for new functionality. At the same time modern solutions automatically inject the required HTML fragments to collect performance data automatically into HTML content at runtime. Suitability for Application Support Besides operations and business monitoring, support is the third main user Page 274 of end user data. In case a customer complains that a web application is not working properly, information on what this user was doing and why it is not working is required. Synthetic monitoring can help here in case of general functional or performance issues like a slow network from a certain location or broken functionality. It is however not possible to get information on what a user was doing exactly and to follow that users click path. Agent-based solutions provide much better insight. As they collect data for real user interactions they have all information required for understanding potential issues users are experiencing. So also problems experienced by a single user can be discovered. Conclusion Putting all these points together we can see that both synthetic monitoring and agent-based approaches have their strengths and weaknesses. One cannot simply choose one over the other. This is validated by the fact that many companies use a combination of both approaches. This is also true for APM vendors who provide products in both spaces. The advantage of using both approaches is that modern agent-based approaches perfectly compensate for the weaknesses of synthetic monitoring, leading to an ideal solution. Page 275 37 Week 37 Sun Mon Tue Wed Thu Fri Sat 2 9 16 23 30 3 10 17 24 31 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 Business Transaction Management Explained by Andreas Grabner The terms business transaction and business transaction management (BTM) are widely used in the industry but it is not always well understood what we really mean by them. The BTM Industry Portal provides some good articles on this topic and is denitely recommended to check out. The general goal is to answer business-centric questions that business owners ask to application owners: How much revenue is generated by a certain product?, What are my conversion and bounce rates and what impacts them? or Do we meet our SLAs to our premium account users? Challenge 1: Contextual Information is More Than Just the URL In order to answer these questions we need information captured from the underlying technical transactions that get executed by your applications when users interact with your services/web site. Knowing the URL accessed, its average response time and then mapping it to a business transaction is the simplest form of business transaction management but doesnt work in most cases because modern applications dont pass the whole business transaction context in the URL. Business context information such as the username, product details or The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 276 cash value usually comes from method arguments, the user session on the application server or from service calls that are made along the processed transaction. Challenge 2: Business Context is Somewhere Along a Distributed Transaction Modern applications are no longer monolithic. The challenge with that is that transactions are distributed, they can take differing paths, and data we need for our business context (username, product information, price, and so on) is often available on different tiers. This requires us to trace every single transaction across all tiers in order to collect this data for a single user transaction: Every transaction is different: it involves different services, crosses multiple tiers and we need to capture business information along the full end-to-end transaction
Page 277 Challenge 3: Understanding Your Users Means: All Users, All Actions, All the Time Knowing that a certain transaction of a user failed including all contextual information is great. In modern applications users have many ways to reach a certain point in the application, e.g. checking out. So the questions are: how did the user get to this point? What were the actions prior to the problem? Is one of the previous actions responsible for the problem we see? In order to answer these questions we need access to all transactions of every single user. This allows us to a) understand how our users interact with the system b) how users reach our site (identify landing pages) c) where users drop off the site (important for bounce rate and bounce page evaluation) and d) which paths tend to lead to errors. It also supports a critical use case in business transaction management which is user complaint resolution, e.g. when user Maria calls the support line we can look up all her transactions, follow her steps and identify where things went wrong: Page 278 Knowing every action of every user allows us to better understand our users, follow their click paths and speed up individual problem resolution Why Continue Reading? In this blog I give you more examples of business transaction management and focus on the challenges I just explained: We need to analyze more than just URLs as modern web applications have become more complex We live in a distributed transactional world where business context data can come from every tier involved We need to focus on every user and every action to understand our users To make it easier to understand I also bring examples from PGAC and other real life BTM implementations. An Example of Business Transactions Lets assume we have a web site for a travel agency. The interesting business data for this type of application is: What destinations are people searching for? I want to make sure I offer trips to those regions and people nd what they are looking for How many trips are actually being sold? How much money do we make? How many people that search actually buy (convert)? Are there certain trips that have a better conversion rate? From a technical perspective we can monitor every single technical transaction, which means we can look at every single web request that is processed by our application server. If we do that we basically look at a huge set of technical transactions. Some of these transactions represent Page 279 those that handle search requests. Some of them handle the shopping cart or check-out procedure. The rst goal for us however is to identify the individual business transactions such as Search: Among all technical transactions we want to identify those that perform a certain business transaction, e.g. search, checkout and log on Now, not every search is the same. In our case the search keyword (trip destination) is interesting so we want to split the search transactions by the destination used in the search criteria. That allows us to see how many search requests are executed by destination and also how long these search requests take to execute. Assuming we have to query external services to fetch the latest trip offerings it could be very interesting to see how fast or slow searches for specic destinations are, or whether the number of search results impacts query time and therefore user experience. Page 280 Splitting the searches by destination allows us to see how many searches are actually executed and whether there are any performance implications for specic destinations The next interesting aspect is of course the actual business we do or dont do. If our website serves different markets (US, EMEA, Asia, and others) it is interesting to see revenue by market. The next illustration shows that we can focus on the actual purchase business transactions. On these we evaluate revenue generated, the market that generated the revenue and whether there is a performance impact on business by looking at response time as well: Page 281 Looking at those business transactions that handle the purchase allows us to see how much money we actually make split by markets we serve Contextual Information for Every Transaction In order for all of this to actually work we need more than just capturing the URL and response time. A lot of times the type of context information we need is in the user session, in the HTTP POST body or some parameter on a method call or SQL bind value. The next illustration shows us what technical information we are talking about in order to map these values to business transactions: Capturing context information for every single technical transaction is mandatory for business transaction management Page 282 As we live in a distributed transactional world we need the full end-to- end transaction and on that transaction the ability to capture this type of technical but business-relevant data. Every Action of Every Visit to identify Landing Pages, Conversion and Bounce Rates Some of the most interesting metrics for business owners are conversion rates, bounce rates, user satisfaction and how well landing pages do. In order to get these metrics we need to identify every individual user (typically called visitors), all actions (individual technical transactions/ requests) for the user and the information on whether the user had a good or bad experience (we factor performance and functional problems into that). With that information we can: Identify the landing page of a visitor -> thats the rst page that was requested by an individual visitor. The landing page tells you how people get to your website and whether promotions or ads/banners work Identify bounce rates and bounce pages -> a visitor bounces of your site if only one page was visited. We want to know which pages typically make people bounce off the page so we can optimize them. If you spend a lot of money on marketing campaigns but due to bad performance or a problem on these landing pages people bounce off, its money that is wasted Identify click paths visitors take until they actually convert (buy a product) or where along this path they actually leave -> visitors usually need to click through a series of pages before they click on the Conrm button. If they do these visitors convert. If they dont and leave your site somewhere along the way we want to know where and why they are leaving in order to optimize these pages and therefore increase conversion rates Identify how satised our end users are based by looking at response Page 283 times and any functional problems along their execution path -> visitors that experience long page load times or run into any type of functional problem are more likely to be frustrated and leave the site before converting. Therefore we want to track this and identify which actions result in a bad user experience Landing Pages Impact Whether Users Make it Across the First Site Impression Knowing the rst action of every visit lets us identify our landing pages. Visitors that dont make it further than the landing page bounce off our site right away. This is the rst problem we want to address as these people are never going to generate any business. The next screenshot shows how we want to analyze landing pages. It is interesting to see which landing pages we have, which are frequented more often, what the bounce rate of these landing pages is and whether there is any correlation to performance (e.g. higher bounce rate on pages that take longer): Landing page report shows bounce rates, access count and compares load time to the industry standard Page 284 Bounce and Conversion Rates, User Satisfaction and Activity are Important Metrics for Business Owners Having every action of every visitor available, knowing whether visitors only visit the landing page and bounce off or whether they also make it to the checkout page allows us to calculate bounce and conversion rates. Also looking at response times, request count and errors allows us to look at visitor satisfaction and usage. The following illustration shows all these metrics, which a business owner is interested in: Dashboard for business owner to monitor those metrics that impact business Understanding Which Path Visitors Take Allows You to Improve Conversion and Lower Bounce Rates The underlying data for all these metrics we just discussed are the individual actions of every visitor. Only with that information can we identify why certain visitors convert and others dont which will help us to improve conversion rate and lower the bounce rate. The following screenshot shows Page 285 a detailed view of visits. Having all this data allows us to see which visitors actually bounce off of the site after just looking at their landing page, which users actually convert and what path users take when interacting with our web site: Having every visitor and all their actions allows us to manage and optimize business transactions Speed Up User Complaint Resolution With All Actions Available As already explained in the introduction section of this blog one important use case is speeding up the resolution process of individual user complaints. If you have a call center and one of your users complaints about a problem it is best to see what this particular user did to get to the error explained. Having all user actions available and knowing the actual user along with the captured transactions allows us to look up only transactions for that particular user and with that to really see what happened: Page 286 When user Maria calls in we can look up all the actions from this user and see exactly what error occurred Deep Technical Information for Fast Problem Resolution Besides using technical context information as input for business transactions (e.g. username, search keywords, cash amount) we also need very deep technical information in scenarios where we need to troubleshoot problems. If visitors bounce off the page because of slow response time or because of an error we want to identify and x this problem right away. In order to do that our engineers need access to the technical data captured for those users that ran into a particular problem. The following screenshot shows a user that experienced a problem on one of the pages, which is great information for the business owner as he can proactively contact this Page 287 user. The engineer can now go deeper and access the underlying technical information captured including transactional traces that show the problem encountered such as a failed web service call, a long-running database statement or an unhandled exception: From business impact to technical problem description: technical context information helps to solve problems faster To Sum Up: What We Need For Business Transaction Management There are several key elements we need to perform the type of business transaction management explained above: All Visitors, All Actions, All the Time > First action is landing page > Last action is bounce page > Helps us to understand the click path through the site, Page 288 where people bounce off, which path people that convert takeKnowing the click paths allows us to improve conversion rate and lower bounce rates > Looking up the actions of complaining users speeds up problem resolution Technical Context Information on a Distributed Transaction > URLs alone are not enough as the business transaction itself is not always reected in URL or URL parameters > Need to capture business context information from HTTP session, method arguments, web service calls, SQL statements, and other pertinent information sourcesThis information comes from multiple tiers that participate in a distributed transaction > Technical context information speeds up problem resolution Out of the Box Business Transactions for Standard Applications From an implementation perspective the question that always comes up is: do I need to congure all my business transactions manually? The answer is: not all but for most applications we have seen it is necessary as business context information is buried somewhere along the technical transaction and is not necessarily part of the URL. Identifying business transactions based on the URL or web services that get called is of course a valid start and is something that most business transaction management solutions provide, and it actually works quite well for standard applications that use standard web frameworks. For more complex or customized applications it is a different story. Business Transactions by URL The easiest way to identify business transactions is by URL assuming your application uses URLs that tell you something about the related business Page 289 transactions. If your application uses URLs like the following list you can easily map these URLs to business transactions: /home maps to Home /search maps to Search /user/cart maps to Cart /search?q=shoes still maps to Search but it would be great to actually see the search by keyword The following screenshot shows the automatically identied business transactions based on URLs in dynaTrace. We automatically get the information if there are any errors on these transactions, what the response time is or how much time is spent in the database: Business transactions by web request URL works well for standard web applications using meaningful URLs that can easily be mapped Business Transactions by Service/Servlet Another often-seen scenario is business transactions based on servlet names or web service calls that are executed by the technical transaction. This is most often very interesting as you want to know how your calls to the search or credit card web service are doing. The name of the invoked web service method is often very descriptive and can therefore be used for automatic business transactions. Here is an example: Page 290 Business transactions by web service method name works well as method names are often very descriptive Business Transactions by Page TitlePage titles are very often better than URLs to clarify, thats the actual titles of the pages users visit. The following shows us business transactions per page title including information on whether problems are to be found in the browser (browser errors), client, network or server allowing a rst quick root cause analysis: Business transaction by page title helps us to understand end user experience Customized Business Transactions for Non-standard Applications A lot of applications our customers are using dont use standard web frameworks where URLs tell them everything they need to identify their Page 291 business transactions. Here are some examples: Web 2.0 applications use a single service URL. The name of the actual business transaction executed can only be captured from a method argument of the dispatcher method on the backend Enforcing SLAs by account types (free, premium, elite members): the account type of a user doesnt come via the URL but is enumerated with an authentication call on the backend Search options are passed via URL using internal IDs. The human readable name of this option is queried from the database The booking business transaction is only valid if multiple conditions along the technical transaction are fullled, e.g: credit card check is OK and booking was forwarded to delivery Lets have a look at of some examples on how we use customized business transactions in our own environment: Requests by User Type On our Community Portal we have different user types that access our pages. It includes employees, customers, partners, and several others. When a user is logged in we only get the actual user type from an authentication call that is made to the backend JIRA authentication service. We can capture the return value of the getGroup service call and use this to dene a business transaction that splits all authenticated transactions by this user type allowing me to see which types of users are actually consuming content on our Community Portal: Page 292 Using a method return value allows me to analyze activity per user type Search Conversion Rates We have a custom search feature on our Community Portal. In order to ensure that people nd content based on their keywords I need to understand two things: a) what keywords are used and b) which keywords result in a successful click to a search result and which ones dont -> that helps me to optimize the search index. The following screenshot shows the two business transactions I created. The rst splits the search requests based on the keyword which is passed as HTTP POST parameter on an Ajax call. The second looks at clicks to content pages and shows those that actually came from a previous search result. For that I use the referrer header (I know the user came from a search result page) and I use the last used search keyword (part of the user session): Page 293 Using HTTP POST parameter, referrer header and HTTP session information to identify search keywords and the conversion rate to actual result clicks These were just two examples of how we internally use business transactions. In the use cases described it is not possible to just look at a URL, a servlet or web service name to identify the actual business transaction. In these scenarios and scenarios that we always see with our customers it is required to capture information from within the technical transaction and then dene these business transactions based on the context data captured. A Practical Example: How Permanent General Assurance Corporation Uses Business Transactions As a last example on business transaction management in real life I want to highlight some of what was shown during a webinar by Glen Taylor, Web Service Architect from Permanent General Assurance Corporate (PGAC). PGAC runs TheGeneral.com and PGAC.com. Their web applications dont have URLs that tells them whether the user is currently asking for an insurance quote or whether they are in the process of verifying their credit card. The information about the actual business transaction comes from the instance class of an object passed to their ProcessFactory. dynaTrace Page 294 captures the object passed as argument, and its actual class. With that information they are able to split their technical transactions into business transactions. PGAC uses the instance name of a class to dene their business transactions If you are interested in their full story check out the recorded webinar. It is available for download on the dynaTrace recorded webinar page. More on Business Transaction Management If you are already a dynaTrace user you should check out the material we have on our dynaTrace Community Portal: Business Transactions in Core Concepts and Business Transactions in Production. If you are new to dynaTrace check out the information on our website regarding Business Transaction Management and User Experience Management. Page 295 38 Week 38 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 5 12 19 26 You Only Control 1/3 of Your Page Load Performance! by Klaus Enzenhofer You dont agree? Have you ever looked at the details of your page load time and analyzed what really impacts page load time? Let me show you with a real-life example and explain that in most cases you only control 1/3 of the time required to load a page as the rest is consumed by 3rd party content that you do not have under control. Be Aware of Third Party Content When analyzing web page load times we can use tools such as dynaTrace, Firebug or PageSpeed. The following two screenshots show timeline views from dynaTrace AJAX Edition Premium Version. The timelines show all network downloads, rendering activities and JavaScript executions that happen when loading almost exactly the same page. The question is: where does the huge difference come from?
The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 296 Timeline without/with third party content The two screenshots below show these two pages as rendered by the browser. From your own application perspective it is the exact same page the only difference is the additional third party content. The screenshot on the left hand side refers to the rst timeline, the screenshot on the right to the second timeline. To make the differences easier to see I have marked them with red boxes. Page 297 Screenshot of the page without and with highlighted third party content The left hand screenshot shows the page with content delivered by your application. Thats all the business-relevant content you want to deliver to your users, e.g. information about travel offers. Over time this page got enriched with 3 rd party content such as tracking pixels, ads, Facebook Connect, Twitter and Google Maps. These third party components make the difference between the two page loads. Everyone can easily see that this enrichment has an impact on page load performance and therefore affects user experience. Watch this video to see how users experience the rendering of the page. The super-fast page that nishes the download of all necessary resources after a little over two seconds is slowed down by eight seconds. Table 1 shows 5 key performance indicators (KPIs) that represent the impact of the third party content. Page 298
4 Typical Problems with Third Party Content Let me explain 4 typical problems that come with adding third party content and why this impacts page load time. Problem 1: Number of Resources With every new third party feature we are adding new resources that have to be downloaded by the browser. In this example the number of resources increased by 117. Lets compare it with the SpeedOfTheWeb baseline for the shopping industry. The best shopping page loads at least 72 resources. If we would stick with our original page we would be the leader in this category with just 59 resources. In addition to the 117 roundtrips that it takes to download these resources this also means that the total download size of the page grows signicantly. To download the extra (approximately) 2 Mb from the servers of the third party content provider your customer will need extra time. Depending on bandwidth and latency the download time can vary and if you think of downloading the data via a mobile connection, it really can be time consuming. Problem 2: Connection Usage Domain sharding is a good way to enable older browsers to download resources in parallel. Looking at modern web sites, domain sharding is often used too aggressively. But how can you do too much domain sharding? Table 2 shows us all the domains from which we only download one or two resources. There are 17 domains for downloading 23 resources domain sharding at its best! Page 299 And what about connection management overhead? For each domain we have to make a DNS look up so that we know to which server to connect. The setup of a connection also needs time. Our example needed 1286 ms for DNS lookup and another 1176 ms for establishing the connections to the server. As almost every domain refers to third party content you have no control over them and you cannot reduce them. URL Count www.facebook.com 2 plusone.google.com 2 www.everestjs.net 2 pixel2823.everesttech.net 2 pixel.everesttech.net 2 metrics.tiscover.com 2 connect.facebook.net 1 apis.google.com 1 maps.google.com 1 api-read.facebook.com 1 secure.tiscover.com 1 www.googleadservices. com 1 googleads.g.doubleclick. net 1 ad-dc2.adtech.de 1 csi.gstatic.com 1 ad.yieldmanager.com 1 ssl.hurra.com 1
Page 300 Problem 3: Non Minied Resources You are trying to reduce the download size of your page as much as possible. You have put a lot of effort into your continuous integration (CI) process to automatically minify your JavaScript, CSS and images and then you are forced to put (for example) ads on your pages. On our example page we can nd an ad provider that does not minify JavaScript. The screen shot below shows part of the uncompressed JavaScript le. Uncompressed JavaScript code of third party content provider I have put the whole le content into a compressor tool and the size can be reduced by 20%. And again you cannot do anything about it. Problem 4: Awareness of Bad Response Times of Third Party Content Provider Within your data center you monitor the response times for incoming requests. In case the performance of the response time is degrading you will be alerted. Within your data center you have the awareness that you know when something is going wrong and you can do something about Page 301 it. But what about third party content? Do Facebook, Google, etc. send you alerts if they are experiencing bad performance? You will now say that these big providers will never have bad response times, but take a look at the following two examples: Timeline with slow Facebook request This timeline shows us a very long running resource request. You will never see this request lasting 10698 ms in your data center monitoring environment as the resources are provided by Facebook, one of the third party content providers on this page. Page 302 Timeline with slow Facebook and Google+ requests The second example shows the timeline of a different page but with the same problem. On this page not only is Facebook slow but Google+ also. The slow requests have durations from 1.6 seconds to 3.5 seconds and have a big impact the experience of your user. The problem with the user experience is that the bad experience is not ascribed to the third party content provider but to YOU! Page 303 Conclusion What we have seen is that third party content has a big impact on user experience. You cannot rely on big third party content providers to always deliver high performance. You should be aware of the problems that can occur if you put third party content on your page and you really have to take action. In this blog I have highlighted several issues you are facing with third party content. What should be done to prevent these types of problems will be discussed in my next blog -Third Party Content Management! Page 304 Why You Have Less Than a Second to Deliver Exceptional Performance by Alois Reitbauer The success of the web performance movement shows that there is increasing interest and value in fast websites. That faster websites lead to more revenue and reduced costs is a well proven fact today. So being exceptionally fast is becoming the dogma for developing web applications. But what is exceptionally fast and how hard is it to build a top performing web site? Dening Exceptionally Fast First we have to dene what exceptionally fast really means. Certainly it means faster than just meeting user expectations. So we have to look at user expectations rst. A great source for which response times people expect from software is this book. It provides really good insight into time perception in software. I can highly recommend it to any anybody who works in the performance space. There is no single value that denes what performance users expect. It depends on the type of user interaction and ranges from a tenth of a second to ve seconds. In order to ensure smooth continuous interaction with the user an application is expected respond within two to four seconds. So ideally an application should respond within two seconds. 39 Week 39 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 5 12 19 26 The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 305 This research is also backed up by studies done by Forrester asking people about their expectations regarding web site response times. The survey shows that while users in 2006 accepted up to four seconds they expected a site to load within two seconds in 2009. It seems like two seconds is the magic number for a web site to load. As we want to be exceptionally fast this means that our pages have to load in less than two seconds to exceed user expectations. How Much Faster Do We Have to Be? From a purely technical perspective everything faster than two seconds should be considered exceptional. This is however not the case for human users. As we are not clocks, our time perception is not that precise. We are not able to discriminate time differences of only a couple of milliseconds. As a general rule we can say that humans are able to perceive time differences of about 20 percent. This 20 percent rule means that we have to be at least 20 percent faster to ensure that users notice the difference. For delivering exceptional performance this means a page has to load in 1.6 seconds or faster to be perceived exceptionally fast. How Much Time Do We Have? At rst sight 1.6 seconds seems to be a lot of processing time for responding to a request. This would be true if this time was under our control. Unfortunately this is not the case. As a rule of thumb about 80 percent of this time cannot be, or can only indirectly be, controlled by us. Lets take a look at where we lose the time closely. A good way to understand where we lose the time is the web application delivery chain. It shows all the parts that play together to deliver a web page and thus inuence response times.
Page 306 Web application delivery chain On the client side we have to consider rendering, parsing and executing JavaScript. Then there is the whole Internet infrastructure required to deliver content to the user. Then there is our server infrastructure and also the infrastructure of all third party content providers (like ads, tracking services, social widgets) we have on our page. Sending Our Initial Request The rst thing we have to do is to send the initial request. Let us investigate how much time we are losing here. To be able to send the request to the proper server the browser has to look up the domains IP address via DNS.
Page 307 Interactions for initial web request Whenever we communicate over the Internet we have to take two factors into account bandwidth and latency. Thinking of the internet as pipe, bandwidth is the diameter and latency is the length. So while bandwidth helps us to send more data at each single point in time, latency tells us how long it takes to send each piece of data. For the initial page request we are therefore more interested in latency as it directly reects the delay from a user request to the response. So, what should we expect latency-wise? A study of the Yahoo Development Blog has shown that latency varies between 160 and over 400 milliseconds depending on the connection type. So even if we assume a pretty fast connection we have to consider about 300 ms for the two roundtrips. This means we now have 1.3 seconds left. Getting the Content So far we havent downloaded any content yet. How big a site actually is, is not that easy to say. We can however use stats from the HTTP archive. Lets assume we have a very small page of about 200 kB. Using a 1.5 Mbit connection it will take about a second to download all the content. This Page 308 means we now have only 300 ms seconds left. Up to now we have lost about 80 percent of our overall time. Client Side Processing Next we have to consider client side processing. Depending on the complexity of the web page this can be quite signicant. We have seen cases where this might take up several seconds. Lets assume for now that you are not doing anything really complex. Our own tests at SpeedoftheWeb. org show that 300 ms is a good estimate for client side processing time. This however means that have no more time left for server side processing. This means that in case we have to do any processing on the server, delivering an exceptionally fast web site is close to impossible or we have to apply a lot of optimization to reach this ambitious goal. Page 309 Conclusion Delivering exceptional performance is hard - really hard - considering the entire infrastructure in place to deliver content to the user. It is nothing you can simply build-in later. A survey by Gomez testing a large number of sites shows that most pages miss the goal of delivering exceptional performance across all browsers. Performance across 200 web sites Faster browsers help but are not silver bullets for exceptional performance. Many sites even fail to deliver expected user performance. While sites do better when we look at perceived render time also called above-the-fold time, they still cannot deliver exceptional performance. Page 310 40 Week 40 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 5 12 19 26 eCommerce Business Impact of Third Party Address Validation Services by Andreas Grabner Are you running an eCommerce site that relies on third party services such as address validation, credit card authorization or mapping services? Do you know how fast, reliable and accurate these service calls (free or charged) are for your web site? Do you know if it has an impact on your end users when one of these services is not available or returns wrong data? End User and Business Impact of Third Party Service Calls In last weeks webinar Daniel Schrammel, IT System Manager at Leder & Schuh (responsible for sites such as www.shoemanic.com or www. jelloshoecompany.com) told his story about the impact of third party online services to his business. One specic problem Leder & Schuh had was with a service that validates shipping address information. If the shipping address entered is valid, users can opt for a cash on delivery option which is highly popular in the markets they sell to. If the address cant be validated or the service is unreachable, this convenient way of payment is not available and users have to go with credit card payment. As the eCommerce platform used to run their online stores also comes from a third party provider the Leder & Schuh IT team has no visibility into these third party online service calls, whether they succeed and how that impacts end user behavior. The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 311 Monthly Report Basically Means No Visibility As stated before, Leder & Schuh uses an eCommerce Solution that was not developed in-house. Therefore they had no option to monitor the service calls from within the application as this was not supported by the platform (no visibility into application code). They had to rely on a monthly report generated by the address validation service telling them how many requests they had last month, how many succeeded, partially succeeded (e.g. street name incorrect) or completely failed. With that aggregated data it was: Impossible to tell which queries actually caused the verication to fail (was it really the user entering a wrong address or is the service not using an up-to-date address database?) Hard to tell whether a failing address validation has an impact on users decisions to actually buy shoes (is there a correlation between address validation and bounce rates?) Live Monitoring of Service Quality In order to solve this problem they had to get visibility into the eCommerce solution to monitor the calls to the third party address validation service. They were interested in: 1. The call count (to validate the service fee they had to pay) 2. The response code of the service (to see the impact the response had on users bouncing off the site) 3. The actual input parameters that caused the service to return an address validation error (to verify whether addresses were really bogus or should have been valid) Using dynaTrace allowed them to accomplish these and other general application performance management (APM) goals without needing to modify the third party eCommerce platform and without any help from the third party address validation service. The following dashboard Page 312 shows the calls to the address validation service. On the top line we see green which represents the calls that return with a success, orange which represents validations with partial success and red that indicates those calls that failed. The bottom left chart shows an aggregation of these 3 return states showing spikes where up to 30% of the validation calls dont return a success: Monitoring third party service calls, the response code and impact on end users Monitoring the service like this allows Leder & Schuh to: Get live data on service invocations -> dont have to wait until the end of the month Can look at those addresses that failed -> to verify if the data was really invalid of whether the validation service uses an out-of-data database Can verify the number of calls made to the service -> verify if they dont get charged for more calls matches what they get charged Can monitor availability of the service -> in case the service is not reachable this breaches the SLA Page 313 Impact of Service Quality on User Experience and Business As indicated in the beginning, the option cash on delivery is much more popular than paying by credit card. In case the address validation service returns that the address is invalid or in case the service is down (not reachable) the user only gets the option to pay with credit card. Correlating the status and the response time of the service call with the actual orders that come in allows Leder & Schuh to see the actual business impact. It turns out that more users bounce off the site when the only payment option they are given is paying by credit card (functional impact) or if the validation service takes too long to respond (performance impact). The following dashboard shows how business can be impacted by the quality of service calls: Quality of service calls (performance or functional) has a direct impact on orders and revenue
Page 314 Want to Learn More? During the webinar we also talked about general response time, service level and system monitoring they are now doing on their eCommerce platform. With the visibility they got they achieved some signicant application performance improvements and boosted overall business. Here are some of the numbers Daniel presented: 50% fewer database queries 30% faster landing pages 100% visibility into all transactions and third party calls You can watch the recorded webinar on the dynaTrace recorded webinar library. Page 315 41 Week 41 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 5 12 19 26 The Reason I Dont Monitor Connection Pool Usage by Michael Kopp I have been working with performance sensitive applications for a long time now. As can be expected most of them have to use the database at one point or another. So you inevitably end up having a connection pool. Now, to make sure that your application is not suffering from waiting on connections you monitor the pool usage . But is that really helping? To be honest - not really How an Application Uses the Connection Pool Most applications these days use connection pools implicitly. They get a connection, execute some statements and close it. The close call does not destroy the connection but puts it back into a pool. The goal is to minimize the so-called busy time. Under the hood most application servers refrain from putting a connection back into the pool until the transaction has been committed. For this reason it is a good practice to get the database connection as late as possible during a transaction. Again the goal is to minimize usage time, so that many application threads can share a limited number of connections. All connection pools have a usage measure to determine if enough connections are available, or in other words to see if the lack of connections The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 316 has a negative effect. However as a connection is occupied only for very short amounts of time - often fractions of a second - we would need to check the usage equally often to have a statistically signicant chance of seeing the pool being maxed-out under normal conditions. Connection Pool Usage if Polled Every 10 Seconds In reality this is not done, as checking the pool too often (say several times a second) would lead to a lot of monitoring overhead. Most solutions check every couple of seconds and as a result we only see pool usage reaching 100% if it is constantly maxed-out. If we were to track the usage on a continuous basis the result would look different: Pool usage as seen if min/max and average are tracked continuously instead of polled Page 317 This means that by the time we see 100% pool usage with regular monitoring solutions we would already have a signicant negative performance impact - or would we? What Does 100% Pool Usage Really Mean? Actually, it does not mean much. It means that all connections in the pool are in use, but not that any transactions are suffering performance problems due to this. In a continuous load scenario we could easily tune our setup to have 100% pool usage all the time and not have a single transaction suffering; it would be perfect. However many use cases do not have a steady continuous load pattern and we would notice performance degradation long before that. Pool usage alone does not tell us anything; acquisition time does! This shows the pool usage and the min/max acquisition time which is non-zero even though the pool is never maxed out Most application servers and connection pools have a wait or acquisition metric that is far more interesting than pool usage. Acquisition time represents the time that a transaction has to wait for a connection from the pool. It therefore represents real, actionable information. If it increases we know for a fact that we do not have enough connections in the pool all the time (or that the connection pool itself is badly written). This measure can show signicant wait time long before the average pool usage is anywhere close to 100%. But there is still a slight problem: the measure is still an aggregated average across the whole pool or more specically, all Page 318 transactions. Thus while it allows us to understand whether or not there are enough connections overall, it does not enable us to identify which business transactions are impacted and by how much. Measuring Acquisition Time properly Acquisition time is simply the time it takes for the getConnection call to return. We can easily measure that inside our transaction and if we do that we can account for it on a per business transaction basis and not just as an aggregate of the whole pool. This means we can determine exactly how much time we spend waiting for each transaction type. After all I might not care if I wait 10ms in a transaction that has an average response time of a second, but at the same time this would be unacceptable in a transaction type with 100ms response time. The getConnection call as measured in a single transaction. It is 10 ms although the pool average is 0.5ms We could even determine which transaction types are concurrently ghting over limited connections and understand outliers, meaning the occasional case when a transaction waits a relative long time for a connection, which would otherwise be hidden by the averaging affect. Conguring the Optimal Pool Size Knowing how big to congure a pool upfront is not always easy. In reality most people simply set it to an arbitrary number that they assume is big enough. In some high volume cases it might not be possible to avoid wait Page 319 time all the time, but we can understand and optimize it. There is a very easy and practical way to do this. Simply monitor the connection acquisition time during peak load hours. It is best if you do that on a per business transaction basis as described above. You want to pay special attention to how much it contributes to the response time. Make sure that you exclude those transactions from your analysis that do not wait at all; they would just skew your calculation. If the average response time contribution to your specic business transaction is very low (say below 1%) then you can reasonably say that your connection pool is big enough. It is important to note that I am not talking about an absolute value in terms of milliseconds but contribution time! If that contribution time is too high (e.g. 5% or higher) you will want to increase your connection pool until you reach an acceptable value. The resulting average pool usage might be very low on average or close to 100%, it does not really matter! Conclusion The usefulness of a pool measure depends on the frequency of polling it. The more we poll it, the more overhead we add - and in the end it is still only a guess. Impact measures like acquisition time are far more useful and actionable. It allows us to tune the connection pool to a point where it has no or at least acceptable overhead when compared to response time. Like all impact measures it is best not to use the overall average, but to understand it in terms of contribution to the end user response time. Page 320 42 Week 42 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 5 12 19 26 Top 8 Performance Problems on Top 50 Retail Sites before Black Friday by Andreas Grabner The busiest online shopping season is about to start and its time to make a quick check on whether the top shopping sites are prepared for the big consumer rush, or whether they are likely going to fail because their pages are not adhering to web performance best practices. The holiday rush seems to start ramping up earlier and earlier each year. The Gomez Website Performance Pulse showed an average degradation in performance satisfaction of +13% among the top 50 retail sites in the past 24 hours when compared to a typical non-peak period. For some, like this website below, the impact began a week ago. Its important to understand not only whether there is a slowdown, but also why is it the Internet or my app? The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 321 Nobody wants their site to respond with a page like that as users will most likely be frustrated and run off to their competition Before we provide a deep dive analysis on Cyber Monday looking at the pages that ran into problems we want to look at the top 50 retail pages and see whether they are prepared for the upcoming weekend. We will be using the free SpeedOfTheWeb service as well as deep dive dynaTrace browser diagnostics. We compiled the top problems we saw along the 5 parts of the web delivery chain that are on these pages analyzed today, 3 days before Black Friday 2011. These top problems have the potential to lead to actual problems once these pages are pounded by millions of Christmas shoppers. Page 322 High level results of one of the top retail sites with red and yellow indicators on all parts of the web delivery chain Page 323 User Experience: Optimizing Initial Document Delivery and Page Time The actual user experience can be evaluated by 3 key metrics: rst impression; onload; and fully loaded. Looking at one of the top retail sites we see values that are far above the average for the shopping industry (calculated on a day-to-day basis based on URLs of the Alexa shopping index): 4.1 seconds until the user gets the rst visual indication of the page and a total of 19.2 seconds to fully load the page Problem #1: Too Many Redirects Results in Delayed First Impression The browser cant render any content until there is content to render. From entering the initial URL until the browser can render content there are several things that happen: resolving DNS, establishing connections, following every HTTP redirect, downloading HTML content. One thing that can be seen on some of the retail sites is the excessive use of HTTP redirects. Page 324 Here is an example: from entering the initial URL until retrieved from the initial HTML document the browser had to follow 4 redirects, taking 1.3 seconds: 4 HTTP redirects from entering the initial URL until the browser can download the initial HTML document Problem #2: Web 2.0 / JavaScript Impacting OnLoad and Blocking the Browser The timeline view of one of the top retail sites makes it clear that JavaScript the enabler of dynamic and interactive Web 2.0 applications does not necessarily improve user experience but actually impacts user experience by blocking the browser for several seconds before the user can interact with the site: Many network resources from a long list of source domains as well as problematic JavaScript code impact user experience Page 325 The following problems can be seen on the page shown in the timeline above: All JavaScript les and certain CSS les are loaded before any images. This delays rst impression time as the browser has to parse and execute JavaScript les before downloading and painting images One particular JavaScript block takes up to 15 seconds to apply dynamic styles to specic DOM elements Most of the 3rd party content is loaded late which is actually a good practice Browser: JavaScript As already seen in the previous timeline view, JavaScript can have a huge impact in user experience when it performs badly. The problem is that most JavaScript actually performs well on the desktops of web developers. Developers tend to have the latest browser version, have blocked certain JavaScript sources (Google Ads, any types of analytics, etc.) and may not test against the full blown web site. The analysis in this blog was done on a laptop running Internet Explorer (IE) 8 on Windows 7. This can be considered an average consumer machine. Here are two common JavaScript problem patterns we have seen on the analyzed pages: Problem #3: Complex CSS Selectors Failing on IE 8 On multiple pages we can see complex CSS selectors such as the following that take a long time to execute: Complex jQuery CSS lookups taking a very long time to execute on certain browsers Page 326 Why is this so slow? This is because of a problem in IE 8s querySelectorAll method. The latest versions of JavaScript helper libraries (such as jQuery, Prototype and YUI) take advantage of querySelectorAll and simply forward the CSS selector to this method. It however seems that some of these more complex CSS lookups cause querySelectorAll to fail. The fallback mechanism of JavaScript helper libraries is to iterate through the whole DOM in case querySelectorAll throws an exception. The following screenshot shows exactly what happens in the above example: Problem in IEs querySelectorAll stays unnoticed due to empty catch block. Fallback implementation iterates through the whole DOM Page 327 Problem #4: 3rd Party Plugins uch as Supersh One plugin that we found several times was Supersh. We actually already blogged about this two years ago - it can lead to severe client side performance problems. Check out the blog from back then: Performance Analysis of dDynamic JavaScript Menus. One example this year is the following Supersh call that takes 400ms to build the dynamic menu: jQuery plugins such as Supersh can take several hundred milliseconds and block the browser while doing the work Content: Size and Caching In order to load and display a page the browser is required to load the content. That includes the initial HTML document, images, JavaScript and CSS les. Users that come back to the same site later in the holiday season need not necessarily download the full content again but rather access already-cached static content from the local machine. Two problems related to loading content are the actual size as well as the utilization of caching: Page 328 Problem #5: Large Content Leads to Long Load Times Too much and large content is a problem across most of the sites analyzed. The following is an example of a site with 2MB of total page time where 1.5MB was JavaScript alone: This site has 2MB in size which is far above the industry average of 568kb Even on high-speed Internet connections a page size that is 4 times the industry average is not optimal. When accessing pages of that size from a slow connection maybe from a mobile device loading all content to display the site can take a very long time leading to frustration of the end user. Problem #6: Browser Caching Not Leveraged Browser-side caching is an option: web sites have to cache mainly static content on the browser to improve page load time for revisiting users. Many of the tested retail sites show hardly any cached objects. The following report shows one example where caching basically wasnt used at all: Client-side caching is basically not used at all on this page. Caching would improve page load time for revisiting users For more information check out our Best Practices on Browser Caching. Page 329 Network: Too Many Resources and Slow DNS Lookups Analyzing the network characteristics of the resources that get downloaded from a page can give a good indication whether resources are distributed optimally across the domains they get downloaded from. Problem #7: Wait and DNS Time The following is a table showing resources downloaded per domain. The interesting numbers are highlighted. There seems to be a clear problem with a DNS lookup to one of the 3rd party content domains. It is also clear that most of the content comes from a single domain which leads to long wait times as the browser cant download all of them in parallel: Too many resources lead to wait times. Too many external domains add up on DNS and connect Time. Its important to identify the bottlenecks and nd the optimal distribution Reducing resources is a general best practice which will lower the number of roundtrips and also wait time per domain. Checking on DNS and connection times especially with 3rd party domains allows you to speed up these network related timings. Server: Too Many Server-Requests and Long Server Processing Time Besides serving static JavaScript, CSS and image les, eCommerce sites have dynamic content that gets delivered by application servers. Page 330 Problem #8: Too Much Server Side Processing Time Looking at the server request report gives us an indication on how much time is spent on the web servers as well as application servers to deliver the dynamic content. Server processing time and the number of dynamic requests impact highly dynamic pages as we can nd them on eCommerce sites Long server-side processing time can have multiple reasons. Check out the latest blog on the Impact of 3rd Party Service Calls on your eBusiness as well as our other server-side related articles on the dynaTrace blog for more information on common application performance problems. Page 331 Waiting for Black Friday and Cyber Monday This analysis shows that, with only several days until the busiest online shopping season of the year starts, most of the top eCommerce sites out there still have the potential to improve their websites in order to not deliver a frustrating user experience once shoppers go online and actually want to spend money. We will do another deep dive blog next week and analyze some of the top retail pages in more detail providing technical as well as business impact analysis. Page 332 43 Week 43 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 5 12 19 26 5 Things to Learn from JC Penney and Other Strong Black Friday and Cyber Monday Performers by Andreas Grabner The busiest online shopping time in history has brought a signicant increase in visits and revenue over the last year. But not only were shoppers out there hunting for the best deals - web performance experts and the media were out there waiting for problems to happen in order to blog and write about the business impact of sites performing badly or sites that actually went down. We therefore know by now that even though the Apple Store performed really well it actually went down for a short while Friday morning. The questions many are asking right now are: what did those that performed strongly do right, and what did those performing weakly miss in preparing for the holiday season? Learn by Comparing Strong with Weak Performers We are not here to do any nger-pointing but want to provide an objective analysis on sites that performed well vs. sites that could do better in order to keep users on their site. Looking at what sites did right allows you to follow their footsteps. Knowing what causes weak performance allows you to avoid the things that drag your site down. The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 333 Strong Performance by Following Best Practices JC Penney was the top performer based on Gomez Last Mile Analysis on both Black Friday and Cyber Monday, followed by Apple and Dell. For mobile sites it was Sears followed by Amazon and Dell. Taking a closer look at their sites allows us to learn from them. The SpeedoftheWeb speed optimization report shows us that they had strong ratings across all 5 dimensions in the Web performance delivery chain: SpeedoftheWeb analysis shows strong ratings across all dimensions of the web delivery chain Using dynaTrace browser diagnostics technology allows us to do some forensics on all activities that happen when loading the page or when interacting with dynamic Web 2.0 elements on that page. The following screenshot shows the timeline and highlights several things that allow JC Penney to load very quickly as compared to other sites: Low number of resources, light on 3rd party content and not overloaded with Web 2.0 elements Page 334 Things they did well to improve performance: Light on 3rd Party content, e.g.: They dont use Facebook or Twitter JavaScript plugins but rather just provide a popup link to access their pages on these social networks. The page is not overloaded with hundreds of images or CSS les. They for instance only have one CSS le. Minifying this le (removing spaces and empty lines) would additionally minimize the size. JavaScript on that page is very lightweight. No long running script blocks, onLoad event handlers or expensive CSS selectors. One thing that could be done is merging some of the JavaScript les. Static images are hosted on a separate cache-domain served with cache-control headers. This will speed up page load time for revisiting users. Some of these static images could even be sprited to reduce download time. Things they could do to become even better Use CSS sprites to merge some of their static images, e.g. sprite the images in the top toolbar (Facebook, Twitter, sign-up options) Combine and minify JavaScript les Page 335 Weak Performance by not Following Best Practices On the other hand we have those pages that didnt perform that well. Load times of 15 seconds often lead to frustrated users who will then shop somewhere else. The following is a typical SpeedoftheWeb report for these sites showing problems across the web delivery chain: Bad ratings across the web delivery chain for pages that showed weak performance on Black Friday and Cyber Monday Now lets look behind the scenes and learn what actually impacts page load time. The following is another timeline screenshot with highlights on the top problem areas: Page 336 Weak performing sites have similar problem patterns in common, e.g. overloaded with 3rd party content, heavy on JavaScript and not leveraging browser caching Things that impact page load time: Overloaded pages with too much content served in non-optimized form, e.g. static images are not cached or sprited. JavaScript and CSS les are not minimized or merged. Heavy on 3rd party plugins such as ad services, social networks or user tracking. Many single resource domains (mainly due to 3rd party plugins) with often high DNS and connect time. Heavy JavaScript execution e.g. inefcient CSS selectors, slow 3rd party JavaScript libraries for UI effects, etc. Multiple redirects to end up on correct URL, e.g. http://site.com -> http://www.site.com -> http://www.site.com/start.jsp Page 337 To-do-List to Boost Performance for the Remaining Shopping Season Looking across the board at strong and weak performers allows us to come up with a nice to-do list for you to make sure you are prepared for the shoppers that will come to your site until Christmas. Here are the top things to do: Task 1: Check Your 3rd Party Content We have seen from the previous examples that 3rd party content can impact your performance by taking lengthy DNS and connect time, long- lasting network downloads as well as adding heavy JavaScript execution. We know that 3rd party content is necessary but you should do a check on what the impact is and whether there is an alternate solution to embedding 3rd party content, e.g. embed Facebook with a static link rather than the full blown Facebook Connect plugin. Analyze how well your 3rd party components perform with network, server and JavaScript time Also check out the blog on the Impact of 3rd Party Content on Your Load Time. Task 2: Check the Content You Deliver and Control Too often we see content that has not been optimized at all. HTML or JavaScript les that total up to several hundred kilobytes per le often caused by server-side web frameworks that generate lots of unnecessary empty lines, blanks, add code comments to the generated output, etc. The best practice is to combine, minify and compress text les such as HTML, Page 338 JavaScript and CSS les. There are many free tools out there such as YUI Compressor or Closure Compiler. Also check out Best Practices on Network Requests and Roundtrips. Check the content size and number of resources by content type. Combining les and compressing content helps reduce roundtrips and download time Developer comments generated in the main HTML document are good candidates for saving on size Task 3: Check Your JavaScript Executions JavaScript whether custom-coded, a popular JavaScript framework or added through 3rd party content is a big source of performance problems. Analyze the impact of JavaScript performance across all the major browsers, not just the browsers you use in development. Problems with methods Page 339 such as querySelectorAll in IE 8 (discovered in my previous blog post) or problems with outdated jQuery/Yahoo/GWT libraries can have a huge impact on page load time and end user experience: Inefcient CSS selectors or slow 3rd party JavaScript plugins can have major impact on page load time Find more information on problematic JavaScript in the following articles: Impact of Outdated JavaScript Libraries and 101 on jQuery Selector Performance Task 4: Check Your Redirect Settings As already highlighted in my previous blog post where I analyzed sites prior to Black Friday, many sites still use a series of redirects in order to get their users onto the initial HTML document. The following screenshot shows an example and how much time is actually wasted before the browser can start downloading the rst initial HTML document: Proper redirect conguration can save unnecessary roundtrips and speed up page load time For more information and best practices check out our blog post How We Saved 3.5 Seconds by Using Proper Redirects. Page 340 Task 5: Check Your Server-Side Performance Especially with dynamic pages containing location-based offers, your shopping cart or a search result page requires server-side processing. When servers get overloaded with too many requests and when the application doesnt scale well we can observe performance problems when returning the initial HTML document. If this document is slow, the browser isnt able to download any other objects and therefore leaves the screen blank. Typical server-side performance problems are either a result of architectural problems (not built to scale), implementation problems (bad algorithms, wasteful with memory leading to excessive garbage collection, too busy on the database, etc.) or problems with 3rd party Web services (credit card authentication, location based services, address validation, and so on). Check out the Top 10 Server-side Performance Problems Taken from Zappos, Monster, Thomson and Co: Application performance problems on the application server will slow down initial HTML download and thus impact the complete page load Page 341 A good post on how 3rd party web services can impact your server-side processing was taken from the experience report of a large European eCommerce site: Business Impact of 3rd Party Address Validation Service. Conclusion: Performance Problems are Avoidable I hope this analysis gave you some ideas or pointed you to areas you havent thought about targeting when it comes to optimizing your web site performance. Most of these problems are easily avoidable. If you need further information, I leave you with some additional links to blogs we did in the past that show how to optimize real-life application performance: Top 10 Client-side Performance Problems Top 10 Server-side Performance Problems taken from Zappos, Monster, Thomson and Co Real life page analysis on: US Open 2011, Masters, Yahoo News For more, check out our other blogs we have on web performance, and mobile performance. Page 342 44 Week 44 Sun Mon Tue Wed Thu Fri Sat 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 4 11 18 25 5 12 19 26 Performance of a Distributed Key Value Store, or Why Simple is Complex by Michael Kopp Last time I talked about the key differences between RDBMS and the most important NoSQL databases. The key reasons why NoSQL databases can scale the way they do is that they shard based on the entity. The simplest form of NoSQL database shows this best: the distributed key/value store. Last week I got the chance to talk to one of the Voldemort developers on LinkedIn. Voldemort is a pure Dynamo implementation; we discussed its key characteristics and we also talked about some of the problems. In a funny way its biggest problem is rooted in its very cleanliness, simplicity and modular architecture. Performance of a Key/Value Store A key/value store has a very simple API. All there really is to it is a put, a get and a delete. In addition Voldemort, like most of its cousins, supports a batch get and a form of an optimistic batch put. That very simplicity makes the response time very predictable and should also make performance analysis and tuning rather easy. After all one call is like the other and there are only so many factors that a single call can be impacted by: The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 343 I/O in the actual store engine on the server side (this is plug-able in Voldemort, but Berkeley DB is the default) Network I/O to the Voldemort instance Cache hit rate Load distribution Garbage collection Both disk and network I/O are driven by data size (key and value) and load. Voldemort is a distributed store that uses Dynamos consistent hashing; as such the load distribution across multiple nodes can vary based on key hotness. Voldemort provides a very comprehensive JMX interface to monitor these things. On the rst glance this looks rather easy. But on the second glance Voldemort is a perfect example of why simple systems can be especially complex to monitor and analyze. Lets talk about the distribution factor and the downside of a simple API. Performance of Distributed Key/Value Stores Voldemort, like most distributed key/value stores, does not have a master. This is good for scalability and fail-over but means that the client has a little more work to do. Even though Voldemort (and most of its counterparts) does support server side routing, usually the client communicates with all server instances. If we make a put call it will communicate with a certain number of instances that hold a replica of the key (the number is congurable). In a put scenario it will do a synchronous call to the rst node. If it gets a reply it will call the remainder of required nodes in parallel and wait for the reply. A get request, on the other hand, will call the required number of nodes in parallel right away. Page 344 Transaction ow of a single benchmark thread showing that it calls both instances every time What this means is that the client performance of Voldemort is not only dependent on the response time of a single server instance. It actually depends on the slowest one, or in case of put the slowest plus one other. This can hardly be monitored via JMX of the Voldemort instances. Lets understand why. What the Voldemort server sees is a series of put and get calls. Each and every one can be measured. But we are talking about a lot of them and what we get is moving averages and maximums via JMX: Average and maximum get and put latency as measured on the Voldemort instance Page 345 Voldemort also comes with a small benchmarking tool which reports client- side performance of the executed test: [reads] Operations: 899 [reads] Average(ms): 11.3326 [reads] Min(ms): 0 [reads] Max(ms): 1364 [reads] Median(ms): 4 [reads] 95th(ms): 13 [reads] 99th(ms): 70 [transactions] Operations: 101 [transactions] Average(ms): 74.8119 [transactions] Min(ms): 6 [transactions] Max(ms): 1385 [transactions] Median(ms): 18 [transactions] 95th(ms): 70 [transactions] 99th(ms): 1366 Two facts stick out. The client-side average performance is a lot worse than reported by the server side. This could be due to the network or to the fact that we have an average of the slower call every time instead of the overall average (remember, we call multiple server instances for each read/write). The second important piece of data is the relatively high volatility. Neither of the two can be explained by looking at the server side metrics!The performance of a single client request depends on the response time of the replicas that hold the specic key. In order to get an understanding of client-side performance we would need to aggregate response time on a per key and per instance basis. The volume of statistical data would be rather large. Capturing response times for every single key read and write is a lot to capture but, more to the point, analyzing it would be a nightmare. Page 346 Whats even more important is, the key for the key/value store alone might tell us which key range is slow, but not why: it is not actionable. As we have often explained, context is important for performance monitoring and analysis. In case of a key/value store the context of the API alone is not enough and the context of the key is far too much and not actionable. This is the downside of a simple API. The distributed nature only makes this worse as key hotness can lead to an uneven distribution in your cluster. Client Side Monitoring of Distributed Key Value Stores To keep things simple I used the performance benchmark that comes with Voldemort to show things from the client side. Single benchmark transaction showing the volatility of single calls to Voldemort As we can see the client does indeed call several Voldemort nodes in parallel and has to wait for all of them (at least in my example) to return. By looking at things from the client side we can understand why some client functionality has to wait for Voldemort even though server side statistics would never show that. Furthermore we can show the contribution of Voldemort operations overall, or a specic Voldemort instance to a particular transaction type. In the picture we see that Voldemort (at least end-to-end) contributes 3.7% to the response time of doit. We also see that the vast Page 347 majority is in the put calls of the applyUpdate. And we also see that the response time of the nodes in the put calls varies by a factor of three! Identifying the Root Cause of Key Hotness There are two key issues that are hard to track, analyze and monitor with Voldemort according to a Voldemort expert. The one is key hotness. Key hotness is a key problem for all Dynamo implementations. If a certain key range is requested or written much more often than others it can lead to an over-utilization of specic nodes while others are idle. It is very hard to determine which keys are hot at any given time and why. If the application is mostly user-driven it might be nigh impossible to predict up front. One way to overcome this is to correlate end user business transactions with the triggered Voldemort load and response time. The idea is that an uneven load distribution on your distributed key/value store should be triggered by one of the three scenarios All the excessive load is triggered by the same application functionality This is pretty standard and means that the keys that you use in that functionality are either not evenly spread, monotonic increasing or otherwise unbalanced. All the excessive load is triggered by a certain end user group or a specic dimension of a business transaction One example would be that the user group is part of the key(s) and that user group is much more active than usual or others. Restructuring the key might help to make it more diverse. Another example is that you are accessing data sets like city information and for whatever reason New York, London and Vienna are accessed much more often than anything else. (e.g. more people book a trip to these three cities than to anything else) A combination of the above Either the same data set is accessed by several different business Page 348 transactions (in which case you need a cross cut) or the same data structure is accessed by the same business transaction. The key factor is that all this can be identied by tracing your application and monitoring it via your business transactions. The number of discrete business transactions and their dimensions (booking per location, search by category) is smaller than the number of keys you use in your store. More importantly, it is actionable! The fact that 80% of the load on your 6 overloaded store instances results from the business transaction search books thriller enables you to investigate further. You might change the structure of the keys, optimize the access pattern or setup a separate store for the specic area if necessary. Identifying Outliers The second area of issues that are hard to track down are outliers. These are often considered to be environmental factors. Again, JMX metrics arent helping much here, but taking a look at the internals quickly reveals what is happening: PurePath showing the root cause of an outlier Page 349 In my load test of two Voldemort instances (admittedly a rather small cluster) the only outliers were instantly tracked down to synchronization issues within the chosen store engine: Berkeley DB. What is interesting is that I could see that all requests to the particular Voldemort instance that happened during that time frame were similar blocked in Berkley DB. Seven were waiting for the lock in that synchronized block and the 8th was blocking the rest while waiting for the disk. Hotspot showing where 7 transactions were all waiting for a lock in Berkley DB Page 350 The root cause for the lock was that a delete had to wait for the disk This issue happened randomly, always on the same node and was unrelated to concurrent access of the same key. By seeing both the client side (which has to wait and is impacted), the corresponding server side (which shows where the problem is) and having all impacted transactions (in this case I had 8 transactions going to Voldemort1 which were all blocked) I was able to pinpoint the offending area of code immediately. Granted, as a Voldemort user that doesnt want to dig into Berkley DB I cannot x it, but it does tell me that the root cause for the long synchronization block is disk wait and I can work with that. Page 351 Conclusion Key/value stores like Voldemort are usually very fast and have very predictable performance. The key to this is the very clean and simple interface (usually get and put) that does not allow for much volatility in terms of execution path on the server side. This also means that they are much easier to congure and optimize, at least as far as speed of a single instance goes. However this very simplicity can also be a burden when trying to understand end user performance, contribution and outliers. In addition even simple systems become complex when you make them distributed, add more and more instances and make millions of requests to it. Luckily the solution is easy: focus on your own application and its usage of the key/value store instead of the key/value store itself alone. Page 352 45 Week 45 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 Pagination with Cassandra, And What We Can Learn from It by Michael Kopp Like everybody else it took me a while to wrap my head around the BigTable concepts in Cassandra. The brain needs some time to accept that a column in Cassandra is really not the same as a column in our beloved RDBMS. After that I wrote my rst web application with it and ran into a pretty typical problem: I needed to list a large number of results and needed to page this for my web page. And like many others I ran straight into the next wall. Only this time it was not Cassandras fault really, so I thought I would share what I found. Pagination in the Table Oriented World In the mind of every developer there is a simple solution for paging. You add a sequence column to the table that is monotonically increasing and use a select like the following: select * from my_data where sq_num between 25 and 50 This would get me 25 rows. It is fast too, because I made sure the sq_num column had an index attached to it. Now on the face of it this sounds easy, but you run into problems quickly. Almost every use case requires the result to be sorted by some of the columns. In addition the data would not be static, but be inserted to and possibly updated all the time. Imagine you are The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 353 returning a list of names, sorted by rst name. The sq_cnt approach will not work because you cannot re-sequence large amounts of data every time. But luckily databases have a solution for that. You can do crazy selects like the following: select name, address from ( select rownum r, name, address from person sort by name; ) where r > 25 and r < 50; It looks crazy, but is actually quite fast on Oracle (and I think SQL Server too) as it is optimized for it. Although all databases have similar concepts, most dont do so well in terms of performance. Often the only thing possible with acceptable performance is to limit the number of return rows. Offset queries, as presented here, incur a serve performance overhead. With that in mind I tried to do the same for Cassandra. Pagination in Cassandra I had a very simple use case. I stored a list of journeys on a per tenant basis in a column family. The name of the journey was the column name and the value was the actual journey. So getting the rst 25 items was simple. get_slice(key : tenant_key, column_parent : {column_family : Journeys_by_ Tenant}, predicate : { slice_range : { start : A, Page 354 end : Z, reverse : false, count : 25 } } ) But like so many I got stuck here. How to get the next 25 items? I looked, but there was no offset parameter, so I checked doctor Google and the rst thing I found was: Dont do it! But after some more reading I found the solution and it is very elegant indeed. More so than what I was doing in my RDBMS, and best of all it is applicable to RDBMS! The idea is simple: instead of using a numeric position and a counter you simply remember the last returned column name and use it as a starting point in your next request. So if the rst result returned a list of journeys and the 25th was Bermuda then the next button would execute the following: get_slice(key : tenant_key, column_parent : {column_family : Journeys_by_ Tenant}, predicate : { slice_range : { start : Bermuda, end : Z, reverse : false, count : 26 } } ) You will notice that I now retrieve 26 items. This is because start and end are inclusive and I will simply ignore the rst item in the result. Sounds super, but how to go backwards? It turns out that is also simple: You use the rst result of your last page and execute the following: Page 355 get_slice(key : tenant_key, column_parent : {column_family : Journeys_by_ Tenant}, predicate : { slice_range : { start : Bermuda, end : A, reverse : true, count : 26 } } ) The reverse attribute will tell get_slice to go backwards. Whats important is that the end of a reverse slice must be before the start. Done! Well not quite. Having a First and Last button is no problem (simply use reverse starting with Z for the last page), but if like many Web pages, you want to have direct jumpers to the page numbers, you will have to add some ugly cruft code. However, you should ask yourself how useful it is to jump to page 16, really! There is no contextual meaning of the 16th page. It might be better to add bookmarks like A, B, C instead of direct page numbers. Applying This to RDBMS? The pagination concept found in Cassandra can be applied to every RDBMS. For the rst select simply limit the number of return rows either by ROWNUM, LIMIT or similar (you might also use the JDBC API). select name, address from person sort by name fetch rst 25 rows only; Page 356 For the next call, we can apply what we learned from Cassandra: select name, address from person where name > Andreas sort by name fetch rst 25 rows only; If we want to apply this to the previous button it will look like this: select name, address from person where name < Michael sort by name desc fetch rst 25 rows only; For the Last button simply omit the where clause. The advantage? It is far more portable then offset selects virtually every database will support it. It should also perform rather well, as long as you have an index on the name column (the one that you sort by). Finally, there is no need to have a counter column! Conclusion NoSQL databases challenge us because they require some rewiring of our RDBMS-trained brain. However some of the things we learn can also make our RDBMS applications better. Of course you can always do even better and build pagination into your API. Amazons SimpleDB is doing that, but more on SimpleDB later. Stay tuned Page 357 46 Week 46 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 The Top Java Memory Problems Part 2 by Michael Kopp Some time back I planned to publish a series about Java memory problems. It took me longer than originally planned, but here is the second installment. In the rst part I talked about the different causes for memory leaks, but memory leaks are by no means the only issue around Java memory management. High Memory Usage It may seem odd, but too much memory usage is an increasingly frequent and critical problem in todays enterprise applications. Although the average server often has 10, 20 or more GB of memory, a high degree of parallelism and a lack of awareness on the part of the developer leads to memory shortages. Another issue is that while it is possible to use multiple gigabytes of memory in todays JVMs the side effects are very long garbage collection (GC) pauses. Sometimes increasing the memory is seen as a workaround to memory leaks or badly-written software. More often than not this makes things worse in the long run and not better. These are the most common causes for high memory usage. The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 358 HTTP Session as Cache The session caching anti-pattern refers to the misuse of the HTTP session as data cache. The HTTP session is used to store user data or a state that needs to survive a single HTTP request. This is referred to as conversational state and is found in most web applications that deal with non-trivial user interactions. The HTTP session has several problems. First, as we can have many users, a single web server can have quite a lot of active sessions, so it is important to keep them small. The second problem is that they are not specically released by the application at a given point. Instead, web servers have a session timeout which is often quite high to increase user comfort. This alone can easily lead to quite large memory demands if we consider the number of parallel users. However in reality we often see HTTP sessions multiple megabytes in size. These so called session caches happen because it is easy and convenient for the developer to simply add objects to the session instead of thinking about other solutions like a cache. To make matters worse this is often done in a re and forget mode, meaning data is never removed. After all why should you? The session will be removed after the user has left the page anyway (or so we may think). What is often ignored is that session timeouts from 30 minutes to several hours are not unheard of. A practical example is the storage of data that is displayed in HTML selection elds (such as country lists). This semi-static data is often multiple kilobytes in size and is held per user in the heap if kept in the session. It is better to store this which moreover is not user-specic in one central cache. Another example is the misuse of the hibernate session to manage the conversational state. The hibernate session is stored in the HTTP session in order to facilitate quick access to data. This means storage of far more states than necessary, and with only a couple of users, memory usage immediately increases greatly. In modern Ajax applications, it may also be possible to shift the conversational state to the client. In the ideal case, this leads to a state-less or state-poor server application that scales much better. Another side effect of big HTTP sessions is that session replication becomes a real problem. Page 359 Incorrect Cache Usage Caches are used to increase performance and scalability by loading data only once. However, excessive use of caches can quickly lead to performance problems. In addition to the typical problems of a cache, such as misses and high turnaround, a cache can also lead to high memory usage and, even worse, to excessive GC behavior. Mostly these problems are simply due to an excessively large cache. Sometimes, however, the problem lies deeper. The key word here is the so-called soft reference. A soft reference is a special form of object reference. Soft references can be released at any time at the discretion of the garbage collector. In reality however, they are released only to avoid an out-of-memory error. In this respect, they differ greatly from weak references, which never prevent the garbage collection of an object. Soft references are very popular in cache implementations for precisely this reason. The cache developer assumes, correctly, that the cache data is to be released in the event of a memory shortage. If the cache is incorrectly congured, however, it will grow quickly and indenitely until memory is full. When a GC is initiated, all the soft references in the cache are cleared and their objects garbage collected. The memory usage drops back to the base level, only to start growing again. This phenomenon can easily be mistaken to be an incorrectly congured young generation. It looks as if objects get tenured too early only to be collected by the next major GC. This kind of problem often leads to a GC tuning exercise that cannot succeed. Only proper monitoring of the cache metrics or a heap dump can help identify the root cause of the problem. Churn Rate and High Transactional Memory Usage Java allows us to allocate a large number of objects very quickly. The generational GC is designed for a large number of very short-lived objects, but there is a limit to everything. If transactional memory usage is too high, it can quickly lead to performance or even stability problems. The difculty here is that this type of problem comes to light only during a load test and Page 360 can be overlooked very easily during development. If too many objects are created in too short a time, this naturally leads to an increased number of GCs in the young generation. Young generation GCs are only cheap if most objects die! If a lot of objects survive the GC it is actually more expensive than an old generation GC would be under similar circumstances! Thus high memory needs of single transactions might not be a problem in a functional test but can quickly lead to GC thrashing under load. If the load becomes even higher these transactional objects will be promoted to the old generation as the young generation becomes too small. Although one could approach this from this angle and increase the size of the young generation, in many cases this will simply push the problem a little further out, but would ultimately lead to even longer GC pauses (due to more objects being alive at the time of the GC). The worst of all possible scenarios, which we see often nevertheless, is an out-of-memory error due to high transactional memory demand. If memory is already tight, higher transaction load might simply max out the available heap. The tricky part is that once the OutOfMemory hits, transactions that wanted to allocate objects but couldnt are aborted. Subsequently a lot of memory is released and garbage collected. In other words the very reason for the error is hidden by the OutOfMemory error itself! As most memory tools only look at the Java memory every couple of seconds they might not even show 100% memory at any point in time. Since Java 6 it is possible to trigger a heap dump in the event of an OutOfMemory which will show the root cause quite nicely in such a case. If there is no OutOfMemory one can use trending or -histo memory dumps (check out jmap or dynaTrace) to identify those classes whose object numbers uctuate the most. Those are usually classes that are allocated and garbage collected a lot. The last resort is to do a full scale allocation analysis. Page 361 Large Temporary Objects In extreme cases, temporary objects can also lead to an out-of-memory error or to increased GC activity. This happens, for example, when very large documents (XML, PDF) have to be read and processed. In one specic case, an application was unavailable temporarily for a few minutes due to such a problem. The cause was quickly found to be memory bottlenecks and garbage collection that was operating at its limit. In a detailed analysis, it was possible to pin down the cause to the creation of a PDF document: byte tmpData[] = new byte[1024]; int offs = 0; do { int readLen = bis.read(tmpData, offs, tmpData.length - offs); if(readLen == -1) break; offs += readLen; if(offs == tmpData.length) { byte newres[] = new byte[tmpData.length + 1024]; System.arraycopy(tmpData, 0, newres, 0,tmpData.length); tmpData = newres; } } while(true); To the seasoned developer it will be quite obvious that processing multiple megabytes with such a code leads to bad performance due to a lot of unnecessary allocations and ever growing copy operations. However a lot of times such a problem is not noticed during testing, but only once a certain level of concurrency is reached where the number of GCs and/or Page 362 amount of temporary memory needed, becomes a problem. When working with large documents, it is very important to optimize the processing logic and prevent it from being held completely in the memory. Memory-related Class Loader Issues Sometimes I think that the class loader is to Java what DLL hell was to Windows. When there are memory problems, one thinks primarily of objects that are located in the heap and occupy memory. In addition to normal objects, however, classes and constant values are also administered in the heap. In modern enterprise applications, the memory requirements for loaded classes can quickly amount to several hundred MB, and thus often contribute to memory problems. In the Hotspot JVM, classes are located in the so-called permanent generation or PermGen. It represents a separate memory area, and its size must be congured separately. If this area is full, no more classes can be loaded and an out-of-memory occurs in the PermGen. The other JVMs do not have a permanent generation, but that does not solve the problem. It is merely recognized later. Class loader problems are some of the most difcult problems to detect. Most developers never have to deal with this topic and tool support is also poorest in this area. I want to show some of the most common memory- related class loader problems: Large Classes It is important not to increase the size of classes unnecessarily. This is especially the case when classes contain a great many string constants, such as in GUI applications. Here all strings are held in constants. This is basically a good design approach, however it should not be forgotten that these constants also require space in the memory. On top of that, in the case of the Hotspot JVM, string constants are a part of the PermGen, which can then quickly become too small. In a concrete case the application had a separate class for every language it supported, where each class contained every single text constant. Each of these classes itself was actually too Page 363 large already. Due to a coding error that happened in a minor release, all languages, meaning all classes, were loaded into memory. The JVM crashed during start up no matter how much memory was given to it. Same Class in Memory Multiple Times Application servers and OSGi containers especially tend to have a problem with too many loaded classes and the resulting memory usage. Application servers make it possible to load different applications or parts of applications in isolation to one another. One feature is that multiple versions of the same class can be loaded in order to run different applications inside the same JVM. Due to incorrect conguration this can quickly double or triple the amount of memory needed for classes. One of our customers had to run his JVMs with a PermGen of 700MB - a real problem since he ran it on 32bit Windows where the maximum overall JVM size is 1500MB. In this case the SOA application was loaded in a JBoss application server. Each service was loaded into a separate class loader without using the shared classes jointly. All common classes, about 90% of them, were loaded up to 20 times, and thus regularly led to out-of-memory errors in the PermGen area. The solution here was strikingly simple: proper conguration of the class loading behavior in JBoss. The interesting point here is that it was not just a memory problem, but a major performance problem as well! The different applications did use the same classes, but as they came from different class loaders, the server had to view them as different. The consequence was that a call from one component to the next, inside the same JVM, had to serialize and de- serialize all argument objects. This problem can best be diagnosed with a heap dump or trending dump (jmap -histo). If a class is loaded multiple times, its instances are also counted multiple times. Thus, if the same class appears multiple times with a different number of instances, we have identied such a problem. The class loader responsible can be determined in a heap dump through simple reference tracking. We can also take a look at the variables of the class Page 364 loader and, in most cases, will nd a reference to the application module and the .jar le. This makes it possible to determine even if the same .jar le is being loaded multiple times by different application modules. Same Class Loaded Again and Again A rare phenomenon, but a very large problem when it occurs, is the repeated loading of the same class, which does not appear to be present twice in the memory. What many forget is that classes are garbage collected too, in all three large JVMs. The Hotspot JVM does this only during a major GC, whereas both IBM and JRockit can do so during every GC. Therefore, if a class is used for only a short time, it can be removed from the memory again immediately. Loading a class is not exactly cheap and usually not optimized for concurrency. If the same class is loaded by multiple threads, Java synchronizes these threads. In one real world case, the classes of a script framework (bean shell) were loaded and garbage collected repeatedly because they were used for only a short time and the system was under load. Since this took place in multiple threads, the class loader was quickly identied as the bottleneck once analyzed under load. However, the development took place exclusively on the Hotspot JVM, so this problem was not discovered until it was deployed in production. In case of the Hotspot JVM this specic problem will only occur under load and memory pressure as it requires a major GC, whereas in the IBM JVM or JRockit this can already happen under moderate load. The class might not even survive the rst garbage collection! Incorrect Implementation of Equals and Hashcode The relationship between the hashcode method and memory problems is not obvious at rst glance. However, if we consider where the hashcode method is of high importance this becomes clearer. The hashcode and equals methods are used within hash maps to insert and nd objects based on their key. However, if the implementation of the Page 365 operator is faulty, existing entries are not found and new ones keep being added. While the collection responsible for the memory problem can be identied very quickly, it may be difcult to determine why the problem occurs. We had this case at several customers. One of them had to restart his server every couple of hours even though it was congured to run at 40GB! After xing the problem they ran quite happily with 800MB. A heap dump even if complete information on the objects is available rarely helps in this case. One would simply have to analyze too many objects to identify the problem. In this case, the best variant is to test comparative operators proactively, in order to avoid such problems. There are a few free frameworks (such as http://code.google.com/p/equalsverier/) that ensure that equals and hash code conrm to the contract. Conclusion High memory usage is still one of the most frequent problems that we see and it often has performance implications. However most of them can be identied rather quickly with todays tools. In the next installment of this series I will talk about how to tune your GC for optimal performance, provided you do not suffer from memory leaks or the problems mentioned in this blog. You might also want to read my other memory blogs: The Top Java Memory Problems Part 1 How Garbage Collection differs in the three big JVMs Major GCs Separating Myth from Reality The impact of Garbage Collection on Java performance Page 366 47 Week 47 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 How to Manage the Performance of 1000+ JVMs by Michael Kopp Most production monitoring systems I have seen have one major problem: There are too many JVMs, CLRs and hosts to monitor. One of our bigger customers (and a Fortune 500 company) mastered the challenge by concentrating on what really matters: the applications! The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 367 Ensure Health The following dashboard is taken directly from the production environment of that customer: High-level transaction health dashboard that shows how many transactions are performing badly What it does is pretty simple: it shows the load of transactions in their two data centers. The rst two charts show the transaction load over different periods of time. The third shows the total execution sum of all those transactions. If the execution time goes up but the transaction count does not, they know they have a bottleneck to investigate further. The pie charts to the right show the same information in a collapsed form. The color coding indicates the health of the transactions. Green ones have a response time below a second while red ones are over 3 seconds. In case of an immediate problem the red area in the ve minute pie chart grows quickly and they know they have to investigate. The interesting thing is that instead of looking at the health of hosts or databases, the primary indicators they use for health are their end users experiences and business transactions. If the amount of yellow or red Page 368 transactions increases they start troubleshooting. The rst lesson we learn from that is to measure health in terms that really matter to your business and end users. CPU and memory utilization do not matter to your users, response time and error rates do. Dene Your Application Once they detect a potential performance or health issue they rst need to isolate the problematic application. This might sound simple, but they have hundreds of applications running in over 1000 JVMs in this environment. Each application spans several JVMs plus several C++ components. Each transaction in turn ows through a subset of all these processes. Identifying the application responsible is important to them and for that purpose they have dened another simple dashboard that shows the applications that are responsible for the red transactions: This dashboard shows which business transactions are the slowest and which are very slow most often They are using dynaTrace business transaction technology to trace and identify all their transactions. This allows them to identify which specic Page 369 business transactions are slow and which of them are slow most often. They actually show this on a big screen for all to see. So not only does Operations have an easy time identifying the team responsible, most of the time that team already knows by the time they get contacted! This is our second lesson learned: know and measure your application(s) rst! This means: You dene and measure performance at the unique entry point to the application/business transaction You know or can dynamically identify the resources, services and JVMs used by that application and measure those Measure Your Applicationand its Dependencies Once the performance problem or an error is identied the real fun begins, as they need to identify where the problem originates in the distributed system. To do that we need to apply the knowledge that we have about the application and to measure the response time on all involved tiers. The problem might also lie between two tiers, in the database or with an external service you call. You should not only measure the entry points but also the exit points of your services. In large environments, like the one in question, it is not possible to know all the dependencies upfront. Therefore we need the ability to automatically discover the tiers and resources used instead. Page 370 Show the transaction ow of a single business transaction type At this point, we can isolate the fault domain down to the JVM or Database level. The logical next step is to measure the things that impact the application on those JVMs/CLRs. That includes the resources we use and the third party services we call. But in contrast to the usual utilization- based monitoring, we are interested in metrics that reect the impact these resources have on our application. For example: instead of only monitoring the connection usage of a JDBC connection it makes much more sense to look at the average wait duration and the number of threads waiting for a connection. These metrics represent the direct impact the resource pool has. The usage on the other hand explains why a thread is waiting but 100% usage does not imply that a thread is waiting! The downside with normal JMX-based monitoring of resource measures is that we still cannot directly relate their impact to a particular type of transaction or service. We can only do that if we measure the connection acquisition directly from within the service. This is similar to measuring the client side and server Page 371 side of a service call. The same thing can be applied to the execution of database statements itself. Our Fortune 500 company is doing exactly that and found that their worst-performing application is executing the following statements quite regularly This shows that a statement that takes 7 seconds on average is executed regularly While you should generally avoid looking at top 10 reports for analysis, in this case it is clear that the statements were at the core of their performance problem. Finally we also measure CPU and memory usage of a JVM/CLR. But we again look at the application as the primary context. We measure CPU usage of a specic application or type of transaction. It is important to remember that an application in the context of SOA is a logical entity and cannot be identied by its process or its class alone. It is the runtime context, e.g. the URI or the SOAP message that denes the application. Therefore, in order to nd the applications responsible for CPU consumption, we measure it on that level. Measuring memory on a transaction level is quite hard and maybe not worth the effort, but we can measure the impact that garbage collection (GC) has. The JVM TI interface informs us whenever a GC suspends the application threads. This can be directly related to response time impact on the currently executing transactions or applications. Our customer uses such a technique to investigate the transactions that consume the most CPU time or are impacted the most by garbage collection: Page 372 Execution time spent in fast, slow and very slow transactions compared with their respective volume This dashboard shows them that, although most execution time is spent in the slow transactions, they only represent a tiny fraction of their overall transaction volume. This tells them that much of their CPU capacity is spent in a minority of their transactions. They use this as a starting point to go after the worst transactions. At the same time it shows them on a very high level how much time they spend in GC and if it has an impact. This again lets them concentrate on the important issues. All this gives them a fairly comprehensive, yet still manageable, picture of where the application spends time, waits and uses resources. The only thing that is left to do is to think about errors. Monitoring for Errors As mentioned before, most error situations need to be put into the context of their trigger in order to make sense. As an example: if we get an exception telling us that a particular parameter for a web service is invalid we need to know how that parameter came into being. In other words we want to know which other service produced that parameter or if the user entered something wrong, which should have been validated Page 373 on the screen already. Our customer is doing the reverse which also makes a lot of sense. They have the problem that their clients are calling them and complaining about poor performance or errors happening. When a client calls them, they use a simple dashboard to look up the user/account and from there lter down to any errors that happened to that particular user. As errors are captured as part of the transaction they also identify the business transaction responsible and have the deep-dive transaction trace that the developer needs in order to x it. That is already a big step towards a solution. For their more important clients they are actually working on proactively monitoring those and actively call them up in case of problems. In short, when monitoring errors we need to know which application and which ow led to that error and which input parameters were given. If possible we would also like to have stack traces of all involved JVMs/CLRs and the user that triggered it. Making Sure an Optimization Works! There is one other issue that you have in such a large environment. Whenever you make changes it might have a variety of effects. You want to make sure that none are negative and that the performance actually improves. You can obviously do that in tests. You can also compare previously recorded performance data with the new one, but in such a large environment this can be quite a task, even if you automate it. Our customer came up with a very pragmatic way to do a quick check instead of a going through the more comprehensive analysis right away. The fact is all they really care about are the slow or very slow transactions, and not so much whether satisfactory performance got even better. Page 374 Transaction load performance breakdown that shows that the outliers are indeed reduced after the x The chart shows the transaction load (number of transactions) on one of their data centers, color coded for satisfactory, slow and very slow response time (actually they are split into several more categories). We see the outliers on the top of the chart (red portion of the bars). The dip in the chart represents the time that they diverted trafc to the other data center to apply the necessary changes. After the load comes back the outliers have been signicantly reduced. While this does not grantee that the change applied is optimal in all cases it tells them that overall it has the desired effect under full production load! What about Utilization Metrics? At this point you might ask if I have forgotten about utilization metrics like CPU usage and the like, or if I simply dont see their uses. No I have not forgotten them and they have uses. But they are less important than you might think. A utilization metric tells me if that resource has reached Page 375 capacity. In that regard it is very important for capacity planning, but as far as performance and stability go it only provides additional context. As an example: knowing that the CPU utilization is 99% does not tell me whether the application is stable or if that fact has a negative impact on the performance. It really doesnt! On the other hand if I notice that an application is getting slower and none of the measured response time metrics (database, other services, connection pools) increase while at the same time the machine that hosts the problematic service reaches 99% CPU utilization we might indeed have hit a CPU problem. But to verify that I would in addition look at the load average which, similar to the number of waiting threads on a connection pool, signies the number of threads waiting for a CPU and thus signies real impact. The value operating system level utilization metrics give gets smaller all the time. Virtualization and cloud technologies not only distort the measurement itself indeed by both running in a shared environment and having the ability to get more resources on demand, resources are neither nite nor dedicated and thus resource utilization metrics become dubious. At the same time application response time is unaffected if measured correctly, and remains the best and most direct indicator of real performance! Page 376 48 Week 48 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 Third Party Content Management Applied: Four Steps to Gain Control of Your Page Load Performance! by Klaus Enzenhofer Todays web sites are often cluttered with third party content that slows down page load and rendering times, hampering user experience. In my rst blog post, I presented how third party content impacts your websites performance and identied common problems with its integration. Today I want to share the experience I have gained as a developer and consultant in the management of third party content. In the following, I will show you best practices for integrating Third Party Content and for convincing your business that they will benet from establishing third party management. First the bad news: as a developer, you have to get the commitment for establishing third party management and changing the integration of third party content from the highest level of business management possible the best is CEO level. Otherwise you will run into problems implementing improvements. The good news is that, from my experience, this is not an unachievable goal you just have to bring the problems up the right way with hard facts. Lets start our journey towards implementing third party content management from two possible starting points I have seen in the past: the rst one is triggered if someone from the business has a bad user experience and wants to nd out who is responsible for the slow pages. The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 377 The second one is that you as the developer know that your page is slow. No matter where you are starting the rst step you should make is to get exact hard facts. Step 1: Detailed Third Party Content Impact Analysis For a developer this is nothing really difcult. The only thing we have to do is to use the web performance optimization tool of our choice and take a look at the page load timing. What we get is a picture like the screenshot below. We as developers immediately recognize that we have a problem but for the business this is a diagram that needs a lot of explanation. Timeline with third party content Page 378 As we want to convince them we should make it easier to understand for them. In my experience something that works well is to take the time to implement a URL parameter that turns off all the third party content for a webpage. Then we can capture a second timeline from the same page without the Third Party requests. Everybody can now easily see that there are huge differences: Timeline without third party content We can present these timelines to the business as well but we still have to explain what all the boxes, timings, etc. mean. We should invest some more minutes and create a table like the one below, where we compare some main key performance indicators (KPI). As a single page is not representative we prepare similar tables for the 5 most important pages. Which pages these are depends on your website. Landing pages, product pages and pages on the money path are potentially interesting. Our web analytics tool can help us to nd the most interesting pages. Page 379 Step 2: Inform Business about the Impact During step 1 we have found out that the impact is signicant, we have collected facts and we still think we have to improve the performance of our application. Now it is time to present the results of the rst step to the business. From my experience the best way to do this is a face to face meeting with high level business executives. CEO, CTO and other business unit executives are the appropriate attendees. The presentation we give during this meeting should cover the following three major topics: Case study facts from other companies The hard facts we have collected Recommendations for improvements Google, Bing, Amazon, etc. have done case studies that show the impact of slow pages to the revenue and the users interaction with the website. Amazon for example found out that a 100 ms slower page reduces revenue by 1%. I have attached an example presentation to this blog which should provide some guidance for your presentation and contains some more examples. After this general information we can show the hard facts about our system and as the business now is aware of the relationship between performance and revenue they normally listen carefully. Now we are no longer talking about time but money. At the end of our presentation we make some recommendations about how we can improve the integration of third party content. Dont be shy no third party content plugin is untouchable at this point. Some of the recommendations can only be decided by the business and not by the development. Our goals for this meeting are that we have the commitment to proceed, that we get support from the executives when discussing the implementation alternatives, and that we have a follow up meeting with Page 380 the same attendees to show improvements. What would be nice is the consent of the executives to the recommended improvements but from my experience they commit seldom. Step 3: Check Third Party Content Implementation Alternatives Now as we have the commitment, we can start thinking about integration alternatives. If we stick to the standard implementation the provider recommends, we wont be able to make any improvements. We have to be creative and always have to try to create win-win situations! Here in this blog I want to talk about 4 best practices I have encountered the past. Best Practice 1: Remove It Every developer will now say OK, thats easy, and every businessman will say Thats not possible because we need it! But do you really need it? Lets take a closer look at social media plugins, tracking pixels and ads. A lot of websites have integrated social media plugins like those for Twitter or Facebook. They are very popular these days and a lot of webpages have integrated such plugins. Have you ever checked how often your users really use one of the plugins? A customer of ours had integrated ve plugins. After 6 months they checked how often each of them was used. They found out that only one was used, and only by other people than the QA department who checked that all of them were working after each release. With a little investigation they found out that four of the ve plugins could be removed as simply nobody was using them. What about tracking pixels? I have seen a lot of pages out there that have not only integrated one tracking pixel but ve, seven or even more. Again, the question is: do we really need all of them? It does not matter who we ask, we will always get a good explanation of why a special pixel is needed but stick to the goal of reducing it down to one pixel. Find out which one can deliver most or even all of the data that each department Page 381 needs and remove all the others. Problems we might run into will be user privileges and business objectives that are dened for teams and individuals on specic statistics. It costs you some effort to handle this but at the end things will get easier as you have only one source for your numbers and you will stop discussing which statistics delivers the correct values and from which statistics to take the numbers, as there is only one left. Once at a customer we have removed 5 tracking pixels with a single blow. As this led to incredible performance improvements, their marketing department made an announcement to let customers know they care about their experience. This is a textbook example of creating a win-win-situation as mentioned above. Other third party content that is a candidate for removal is banner ads. Businessmen will now say this guy is crazy to make such a recommendation, but if your main business is not earning money with displaying ads then it might be worth taking a look at it. Taking the numbers from Amazon that 100 ms additional page load time reduces revenue by one percent and think of the example page where ads consume round about 1000 ms of page load time 10 times that. This would mean that we lose 10 * 1% = 10% of our revenue just because of ads. The question now is: Are you really earning 10% or more of your total revenue with ads? If not you should consider removing ads from your page. Best Practice 2: Move Loading of Resources Back After the Onload Event As we now have removed all unnecessary third party content, we still have some work left. For user experience, apart from the total load time, the rst impression and the onload-time are the most important timings. To improve these timings we can implement lazy loading where parts of the page are loaded after the onload event via JavaScript; several libraries are available that help you implement this. There are two things you should be aware of: the rst thing is that you are just moving the starting point for the download of the resources so you are not reducing the download size of your page, or the number of requests. The second thing is that lazy loading Page 382 only works when JavaScript is available in the users browser. So you have to make sure that your page is useable without JavaScript. Candidates for moving the download starting point back are plugins that only work if JavaScript is available or are not vital to the usage of the page. Ads, social media plugins, maps, are in most cases such candidates. Best Practice 3: Load on User Click This is an interesting option if you want to integrate a social media plugin. The standard implementation for example of such a plugin looks like the picture below. It consists of a button to trigger the like/tweet action and the number of likes/tweets. Twitter and Facebook buttons integration example To improve this, the question that has to be answered is: do the users really need to know how often the page was tweeted, liked, etc.? If the answer is no we can save several requests and download volume. All we have to do is deliver a link that looks like the action button and if the user clicks on the link we can open a popup window or an overlay where the user can perform the necessary actions. Best Practice 4: Maps vs. Static Maps This practice focuses on the integration of maps like Google Maps or Bing Maps on our page. What can be seen all around the web are map integrations where the maps are very small and only used to give the user a hint about where the point of interest is located. To show the user this hint, several JavaScript les and images have to be downloaded. In most cases the user does not need to zoom or reposition the map, and as the map is small it is also hard to use. Why not use the static map implementation Bing Maps and Google Maps are offering? To gure out the advantages of the static implementation I have created two HTML pages which show the Page 383 same map. One uses the standard implementation and the other the static implementation. Find the source les here. After capturing the timings we get the following results: When we take a closer look at the KPIs we can see that every KPI for the static Google Maps implementation is better. Especially when we look at the timing KPIs we can see that the rst impression and the Onload time improve by 34% and 22%. The total load time decreases by 1 second which is 61% less, a really big impact on user experience. Some people will argue that this approach is not applicable as they want to offer the map controls to their customers. But remember Best Practice 3 load on user click: As soon as the user states his intention of interacting with the map by clicking on it, we can offer him a bigger and easier-to- use map by opening a popup, overlay or a new page. The only thing the development has to do is to surround the static image with a link tag. Page 384 Step 4: Monitor the Performance of your Web Application/Third Party Content As we need to show improvements in our follow-up meeting with business executives, it is important to monitor how the performance of our website evolves over the time. There are three things that should be monitored by the business, Operations and Development: 1. Third party content usage by customers and generated business value Business Monitoring 2. The impact of newly added third party content Development Monitoring 3. The performance of third party content in the client browser Operations Monitoring Business Monitoring: An essential part of the business monitoring should be a check whether the requested third party features contribute to business value. Is the feature used by the customer or does it help us to increase our revenue? We have to ask this question again and again not only once at the beginning of the development, but every time when business, Development and Operations meet to discuss web application performance. If we ever can state No, the feature is not adding value, remove it as soon as possible! Page 385 Operations Monitoring: There are only a few tools that help us to monitor the impact of Third Party Content to our users. What we need is either a synthetic monitoring system like Gomez provides, or a monitoring tool that really sits in our users browser and collects the data there like dynaTrace User Experience Management (UEM). Synthetic monitoring tools allow us to monitor the performance from specied locations all over the world. The only downside is that we are not getting data from our real users. With dynaTrace UEM we can monitor the third party content performance of all our users wherever they are situated and we get the actual timings as experienced timings. The screenshot below shows a dashboard from dynaTrace UEM that contains all the important data from the operations point of view. The pie chart and the table below indicate which third party content provider has the biggest impact on the page load time and the distribution. The three line charts on the right side show you the request trend, the total page load time and the Onload time and the average time that third party content contributes to your page performance. Page 386 dynaTrace third party monitoring dashboard Development Monitoring: A very important thing is that Development has the ability to compare KPIs between two releases and view the differences between the pages with and without third party content. If we have already established functional web tests which integrate with a web performance optimization tool that delivers us the necessary values for the KPIs. We just have to reuse the switch we have established during step 1 and run automatic tests on the pages we have identied as the most important. From this moment on we will always be able to automatically nd regression caused by third party content. Page 387 We may also consider enhancing our switch and make each of the third party plugins switchable. This allows us to check the overhead a new plugin adds to our page. It also helps us when we have to decide which feature we want to turn if there are two or more similar plugins. Last but not least as now the business, Operations and Development have all the necessary data to improve the user experience, we should meet up regularly to check the performance trends of our page and nd solutions to upcoming performance challenges. Conclusion It is not a big deal to start improving the third party content integration. If we want to succeed it is necessary that business executives, Development and Operations work together. We have to be creative, we have to make compromises and we have to be ready to go different ways of integration never stop aiming for a top performing website! If we take things seriously we can improve the experience of our users and therefore increase our business. Page 388 49 Week 49 Sun Mon Tue Wed Thu Fri Sat 4 11 18 25 5 12 19 26 6 13 20 27 7 14 21 28 1 8 15 22 29 2 9 16 23 30 3 10 17 24 31 Clouds on Cloud Nine: the Challenge of Managing Hybrid-Cloud Environments by Andreas Grabner Obviously, cloud computing is not just a fancy trend anymore. Quite a few SaaS offerings are already built on platforms like Windows Azure. Others use Amazons EC2 to host their complete infrastructure or at least use it for additional resources to handle peak load or do number-crunching. Many also end up with a hybrid approach (running distributed across public and private clouds). Hybrid environments especially make it challenging to manage cost and performance overall and in each of the clouds. In this blog post we discuss the reasons why you may want to move your applications into the cloud, why you may need a hybrid cloud approach or why it might be better to stay on-premise. If you choose a cloud or hybrid-cloud approach the question of managing your apps in these silos comes up. You want to make sure your move to the cloud makes sense in terms of total cost of ownership while ensuring at least the same end user experience as compared to running your applications the way you do today. Cloud or No Cloud A Question You Have to Answer The decision to move to the cloud is not easy and depends on multiple factors: is it technically feasible? Does it save cost, and how can we manage The Application Performance Almanac is an annual release of leading edge application knowledge brought to you by blog.dynatrace.com Subscribe to blog feed Subscribe by email Page 389 cost and performance? Is our data secure with the cloud provider? Can we run everything in a single cloud, do we need a hybrid-cloud approach and how do we integrate our on-premise services? The question is also which parts of your application landscape benet from a move into the cloud. For some it makes sense for some it will not. Another often-heard question is the question of trust: we think it is out of question that any cloud data center is physically or logically secured to the highest standards. In fact, the cloud data center is potentially more secure than many data centers of small or medium sized enterprises. It boils down to the amount of trust you have in your cloud vendor. Now, lets elaborate on the reasons why you may or may not move your applications to the cloud. Reasons and Options for Pure Cloud If you are a pure Microsoft shop and you have your web applications implemented on ASP.NET using SQL Server as your data repository, Microsoft will talk you into using Windows Azure. You can focus on your application and Microsoft provides the underlying platform with options to scale, optimize performance using content delivery networks (CDNs), leverage single-sign-on and other services. If you have a Java or Python application and dont want to have to care about the underlying hardware and deployment to application and web servers you may want to go with Google AppEngine. c) If you have any type of application that runs on Linux (or also Windows) then there is of course the oldest and most experienced player in the cloud computing eld: Amazon Web Services. The strength of Amazon is also not necessarily in PaaS (Platform as a Service) but more in IaaS as they make it very easy to spawn new virtual machine instances through EC2. Page 390 There is a nice overview that compares these three cloud providers: Choosing from the major PaaS providers. (Rememberthe landscape and offerings are constantly changing so make sure to check with the actual vendors on pricing and services). There are a lot more providers in the more traditional IaaS space like Rackspace, GoGrid and others. Reasons for Staying On-Premise Not everybody is entitled to use the cloud and sometimes it simply doesnt make sense to move your applications from your data centers to a cloud provider. Here are three reasons: 1. It might be the case that regulatory issues or the law stop you from using cloud resources. For instance you are required to isolate and store data physically on premise, e.g. banking industry. 2. You have legacy applications requiring specic hardware or even software (operating system); it can be laborious and thus costly or simply impossible to move into the cloud. 3. It is simply not cheaper to run your applications in the cloud; the cloud provider doesnt offer all the services you require to run your application on their platform or it would become more complex to manage your applications through the tools provided by the cloud vendor. Reasons for Hybrid Clouds We have customers running their applications in both private and public clouds, sometimes even choosing several different public cloud providers. The common scenarios here are: Cover peak load: lets assume you operate your application within your own data center and deal with seasonal peaks (e.g. four weeks of Christmas business). You might consider additional hardware provisioning in the cloud to cover these peaks. Depending on your technologies used you may end up using multiple different public cloud providers. Page 391 Data center location constraints: in the gambling industry its required by law to have data centers in certain countries in order to offer these online services. In order to avoid building data centers around the globe and taking them down again when local laws change we have seen the practice of using cloud providers in these countries instead of investing a lot of money up-front to build your own data centers. Technically this is not different from choosing a traditional hosting company in that country, but a cloud-based approach provides more exibility. And here again it is possible to choose different cloud providers as not every cloud provider has data centers in the countries you need. Improve regional support and market expansion: when companies grow and expand to new markets they also want to serve these new markets with the best quality possible. Therefore its common practice to use cloud services such as CDNs or even to host the application in additional regional data centers of the current cloud providers. Externalize frontend and secure backend: we see this scenario a lot in eCommerce applications where the critical business backend services are kept in the private data center, with the frontend application hosted in the public cloud. During business/shopping hours it is easy to add additional resources to cover the additional frontend activity. During off-hours its easy and cost-saving to scale down instead of having many servers running idle in your own environment. A Unied View: The (Performance) Management Challenge in Clouds Combining your Azure usage reports with your Google Analytics statistics at the end of the month and correlating this with the data collected in your private cloud is tedious job and in most cases wont give you the answers to the question you have, which potentially include: How well do we leverage the resources in the cloud(s)? How much does it cost to run certain applications/business transactions especially when costs are distributed across clouds? Page 392 How can we identify problems our users have with the applications running across clouds? How can we access data from within the clouds to speed up problem resolution? Central Data Collection from All Clouds and Applications We at dynaTrace run our systems (Java, .NET and native applications) across Amazon EC2, Microsoft Windows Azure and on private clouds for the reasons mentioned above, and so do an increasing number of our customers. In order to answer the questions raised above we monitor these applications both from an infrastructure and cloud provider perspective as well as from a transactional perspective. To achieve this we Use the Amazon API to query information about Instance usage and cost Query the Azure Diagnostics Agent to monitor resource consumption Use dynaTrace User Experience Management (UEM) to monitor end user experience Use dynaTrace application performance management (APM) across all deployed applications and all deployed clouds Monitor business transactions to map business to performance and cost The following shows an overview of what central monitoring has to look like. End users are monitored using dynaTrace UEM, the individual instances in the cloud are monitored using dynaTrace Agents (Java, .NET, native) as well as dynaTrace Monitors (Amazon Cost, Azure Diagnostics, and so on). This data combined with data captured in your on-premise deployment is collected by the dynaTrace Server providing central application performance management: Page 393 Getting a unied view of application performance data by monitoring all components in all clouds Real-Life Cross Cloud Performance Management Now lets have a closer look at the actual benets we get from having all this data available in a single application performance management solution. Understand Your Cloud Deployment Following every transaction starting at the end-user all the way through your deployed application makes it possible to a) understand how your application actually works b) how your application is currently deployed in a this very dynamic environment and c) identify performance hotspots: Follow your end user transactions across your hybrid cloud environment: identify architectural problems and hotspots Page 394 Central Cost Control It is great to get monthly reports from Microsoft but it is better to monitor your costs online up-to-the-minute. The following screenshot shows the dashboard that highlights the number of Amazon instances we are using and the related costs giving us an overview of how many resources we are consuming right now: Live monitoring of instances and cost on Amazon Monitor End User Experience If you deploy your application across multiple cloud data centers you want to know how the users serviced by individual data centers do. The following screenshot shows us how end user experience is for our users in Europe they should mainly be serviced by the European data centers of Azure: Page 395 Analyze regional user experience and verify how well your regional cloud data centers service your users Root Cause Analysis In case your end users are frustrated because of bad performance or problems you want to know what these problems are and whether they are application or infrastructure related. Capturing transactional data from within the distributed deployed application allows us to pinpoint problems down to the method level: Page 396 Identify which components or methods are your performance hotspots on I/O, CPU, sync or wait For developers it is great to extract individual transactions including contextual information such as exceptions, log messages, web service calls, database statements and the information on the actual hosts (web role in Azure, JVM in EC2 or application server onpremise) that executed this transaction: dynaTrace PurePath works across distributed cloud applications making it easy for developers to identify and x problems Page 397 Especially the information from the underlying hostswhether virtual in one of your clouds or physical in your data centerallows you to gure out whether a slowdown was really caused by slow application code or an infrastructure/cloud provider problem. For our dynaTrace users If you want to know more about how to deploy dynaTrace in Windows Azure or Amazon EC2 check the following resources on the dynaTrace Community Portal: Windows Azure Best Practices and Tools, Amazon Account Monitor, Amazon EC2 FastPack Index Agile and continuous integration [20] Ajax troubleshooting guide [31] Architecture validation with dynaTrace [20] Automation AJAX Edition with Showslow [8] + AJAX Edition with Showslow [13] of cross-browser development [112] + of cross-browser development [183] of load test [245] of regression and scalability analysis [86] security with business transaction [122] to validate code in continuous integration [20] Azure hybrid with EC2 [388] Best Practices for Black Friday survival [332] for cross-browser testing [183] Microsoft not following [160] Business Transaction Management explained [275] over 1000+ JVMs [366] security testing with [122] Caching memory-sensitivity of [153] Cassandra garbage collection suspension [232] pagination with [352] Cloud and key-value stores [343] auto-scaling in [193] hybrid performance management [388] in the load test [249] + in the load test [250] inside horizontally scaling databases [232] pagination in horizontally horizontally scaling databases [353] public and private performance management [166] RDBMS versus horizontally scaling databases [261] Continuous Integration dynaTrace in [20] Cross-browser DOM case sensitivity [209] exceptional performance with browser plurality [304] Firefox versus Internet Explorer [202] Javascript implementation [160] page load optimization with UEM [238] stable functional testing [112] Database connection pool monitoring [315] DevOps APM in WebSphere [134] automatic error detection in production [213] business transaction management explained [275] control of page load performance [295] incorrect measurement of response times [177] managing performance of 1000+ JVMs [366] performance management in public and onpremise cloud [166] step-by-step guide to APM in production [77] + step-by-step guide to APM in production [98] top performance problems before Black Friday [320] troubleshooting response times [65] why SLAs on request errors do not work [258] why do APM in production [220] Dynamo key/value stores in [342] EC2 challenges of hybridizing [388] eCommerce top performance problems before Black Friday [320] user experience strong performers [332] Web 2.0 best practices in [183] Exception cost of [72] Firefox framework performance [40] versus Internet Explorer [202] Frameworks problems in Firefox with old versions [40] Garbage collection across JVMs [145] impact on Java performance [59] myths [37] Java major garbage collections in [145] memory problems in [92] + memory problems in [357] object caches and memory [153] performance management of 1000+ JVMs [366] serialization [25] jQuery in Internet Explorer [209] Load Testing cloud services [249] + cloud services 250 importance of [245] white box [86] Memory leaks [92] + leaks [357] sensitivity of object caches [153] Metrics incorrect [176] in production [98] of third party content [376] to deliver exceptional performance [304] Mobile server-side ramications on mobile [198] time to deliver exceptional performance [304] NoSQL Cassandra performance [231] or RDBMS [261] pagination with Cassandra [353] shard behavior [342] Page Load Time control of [295] reducing with caching [238] Production APM in WebSphere [134] automatic error detection in [213] managing 1000+ JVMs in [366] measuring a distributed system in [98] step-by-step guide to APM [77] why do APM in [220] RDBMS comparison with Cassandra [352] or NoSQL [262] Scalability automatically in the cloud [193] white box testing for [86] Security testing with business transactions [122] Selenium cross-browser functional web testing with [112] Serialization in Java [25] Server-side connection pool usage [315] performance in mobile [198] SLAs and synthetic monitoring [269] on request errors [257] Synthetic monitoring will it die [269] System metrics distributed [98] trustworthiness [65] Third-Party Content effect on page load [295] business impact of [310] minimizing effect on page load [376] Tuning connection pool usage [315] cost of an exception [72] garbage collection in the 3 big JVMs [145] myths about major garbage collections [37] serialization in Java [25] top Java memory problems [92] + top Java memory problems [357] why object caches need to be memory sensitive [153] worker threads under load [15] User Experience how to save 3.5 seconds of load time with [238] in ecommerce [310] on Black Friday [332] proactive management of [213] synthetic monitoring and [269] users as crash test dummies [245] Virtualization versus public cloud [166] Web Ajax troubleshooting guide [31] best practices dont work for single- page applications [45] eCommerce impact of address validation services [310] four steps to gain control of page load performance [376] frameworks slowing down Firefox [40] how case-sensitivity can kill page load time [209] how to save 3.5 seconds page load time [238] impact of garbage collection on Java performance [59] lessons from strong Black Friday performers [332] page load time of US Open across browsers [202] set up ShowSlow as web performance repository [8] why Firefox is slow on Outlook web access [160] will synthetic monitoring die [269] Web 2.0 automated optimization [183] testing and optimizing [46] WebSphere APM in [134]