Wednesday, March 7, 2012

APM for cloud apps: Next big challenge

As more and more companies are pushing their apps on the cloud, industry is facing a new challenge of monitoring performance of the apps on various clouds and measuring the SLAs for application performance in cloud. I won’t be astounded if someone comes up with a new term "ALAs - Application Level Performance" in near future. As cloud itself has to fulfill composite set of expectations like auto scaling, billing, on demand provisioning etc, it’s becoming difficult to segregate the root causes of application level performance, simply because cloud has many layers to track and applications are becoming more and more distributed, loosely coupled and difficult to manage. In such a complex ecosystem, tracking application performance is a real challenge in front of companies working in the sphere of APM.
How important is APM?
APM is not a revenue generating solution for enterprises. However, it’s the core of IT services as the thrust of APM is to fulfill SLAs and make sure the end-user experience is of highest quality. In a highly competitive market, excellence plays the most important role. If you are not providing high-quality solutions to your customer, you are already out of the market. In such a fiercely competitive market, APM plays a key role. Instead of having reactive processes, companies are looking for proactive processes.
As mentioned in one of the research report from Gartner - The factors most responsible for the increased attention now being paid to the APM process and the tools and services supporting it do not come from IT, but from the business side of the enterprise, which has (during the past decade) fundamentally changed its attitude towards IT in general. Line of business and C-level executives now generally recognize that IT is not just infrastructure that supports background workflows, but is also, and more fundamentally, a direct generator of revenue and a key enabler of strategy.
Gartner has defined five dimensions of an APM solution:
1. End-User experience monitoring - The capture of data about how end-to-end application availability, latency, execution correctness and quality appeared to the end-user.
2. Runtime application architecture discovery, modeling and display - The discovery of the software and hardware components involved in application execution and the array of possible paths across which these components could communicate to enable that involvement.
3. User-defined transaction profiling - The tracing of events as they occur among the components or objects as they move across the paths discovered in the second dimension; this is generated in response to a user’s attempt to cause the application to execute what the user regards as a logical unit of work.
4. Component deep - Dive monitoring in application context -- the fine-grained monitoring of resources consumed by and events occurring within the components discovered in the second dimension.
5. Analytics - The marshaling of techniques, including behavior learning engines, complex-event processing (CEP) platforms, log analysis, and multidimensional database analysis to discover meaningful and actionable patterns in the typically large datasets generated by the first four dimensions of APM.
Suggested features for an APM supporting cloud apps
Besides the above dimensions for a typical APM solution, I suggest following features for an APM supporting the cloud apps:
1.     Support for multiple clouds and ability to provide a CMDB for cloud components.
2.     Agentless architecture to reduce performance monitoring side-effects.
3.     Integration capabilities with cloud provided monitoring tools.
4.     Auto-sensing of application performance degradation and isolation of the cause from cloud.
5.     Integration capabilities with ITIL-compliant service desk/ IT operations management tools.
6.     Ability to define and translate application performance in terms of business SLAs.  
To meet the 5 dimensions of Application Performance Management viz End-User Experience Monitoring, Runtime Application Architecture Discovery, Modeling and Display, User-Defined Transaction Profiling, Component Deep - dive monitoring, and Analytics, a typical APM tool has the following component layers:
1.     Real Time Dashboard, Alerts and Notification Layer
2.     Application Monitoring Layer
3.     Network Monitoring and Management Layer
4.     Database Layer (Rule DB, Log, DB, CM DB)
5.     Integration Layer (ESM integration, Infrastructure Integration and LDAP integration)
Following is the high level architecture diagram of a typical APM tool.

The APM works on the FCAPS philosophy. FCAPS is acronym for Fault Detection and Management, Configuration Management, Accounting, Performance Management and Security Management of underlying infrastructure comprising of various hardware, software and firmware components. It typically includes, central nervous system of your computing infrastructure (CNS = Compute, Network and Storage). APM becomes complex in a distributed environment when there are thousands of software and hardware components to manage using millions of events to monitor and thousands of rules to define.
A typical APM will get hooked to underneath infrastructure by providing the relevant SDKs. Example, if your infrastructure uses VMWare for virtualization, Oracle DB instances, IBM Web Sphere, MS Exchange Server etc, a good APM will be able to gather events from all these components and process these data based on the set rules to generate real time dashboards, alerts and notifications. A good APM tool will also provide facility to define performance indicators as performance definition varies in different context.
A good APM tool will have capabilities to get integrated with existing Enterprise systems Management tools so that it becomes part of the IT management eco-systems and doesn’t work in silos.
APM tool should also have capabilities to generate dashboards periodically to address needs to various business users such as business owners, technology owners and process owners.
Agent-Based and Agent-Less Architecture:
Most of the APM tools work on agent based architecture wherein agents with small footprint are deployed across distributed systems to listen, capture and transmit event information to central event DB. The agents are designed to keep the performance overload as minimum as possible else there can be a danger of agents themselves becoming performance and security bottlenecks.
Agent-less architecture is based on signal interception paradigm wherein event information is gathered from existing system components by intercepting events and logs and processing these events and logs to generate high level dashboards.
What do various components of APM do?
1.     Network Monitoring Components:
a.     Fault Management Engine: Fault management is a set of functions that enable the detection, isolation, and correction of abnormal operation of the network.
b.     Configuration Management Engine: Configuration management provides functions to identify, collect configuration data from, exercise control over, and provide configuration data to network elements.
c.     Accounting Management Engine: Accounting management lets you measure the use of network services and determine costs to the service provider and charges to the customer for such use. It also supports the determination of charges for services.
d.     Performance Management Engine: Performance management provides functions to evaluate and report on the behavior of telecommunication equipment and the effectiveness of the network or network element. Its role is to gather and analyze statistical data for the purpose of monitoring and correcting the behavior and effectiveness of the network, network elements, or other equipment, and to aid in planning, provisioning, maintenance, and quality measurement.
e.     Security Management Engine: Security services provide authentication, access control, data confidentiality, data integrity, and nonrepudiation. It also provides security event detection and reporting reports activities that may be construed as a security violation (unauthorized user, physical tampering with equipment) on higher layers of security applications.
2.     Information Collection and Transition Agents:  these are listeners in various forms across the network which can intercept signals, process them and transmit to a central repository for further processing. Example, SNMP agents. For those who do not know SNMP, it stands for Simple Network Monitoring Protocol , based on simple request/response paradigm. http://www.wtcs.org/snmp4tpc/snmp.htm seems to be a good primer on this topic.
3.     Events Log and Data Processing Engine: Different systems generate logs in different formats. A log processor has to understand such a variety log formats by parsing these logs, extract the required information and store in the central DB for further processing.
4.     Rules Definition and Enforcement Engine: Dashboards, alerts and notifications are required to generate for various user groups. Each user group in the eco-system will have its own way to see and interpret data for various business and technical reasons. To make log processing more meaningful, APM provides a way to define and enforce rules. Example, performance counters can be defined for benchmarking performance of the app for various business needs. A user trading shares online may want response time not more than 5 seconds whereas a data entry operator at university can afford to wait for few more seconds. Rule definition engine provides a way to create and edit rules dynamically and hence has the capability to enforce alert/notification generation, SLA report generation, and dashboard generation based on the set benchmarks and thresholds.
5.     Dashboard, Notification and Alert Generation Engine: This is the top level component in an APM which generates user friendly graphs, reports, alerts and notification for the end users. It also have capabilities to benchmark itself to generate comparison charts, projections for the future performance and send warnings proactively.

Why so much focus on APM?
The natural answer is market size and complexity of cloud-based applications. Also APM is surely an expensive phenomenon. Business owners need to understand what is more expensive – the cost of application or the cost of customers moving away from their businesses due to lack of QoS. It’s a tread off business owners need to understand and surely it’s going to be a headache for most of them. APM solutions are available in wide range – from cost perspective and from features perspective. There is no single answer to the question like which is the best APM tool for my need. Since I see cloud to be the future of IT, I see APM to be the most critical area to tame for IT managers, and hence I can see huge interest companies will have in APM especially while moving their apps to cloud. Moreover, IT services companies will have to heavily rely on good APM solutions to make sure they are delivering a right solution to their customer. I can foresee the QA teams in IT and consulting services using and relying heavily on APM tools. In short, I can see a bright future for APM companies and the consultants working in this highly challenging area of specialization.

About Author: Satish Agrawal is Vice President - Cloud Computing at e-Zest Solutions Ltd. He has over 16 years of experience in IT and software product engineering space and has built and implemented end-to-end cloud solutions for clients across geographies

No comments: