APM for cloud apps: Next big challenge
As
more and more companies are pushing their apps on the cloud, industry is facing
a new challenge of monitoring performance of the apps on various clouds and
measuring the SLAs for application performance in cloud. I won’t be astounded
if someone comes up with a new term "ALAs - Application Level
Performance" in near future. As cloud itself has to fulfill composite set
of expectations like auto scaling, billing, on demand provisioning etc, it’s
becoming difficult to segregate the root causes of application level
performance, simply because cloud has many layers to track and applications are
becoming more and more distributed, loosely coupled and difficult to manage. In
such a complex ecosystem, tracking application performance is a real challenge
in front of companies working in the sphere of APM.
How important is APM?
APM
is not a revenue generating solution for enterprises. However, it’s the core of
IT services as the thrust of APM is to fulfill SLAs and make sure the end-user
experience is of highest quality. In a highly competitive market, excellence
plays the most important role. If you are not providing high-quality solutions
to your customer, you are already out of the market. In such a fiercely
competitive market, APM plays a key role. Instead of having reactive processes,
companies are looking for proactive processes.
As
mentioned in one of the research report from Gartner - The factors most
responsible for the increased attention now being paid to the APM process and
the tools and services supporting it do not come from IT, but from the business
side of the enterprise, which has (during the past decade) fundamentally
changed its attitude towards IT in general. Line of business and C-level
executives now generally recognize that IT is not just infrastructure that
supports background workflows, but is also, and more fundamentally, a direct
generator of revenue and a key enabler of strategy.
Gartner
has defined five dimensions of an APM solution:
1. End-User experience monitoring - The capture of data about how end-to-end
application availability, latency, execution correctness and quality appeared
to the end-user.
2. Runtime application architecture discovery, modeling and
display - The discovery of
the software and hardware components involved in application execution and the
array of possible paths across which these components could communicate to
enable that involvement.
3. User-defined transaction profiling - The tracing of events as they occur
among the components or objects as they move across the paths discovered in the
second dimension; this is generated in response to a user’s attempt to cause
the application to execute what the user regards as a logical unit of work.
4. Component deep - Dive monitoring in application context -- the fine-grained
monitoring of resources consumed by and events occurring within the components
discovered in the second dimension.
5. Analytics - The marshaling of techniques, including behavior learning
engines, complex-event processing (CEP) platforms, log analysis, and multidimensional
database analysis to discover meaningful and actionable patterns in the
typically large datasets generated by the first four dimensions of APM.
Suggested features for an APM supporting cloud apps
Besides
the above dimensions for a typical APM solution, I suggest following features
for an APM supporting the cloud apps:
1.
Support for multiple
clouds and ability to provide a CMDB for cloud components.
2.
Agentless architecture
to reduce performance monitoring side-effects.
3.
Integration
capabilities with cloud provided monitoring tools.
4.
Auto-sensing of
application performance degradation and isolation of the cause from cloud.
5.
Integration
capabilities with ITIL-compliant service desk/ IT operations management tools.
6.
Ability to define and
translate application performance in terms of business SLAs.
To meet the 5 dimensions of Application Performance
Management viz End-User
Experience Monitoring, Runtime Application Architecture Discovery, Modeling and
Display, User-Defined Transaction Profiling, Component Deep - dive monitoring, and
Analytics, a typical APM tool has the following component layers:
1. Real
Time Dashboard, Alerts and Notification Layer
2. Application
Monitoring Layer
3. Network
Monitoring and Management Layer
4. Database
Layer (Rule DB, Log, DB, CM DB)
5. Integration
Layer (ESM integration, Infrastructure Integration and LDAP integration)
Following is the high level
architecture diagram of a typical APM tool.
The
APM works on the FCAPS philosophy. FCAPS is acronym for Fault Detection and
Management, Configuration Management, Accounting, Performance Management and
Security Management of underlying infrastructure comprising of various
hardware, software and firmware components. It typically includes, central
nervous system of your computing infrastructure (CNS = Compute, Network and
Storage). APM becomes complex in a distributed environment when there are
thousands of software and hardware components to manage using millions of
events to monitor and thousands of rules to define.
A
typical APM will get hooked to underneath infrastructure by providing the
relevant SDKs. Example, if your infrastructure uses VMWare for virtualization,
Oracle DB instances, IBM Web Sphere, MS Exchange Server etc, a good APM will be
able to gather events from all these components and process these data based on
the set rules to generate real time dashboards, alerts and notifications. A
good APM tool will also provide facility to define performance indicators as
performance definition varies in different context.
A
good APM tool will have capabilities to get integrated with existing Enterprise
systems Management tools so that it becomes part of the IT management
eco-systems and doesn’t work in silos.
APM
tool should also have capabilities to generate dashboards periodically to
address needs to various business users such as business owners, technology
owners and process owners.
Agent-Based
and Agent-Less Architecture:
Most
of the APM tools work on agent based architecture wherein agents with small
footprint are deployed across distributed systems to listen, capture and
transmit event information to central event DB. The agents are designed to keep
the performance overload as minimum as possible else there can be a danger of
agents themselves becoming performance and security bottlenecks.
Agent-less
architecture is based on signal interception paradigm wherein event information
is gathered from existing system components by intercepting events and logs and
processing these events and logs to generate high level dashboards.
What
do various components of APM do?
1. Network
Monitoring Components:
a. Fault
Management Engine: Fault management is a set of functions that enable the
detection, isolation, and correction of abnormal operation of the network.
b. Configuration
Management Engine: Configuration management provides functions to identify,
collect configuration data from, exercise control over, and provide
configuration data to network elements.
c. Accounting
Management Engine: Accounting management lets you measure the use of network
services and determine costs to the service provider and charges to the
customer for such use. It also supports the determination of charges for
services.
d. Performance
Management Engine: Performance management provides functions to evaluate and
report on the behavior of telecommunication equipment and the effectiveness of
the network or network element. Its role is to gather and analyze statistical
data for the purpose of monitoring and correcting the behavior and
effectiveness of the network, network elements, or other equipment, and to aid
in planning, provisioning, maintenance, and quality measurement.
e. Security
Management Engine: Security services provide authentication, access control,
data confidentiality, data integrity, and nonrepudiation. It also provides
security event detection and reporting reports activities that may be construed
as a security violation (unauthorized user, physical tampering with equipment)
on higher layers of security applications.
2. Information
Collection and Transition Agents: these are
listeners in various forms across the network which can intercept signals,
process them and transmit to a central repository for further processing.
Example, SNMP agents. For those who do not know SNMP, it stands for Simple
Network Monitoring Protocol , based on simple request/response paradigm. http://www.wtcs.org/snmp4tpc/snmp.htm
seems to be a good primer on this topic.
3. Events
Log and Data Processing Engine: Different systems generate logs in different
formats. A log processor has to understand such a variety log formats by
parsing these logs, extract the required information and store in the central
DB for further processing.
4. Rules
Definition and Enforcement Engine: Dashboards, alerts and notifications are
required to generate for various user groups. Each user group in the eco-system
will have its own way to see and interpret data for various business and
technical reasons. To make log processing more meaningful, APM provides a way
to define and enforce rules. Example, performance counters can be defined for
benchmarking performance of the app for various business needs. A user trading
shares online may want response time not more than 5 seconds whereas a data
entry operator at university can afford to wait for few more seconds. Rule
definition engine provides a way to create and edit rules dynamically and hence
has the capability to enforce alert/notification generation, SLA report
generation, and dashboard generation based on the set benchmarks and
thresholds.
5. Dashboard,
Notification and Alert Generation Engine: This is the top level component in an
APM which generates user friendly graphs, reports, alerts and notification for
the end users. It also have capabilities to benchmark itself to generate
comparison charts, projections for the future performance and send warnings
proactively.
Why so much focus on APM?
The
natural answer is market size and complexity of cloud-based applications. Also
APM is surely an expensive phenomenon. Business owners need to understand what
is more expensive – the cost of application or the cost of customers moving
away from their businesses due to lack of QoS. It’s a tread off business owners
need to understand and surely it’s going to be a headache for most of them. APM
solutions are available in wide range – from cost perspective and from features
perspective. There is no single answer to the question like which is the best
APM tool for my need. Since I see cloud to be the future of IT, I see APM to be
the most critical area to tame for IT managers, and hence I can see huge
interest companies will have in APM especially while moving their apps to
cloud. Moreover, IT services companies will have to heavily rely on good APM
solutions to make sure they are delivering a right solution to their customer.
I can foresee the QA teams in IT and consulting services using and relying
heavily on APM tools. In short, I can see a bright future for APM companies and
the consultants working in this highly challenging area of specialization.
About Author: Satish Agrawal is Vice President - Cloud Computing at e-Zest
Solutions Ltd. He has over 16 years of experience in IT and software product
engineering space and has built and implemented end-to-end cloud solutions for
clients across geographies