Beyond The Hype - How Do You Really Put Ai To Work For Itops?

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Beyond the hype—

how do you really put AI


to work for ITOps?
Best practices to get the most out of new machine
learning and data analytics tools
ITOps: All you do is save the day…
and deliver the future
The world runs on IT operations. When new digital experiences delight customers,
innovative business models disrupt industries, and big data makes organizations smarter
than ever, it’s because of your success in supporting the complex, dynamic infrastructure
and apps that power them.

Lately that success has become more challenging to achieve.

N You’re too busy for false alarms and duplicate events


Traditional monitoring tools can’t tell the difference between a routine spike and a growing
problem—or when multiple events stem from the same core issue. You need a way to filter out
the noise, zero in on root causes, and resolve issues fast.

N Your systems should be as smart as you are


You know how to make sense of machine data—how to find patterns, diagnose problems, and
prevent future disruptions. The challenge is to do it at the scale and speed of digital business. It’s
time for your IT operations management tools to step up.

N AIOps to the rescue


Artificial intelligence takes ITOM beyond operational event management to become smarter and
more proactive. Data analytics and machine learning help you anticipate how data, apps, and
systems will evolve and interact under various conditions. Together, they let you respond more
quickly, efficiently, and accurately to whatever your digital environment throws at you.

So, how do you put AIOps to work?

02
Rubber, meet road:
What's really possible?
AIOps puts machine learning and data analytics to work in a
number of different contexts to enable simpler, faster, and more
efficient IT operations management—without relying on static
thresholds. These include:

APM (application performance monitoring) – Taking


a service-aware, user-centric approach to application
performance

Behavioral learning (dynamic baselining) – Understanding


which issues are likely to become actual problems so you know
what to fix first

Predictive event management – Getting an early warning for


potential problems before they impact users

Probable cause analysis – Correlating related events and


anomalies so you can troubleshoot more accurately and
address root causes more quickly

Log analytics – Using day-to-day log files to make the system


even smarter about baselines and anomalies

In the following pages we’ll explore each of these in a little more depth.

03
The basics: Getting smarter about APM
Any successful AIOps strategy requires a solid foundation of ITOps best practices—and that begins with application monitoring.

The way we deliver applications is changing. We need a new way to monitor their performance, too. Goodbye to simple,
centralized application delivery infrastructures and standardized endpoints—hello multi-source delivery, hybrid and multi-cloud
environments, and a dizzying diversity of user devices.

As app developers and owners embrace DevOps and agile to speed innovation, it’s no longer enough for you to monitor for
availability, errors, and job completion times. Now you’ve got to make sure data is being processed correctly, identify problem-
causing code, and diagnose slow application response times.

What does APM need to look like today? Focus on the perspectives that matter most:

Your applications and services Your end users

Your infrastructure doesn’t drive your business—your Customers don’t care about your servers or routers;
applications and services do. By unifying monitoring data they care about the quality of service they’re getting.
across your environment into a service-aware, app-centric Critical metrics to monitor to ensure a great end user
view, you can better troubleshoot issues, find root experience include availability, responsiveness,
causes, and prevent incidents from recurring. and usability.

Learn more about service-aware, app-centric APM Learn more about end user experience monitoring

04
Behavioral learning:
Is this issue really an issue?
How can you tell the difference between a routine fluctuation in utilization and a
problem in the making? Behavioral learning puts utilization metrics in the context of
normal activity to separate actual problems from the ups and downs of a typical week.

When you’re relying solely on a static threshold—say, 80 percent CPU utilization—you’ll get alerts
even for spikes that are perfectly understandable, like high CPU usage first thing Monday morning.
With AIOps, machine learning algorithms establish a learned baseline that reflects typical utilization
patterns over the course of the week. For example:

N Monday mornings: 85–95 percent utilization


N Thursday afternoons: 65–75 percent utilization
And so on.

You still set a static threshold for each resource, but now you’re only alerted if utilization is both over
the threshold and beyond the learned baseline. Here’s how it plays out.

Monday morning: Thursday afternoon:


CPU utilization hits 90 percent. Before sending CPU utilization hits 90 percent again. This time,
an alert, the system pauses: “Hmm, 90 percent the system takes notice: “This is unusual for
is pretty typical for Monday morning. I’ll only this time of the week—something clearly needs
alert if we go past that 95 percent baseline.” to be looked at. I’ll send an alert.”

Fewer false alarms, clearer issue prioritization—AIOPs is already making life better for ITOps.

Learn more about behavioral learning

05
Predictive event management:
Your crystal ball
The best time to solve a problem is before it becomes a problem.

When your environment is constantly changing—and whose isn’t?—it’s


inevitable for a new configuration to trigger unintended consequences
now and then. When that means flooding a key resource at a peak time so
that performance suffers, you’re going to hear about it from customers
and the business.

Predictive event management makes behavioral learning actionable.


We’ve already talked about using machine learning to set baselines based
on typical usage patterns, then watch for deviations from those norms so
you can be alerted well before critical thresholds are reached.

That foresight allows you to take a more predictive and proactive


approach to problem resolution. The interval between a deviation from the
normal baseline and a disruption in service—typically about three hours—
can give you time to address the issue before users even notice.

Learn more about predictive event management

06
Probable cause analysis:
Zeroing in
You can spend all day chasing down the effects of a problem—
or you can go right to its cause and eliminate all those effects
in one blow.

In probable cause analysis, a data analytics process considers all the


factors that can impact the performance of an application or service to
help you determine what’s really to blame. When an issue arises, relevant
events are ranked by their relationship to the initial event, the timeframe in
question, and any anomalies flagged through behavioral learning.

For example, let’s say a server in your environment is running slowly due to
excessive processing time. Probable cause analysis reveals related events
that suggest an issue with the server’s memory. In all likelihood, it’s this
memory issue that’s slowing the server’s response to data requests.

Instead of a drawn-out troubleshooting exercise, you can use this AIOps-


derived insight to go right after the guilty party. You address the memory
issue, processor time goes back to normal, the server performs as it
should, and your SLAs are safe.

Learn more about probable cause analysis

07
Log analytics:
Listening to your log data
The ability of machine learning to learn baselines and detect
anomalies can be applied to log files as well.

Log entries tend to follow fairly consistent patterns when your IT


environment is healthy. When the number of log entries that match this
pattern change by an unusual amount—either positively or negatively—you’ve
got an “out-of-bound” anomaly that signals an issue in your IT environment.

Maybe a large numbers of users have lost access to a particular machine—


or maybe their access has increased unexpectedly. In either case, there may
well be configuration changes related to access lists (ACLs) that you need
to address.

In this case, machine learning offers yet another way to discover issues that
need to be addressed. You haven’t started getting complaints from users; you
haven’t crossed any utilization thresholds; you haven’t seen any red lights—
but something’s not right. By investigating and addressing the situation
now, you can keep this anomaly from becoming a real problem that affects
your business.

Learn more about log analytics

08
CUSTOMER CASE STUDY

The new face of ITOM


at Boston Scientific The impact

Deeper visibility into the layers


of each application—including
Boston Scientific, an $8.4 billion enterprise, has been a global
network and database—accelerate
medical technology leader for over 35 years. Before its ITOM troubleshooting and resolution to
modernization initiative: minimize downtime

ɝɝ Fragmented monitoring made it hard to maintain availability, uptime, A lean, AI-powered process
and performance across over 650 applications makes it possible to detect and
ɝɝ A, reactive, email-based approach to ticket creation delayed correlate events, take corrective
responsiveness and MTTR action, automatically generate
ɝɝ Development teams had to monitor their own applications, diverting tickets, and alert the right people
their focus from innovation
Approximately one-third of critical
To increase speed, efficiency, and insight, the company created a new tickets are intercepted by the
centralized digital operations center powered by AI-powered TrueSight operations center and addressed
solutions from BMC. proactively—a share that
continues to rise

A smarter approach to ITOM Event response time has dropped


The digital operations center staff uses TrueSight dashboards to monitor to 15–30 minutes, with the right
infrastructure and key applications at facilities around the world. Rich data people fully informed
analytics and machine learning help the team become more focused,
proactive, and informed to ensure consistent performance and uptime. Learn more about AIOps-powered ITOM at
Boston Scientific

09
Ensuring operational
excellence with built-in intelligence TrueSight is an AIOps
platform that helps complex
and growing enterprises
As digital transformation pushes the speed and complexity of ITOps to new reinvent how IT operations
levels, artificial intelligence has become more than just a useful innovation—it’s delivers fast, secure, and
now crucial for survival and success. By building data analytics and machine cost-effective services.
learning into your ITOM toolset, you can:
Learn more here
Proactively predict and prevent issues before they impact your business

Cut false alarms so you can focus on what matters

Troubleshoot problems faster and more accurately to speed MTTR

Gain deeper insight from activity and events across your complex environment

Most importantly, AIOps helps you deliver the performance and availability your business
needs, when it needs it, no matter how complex your environment becomes. That elevates
the strategic importance and visibility of ITOps so the business sees you as the heroes you
are—as it should be.

10
Continue your AIOps education
It’s a great time to be in IT operations. New AI capabilities have
the potential to deliver increased value to the business, with
machine learning and analytics applied to big data to deliver rich,
actionable insights that can transform IT operations.

More resources to drive your AIOps journey

Analyst Report
Read the report from Enterprise Management
Associates based on a survey of over 300 IT
decision makers: ‘AIOps and IT Analytics
at the Crossroads – What’s real today and
what’s most needed for tomorrow?’

Read the Report

AIOps Video
Watch the short video ‘Elevate IT Operations
with AIOps’ on how the TrueSight AIOps
platform is helping customers reduce event
remediation times, transform ITOps, and drive
digital transformation.

Watch the Video

11
About BMC
BMC helps customers run and reinvent their businesses with open, scalable, and modular solutions to complex IT problems. Bringing
both unmatched experience in optimization and limitless passion for innovation to technologies from mainframe to mobile to cloud
and beyond, BMC helps more than 10,000 customers worldwide reinvent, grow, and build for the future success of their enterprises.

BMC—The Multi-Cloud Management Company. www.bmc.com

BMC—The Multi-Cloud Management Company


BMC, BMC Software, the BMC logo, and the BMC Software logo are the exclusive properties of BMC Software Inc., are registered or pending registration with the U.S.
Patent and Trademark Office, and may be registered or pending registration in other countries. All other BMC trademarks, service marks, and logos may be registered
or pending registration in the U.S. or in other countries. All other trademarks or registered trademarks are the property of their respective owners. © Copyright 2018
* 509151*
BMC Software, Inc.

You might also like