ITSI Seattle-Serviceintelligencehands-Onworkshop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

Setup Before You Can Play

1. Download this presentation slide deck: https://splunk.box.com/v/Chicago-HandsOn-Workshop

2. If you have not done so already, Sign up for the FREE Splunk ITSI Online Sandbox:
• http://splunk.com/itsi
• Select "Free Online Sandbox"

3. Please test access to your sandbox;


• Chrome, Firefox, Safari
are recommended;
• IE is NOT recommended

4. After logging in, select


IT Service Intelligence from the
list of apps at the left

1
Copyright © 2016 Splunk Inc.

Building Business
Service Intelligence with
Splunk IT Service Intelligence
Thursday October 27, 2016

Tom Harrop Michael Donnelly


IT Operations Specialist ITOA Architect
Agenda
u Introductions and Set Up
u Splundamentals – IT Troubleshooting with Splunk
u What is IT Service Intelligence?
u Service Intelligence Design Practices
u Let's Play!
u What's Next?
u Happy Hour!
3
Safe Harbor Statement
During the course of this presentation, we may make forward looking statements regarding future events
or the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results could
differ materially. For important factors that may cause actual results to differ from those contained in our
forward-looking statements, please review our filings with the SEC. The forward-looking statements
made in this presentation are being made as of the time and date of its live presentation. If reviewed
after its live presentation, this presentation may not contain current or accurate information. We do not
assume any obligation to update any forward looking statements we may make. In addition, any
information about our roadmap outlines our general product direction and is subject to change at any
time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.

4
Defining Service Intelligence
Enabling a business-aware IT
Measuring and reporting on indicators that matter

Unlocking operational efficiencies


Collaborating across silos to improve service operations

Data-based decision making


Solving problems and anticipating pitfalls with sophisticated
analytics and powerful insights
Key Takeaways

1 Build on what you are already doing with Splunk

Service Intelligence design and configuration practices

3 What is possible with Splunk IT Service Intelligence


Splundamentals – IT
Troubleshooting with
Splunk
Challenging Traditional Methods
Aggregation/Correlation/Visualization
Synthetic APM

Application Layer
Service Layer

Byte Code
Challenges
Instrumentation
• Too many disparate components
74% Adaptive
• Difficult to define Service Model
-36% Thresholding • Labor intensive
HP Run-Time Service Model
CA Service Operations Insight • Most implementations fail
IBM NetCool/Omnibus
Service Model definition
• Very important source is
Infrastructure Layer

Server & Correlation Engine missing! (machine data)

Business Layer
Storage

Network
Data-Defined & Driven Service Insights
Synthetic APM
Availability, Capacity,
User Experience

Application Layer
Byte Code Instrumentation Service Intelligence
Usage, Experience,
Performance, Quality
Splunk> is the missing link
Adaptive Thresholding • Data Fidelity
74% Apps, Services, Systems
-36% • Single Repository for ALL data
MACHINE DATA
• Easier to Manage Services
Server • Reduced Integrations
Infrastructure Layer

Performance, Usage,
Dependency
• Reduced Point Solutions
• Collaborative Approach
Storage • Quick time to value
Utilization, Capacity,
Performance
Data Fabric Platform
Network
Packet, Payload, Traffic,
Utilization, Perf
Splunk Approach to Machine Data
Traditional Splunk

Schema on Write Schema on Read

• Define Static schema


SQL • “Schema-on-the-Fly”
Search
• ETL into Schema • Data in native format
• Enrich at write • Enrich at read
ETL Universal Indexing
• New data = new columns • New data = no changes needed
• New questions = new columns • New questions = no changes needed
• “Data at rest” (delayed info) • “Data in motion” (Real time)
Structured Unstructured
• Labor Intensive & time consuming • Fast time to value
RDBMS Volume Velocity Variety

Ideal for Reporting 10


Ideal for Investigation
Listen to your data
Let’s take a closer look at IT troubleshooting with Splunk

11
Machine learning-powered analytics for real-time service
insights, simplified operations and root-cause isolation
IT Service Intelligence Value Stack
§ Adaptive Threshold
§ Behavior Anomaly
ML § Correlates Data into Knowledge

ITSI Service § Visualizes entire stack


§ View the entire Ecosystem
Model
§ 3 clicks to get the answer versus 10

§ Accelerators
§ Trend aggregation
§ Multi KPI Alerts

§ Time Series Index


§ Schema on Read
§ Data Model
The possibilities for Business…
The possibilities for IT Operations…

Service Health
Buttercup Games Example
What is a Service?

Service
Requests
Responses

In ITSI, a Service is a logical group of technology components that a user


deems need to be monitored together.
It can often be generalized as a “black box” which we send requests, and
expect responses

17
What is a Service?
Technical Services
Requests
DNS
Responses

Requests
Auth Responses

Requests
Web
Responses

Services can be lower level (technical) …

18
What is a Service?
Technical Services Business Services
Requests
DNS
Responses Volume
Order Entry
Revenue
Requests
Auth Responses
Customer Requests
Requests Care SLA Compliance
Web
Responses

Services can also be higher level (business) …

19
What is a Service?

Customer Transactions

Business Function
API Services

API/Middleware
Web Services

Mobile
RBMDBs

Hypervisor and Hosts

DNS
Storage Tier
Packet Network

Services can encompass multiple tiers of the IT domain.


Services may also depend upon other services

20
What is a KPI?
Customer Business
DNS Transactions Function

KPI: Request volume KPI: Transaction volume KPI: Business volume


KPI: Error rate KPI: Error rate KPI: Error rate
KPI: Average response time KPI: Average response time KPI: Revenue rate
KPI: Server CPU load KPI: Max response time KPI: Conversion rate
KPI: Configuration changes KPI: Count of Change records KPI: Count of Incident tickets

KPIs and Health scores constitute the means by which


Services are monitored.

21
Key Performance Indicators (KPIs)

A Key Performance Indicator (KPI) is powered by a Splunk search in ITSI that


monitors a specific attribute like CPU utilization, Response Time, Number of
Errors and so on. KPIs are contained within Services to measure their health.

22
Service Health Scores

A Health score is a score form 0-100 (0 being critical and 100 being normal)
that measures the health of a Service. It is calculated based on all KPIs
importance and its status (e.g. green, orange, red), once every minute.

23
Splunk IT Service Intelligence
Let’s take a closer look at Service Intelligence with Splunk

24
Service Intelligence
Design Practices

25
Best Practices for Service Intelligence

Start With a Bring Subject Design Before


Problem Worth Experts Together Configuring
Solving
Start With A Problem Worth Solving

Review your organization’s critical services

Identify a service that has impactful and measurable


challenges
Buttercup Games – How Can We Help?

Manufacturer of toys and games

Desire to improve supply chain efficiency and customer satisfaction

New online store has issues that impact customer experience and revenue
The Business Problem for Buttercup Games
Supply ERP Online Failed Business
Chain Systems Store Interactions Impact

? ?
?

Limited Frequent Poor Customer War rooms $48,000/wk


Visibility Bottlenecks Satisfaction 32 hrs/wk
in revenue
loss
Bring Subject Experts Together

Identify stakeholders and support personnel for the


selected service

Create awareness and invite their collaboration to solve


the business challenge
Your Service Intelligence Collaborators
Operations and Enterprise
Service Owners Administrators
Support Architecture
• Business • Common issues • Business • Current tools
functions • Performance processes and usage, and
• Performance indicators • Key inputs and adoption levels
indicators • Resolution outputs • Splunk expertise
• Common processes • Technology • Environment
business issues • Tools used for architecture expertise
• Frequency of resolving issues • Data • Personal pain
issues • Frequency of architecture
• Business impact issues • Common issues
of issues • IT impact of
issues

31
Design Before Configuring

Identify pains, performance indicators


and measurement goals for the service

Identify components and data


needed to drive service insights

Consolidate the mappings into


an enterprise process/IT services map
Service Intelligence Goals for Buttercup Games
Supply ERP Online Failed Business
Chain Systems Store Interactions Impact

GOAL 1
?
GOAL 2?
Continuous improvement
through visibility to key ?
Increase customer satisfaction and reduce
cost through fewer failures and restoration
indicators of supply chain
activities
performance

Limited Frequent Poor Customer War rooms $48,000/wk


Visibility Bottlenecks Satisfaction 32 hrs/wk
in revenue
loss
Service Intelligence Design – Buttercup Games
• ServiceHealth
Service Layer Supply Chain • Incidents/Changes
• Customer Satisfaction

• Total Orders • Unit Count


Business Layer • Service Level • Delivery Time
• Total Revenue • Unit Failures

Order Entry Manufacturing Shipping Fulfillment

• Online Orders
Application Layer • Online Revenue
• Response Time
Online Store EDI

• HTTP Hits • Response Time • Response Time


• Error Rate • Error Rate • Storage Free

Web Tier Middleware

• CPU Load • CPU Load


Infrastructure Layer • Memory Used • Memory Used
• Disk Used • Disk Used
• IO Latency • IO Latency
Service Decomposition
Business Service
Service Layer

Business Layer
Mail Transport -Order Processing
E-Commerce -Financials

Application Layer
Middleware –Application Server -Database
Custom Apps

Infrastructure Layer
Power / Cooling / Facilities
Server –Networking –Storage
Service Intelligence Design in ITSI
1. High-value business services
• Buttercup Games Online Store and Supply Chain

2. Major business functions


• Order Entry, Manufacturing, Shipping Fulfillment

3. Supporting services
• Web, Middleware, Database

4. Relevant KPIs for each service


• Database:, errors, SQL hits, …)

5. Splunk search for each KPI


• (index=DB (warn* OR error*) | stats count)

36
Service Decomposition – Buttercup Games
Service Layer Supply Chain

Business Layer

Order Entry Manufacturing Shipping Fulfillment

Application Layer
Online Store EDI

Web Tier Middleware

Infrastructure Layer
Putting It All Together
• ServiceHealth
Service Layer Supply Chain • Incidents/Changes
• Customer Satisfaction

• Total Orders • Unit Count


Business Layer • Service Level • Delivery Time
• Total Revenue • Unit Failures

Order Entry Manufacturing Shipping Fulfillment

• Online Orders
Application Layer • Online Revenue
• Response Time
Online Store EDI

• HTTP Hits • Response Time • Response Time


• Error Rate • Error Rate • Storage Free

Web Tier Middleware

• CPU Load • CPU Load


Infrastructure Layer • Memory Used • Memory Used
• Disk Used • Disk Used
• IO Latency • IO Latency
Typical Data Sources
Service Layer Supply Chain
• Application Logs
• Corporate Databases
Business Layer • Service Management

Order Entry Manufacturing Shipping Fulfillment

Application Layer
Online Store EDI • Application Logs
• Webserver Logs
• DB Perf Counters
• Wire data

Web Tier Middleware

• Perf Counters
Infrastructure Layer • Access Logs
• Network Logs
Copyright © 2016 Splunk Inc.

Let’s Play!

Setting up Service Intelligence


Service Visibility in ITSI

CLICK
“Glass Tables”

41
Service Visibility in ITSI

CLICK (open in new tab)


“Buttercup Games
Business Process (IN
PROGRESS)”

42
Service Visibility in ITSI

CLICK (open in new tab)


“Buttercup Games
Online Store”

43
Goal 1: Supply Chain Visibility

44
Goal 2: Online Store Process Flow

45
New Requirements!
● Create a new KPI for the DB Service:
● Network Utilization

“WE only have about 15min


● Modify the Executive Glass Table TO DO WHAT ???!!???”
in order to show off the services
you slave over Think about how long this
would take you today?

46
Configuration of DB Service

Click Configure >


Click Services

47
Let’s Talk Entities

● Select DB Service

● Entities are the relevant things which support


this service (usually hosts)
● Select the right entries with filters, ANDs, ORs
● Original Entity list can come from CMDB,
spreadsheet, Splunk search, others

48
A KPI in 5 minutes? Absolutely!

Call it “Network Utilization”,


Click New – Generic KPI with your username up front

Select Data Model


● Host Operating System
● Network
● # bytes
● Next

49
KPIs Continued….
● Select Yes for Split by & Filter options
● Select host for Entity Lookup & Alias options
● Click Next

Splunk Builds Searches for you –


Oh Yeah, that’s happening J

50
Almost There…
Select
● KPI Search Schedule: Every Minute
● Entity Calculation: Average
● Service/Agg Calculation: Average
● Calculation Window: Last Minute
● Click Next

● Unit: Bps
● Click Next

51
Final Steps …
Set your thresholds:
● Aggregate (All)
● Per Entity
● Click “Add Threshold” TWICE
● Make the Neapolitan ice cream colors
Yellow, Green, Yellow
● Drag the sliders around in order to get
the current data graph entirely inside the
Green (normal) band
● Click Finish
● Other options are also available,
including adaptive thresholds and
anomaly detection

52
Adaptive Thresholds
What if your KPI data looks like this?

53
Adaptive Thresholds
Static thresholds will not work…

54
Adaptive Thresholds
Adaptive Thresholding works beautifully with cyclical (and other dynamic) data

55
Anomaly Detection

● Machine Learning
● Works well for data with patterns
● Requires some “training” (trial & error)
to zero in on best sensitivity
● More sophisticated capabilities coming!
(multivariate, more algorithms, etc)

56
Let’s Fix that Glass Table

57
Clone the Glass Table
Return to Saved Glass Tables page
(click on Glass Tables in the upper menu bar)

CLICK Edit for “Buttercup Games Business Process (IN


PROGRESS)”
• Select Clone
• Title: Add your username
to the front
• Permissions: Shared in App
• Click Clone Page

• Click on your new Glass Table


from the list, to view it

58
Edit & Have Fun!
Click on Edit in the upper right corner of your Glass Table

Use the “Services” panel on the left to select Individual KPIs,


or Aggregate Service Health Scores
• Choose 2 KPIs from Online Store that would be useful in
the “Order Process” section
• Drag the selected widgets onto the canvas, positioning in
the gray oval

• What’s the difference between the

and tools at the top left?

59
More Fun with the Glass Table Editor…
Use the Configurations panel on the right to edit a
selected widget
• Can change the visualization type, drilldown
behavior, and other settings

• You should hit Save frequently


• Revert All Changes can be helpful, occasionally

60
Finishing up …
• Add a ServiceHealthScore widget for Online
Store under Buttercup
• Choose a Viz Type with a sparkline graph, then
resize to make it look pretty
• Modify the Custom Drilldown action to go to
the saved glass table,
Buttercup Games Online Store
• Bonus Points: Make the label bigger, more
readable

• Click Save
• View when done

61
Copyright © 2016 Splunk Inc.

Let’s Play!

A Troubleshooting Exercise
A Troubleshooting Exercise
Let’s use ITSI to troubleshoot an outage
● Start at your Glass Table, “<UserName> Buttercup Business Process”
● Customer Care reports that unhappy customers are complaining of failures
and long delays when trying to purchase
● The calls began coming in at around the top of the last hour.
● In the upper right corner of the Glass Table, change the time picker from Now
to XX:00:00.0, where XX is the previous hour. For example, if it is currently
14:05, set the time picker to 13:00:00.0, then Apply

● This is how we can “time travel” back to see conditions at a particular


outage– oh yeah!

63
A Troubleshooting Exercise, cont’d
● The Online Store seems to be degraded, just as Customer Care reported.
Click on the widget under Buttercup to drill down further

64
A Troubleshooting Exercise, cont’d.
● The Online Store Glass Table shows a much more detailed view, including the impacted customer-facing KPIs
at the far left (Revenue, etc)
● Based on this view of all the relevant
services, where do you think the root cause
lies?
● Which service should we troubleshoot first?
● Click on Health widget for that service, to
drill down to a Deep Dive

65
Deep Dive
● Deep Dive shows multiple KPIs and Health Scores in parallel “swim
lanes”.
● The Health Score for this Service is the top swim lane. Can you see
when it begins to degrade from 100%?
● Mousing over this point in time, can you spot the KPI with the
leading fault indication, i.e., what failed first?

● To improve readability, make sure the


Primary Time Range (lower left corner) is
set to Presets > Last 60 minutes

66
Multi-KPI Alerts and Notable Events
● Click on Notable Events Review
● Multiple KPIs and Healthscores can
be combined in sophisticated ways
to create Multi-KPI alerts
● When a Multi-KPI alert fires, one
of the outcomes is the creation of
a Notable Event
● Notable Events allow NOC
personnel and others to triage and
coordinate event management
efforts

67
Service Analyzer
● Click on Service Analyzer > Default Service Analyzer

● Back where we started!


● This view shows a “no-frills” list of
services (top) and hottest KPIs
(bottom)
● Provides access into Service Details
● It is useful for NOCs and others
who need a high-level situational
view

68
Copyright © 2016 Splunk Inc.

Let’s Play!

Advanced Exercises
Summary
● High-value services can be decomposed and modeled in ITSI, using machine data
from the relevant systems
● Services and KPIs can be created in minutes, with sophisticated thresholding
techniques to distinguish “normal” from “not normal”
● Glass Tables allow service health and KPI metrics to be displayed in a way that
makes sense to specific groups, such as Executive Leadership, Business Service
Owners, the NOC, DevOps & Others
● Deep Dives allow KPIs to be compared side-by-side across any time range,
accelerating root cause analysis and significantly reducing MTTR
● Multi-KPI Alerts and Notable Events reduce alert noise, producing actionable
events and a means to manage them
● … and it’s fast+fun to build!

70
What our ITSI
Customers are
doing
Splunk IT Service Intelligence
Machine Learning-Powered, Analytics-Driven IT Operations
Prioritize incidents with context Redefine the role of IT
Deliver business & service context to prioritize Support decisions & communicate results
incident investigation & action with powerful service-level insights

Simplify service operations Unify siloed monitoring


Leverage machine learning to Combine events & metrics
detect anomalies & highlight across silos with ease,
events that matter flexibility & scale in days
Splunk’s Solution: A lens could be multiple
processes…
These are Heath Scores – a high level aggregation of the health of the underlying processes.

This shows how ‘Glass


Tables’ can visualize
key performance
indicators and health
scores that combine
data from diverse
sources.

This example is an
abbreviated ‘Book to
All the scores are time based KPI’s All the scores are color coded to convey if Bill’, or sometimes
or nested sub processes that are they are “normal” or “abnormal” based on called ‘Order to Cash’
searching in real time for some your criteria OR Splunk’s Packaged Machine business process.
relevant condition of interest. Learning, enabled with an ON/OFF switch.

Copyright © 2016 Splunk, Inc.


Call Center Service
VOIP Service Mail Support

Service Health Transactions

Online Msg
Inbound Calls

ACD Analysis – Core Splunk


Call Wait History
Inbound Analysis
Social Media
Social Media
Money Transfer Services

Service Health Corporate

Internal Transfer Service Fed Exchange Service

Money Exchange Service


External Wire Service

Core Splunk Searches


Transaction History
System Investigation
Reconciliation Service
Heat Map Analysis Online Transactions
Continuous Operational Visibility

CIO Scorecard
Enterprise Service Status Major Incidents Major Changes

Service Health Volume Revenue Incidents Changes Service Health Volume Revenue Incidents Changes

Service Health Volume Revenue Incidents Changes Service Health Volume Ontime Delivery Incidents Changes

Service Health Volume Revenue Incidents Changes Service Health Throughput Container Util Incidents Changes
The Vision - Business Operations Center

SOC

NOC BOC
• Splunk ITSI has the fundamentals to deliver on the promise of real time business visualizations
• Modeled after your Security, Network, and IT Operations Centers
• Monitoring and diagnosis of important ecommerce and brick and mortar operations
• Enhanced with process insight from end-to-end, alerts, machine learning and real-time response
Sign Up Now – We’re here to help!
Harness the creativity and domain knowledge of your
organization to unlock the value of data and solve an
important Business Service problem through a joint service
intelligence workshop with key stakeholders

What is it? Define methods for:


› 1 Day Onsite Workshop › Proactive service monitoring
› Tightly linked with value › Reduced risk and failures
› Collaborative approach › Faster issue resolution
› Build your own Glass › Increased business performance
Table
Our Workshop In Action
Your Mission, should you choose to accept it…

Find a problem Bring your subject Conduct a Service


worth solving in experts together Intelligence
your enterprise workshop
Reference Stuff
● ITSI Guidebook: In your ITSI instance:
Search -> Dashboards -> ITSI Sandbox Guide

● ITSI Documentation:
http://docs.splunk.com/Documentation/ITSI

81
Thank You
Please fill out the Survey
https://www.surveymonkey.com/r/NBXBYCG

You might also like