0% found this document useful (0 votes)

85 views16 pages

Igcs 1752181337 040723 1434 54

This document outlines the monitoring standards for compute, traffic, databases, Redis, and Kafka in the Shopee IDC (SGDC). It provides the minimum required dashboards and metrics for each including metrics for memory, disk, CPU usage, network traffic, request latency, error rates, and more. It also provides example Prometheus queries for collecting and visualizing these metrics in Grafana.

Uploaded by

Ahmad Nabil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views16 pages

Igcs 1752181337 040723 1434 54

Uploaded by

Ahmad Nabil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Alerting & Monitoring Standard in Shopee IDC

(SGDC)
Monitoring Standard
Minimum Required Dashboard
Compute Monitoring
Memory
Disk
CPU
Container Restart Count
Container Uptime
Incoming network
Outgoing network
Traffic Monitoring
Total incoming traffic
Total successful process
Total error received
Total incoming traffic per path
Total error received per path
Latency P50
Latency P75
Latency P95
Redis Monitoring
Kafka Monitoring
Streaming Application
Consumer Lag per-consumer-group
Throughput
Process Latency
Topic
Message In
Bytes In
Bytes Out
Start Offset
End Offset
Consumer
Bytes Consume Rate (byte/s)
Bytes Consume Rate per Topic (byte/s)
Rate of Record consumed
Rate of Record consumed per Topic
Avg lag per client and partition
Avg fetch size
Fetch request latency
Fetch request rate
Current connection count
Avg and max throttle time
Record lead per-partition
Node response rate
Failed auth rate
Producer
Outgoing message per second
Outgoing message per second per topic
Batch size average
Buffer size available
Request in flight
Request Queue avg time
Produce request latency
Produce request rate
Produce response rate
Connection count
Connection creation rate
Connection close rate
Select rate
Record error rate
Record retry rate
Failed authenticated rate
Database Monitoring
Server Uptime
Inbound Connection
Outbound Connection
Max Thread Connected [%]
Max Connection and Thread Connected
Max Used Connection
Average Thread Connected
Slow Queries
Aborted Connection Attempt
Aborted Client Timeout
Alerting Standard
Minimum Required Alert
Compute Alert
Traffic Alert
Example Query Aggregation
query without aggregation
query with aggregation
Example Query Function
without using function
Increase
Rate
Arithmetic Query PromQL

Monitoring Standard

You can see the implementation in this dashboard: https://monitoring.infra.sz.shopee.io/grafana/d/ZlJft1HVk/000-general-webservice-

monitoring?orgId=69

Minimum Required Dashboard

Compute Monitoring
Resource CPU and memory should be calculated with percentage, hence the calculation should be TOTAL RESOURCE USED / TOTAL RESOURCE LIMIT,
and all of this should be calculated per container. any VM should have all of this type of monitoring, except for disk usage in container environment, only
container using persistent volume should have disk monitoring.

NOTE: some people might use different metric to calculate all of this monitoring!

Memory

(pod:pod_memory_usage:rate{pod_name=~"adzan.", pod_name!~".-xx-.*"} / (pod:pod_memory_limit_bytes{pod_name=~"

adzan.*", pod_name!~".*-xx-.*"} / 1024 / 1024)) * 100

Disk

(kubelet_volume_stats_used_bytes{} / kubelet_volume_stats_capacity_bytes{} 1024 1024) * 100

CPU

(rate(container_cpu_usage_seconds_total{pod=~"adzan-be.", image!~".pause.", pod!~".-xx-.*"}[60s]) / max

(kube_pod_container_resource_limits_cpu_cores{pod=~"adzan-be.*", pod!~".*-xx-.*"}) * 100)

Container Restart Count

increase(kube_pod_container_status_restarts_total{pod=~"adzan-be.*", pod!~".*-xx-.*"}[5m])

Container Uptime

kube_pod_status_ready{pod=~"adzan-be.", condition="true", pod!~".-xx-.*"}

Incoming network

pod:pod_network_receive_bytes_total:sum_rate{pod_name=~"adzan.*"}

Outgoing network

pod:pod_network_transmit_bytes_total:sum_rate{pod_name=~"adzan.*"}

Traffic Monitoring

Total incoming traffic

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant"}[1m]))

Total successful process

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"

2.*|3.*"}[1m]))

Total error received

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"

4.*|5.*"}[1m]))

Total incoming traffic per path

sum(increase(idgame_handled_total{service="adzan-be",env=~"$env", cid=~"$cid", tenant=~"$tenant", path!~".*ping.

*", region!="xx"}[1m])) by(path, method)

Total error received per path

sum(increase(idgame_handled_total{service="adzan-be",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code!~"

200", path!~".*ping.*", region!="xx"}[1m])) by(path, method, status_code)

Latency P50

histogram_quantile(0.50, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))

Latency P75
histogram_quantile(0.75, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))

Latency P95

histogram_quantile(0.95, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))

Redis Monitoring
blablabla

Kafka Monitoring
References:

https://access.redhat.com/documentation/en-us/red_hat_amq/7.2/html/using_amq_streams_on_red_hat_enterprise_linux_rhel/monitoring-
str#doc-wrapper
https://docs.confluent.io/platform/current/kafka/monitoring.html
https://medium.com/@yashwant.deshmukh23/kafka-real-time-streaming-application-monitoring-and-alerting-daa4a8796c61

Streaming Application

Consumer Lag per-consumer-group

Approximate Lag of every consumer group by Topic/Partition. The value is the number of messages that are not consumed yet.

kafka_consumergroup_lag{mon_proj=~"$kafka_cluster"} > 0

Throughput

The average number of records sent per second for a topic.

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesOutPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Process Latency

Topic

Message In

Aggregate incoming message rate.

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="MessagesInPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Bytes In

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesInPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Bytes Out
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesOutPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Start Offset

Current Offset at the Broker by Topic/Partition

kafka_topic_partition_current_offset{cluster="$kafka_cluster",topic=~"$kafka_topic",
partition=~"$kafka_partition"}

End Offset

Oldest Offset still current in topic

kafka_topic_partition_oldest_offset{cluster="$kafka_cluster",topic=~"$kafka_topic",
partition=~"$kafka_partition"}

Consumer

Bytes Consume Rate (byte/s)

kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{cluster="$kafka_cluster"}

Bytes Consume Rate per Topic (byte/s)

sum(kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}) by (topic)

Rate of Record consumed

kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate{cluster="$kafka_cluster"}

Rate of Record consumed per Topic

sum(kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}) by (topic)

Avg lag per client and partition

kafka_consumer_consumer_fetch_manager_metrics_records_lag{cluster="$kafka_cluster", topic=~"$kafka_topic",
partition=~"$kafka_partition"}

Avg fetch size

kafka_consumer_consumer_fetch_manager_metrics_fetch_size_avg{cluster="$kafka_cluster", topic=~"$kafka_topic"}
Fetch request latency

kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_max{cluster="$kafka_cluster", topic=~"$kafka_topic"}

Fetch request rate

kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}

Current connection count

The number of active connections that the consumer has to the Kafka broker.

kafka_consumer_consumer_metrics_connection_count{cluster="$kafka_cluster"}

Avg and max throttle time

kafka_consumer_consumer_fetch_manager_metrics_fetch_throttle_time_avg{mon_proj=~"$kafka_cluster"}

kafka_consumer_consumer_fetch_manager_metrics_fetch_throttle_time_max{mon_proj=~"$kafka_cluster"}

Record lead per-partition

kafka_consumer_consumer_fetch_manager_metrics_records_lead{cluster="$kafka_cluster"}

Node response rate

The average number of responses received per second from the broker.

kafka_consumer_consumer_node_metrics_response_rate{cluster="$kafka_cluster"}

Failed auth rate

kafka_consumer_consumer_metrics_failed_authentication_rate{cluster=~"$kafka_cluster"}

Producer

Outgoing message per second

sum(kafka_server_BrokerTopicMetrics_Count{cluster="kafka_mss_live_th",name="MessagesInPerSec"}>0) by (instance)

Outgoing message per second per topic

kafka_server_BrokerTopicMetrics_Count{cluster="$kafka_cluster",name="MessagesInPerSec", topic=~"$kafka_topic"}

Batch size average

kafka_producer_producer_metrics_batch_size_avg{cluster="$kafka_cluster"}

Buffer size available

kafka_producer_producer_metrics_buffer_available_bytes{cluster="$kafka_cluster"}

Request in flight

kafka_producer_producer_metrics_requests_in_flight{cluster=~"$kafka_cluster"}

Request Queue avg time

kafka_producer_producer_metrics_record_queue_time_avg{cluster=~"$kafka_cluster"}

Produce request latency

Produce is one major component of end-to-end latency. Produce latency is the total time taken to process the record and batch with other records in the
internal Kafka producer.

50P:

kafka_network_RequestMetrics_50thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

instance="$kafka_broker"}

75P:

kafka_network_RequestMetrics_75thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

instance="$kafka_broker"}

95P:

kafka_network_RequestMetrics_95thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

instance="$kafka_broker"}

98P:
kafka_network_RequestMetrics_98thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",
instance="$kafka_broker"}

99P:

kafka_network_RequestMetrics_99thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

instance="$kafka_broker"}

999P:

kafka_network_RequestMetrics_999thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

instance="$kafka_broker"}

Produce request rate

sum(kafka_network_RequestMetrics_OneMinuteRate{cluster="$kafka_cluster",request="Produce", name="
RequestsPerSec"}) by (instance)

Produce response rate

kafka_producer_producer_metrics_response_rate{cluster="$kafka_cluster"}

Connection count

kafka_producer_producer_metrics_connection_count{cluster="$kafka_cluster"}

Connection creation rate

kafka_producer_producer_metrics_connection_creation_rate{cluster="$kafka_cluster"}

Connection close rate

kafka_producer_producer_metrics_connection_close_rate{cluster=~"$kafka_cluster"}

Select rate

kafka_producer_producer_metrics_select_rate{cluster="$kafka_cluster"}

Record error rate

query A:
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="FailedProduceRequestsPerSec",
topic=~"$kafka_topic"}) by (topic)

query B:

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="FailedFetchRequestsPerSec",
topic=~"$kafka_topic"}) by (topic)

Record retry rate

sum(kafka_producer_producer_topic_metrics_record_retry_rate{cluster="$kafka_cluster", topic=~"$kafka_topic"})
by (topic)

Failed authenticated rate

sum(kafka_network_RequestMetrics_OneMinuteRate{cluster="$kafka_cluster", error=~"
GROUP_AUTHORIZATION_FAILED|TOPIC_AUTHORIZATION_FAILED|TRANSACTIONAL_ID_AUTHORIZATION_FAILED|CLUSTER_AUTHORIZATIO
N_FAILED"} > 0) by (instance)

Database Monitoring

Server Uptime

(avg by (instance, role) (mysql_global_status_uptime{db_cluster="db-shopee-gamerun"}) / 3600 / 24)

Inbound Connection

avg by (instance, role) (rate(mysql_global_status_bytes_received{db_cluster="db-shopee-gamerun"}[1m]))

Outbound Connection

avg by (instance, role) (rate(mysql_global_status_bytes_sent{db_cluster="db-shopee-gamerun"}[1m]))

Max Thread Connected [%]

avg by (instance, role) (max_over_time(mysql_global_status_threads_connected{db_cluster="db-shopee-gamerun"}

[1m])) / avg by (instance, role) (max_over_time(mysql_global_variables_max_connections{db_cluster="db-shopee-
gamerun"}[1m])) * 100

Max Connection and Thread Connected

avg(avg by (instance, role) (max_over_time(mysql_global_variables_max_connections{db_cluster="db-shopee-

gamerun"}[1m])))
avg by (instance, role) (max_over_time(mysql_global_status_threads_connected{db_cluster="db-shopee-gamerun"}
[1m]))

Max Used Connection

avg by (instance, role) (max_over_time(mysql_global_status_max_used_connections{db_cluster="db-shopee-gamerun"}

[1m]))

Average Thread Connected

avg by (instance, role) (avg_over_time(mysql_global_status_threads_running{db_cluster="db-shopee-gamerun"}[1m]))

Slow Queries

avg by (instance, role) (rate(mysql_global_status_slow_queries{db_cluster="db-shopee-gamerun"}[1m]))

Aborted Connection Attempt

sum by (instance, role) (increase(mysql_global_status_aborted_connects{db_cluster="db-shopee-gamerun"}[1m]))

Aborted Client Timeout

irate(mysql_global_status_aborted_clients{db_cluster="db-shopee-gamerun"})

Alerting Standard

Minimum Required Alert

Compute Alert
Memory usage
CPU usage
Disk usage
Container restart
Container downtime

Traffic Alert
Error rate
Latency

Example Query Aggregation

Prometheus supports the following built-in aggregation operators that can be used to aggregate the elements of a single instant vector, resulting in a
new vector of fewer elements with aggregated values:

sum (calculate sum over dimensions)

min (select minimum over dimensions)
max (select maximum over dimensions)
avg (calculate the average over dimensions)
group (all values in the resulting vector are 1)
stddev (calculate population standard deviation over dimensions)
stdvar (calculate population standard variance over dimensions)
count (count number of elements in the vector)
count_values (count number of elements with the same value)
bottomk (smallest k elements by sample value)
topk (largest k elements by sample value)
quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)

query without aggregation

this query only will show the raw value of every wordpuzzle pod, to gather more precise monitoring, we need to modify the query, depends with what
we need to monitor.

kube_pod_info{pod=~"wordpuzzle.*"}

query with aggregation

if you want to know how many wordpuzzle pod you have, you can use sum function.

sum(kube_pod_info{pod=~"wordpuzzle.*"})

as you can see, all of the pod will be summarized as 1 value, and you can also see in the graph the increase in number of pod from time to time with this
query.

there are two method to grouping, depends with your logic.

first method with grouping inside 1 query, with this method you need to relabel the original label to be custom label, combine it with regex to filter
specific value that you want.
sum(
label_replace(kube_pod_info{pod=~"wordpuzzle-be-live.*"}, "pod_group", "$1", "pod", "wordpuzzle-be-live-
(\\w+(-\\w+)*).*"),
"pod_group", "$1", "pod", "wordpuzzle-be-live-(\\w+(-\\w+)*).*"
) by (pod_group)

the second method is to use 2 query inside 1 panel, it is more easy to use this query because you don't have to think about the regex and relabel logic at
all.

sum(kube_pod_info{pod=~"wordpuzzle-be-live-sg.*"})

sum(kube_pod_info{pod=~"wordpuzzle-be-live-xx.*"})

the difference will be in the legend. with relabel you can use variable {{pod_group}} as the legend, and it will show the value of sg and xx, withou
relabeling, you need to input the legend manually per query you are using.

Example Query Function

PromQl also provide function to make our lives easier to calculate and make our monitoring more precise. you can see the link provided in this section
to learn more about prometheus function.

without using function

(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"})
as an example, if we query without using function and only filtering the path ping, grafana will show us 2 graph over 1000k value. it is indicating that the
ping endpoint has been hit over 1000k since the earliest time in the record of prometheus DB.

Increase

calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically
adjusted for. The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer
result even if a counter increases only by integer increments.

increase(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m])

with increase function we can calculate how many time this endpoint has been hit (in this case over 1 minutes), with increase function we can calculate
how many RPS endpoint ping has been hit during 1 minute.

another pov from different panel with increase function

with different visualization we can get more insight what this increase function capable of. it is grouping by 4 graph because adzan has different 4 host
value.

but to calculate RPS, we don't need to know whereas the container are being deployed on, hence we can calculate the RPS for ping endpoint using sum
aggregator.
sum(increase(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m]))

with using sum aggregator and increase function now we know how many request per minute adzan ping endpoint has.

if we sum this endpoint without increase function it will calculate the total this endpoint has been hit from the earliest data until the latest data. and we
can see the graph is increasing from time to time.

Rate
calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target
restarts) are automatically adjusted for. Also, the calculation extrapolates to the ends of the time range, allowing for missed scrapes or imperfect
alignment of scrape cycles with the range's time period.

(rate(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m]))

rate will calculate PER SECOND AVERAGE from the result of one query, in this case, adzan RPM is 6 from our previous section. with rate it will be
averaged by 60 because we are calculating this end point every minute, so its like 6/60=0,1.

rate can be using to any type of metrics provided by prometheus, but it is more useful to calculate vector type metric likes cpu or memory. you can also
sum the result of rate function.

in some function, the have a bracket value for accumulation time, you can use your own value likes [30s] or [1m] or [1h] or with global variable likes
[$__range].

NOTE: be careful when you use [$__range] because it will accumulate the value based on the time range

Arithmetic Query PromQL

PromQL also support for arithmetic query to calculate and visualize data more accurate.
sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"
3.*|2.*"}[1m])) / sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid",
tenant=~"$tenant"}[1m])) * 100

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"

4.*|5.*"}[1m])) / sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid",
tenant=~"$tenant"}[1m])) * 100

with arithmetic query we can calculate the percentage of our service SLA the first thing you need to do is to know the formula.

H13 821 HCIP Cloud Service Bank
No ratings yet
H13 821 HCIP Cloud Service Bank
5 pages
E36 Asc+t
No ratings yet
E36 Asc+t
16 pages
Capacity Management Presentation
100% (4)
Capacity Management Presentation
57 pages
Kumpulan Soal Error Analysis
100% (5)
Kumpulan Soal Error Analysis
2 pages
ProxySG Performance Monitoring and Troubleshooting Webcast - Final
No ratings yet
ProxySG Performance Monitoring and Troubleshooting Webcast - Final
63 pages
Monitoring & Metrics: Sysops Admin Exam Notes
100% (2)
Monitoring & Metrics: Sysops Admin Exam Notes
16 pages
Hospital Managemen T System: Oose LAB File
No ratings yet
Hospital Managemen T System: Oose LAB File
62 pages
Diploma in Computer Application (Second Semester) Examination, February, 2019 Coreldraw
No ratings yet
Diploma in Computer Application (Second Semester) Examination, February, 2019 Coreldraw
3 pages
f5 Cheat
0% (1)
f5 Cheat
8 pages
KPIs and Thresholds
100% (1)
KPIs and Thresholds
8 pages
Geerations of Computer 1st To 5th Explained With Pictures
No ratings yet
Geerations of Computer 1st To 5th Explained With Pictures
9 pages
Intelligent Alarm Cloud Platform: Operation Manual
No ratings yet
Intelligent Alarm Cloud Platform: Operation Manual
36 pages
20 Datacenter Measurements
No ratings yet
20 Datacenter Measurements
28 pages
MAIN Electrical Parts List: Design LOC Sec Code Description
No ratings yet
MAIN Electrical Parts List: Design LOC Sec Code Description
10 pages
IMS-ZXUN CSCF-A-EN-Maintenance & Troubleshooting-System Maintenance-Routine Operations Guide-1-TM-201010-73
No ratings yet
IMS-ZXUN CSCF-A-EN-Maintenance & Troubleshooting-System Maintenance-Routine Operations Guide-1-TM-201010-73
73 pages
ZP Technical Specification
No ratings yet
ZP Technical Specification
26 pages
Evaluation of Information Systems
No ratings yet
Evaluation of Information Systems
29 pages
HyperStoreAdminGuide v-7.2.3
No ratings yet
HyperStoreAdminGuide v-7.2.3
1,073 pages
SAN Performance Metrics - The SAN GUY
No ratings yet
SAN Performance Metrics - The SAN GUY
6 pages
Issue and Resolutions
No ratings yet
Issue and Resolutions
6 pages
Water Body Extraction From Sentinel-3 Image With Multiscale Spatiotemporal Super-Resolution Mapping
No ratings yet
Water Body Extraction From Sentinel-3 Image With Multiscale Spatiotemporal Super-Resolution Mapping
20 pages
Oracle Fusion Middleware: Cloning
No ratings yet
Oracle Fusion Middleware: Cloning
25 pages
8b 10b Encode Decode
No ratings yet
8b 10b Encode Decode
5 pages
06 - Monitoring-Logging
No ratings yet
06 - Monitoring-Logging
29 pages
SRX SNMP Monitoring Guide - v1.2
No ratings yet
SRX SNMP Monitoring Guide - v1.2
15 pages
F5101 5 PDF
No ratings yet
F5101 5 PDF
39 pages
IPTV Professional Operation and Maintenance Program
No ratings yet
IPTV Professional Operation and Maintenance Program
27 pages
Architecture Design PDF
No ratings yet
Architecture Design PDF
10 pages
Chapter 4 Script
No ratings yet
Chapter 4 Script
5 pages
Content CheatsheetSnowflake
No ratings yet
Content CheatsheetSnowflake
1 page
PCS50-630 User Manual 20220509
No ratings yet
PCS50-630 User Manual 20220509
37 pages
Tugas Inggris
No ratings yet
Tugas Inggris
2 pages
Alibaba Summary
No ratings yet
Alibaba Summary
9 pages
Kubernetes 10 Tips To Overcome The Overwhelming
No ratings yet
Kubernetes 10 Tips To Overcome The Overwhelming
15 pages
Scribd 1
No ratings yet
Scribd 1
3 pages
Tsarouchas Anastasios Resume
No ratings yet
Tsarouchas Anastasios Resume
1 page
Content CheatsheetKubernetes
No ratings yet
Content CheatsheetKubernetes
2 pages
Secure SD-WAN Ordering Guide
No ratings yet
Secure SD-WAN Ordering Guide
8 pages
Casio G-Shock Watch - URBANHUG
No ratings yet
Casio G-Shock Watch - URBANHUG
1 page
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
No ratings yet
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
11 pages
MOP UDM Deployment-22.8: AWS Based
No ratings yet
MOP UDM Deployment-22.8: AWS Based
15 pages
Acer Huaqin NX8102 Alien GL v1.0
No ratings yet
Acer Huaqin NX8102 Alien GL v1.0
53 pages
Math Homework Tic Tac Toe
100% (1)
Math Homework Tic Tac Toe
8 pages
Automationinmanufacturingunit 1byvarunpratapsingh 230215010703 90e10c8e
No ratings yet
Automationinmanufacturingunit 1byvarunpratapsingh 230215010703 90e10c8e
57 pages
Devops Ultimate Monitoring Project
No ratings yet
Devops Ultimate Monitoring Project
17 pages
Weekly Server Analytics Report
No ratings yet
Weekly Server Analytics Report
5 pages
9330 Brochure HQ A2 Markem Imaje
100% (1)
9330 Brochure HQ A2 Markem Imaje
2 pages
Data Quality Model
No ratings yet
Data Quality Model
107 pages
Monday Assignment
No ratings yet
Monday Assignment
3 pages
Monitor Google Compute Engine For High CPU Usage
No ratings yet
Monitor Google Compute Engine For High CPU Usage
10 pages
SinoGNSS A300 GNSS Receiver
No ratings yet
SinoGNSS A300 GNSS Receiver
2 pages
Cloudwise Synthetic Monitoring - Complete Data Sheet - 2024
No ratings yet
Cloudwise Synthetic Monitoring - Complete Data Sheet - 2024
6 pages
HPE - A00127308en - Us - HPE Performance Cluster Manager System Monitoring Guide
No ratings yet
HPE - A00127308en - Us - HPE Performance Cluster Manager System Monitoring Guide
308 pages
Handwriting Recognition Software
No ratings yet
Handwriting Recognition Software
10 pages
3-NSX Enablement Workshop (2) .PPTX (Read-Only)
No ratings yet
3-NSX Enablement Workshop (2) .PPTX (Read-Only)
107 pages
Tech Arena 24 Phase1
No ratings yet
Tech Arena 24 Phase1
8 pages
How To Troubleshoot High Dataplane CPU
No ratings yet
How To Troubleshoot High Dataplane CPU
9 pages
Kube Proxy
No ratings yet
Kube Proxy
9 pages
Non Uniform Traffic Dataset 50 Samples
No ratings yet
Non Uniform Traffic Dataset 50 Samples
4 pages
HPE - Sf000100922en - Us - HPCM - Understanding and Troubleshooting Monitoring
No ratings yet
HPE - Sf000100922en - Us - HPCM - Understanding and Troubleshooting Monitoring
8 pages
System Monitoring
No ratings yet
System Monitoring
4 pages
Kubernetes
No ratings yet
Kubernetes
9 pages
Project 2
No ratings yet
Project 2
4 pages
Calculate The Capacity and Plan For Growth On Kubernetes
No ratings yet
Calculate The Capacity and Plan For Growth On Kubernetes
4 pages
13 Application Performance Metrics and How To Measure Them - TechTarget
No ratings yet
13 Application Performance Metrics and How To Measure Them - TechTarget
6 pages
Bikku
No ratings yet
Bikku
1 page
Kubernetes, CDNS, and Monitoring For Modern
No ratings yet
Kubernetes, CDNS, and Monitoring For Modern
16 pages
Kubernetes Cluster Security - Kube-Hunter
No ratings yet
Kubernetes Cluster Security - Kube-Hunter
6 pages
Part 7 Kubernetes Real Time Troubleshooting 1721726688
No ratings yet
Part 7 Kubernetes Real Time Troubleshooting 1721726688
6 pages
Organ Donar Prediction Using Machine Learning
No ratings yet
Organ Donar Prediction Using Machine Learning
13 pages
IBDP Computer Science Revision Notes Paper 1
100% (1)
IBDP Computer Science Revision Notes Paper 1
32 pages
Week 1 Implementation Detailed Guide To Create AWS EKS Cluster and Cluster Setup Using Eksctl and Bash Scripts
No ratings yet
Week 1 Implementation Detailed Guide To Create AWS EKS Cluster and Cluster Setup Using Eksctl and Bash Scripts
10 pages
Detailed Guide
No ratings yet
Detailed Guide
8 pages
Proposal - NorthTrend - N2N Renewal
No ratings yet
Proposal - NorthTrend - N2N Renewal
5 pages
Template Cisco sg500x-48p
No ratings yet
Template Cisco sg500x-48p
17 pages
Experiment 9 1
No ratings yet
Experiment 9 1
5 pages
BOP Configurations
No ratings yet
BOP Configurations
4 pages
Radio Com & Nav System
No ratings yet
Radio Com & Nav System
32 pages
Datacenter Handbook-Pages-2
No ratings yet
Datacenter Handbook-Pages-2
51 pages
Lecture 1-1 Introduction To Digital Systems
No ratings yet
Lecture 1-1 Introduction To Digital Systems
16 pages
Constructing The 10x Efficiency of Cloud Native Ai Infrastructure Matsu Zha Ai Xia 10 Dyags Peter Pan Daocloud Xie Zuo Daocloud
No ratings yet
Constructing The 10x Efficiency of Cloud Native Ai Infrastructure Matsu Zha Ai Xia 10 Dyags Peter Pan Daocloud Xie Zuo Daocloud
48 pages
Monitoring
No ratings yet
Monitoring
17 pages
ssr130 Perf With Idp
No ratings yet
ssr130 Perf With Idp
4 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
From Everand
Mastering Python Network Automation: Automating Container Orchestration, Configuration, and Networking with Terraform, Calico, HAProxy, and Istio
Tim Peters
No ratings yet
PHP Microservices
From Everand
PHP Microservices
Carlos Pérez Sánchez
3/5 (1)
Mastering Metasploit - Second Edition
From Everand
Mastering Metasploit - Second Edition
Nipun Jaswal
5/5 (1)
Software Architecture with Python
From Everand
Software Architecture with Python
Anand Balachandran Pillai
3/5 (1)
Python Networking 101: Navigating essentials of networking, socket programming, AsyncIO, network testing, simulations and Ansible
From Everand
Python Networking 101: Navigating essentials of networking, socket programming, AsyncIO, network testing, simulations and Ansible
Odette Windsor
No ratings yet
Node.js Design Patterns
From Everand
Node.js Design Patterns
Mario Casciaro
4/5 (3)

Igcs 1752181337 040723 1434 54

Uploaded by

Igcs 1752181337 040723 1434 54

Uploaded by

Alerting & Monitoring Standard in Shopee IDC

You can see the implementation in this dashboard: https://monitoring.infra.sz.shopee.io/grafana/d/ZlJft1HVk/000-general-webservice-

Minimum Required Dashboard

(pod:pod_memory_usage:rate{pod_name=~"adzan.*", pod_name!~".*-xx-.*"} / (pod:pod_memory_limit_bytes{pod_name=~"

(kubelet_volume_stats_used_bytes{} / kubelet_volume_stats_capacity_bytes{} *1024 * 1024) * 100

(rate(container_cpu_usage_seconds_total{pod=~"adzan-be.*", image!~".*pause.*", pod!~".*-xx-.*"}[60s]) / max

Container Restart Count

kube_pod_status_ready{pod=~"adzan-be.*", condition="true", pod!~".*-xx-.*"}

Total incoming traffic

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant"}[1m]))

Total successful process

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"

Total error received

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"

Total incoming traffic per path

sum(increase(idgame_handled_total{service="adzan-be",env=~"$env", cid=~"$cid", tenant=~"$tenant", path!~".*ping.

Total error received per path

sum(increase(idgame_handled_total{service="adzan-be",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code!~"

Consumer Lag per-consumer-group

The average number of records sent per second for a topic.

Aggregate incoming message rate.

Current Offset at the Broker by Topic/Partition

Oldest Offset still current in topic

Bytes Consume Rate (byte/s)

Bytes Consume Rate per Topic (byte/s)

Rate of Record consumed

Rate of Record consumed per Topic

Avg lag per client and partition

Avg fetch size

Fetch request rate

Current connection count

Avg and max throttle time

Record lead per-partition

Node response rate

Failed auth rate

Outgoing message per second

Outgoing message per second per topic

Batch size average

Buffer size available

Request Queue avg time

Produce request latency

kafka_network_RequestMetrics_50thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

kafka_network_RequestMetrics_75thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

kafka_network_RequestMetrics_95thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

kafka_network_RequestMetrics_99thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

kafka_network_RequestMetrics_999thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",

Produce request rate

Produce response rate

Connection creation rate

Connection close rate

Record error rate

Record retry rate

Failed authenticated rate

(avg by (instance, role) (mysql_global_status_uptime{db_cluster="db-shopee-gamerun"}) / 3600 / 24)

avg by (instance, role) (rate(mysql_global_status_bytes_received{db_cluster="db-shopee-gamerun"}[1m]))

avg by (instance, role) (rate(mysql_global_status_bytes_sent{db_cluster="db-shopee-gamerun"}[1m]))

Max Thread Connected [%]

avg by (instance, role) (max_over_time(mysql_global_status_threads_connected{db_cluster="db-shopee-gamerun"}

Max Connection and Thread Connected

avg(avg by (instance, role) (max_over_time(mysql_global_variables_max_connections{db_cluster="db-shopee-

Max Used Connection

avg by (instance, role) (max_over_time(mysql_global_status_max_used_connections{db_cluster="db-shopee-gamerun"}

Average Thread Connected

avg by (instance, role) (avg_over_time(mysql_global_status_threads_running{db_cluster="db-shopee-gamerun"}[1m]))

avg by (instance, role) (rate(mysql_global_status_slow_queries{db_cluster="db-shopee-gamerun"}[1m]))

Aborted Connection Attempt

sum by (instance, role) (increase(mysql_global_status_aborted_connects{db_cluster="db-shopee-gamerun"}[1m]))

Aborted Client Timeout

Minimum Required Alert

Example Query Aggregation

sum (calculate sum over dimensions)

query without aggregation

query with aggregation

there are two method to grouping, depends with your logic.

Example Query Function

without using function

(pod:pod_memory_usage:rate{pod_name=~"adzan.", pod_name!~".-xx-.*"} / (pod:pod_memory_limit_bytes{pod_name=~"

(kubelet_volume_stats_used_bytes{} / kubelet_volume_stats_capacity_bytes{} 1024 1024) * 100

(rate(container_cpu_usage_seconds_total{pod=~"adzan-be.", image!~".pause.", pod!~".-xx-.*"}[60s]) / max

kube_pod_status_ready{pod=~"adzan-be.", condition="true", pod!~".-xx-.*"}