Igcs 1752181337 040723 1434 54

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Alerting & Monitoring Standard in Shopee IDC

(SGDC)
Monitoring Standard
Minimum Required Dashboard
Compute Monitoring
Memory
Disk
CPU
Container Restart Count
Container Uptime
Incoming network
Outgoing network
Traffic Monitoring
Total incoming traffic
Total successful process
Total error received
Total incoming traffic per path
Total error received per path
Latency P50
Latency P75
Latency P95
Redis Monitoring
Kafka Monitoring
Streaming Application
Consumer Lag per-consumer-group
Throughput
Process Latency
Topic
Message In
Bytes In
Bytes Out
Start Offset
End Offset
Consumer
Bytes Consume Rate (byte/s)
Bytes Consume Rate per Topic (byte/s)
Rate of Record consumed
Rate of Record consumed per Topic
Avg lag per client and partition
Avg fetch size
Fetch request latency
Fetch request rate
Current connection count
Avg and max throttle time
Record lead per-partition
Node response rate
Failed auth rate
Producer
Outgoing message per second
Outgoing message per second per topic
Batch size average
Buffer size available
Request in flight
Request Queue avg time
Produce request latency
Produce request rate
Produce response rate
Connection count
Connection creation rate
Connection close rate
Select rate
Record error rate
Record retry rate
Failed authenticated rate
Database Monitoring
Server Uptime
Inbound Connection
Outbound Connection
Max Thread Connected [%]
Max Connection and Thread Connected
Max Used Connection
Average Thread Connected
Slow Queries
Aborted Connection Attempt
Aborted Client Timeout
Alerting Standard
Minimum Required Alert
Compute Alert
Traffic Alert
Example Query Aggregation
query without aggregation
query with aggregation
Example Query Function
without using function
Increase
Rate
Arithmetic Query PromQL

Monitoring Standard

You can see the implementation in this dashboard: https://monitoring.infra.sz.shopee.io/grafana/d/ZlJft1HVk/000-general-webservice-


monitoring?orgId=69

Minimum Required Dashboard

Compute Monitoring
Resource CPU and memory should be calculated with percentage, hence the calculation should be TOTAL RESOURCE USED / TOTAL RESOURCE LIMIT,
and all of this should be calculated per container. any VM should have all of this type of monitoring, except for disk usage in container environment, only
container using persistent volume should have disk monitoring.

NOTE: some people might use different metric to calculate all of this monitoring!

Memory

(pod:pod_memory_usage:rate{pod_name=~"adzan.*", pod_name!~".*-xx-.*"} / (pod:pod_memory_limit_bytes{pod_name=~"


adzan.*", pod_name!~".*-xx-.*"} / 1024 / 1024)) * 100

Disk

(kubelet_volume_stats_used_bytes{} / kubelet_volume_stats_capacity_bytes{} *1024 * 1024) * 100

CPU

(rate(container_cpu_usage_seconds_total{pod=~"adzan-be.*", image!~".*pause.*", pod!~".*-xx-.*"}[60s]) / max


(kube_pod_container_resource_limits_cpu_cores{pod=~"adzan-be.*", pod!~".*-xx-.*"}) * 100)

Container Restart Count


increase(kube_pod_container_status_restarts_total{pod=~"adzan-be.*", pod!~".*-xx-.*"}[5m])

Container Uptime

kube_pod_status_ready{pod=~"adzan-be.*", condition="true", pod!~".*-xx-.*"}

Incoming network

pod:pod_network_receive_bytes_total:sum_rate{pod_name=~"adzan.*"}

Outgoing network

pod:pod_network_transmit_bytes_total:sum_rate{pod_name=~"adzan.*"}

Traffic Monitoring

Total incoming traffic

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant"}[1m]))

Total successful process

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"


2.*|3.*"}[1m]))

Total error received

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"


4.*|5.*"}[1m]))

Total incoming traffic per path

sum(increase(idgame_handled_total{service="adzan-be",env=~"$env", cid=~"$cid", tenant=~"$tenant", path!~".*ping.


*", region!="xx"}[1m])) by(path, method)

Total error received per path

sum(increase(idgame_handled_total{service="adzan-be",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code!~"


200", path!~".*ping.*", region!="xx"}[1m])) by(path, method, status_code)

Latency P50

histogram_quantile(0.50, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))

Latency P75
histogram_quantile(0.75, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))

Latency P95

histogram_quantile(0.95, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))

Redis Monitoring
blablabla

Kafka Monitoring
References:

https://access.redhat.com/documentation/en-us/red_hat_amq/7.2/html/using_amq_streams_on_red_hat_enterprise_linux_rhel/monitoring-
str#doc-wrapper
https://docs.confluent.io/platform/current/kafka/monitoring.html
https://medium.com/@yashwant.deshmukh23/kafka-real-time-streaming-application-monitoring-and-alerting-daa4a8796c61

Streaming Application

Consumer Lag per-consumer-group

Approximate Lag of every consumer group by Topic/Partition. The value is the number of messages that are not consumed yet.

kafka_consumergroup_lag{mon_proj=~"$kafka_cluster"} > 0

Throughput

The average number of records sent per second for a topic.

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesOutPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Process Latency

Topic

Message In

Aggregate incoming message rate.

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="MessagesInPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Bytes In

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesInPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Bytes Out
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesOutPerSec",
topic=~"$kafka_topic"}>0) by (topic)

Start Offset

Current Offset at the Broker by Topic/Partition

kafka_topic_partition_current_offset{cluster="$kafka_cluster",topic=~"$kafka_topic",
partition=~"$kafka_partition"}

End Offset

Oldest Offset still current in topic

kafka_topic_partition_oldest_offset{cluster="$kafka_cluster",topic=~"$kafka_topic",
partition=~"$kafka_partition"}

Consumer

Bytes Consume Rate (byte/s)

kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{cluster="$kafka_cluster"}

Bytes Consume Rate per Topic (byte/s)

sum(kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}) by (topic)

Rate of Record consumed

kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate{cluster="$kafka_cluster"}

Rate of Record consumed per Topic

sum(kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}) by (topic)

Avg lag per client and partition

kafka_consumer_consumer_fetch_manager_metrics_records_lag{cluster="$kafka_cluster", topic=~"$kafka_topic",
partition=~"$kafka_partition"}

Avg fetch size

kafka_consumer_consumer_fetch_manager_metrics_fetch_size_avg{cluster="$kafka_cluster", topic=~"$kafka_topic"}
Fetch request latency

kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_max{cluster="$kafka_cluster", topic=~"$kafka_topic"}

Fetch request rate

kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}

Current connection count

The number of active connections that the consumer has to the Kafka broker.

kafka_consumer_consumer_metrics_connection_count{cluster="$kafka_cluster"}

Avg and max throttle time

kafka_consumer_consumer_fetch_manager_metrics_fetch_throttle_time_avg{mon_proj=~"$kafka_cluster"}

kafka_consumer_consumer_fetch_manager_metrics_fetch_throttle_time_max{mon_proj=~"$kafka_cluster"}

Record lead per-partition

kafka_consumer_consumer_fetch_manager_metrics_records_lead{cluster="$kafka_cluster"}

Node response rate

The average number of responses received per second from the broker.

kafka_consumer_consumer_node_metrics_response_rate{cluster="$kafka_cluster"}

Failed auth rate

kafka_consumer_consumer_metrics_failed_authentication_rate{cluster=~"$kafka_cluster"}

Producer

Outgoing message per second


sum(kafka_server_BrokerTopicMetrics_Count{cluster="kafka_mss_live_th",name="MessagesInPerSec"}>0) by (instance)

Outgoing message per second per topic

kafka_server_BrokerTopicMetrics_Count{cluster="$kafka_cluster",name="MessagesInPerSec", topic=~"$kafka_topic"}

Batch size average

kafka_producer_producer_metrics_batch_size_avg{cluster="$kafka_cluster"}

Buffer size available

kafka_producer_producer_metrics_buffer_available_bytes{cluster="$kafka_cluster"}

Request in flight

kafka_producer_producer_metrics_requests_in_flight{cluster=~"$kafka_cluster"}

Request Queue avg time

kafka_producer_producer_metrics_record_queue_time_avg{cluster=~"$kafka_cluster"}

Produce request latency

Produce is one major component of end-to-end latency. Produce latency is the total time taken to process the record and batch with other records in the
internal Kafka producer.

50P:

kafka_network_RequestMetrics_50thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",


instance="$kafka_broker"}

75P:

kafka_network_RequestMetrics_75thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",


instance="$kafka_broker"}

95P:

kafka_network_RequestMetrics_95thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",


instance="$kafka_broker"}

98P:
kafka_network_RequestMetrics_98thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",
instance="$kafka_broker"}

99P:

kafka_network_RequestMetrics_99thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",


instance="$kafka_broker"}

999P:

kafka_network_RequestMetrics_999thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",


instance="$kafka_broker"}

Produce request rate

sum(kafka_network_RequestMetrics_OneMinuteRate{cluster="$kafka_cluster",request="Produce", name="
RequestsPerSec"}) by (instance)

Produce response rate

kafka_producer_producer_metrics_response_rate{cluster="$kafka_cluster"}

Connection count

kafka_producer_producer_metrics_connection_count{cluster="$kafka_cluster"}

Connection creation rate

kafka_producer_producer_metrics_connection_creation_rate{cluster="$kafka_cluster"}

Connection close rate

kafka_producer_producer_metrics_connection_close_rate{cluster=~"$kafka_cluster"}

Select rate

kafka_producer_producer_metrics_select_rate{cluster="$kafka_cluster"}

Record error rate

query A:
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="FailedProduceRequestsPerSec",
topic=~"$kafka_topic"}) by (topic)

query B:

sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="FailedFetchRequestsPerSec",
topic=~"$kafka_topic"}) by (topic)

Record retry rate

sum(kafka_producer_producer_topic_metrics_record_retry_rate{cluster="$kafka_cluster", topic=~"$kafka_topic"})
by (topic)

Failed authenticated rate

sum(kafka_network_RequestMetrics_OneMinuteRate{cluster="$kafka_cluster", error=~"
GROUP_AUTHORIZATION_FAILED|TOPIC_AUTHORIZATION_FAILED|TRANSACTIONAL_ID_AUTHORIZATION_FAILED|CLUSTER_AUTHORIZATIO
N_FAILED"} > 0) by (instance)

Database Monitoring

Server Uptime

(avg by (instance, role) (mysql_global_status_uptime{db_cluster="db-shopee-gamerun"}) / 3600 / 24)

Inbound Connection

avg by (instance, role) (rate(mysql_global_status_bytes_received{db_cluster="db-shopee-gamerun"}[1m]))

Outbound Connection

avg by (instance, role) (rate(mysql_global_status_bytes_sent{db_cluster="db-shopee-gamerun"}[1m]))

Max Thread Connected [%]

avg by (instance, role) (max_over_time(mysql_global_status_threads_connected{db_cluster="db-shopee-gamerun"}


[1m])) / avg by (instance, role) (max_over_time(mysql_global_variables_max_connections{db_cluster="db-shopee-
gamerun"}[1m])) * 100

Max Connection and Thread Connected

avg(avg by (instance, role) (max_over_time(mysql_global_variables_max_connections{db_cluster="db-shopee-


gamerun"}[1m])))
avg by (instance, role) (max_over_time(mysql_global_status_threads_connected{db_cluster="db-shopee-gamerun"}
[1m]))

Max Used Connection

avg by (instance, role) (max_over_time(mysql_global_status_max_used_connections{db_cluster="db-shopee-gamerun"}


[1m]))

Average Thread Connected

avg by (instance, role) (avg_over_time(mysql_global_status_threads_running{db_cluster="db-shopee-gamerun"}[1m]))

Slow Queries

avg by (instance, role) (rate(mysql_global_status_slow_queries{db_cluster="db-shopee-gamerun"}[1m]))

Aborted Connection Attempt

sum by (instance, role) (increase(mysql_global_status_aborted_connects{db_cluster="db-shopee-gamerun"}[1m]))

Aborted Client Timeout

irate(mysql_global_status_aborted_clients{db_cluster="db-shopee-gamerun"})

Alerting Standard

Minimum Required Alert

Compute Alert
Memory usage
CPU usage
Disk usage
Container restart
Container downtime

Traffic Alert
Error rate
Latency

Example Query Aggregation


Prometheus supports the following built-in aggregation operators that can be used to aggregate the elements of a single instant vector, resulting in a
new vector of fewer elements with aggregated values:

sum (calculate sum over dimensions)


min (select minimum over dimensions)
max (select maximum over dimensions)
avg (calculate the average over dimensions)
group (all values in the resulting vector are 1)
stddev (calculate population standard deviation over dimensions)
stdvar (calculate population standard variance over dimensions)
count (count number of elements in the vector)
count_values (count number of elements with the same value)
bottomk (smallest k elements by sample value)
topk (largest k elements by sample value)
quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)

query without aggregation

this query only will show the raw value of every wordpuzzle pod, to gather more precise monitoring, we need to modify the query, depends with what
we need to monitor.

kube_pod_info{pod=~"wordpuzzle.*"}

query with aggregation

if you want to know how many wordpuzzle pod you have, you can use sum function.

sum(kube_pod_info{pod=~"wordpuzzle.*"})

as you can see, all of the pod will be summarized as 1 value, and you can also see in the graph the increase in number of pod from time to time with this
query.

there are two method to grouping, depends with your logic.

first method with grouping inside 1 query, with this method you need to relabel the original label to be custom label, combine it with regex to filter
specific value that you want.
sum(
label_replace(kube_pod_info{pod=~"wordpuzzle-be-live.*"}, "pod_group", "$1", "pod", "wordpuzzle-be-live-
(\\w+(-\\w+)*).*"),
"pod_group", "$1", "pod", "wordpuzzle-be-live-(\\w+(-\\w+)*).*"
) by (pod_group)

the second method is to use 2 query inside 1 panel, it is more easy to use this query because you don't have to think about the regex and relabel logic at
all.

sum(kube_pod_info{pod=~"wordpuzzle-be-live-sg.*"})

sum(kube_pod_info{pod=~"wordpuzzle-be-live-xx.*"})

the difference will be in the legend. with relabel you can use variable {{pod_group}} as the legend, and it will show the value of sg and xx, withou
relabeling, you need to input the legend manually per query you are using.

Example Query Function


PromQl also provide function to make our lives easier to calculate and make our monitoring more precise. you can see the link provided in this section
to learn more about prometheus function.

without using function

(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"})
as an example, if we query without using function and only filtering the path ping, grafana will show us 2 graph over 1000k value. it is indicating that the
ping endpoint has been hit over 1000k since the earliest time in the record of prometheus DB.

Increase

calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically
adjusted for. The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer
result even if a counter increases only by integer increments.

increase(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m])

with increase function we can calculate how many time this endpoint has been hit (in this case over 1 minutes), with increase function we can calculate
how many RPS endpoint ping has been hit during 1 minute.

another pov from different panel with increase function

with different visualization we can get more insight what this increase function capable of. it is grouping by 4 graph because adzan has different 4 host
value.

but to calculate RPS, we don't need to know whereas the container are being deployed on, hence we can calculate the RPS for ping endpoint using sum
aggregator.
sum(increase(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m]))

with using sum aggregator and increase function now we know how many request per minute adzan ping endpoint has.

if we sum this endpoint without increase function it will calculate the total this endpoint has been hit from the earliest data until the latest data. and we
can see the graph is increasing from time to time.

Rate
calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target
restarts) are automatically adjusted for. Also, the calculation extrapolates to the ends of the time range, allowing for missed scrapes or imperfect
alignment of scrape cycles with the range's time period.

(rate(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m]))

rate will calculate PER SECOND AVERAGE from the result of one query, in this case, adzan RPM is 6 from our previous section. with rate it will be
averaged by 60 because we are calculating this end point every minute, so its like 6/60=0,1.

rate can be using to any type of metrics provided by prometheus, but it is more useful to calculate vector type metric likes cpu or memory. you can also
sum the result of rate function.

in some function, the have a bracket value for accumulation time, you can use your own value likes [30s] or [1m] or [1h] or with global variable likes
[$__range].

NOTE: be careful when you use [$__range] because it will accumulate the value based on the time range

Arithmetic Query PromQL


PromQL also support for arithmetic query to calculate and visualize data more accurate.
sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"
3.*|2.*"}[1m])) / sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid",
tenant=~"$tenant"}[1m])) * 100

sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid", tenant=~"$tenant", status_code=~"


4.*|5.*"}[1m])) / sum(increase(idgame_handled_total{service=~"adzan.*",env=~"$env", cid=~"$cid",
tenant=~"$tenant"}[1m])) * 100

with arithmetic query we can calculate the percentage of our service SLA the first thing you need to do is to know the formula.

You might also like