Igcs 1752181337 040723 1434 54
Igcs 1752181337 040723 1434 54
Igcs 1752181337 040723 1434 54
(SGDC)
Monitoring Standard
Minimum Required Dashboard
Compute Monitoring
Memory
Disk
CPU
Container Restart Count
Container Uptime
Incoming network
Outgoing network
Traffic Monitoring
Total incoming traffic
Total successful process
Total error received
Total incoming traffic per path
Total error received per path
Latency P50
Latency P75
Latency P95
Redis Monitoring
Kafka Monitoring
Streaming Application
Consumer Lag per-consumer-group
Throughput
Process Latency
Topic
Message In
Bytes In
Bytes Out
Start Offset
End Offset
Consumer
Bytes Consume Rate (byte/s)
Bytes Consume Rate per Topic (byte/s)
Rate of Record consumed
Rate of Record consumed per Topic
Avg lag per client and partition
Avg fetch size
Fetch request latency
Fetch request rate
Current connection count
Avg and max throttle time
Record lead per-partition
Node response rate
Failed auth rate
Producer
Outgoing message per second
Outgoing message per second per topic
Batch size average
Buffer size available
Request in flight
Request Queue avg time
Produce request latency
Produce request rate
Produce response rate
Connection count
Connection creation rate
Connection close rate
Select rate
Record error rate
Record retry rate
Failed authenticated rate
Database Monitoring
Server Uptime
Inbound Connection
Outbound Connection
Max Thread Connected [%]
Max Connection and Thread Connected
Max Used Connection
Average Thread Connected
Slow Queries
Aborted Connection Attempt
Aborted Client Timeout
Alerting Standard
Minimum Required Alert
Compute Alert
Traffic Alert
Example Query Aggregation
query without aggregation
query with aggregation
Example Query Function
without using function
Increase
Rate
Arithmetic Query PromQL
Monitoring Standard
Compute Monitoring
Resource CPU and memory should be calculated with percentage, hence the calculation should be TOTAL RESOURCE USED / TOTAL RESOURCE LIMIT,
and all of this should be calculated per container. any VM should have all of this type of monitoring, except for disk usage in container environment, only
container using persistent volume should have disk monitoring.
NOTE: some people might use different metric to calculate all of this monitoring!
Memory
Disk
CPU
Container Uptime
Incoming network
pod:pod_network_receive_bytes_total:sum_rate{pod_name=~"adzan.*"}
Outgoing network
pod:pod_network_transmit_bytes_total:sum_rate{pod_name=~"adzan.*"}
Traffic Monitoring
Latency P50
histogram_quantile(0.50, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))
Latency P75
histogram_quantile(0.75, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))
Latency P95
histogram_quantile(0.95, sum(increase(idgame_handled_seconds_bucket{service="adzan-be",env=~"$env",
tenant=~"$cid", path!~".*ping.*", region!="xx"}[1m])) by (le, path, method))
Redis Monitoring
blablabla
Kafka Monitoring
References:
https://access.redhat.com/documentation/en-us/red_hat_amq/7.2/html/using_amq_streams_on_red_hat_enterprise_linux_rhel/monitoring-
str#doc-wrapper
https://docs.confluent.io/platform/current/kafka/monitoring.html
https://medium.com/@yashwant.deshmukh23/kafka-real-time-streaming-application-monitoring-and-alerting-daa4a8796c61
Streaming Application
Approximate Lag of every consumer group by Topic/Partition. The value is the number of messages that are not consumed yet.
kafka_consumergroup_lag{mon_proj=~"$kafka_cluster"} > 0
Throughput
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesOutPerSec",
topic=~"$kafka_topic"}>0) by (topic)
Process Latency
Topic
Message In
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="MessagesInPerSec",
topic=~"$kafka_topic"}>0) by (topic)
Bytes In
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesInPerSec",
topic=~"$kafka_topic"}>0) by (topic)
Bytes Out
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="BytesOutPerSec",
topic=~"$kafka_topic"}>0) by (topic)
Start Offset
kafka_topic_partition_current_offset{cluster="$kafka_cluster",topic=~"$kafka_topic",
partition=~"$kafka_partition"}
End Offset
kafka_topic_partition_oldest_offset{cluster="$kafka_cluster",topic=~"$kafka_topic",
partition=~"$kafka_partition"}
Consumer
kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{cluster="$kafka_cluster"}
sum(kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}) by (topic)
kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate{cluster="$kafka_cluster"}
sum(kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}) by (topic)
kafka_consumer_consumer_fetch_manager_metrics_records_lag{cluster="$kafka_cluster", topic=~"$kafka_topic",
partition=~"$kafka_partition"}
kafka_consumer_consumer_fetch_manager_metrics_fetch_size_avg{cluster="$kafka_cluster", topic=~"$kafka_topic"}
Fetch request latency
kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_max{cluster="$kafka_cluster", topic=~"$kafka_topic"}
kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_rate{cluster="$kafka_cluster",
topic=~"$kafka_topic"}
The number of active connections that the consumer has to the Kafka broker.
kafka_consumer_consumer_metrics_connection_count{cluster="$kafka_cluster"}
kafka_consumer_consumer_fetch_manager_metrics_fetch_throttle_time_avg{mon_proj=~"$kafka_cluster"}
kafka_consumer_consumer_fetch_manager_metrics_fetch_throttle_time_max{mon_proj=~"$kafka_cluster"}
kafka_consumer_consumer_fetch_manager_metrics_records_lead{cluster="$kafka_cluster"}
The average number of responses received per second from the broker.
kafka_consumer_consumer_node_metrics_response_rate{cluster="$kafka_cluster"}
kafka_consumer_consumer_metrics_failed_authentication_rate{cluster=~"$kafka_cluster"}
Producer
kafka_server_BrokerTopicMetrics_Count{cluster="$kafka_cluster",name="MessagesInPerSec", topic=~"$kafka_topic"}
kafka_producer_producer_metrics_batch_size_avg{cluster="$kafka_cluster"}
kafka_producer_producer_metrics_buffer_available_bytes{cluster="$kafka_cluster"}
Request in flight
kafka_producer_producer_metrics_requests_in_flight{cluster=~"$kafka_cluster"}
kafka_producer_producer_metrics_record_queue_time_avg{cluster=~"$kafka_cluster"}
Produce is one major component of end-to-end latency. Produce latency is the total time taken to process the record and batch with other records in the
internal Kafka producer.
50P:
75P:
95P:
98P:
kafka_network_RequestMetrics_98thPercentile{cluster="$kafka_cluster", request="Produce", name="TotalTimeMs",
instance="$kafka_broker"}
99P:
999P:
sum(kafka_network_RequestMetrics_OneMinuteRate{cluster="$kafka_cluster",request="Produce", name="
RequestsPerSec"}) by (instance)
kafka_producer_producer_metrics_response_rate{cluster="$kafka_cluster"}
Connection count
kafka_producer_producer_metrics_connection_count{cluster="$kafka_cluster"}
kafka_producer_producer_metrics_connection_creation_rate{cluster="$kafka_cluster"}
kafka_producer_producer_metrics_connection_close_rate{cluster=~"$kafka_cluster"}
Select rate
kafka_producer_producer_metrics_select_rate{cluster="$kafka_cluster"}
query A:
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="FailedProduceRequestsPerSec",
topic=~"$kafka_topic"}) by (topic)
query B:
sum(kafka_server_BrokerTopicMetrics_OneMinuteRate{mon_proj="$kafka_cluster",name="FailedFetchRequestsPerSec",
topic=~"$kafka_topic"}) by (topic)
sum(kafka_producer_producer_topic_metrics_record_retry_rate{cluster="$kafka_cluster", topic=~"$kafka_topic"})
by (topic)
sum(kafka_network_RequestMetrics_OneMinuteRate{cluster="$kafka_cluster", error=~"
GROUP_AUTHORIZATION_FAILED|TOPIC_AUTHORIZATION_FAILED|TRANSACTIONAL_ID_AUTHORIZATION_FAILED|CLUSTER_AUTHORIZATIO
N_FAILED"} > 0) by (instance)
Database Monitoring
Server Uptime
Inbound Connection
Outbound Connection
Slow Queries
irate(mysql_global_status_aborted_clients{db_cluster="db-shopee-gamerun"})
Alerting Standard
Compute Alert
Memory usage
CPU usage
Disk usage
Container restart
Container downtime
Traffic Alert
Error rate
Latency
this query only will show the raw value of every wordpuzzle pod, to gather more precise monitoring, we need to modify the query, depends with what
we need to monitor.
kube_pod_info{pod=~"wordpuzzle.*"}
if you want to know how many wordpuzzle pod you have, you can use sum function.
sum(kube_pod_info{pod=~"wordpuzzle.*"})
as you can see, all of the pod will be summarized as 1 value, and you can also see in the graph the increase in number of pod from time to time with this
query.
first method with grouping inside 1 query, with this method you need to relabel the original label to be custom label, combine it with regex to filter
specific value that you want.
sum(
label_replace(kube_pod_info{pod=~"wordpuzzle-be-live.*"}, "pod_group", "$1", "pod", "wordpuzzle-be-live-
(\\w+(-\\w+)*).*"),
"pod_group", "$1", "pod", "wordpuzzle-be-live-(\\w+(-\\w+)*).*"
) by (pod_group)
the second method is to use 2 query inside 1 panel, it is more easy to use this query because you don't have to think about the regex and relabel logic at
all.
sum(kube_pod_info{pod=~"wordpuzzle-be-live-sg.*"})
sum(kube_pod_info{pod=~"wordpuzzle-be-live-xx.*"})
the difference will be in the legend. with relabel you can use variable {{pod_group}} as the legend, and it will show the value of sg and xx, withou
relabeling, you need to input the legend manually per query you are using.
(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"})
as an example, if we query without using function and only filtering the path ping, grafana will show us 2 graph over 1000k value. it is indicating that the
ping endpoint has been hit over 1000k since the earliest time in the record of prometheus DB.
Increase
calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically
adjusted for. The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer
result even if a counter increases only by integer increments.
increase(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m])
with increase function we can calculate how many time this endpoint has been hit (in this case over 1 minutes), with increase function we can calculate
how many RPS endpoint ping has been hit during 1 minute.
with different visualization we can get more insight what this increase function capable of. it is grouping by 4 graph because adzan has different 4 host
value.
but to calculate RPS, we don't need to know whereas the container are being deployed on, hence we can calculate the RPS for ping endpoint using sum
aggregator.
sum(increase(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m]))
with using sum aggregator and increase function now we know how many request per minute adzan ping endpoint has.
if we sum this endpoint without increase function it will calculate the total this endpoint has been hit from the earliest data until the latest data. and we
can see the graph is increasing from time to time.
Rate
calculates the per-second average rate of increase of the time series in the range vector. Breaks in monotonicity (such as counter resets due to target
restarts) are automatically adjusted for. Also, the calculation extrapolates to the ends of the time range, allowing for missed scrapes or imperfect
alignment of scrape cycles with the range's time period.
(rate(idgame_handled_total{application=~"adzan-be.*", path=~".*ping.*"}[1m]))
rate will calculate PER SECOND AVERAGE from the result of one query, in this case, adzan RPM is 6 from our previous section. with rate it will be
averaged by 60 because we are calculating this end point every minute, so its like 6/60=0,1.
rate can be using to any type of metrics provided by prometheus, but it is more useful to calculate vector type metric likes cpu or memory. you can also
sum the result of rate function.
in some function, the have a bracket value for accumulation time, you can use your own value likes [30s] or [1m] or [1h] or with global variable likes
[$__range].
NOTE: be careful when you use [$__range] because it will accumulate the value based on the time range
with arithmetic query we can calculate the percentage of our service SLA the first thing you need to do is to know the formula.