Understanding Linux On System Z Performance Tools
Understanding Linux On System Z Performance Tools
Understanding Linux On System Z Performance Tools
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corp., registered in many jurisdictions worldwide. Other product and service
names might be trademarks of IBM or other companies. A current list of IBM trademarks is
available on the Web at www.ibm.com/legal/copytrade.shtml.
Agenda
Your swiss army knife for the complex cases – Netstat – network statistics and overview
– Pidstat – per process statistics – Socket Statistics – extended socket statistics
– Slabtop – kernel memory pool consumption – top / ps – process overview
– Lsof – check file flags of open files – Icastats / lszcrypt – check usage of crypto hw support
– Blktrace – low level disk I/O analysis – Lsluns / multipath – check multipath setup
– Hyptop – cross guest cpu consumption monitor – Lsqeth – check hw checksumming and buffer count
– Iptraf - network traffic monitor – Ethtool – check offloading functions
– Dstat – very configurable live system overview – Collectl – full system monitoring
– Irqstats – check irq amount and cpu distribution – Ftrace – kernel function tracing
– Smem – per process/per mapping memory overview – Lttng – complex latency tracing infrastructure
– Jinsight – Java method call stack analysis – Ziomon – Analyze FCP setup and I/O
– Htop – top on steroids – Systemtap – another kernel tracing infrastructure
– Strace – system call statistics – Wireshark / Tcpdump – analyze network traffic in depth
– Ltrace – library call statistics – Iotop – order processes by disk I/O
– Kernel tracepoints – get in-depth timing inside the kernel – Iftop - per connection traffic overview
– Vmstat – virtual memory statistics … ever growing
– Sysstat – full system overview
– Iostat – I/O related statistics
– Dasdstat – disk statistics
– scsi statistics – disk statistics
– Perf – hw counters, tracepoint based evaluations, profiling to find hotspots
– Valgrind – in depth memory/cache analysis and leak detection
– Java Health Center – high level java overview and monitoring
– Java Garbage Collection and Memory visualizer – in depth gc analysis
3 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Structured Agenda
Utilization
Utilization means that a cpu core is used
The majority of those basics is usually known, but it still has a purpose
– Clarify and synchronize the understanding between everybody
– Provide metaphors that allow to explain it more easily next time
5 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Utilization - Metaphor
For all the clarifications on basic terms I will use metaphors
– Based on well known real world examples
– At the beginning of a topic the matching metaphor is provided
Utilization - USR
CPU
Application
Application
Kernel
System Call
For some tasks you need certain privileges
– so you call an administrator (System Call to the Kernel)
– He executes the privileged stuff for you (accounted as SYStem time)
Application
Kernel
System Call
For some tasks you need certain privileges
– so you call an administrator (System Call to the Kernel)
– He executes the privileged stuff for you (accounted as SYStem time)
– There are subcategories
• (H)IRQ: privileged work driven by interrupts (instead of the user)
• SIRQ: privileged work driven by soft interrupts and tasklets (instead of the user)
9 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Utilization - Idle
Application not
CPU Idle requesting work
Utilization - IOWait
Utilization – IOWait
Utilization - Steal
Application thought
it could work
(scheduled)
CPU seems to
be non existent
Utilization - Steal
If the Application
doesn't work
In case a CPU shouldn't run anyway the stealing isn't even recognized
Still accounted as idle or iowait
Utilization - Steal
Thought it would
transmit network packets
It got from the App
Asked the kernel
for I/O handling
CPU is still busy but
for other things
than Linux thought
A Linux does not know for which purpose the cpu was stolen
Steal can indicate issues
– Too high cpu or memory overcommitment
But steal also isn't always bad
– It could be work you requested, just like “USR->SYS in Linux alone”
15 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
II: Driving sync reads from a file, mdisk does work for you
=> Steal? - No, IOWait
III: Driving async writes (AIO) by a DB, causing mdisk work in the HV
=> Steal? - Kind of, Idle if idle without steal
IV: Paging in the HV takes place while you were running a Java based BI load
utilizing all cores
=> Steal? - Yes
V: Paging in the HV takes place while your system is an idling Dev testbed
=> Steal? - No, Idle
26 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Scheduling on Cores
Most systems have more programs than CPUs
So the OS will have to schedule them (time multiplexing)
Such scheduling is called Context switching
Scheduling on Cores
Application
CPU busy working
Scheduling on Cores
CPU busy, for work and overhead
Application
Application working
working
With two programs the OS has to switch them every now and then
– Every program shall get its fair share
– Switching causes overhead
• Some CPU time is no more used for the actual work done by the programs (red)
Scheduling on Cores
Applications trying
to work
With too much runnable programs per CPU the OS gets in trouble
– Actually this happens with any resource being shared a lot
Eventually there are two bad options to choose from
– Real throughput converging to zero (latency optimized)
– Individual applications have to wait longer (throughput optimized)
30 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Scheduling on Cores
cooperative
Applications
Voluntary
context switch
Scheduling on Cores
greedy
Applications
OS has to interrupt
and switch
→ non voluntary
4 Applications Dispatcher
in queue
Applications
in local runqueue
Applications
in local runqueue
Scheduler
migrates task
3 Applications
in local runqueue
120.00%
100.00%
80.00%
Utilization
Utilization
40.00%
20.00%
0.00%
1 3 70 65 60 50 20 3 1 0 0 2 4 67 57 45 2 3 2 0 1
# of runnable processes
120.00%
100.00%
80.00%
Utilization
Utilization
40.00%
20.00%
0.00%
1 3 70 65 60 50 20 3 1 0 0 2 4 67 57 45 2 3 2 0 1
# of runnable processes
120.00%
100.00%
Utilization
80.00%
Mean (Utilization)
Utiization
60.00% SYS
USR
40.00%
20.00%
0.00%
1 3 70 65 60 50 20 3 1 0 0 2 4 67 57 45 2 3 2 0 1
# of runnable processes
Page Cache
Keeping disk data available in memory is caching
– Certain strategies take place
• What should be cached
• Read ahead of data that will likely be used
• Coalesce writes to issue a single disk write for several memory writes
– Proper management of caches can be complex (read cpu intensive)
– As long as the kids just watch all their toys (read) things stay clean
– But when they really play with things (write) the room gets messy (dirty)
– How are things cleaned up?
All pages
dirty_ratio
dirty_background_ratio
Dirty pages
All pages
dirty_ratio
All pages
dirty_ratio
Dirty pages
dirty_background_ratio
Paging/Swapping
Spending more memory than available is overcommitment
In case the accessed memory exceeds the real memory paging has to
take place
– Paging (z/VM) is the same as swapping (Linux)
Paging/Swapping
Writing
Application
Memory Page
Paging/Swapping
Reading
Application
Memory Page
Paging/Swapping
Application wants
to write
Memory has no
free page left
later the process might have something in mind that he wants to write
to the next page
But all pages the OS could provide are full
– This now requires swapping (OS level)
– Or paging (z/VM level)
51 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Paging/Swapping
Top
Shows
– CPU utilization
– Detailed memory usage
Hints
– Parameter -b enables to write the output for each interval into a file
– Use -p [pid1, pid2,...] to reduce the output to the processes of interest
– Configure displayed columns using 'f' key on the running top program
– Use the 'W' key to write current configuration to ~/.toprc
→ becomes the default
top (cont.)
Output
top 11:12:52 up 1:11, 3 users, load average: 1.21, 1.61, 2.03
Tasks: 53 total, 5 running, 48 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.0%us, 5.9%sy, 0.0%ni, 79.2%id, 9.9%wa, 0.0%hi, 1.0%si, 1.0%st
Mem: 5138052k total, 801100k used, 4336952k free, 447868k buffers
Swap: 88k total, 0k used, 88k free, 271436k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P SWAP DATA WCHAN COMMAND
3224 root 18 0 1820 604 444 R 2.0 0.0 0:00.56 0 1216 252 dbench
3226 root 18 0 1820 604 444 R 2.0 0.0 0:00.56 0 1216 252 dbench
2737 root 16 0 9512 3228 2540 R 1.0 0.1 0:00.46 0 6284 868 sshd
3225 root 18 0 1820 604 444 R 1.0 0.0 0:00.56 0 1216 252 dbench
3230 root 16 0 2652 1264 980 R 1.0 0.0 0:00.01 0 1388 344 top
1 root 16 0 848 304 256 S 0.0 0.0 0:00.54 0 544 232 select init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.00 0 0 0 migration migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 0 0 0 ksoftirqd ksoftirqd/0
4 root 10 5 0 0 0 S 0.0 0.0 0:00.13 0 0 0 worker_th events/0
5 root 20 5 0 0 0 S 0.0 0.0 0:00.00 0 0 0 worker_th khelper
Hints
– virtual memory: VIRT = SWAP + RES unit KB
– physical memory used: RES = CODE + DATA unit KB
– shared memory SHR unit KB
54 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Linux ps command
Characteristics: very comprehensive, statistics data on process level
Objective: reports a snapshot of the current processes
Usage: “ps axlf”
Package: RHEL: procps SLES: procps
Shows
– IDs: Pid, Tid, User, …
– Status: stat and wchan
– Details: command, memory consumption and accumulated cpu time
PID TID NLWP POL USER TTY NI PRI PSR P STAT WCHAN START TIME %CPU %MEM VSZ SZ RSS COMMAND
871 871 1 TS root ? 5 29 0 * S< kauditd_thre 10:01 00:00:00 0.0 0.0 0 0 0 [kauditd]
2835 2835 1 TS root pts/2 0 23 0 * Ss+ read_chan 10:38 00:00:00 0.0 0.0 5140 824 2644 bash
3437 3437 1 TS root pts/1 0 23 0 * S+ wait4 11:39 00:00:00 0.0 0.0 1816 248 644 dbench 3
3438 3438 1 TS root pts/1 0 20 0 0 R+ 11:39 00:00:24 33.1 0.0 1820 252 604 dbench 3
…
Hints
– Do not specify blanks inside the -o format string
– Status is a one time shot, most interactive or I/O bound processes might sleep
55 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Linux ps command
cron that should not appear multiple times
– often long jobs that stack up and make it worse
Understand what is on there while healthy
One should be able to name all processes and what they do
#> ps axlf
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 0 2 0 20 0 0 0 kthrea S ? 0:00 [kthreadd]
1 0 3 2 20 0 0 0 smpboo S ? 0:00 \_ [ksoftirqd/0]
5 0 4 2 20 0 0 0 worker S ? 0:00 \_ [kworker/0:0]
1 0 5 2 0 -20 0 0 worker S< ? 0:00 \_ [kworker/0:0H]
...
4 0 1 0 20 0 9672 5096 SyS_ep Ss ? 0:01 /sbin/init
4 0 2023 1 20 0 3584 1400 hrtime Ss ? 0:00 /usr/sbin/crond -n
4 0 2024 1 20 0 2840 892 pause Ss ? 0:00 /usr/sbin/atd -f
...
4 0 2280 1 20 0 10892 3344 poll_s Ss ? 0:00 /usr/sbin/sshd -D
4 0 2316 2280 20 0 14472 4700 poll_s Ss ? 0:00 \_ sshd: root@pts/0
4 0 2318 2316 20 0 106392 2216 wait Ss pts/0 0:00 \_ -bash
0 0 2350 2318 20 0 105576 1056 - R+ pts/0 0:00 \_ ps axlf
1 0 2351 2318 20 0 106392 796 sleep_ D+ pts/0 0:00 \_ -bash
...
0 0 2368 2318 20 0 2192 360 - R pts/0 110:04 /usr/bin/app
...
vmstat
Characteristics: Easy to use, high-level information
Objective: First and fast impression of the current state
Usage: vmstat [interval in sec]
Package: RHEL: sysstat.s390x SLES: sysstat
Output sample:
vmstat 1
procs memory swap io system cpu
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 2 0 4415152 64068 554100 0 0 4 63144 350 55 29 64 0 3 4
3 0 0 4417632 64832 551272 0 0 0 988 125 60 32 67 0 0 1
3 0 0 4411804 72188 549592 0 0 0 8984 230 42 32 67 0 0 1
3 0 0 4405232 72896 555592 0 0 0 16 105 52 32 68 0 0 0
Shows
– Data per time interval
– CPU utilization
– Disk I/O
– Memory usage/Swapping
Hints
– Shared memory usage is listed under 'cache'
57 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Utilization - USR
CPU
Application
End of Part I
The one you should always have → IBM System z Enterprise
Agenda
Non-legal Disclaimer
This is an introduction and cheat sheet
– Know what is out there
– What could be useful in which case
– How could I debug even further
Formulate a theory
Sysstat - sadc/sar
Characteristics: Very comprehensive, statistics data on device level
Objective: Suitable for permanent system monitoring and detailed analysis
Usage (recommended):
– monitor /usr/lib64/sa/sadc [S XALL] [interval in sec] [outfile]
– View sar A f [outfile]
Package: RHEL: sysstat.s390x SLES: sysstat
Shows
– CPU utilization
– Disk I/O overview and on device level
– Network I/O and errors on device level
– Memory usage/Swapping
– … and much more
– Reports statistics data over time and creates average values for each item
Hints
– sadc parameter “-S XALL” enables the gathering of further optional data
– Shared memory is listed under 'cache'
– [outfile] is a binary file, which contains all values. It is formatted using sar
• enables the creation of item specific reports, e.g. network only
• enables the specification of a start and end time → time of interest
11:47:57 proc/s cswch/s
11:48:05 7.98 53189.03
11:48:09 8.00 54030.25
11:48:13 8.75 53599.25
11:48:17 7.25 54208.75
11:48:21 1.25 54054.75
11:48:25 8.02 54390.48
11:47:57 CPU %usr %sys %iowait %steal %irq %soft %guest %idle
11:48:05 all 70.66 23.47 0.00 0.00 0.12 5.74 0.00 0.00
11:48:05 0 70.32 23.44 0.00 0.00 0.25 5.74 0.00 0.25
11:48:05 1 70.75 23.50 0.00 0.00 0.25 5.50 0.00 0.00
11:48:09 all 71.84 22.53 0.00 0.00 0.00 5.51 0.00 0.13
11:48:09 0 72.00 22.25 0.00 0.25 0.00 5.50 0.00 0.00
11:48:09 1 71.75 22.50 0.00 0.00 0.00 5.50 0.00 0.25
11:48:13 all 71.38 23.12 0.00 0.00 0.00 5.50 0.00 0.00
11:48:13 0 70.75 23.75 0.00 0.00 0.00 5.50 0.00 0.00
11:48:13 1 71.93 22.56 0.00 0.00 0.00 5.51 0.00 0.00
11:47:57 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s
11:48:05 lo 52136.16 52136.16 45319.64 45319.64 0.00 0.00 0.00
11:48:05 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:48:05 eth0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:48:09 lo 53254.25 53254.25 46308.95 46308.95 0.00 0.00 0.00
11:48:09 eth1 0.25 0.00 0.08 0.00 0.00 0.00 0.25
11:48:09 eth0 0.25 0.00 0.08 0.00 0.00 0.00 0.25
11:47:57 tps rtps wtps bread/s bwrtn/s
11:48:01 1023.06 1020.30 2.76 35689.22 92.23
11:48:05 2588.53 2587.03 1.50 89208.98 31.92
11:48:09 491.00 425.50 65.50 15352.00 880.00
11:48:13 278.50 276.00 2.50 9016.00 44.00
11:48:17 135.25 133.00 2.25 4560.00 42.00
11:48:21 698.75 152.00 546.75 5376.00 4566.00
Overview of
- operations per second
- transferred amount
11:47:57 DEV tps rd_sec/s wr_sec/s avgrqsz avgqusz await svctm %util
11:48:05 dev2530 165.34 5564.09 0.00 33.65 0.06 0.38 0.21 3.49
11:48:05 dev2531 170.32 5564.09 0.00 32.67 0.06 0.37 0.20 3.49
11:48:05 dev8128 18.95 576.56 0.00 30.42 0.01 0.39 0.39 0.75
11:48:05 dev8144 19.20 674.31 0.00 35.12 0.00 0.26 0.26 0.50
11:48:05 dev8160 19.20 724.19 0.00 37.71 0.01 0.39 0.39 0.75
11:48:05 dev8176 18.95 652.37 0.00 34.42 0.01 0.53 0.53 1.00
11:47:57 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
11:48:01 41324 8211980 99.50 16992 7776084 145764 0.94
11:48:05 40844 8212460 99.51 17008 7776388 145964 0.94
11:48:09 40816 8212488 99.51 17036 7776800 145964 0.94
11:48:13 41088 8212216 99.50 17052 7777028 145968 0.94
11:48:17 42592 8210712 99.48 17072 7775396 146100 0.9
11:47:57 pswpin/s pswpout/s
11:48:01 2853.95 2658.26
11:48:05 2003.26 5399.80
11:48:09 88.59 9921.92
11:48:13 3199.30 53.15
11:47:57 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
11:48:01 4799.00 46.12 55571.93 1.00 12821.05 1859.40 0.00 1859.40 100.00
11:48:05 11153.12 15.96 22226.68 0.00 13414.46 2767.08 0.00 2767.08 100.00
11:48:09 1919.00 125.00 825.00 0.00 11345.75 461.75 0.00 461.75 100.00
11:48:13 1127.00 22.00 598.50 0.00 11227.00 262.25 0.00 262.25 100.00
11:48:17 570.00 21.00 52.75 0.00 11184.50 254.25 0.00 254.25 100.00
11:47:57 runqsz plistsz ldavg1 ldavg5 ldavg15
11:48:01 17 172 0.72 0.35 0.14
11:48:05 15 172 1.95 0.61 0.23
11:48:09 15 172 2.91 0.84 0.30
11:48:13 5 172 3.88 1.07 0.38
11:48:17 15 172 4.21 1.19 0.42
11:48:21 17 171 4.21 1.19 0.42
Sysstat - iostat
Shows
– Throughput
– Request merging
– Device queue information
– Service times
Hints
– Most critical parameter often is await
• average time (in milliseconds) for I/O requests issued to the device to be served.
• includes the time spent by the requests in queue and the time spent servicing them.
– Also suitable for network file systems
75 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
iostat
Output sample:
Time: 10:56:35 AM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrqsz avgqusz await svctm %util
dasda 0.19 1.45 1.23 0.74 64.43 9.29 74.88 0.01 2.65 0.80 0.16
dasdb 0.02 232.93 0.03 9.83 0.18 975.17 197.84 0.98 99.80 1.34 1.33
Time: 10:56:36 AM
dasda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dasdb 0.00 1981.55 0.00 339.81 0.00 9495.15 55.89 0.91 2.69 1.14 38.83
Time: 10:56:37 AM
dasda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dasdb 0.00 2055.00 0.00 344.00 0.00 9628.00 55.98 1.01 2.88 1.19 41.00
Sysstat - PIDSTAT
Shows
– w context switching activity and if it was voluntary
– r memory statistics, especially minor/major faults per process
– d disk throughput per process
Hints
– Also useful if run as background log due to its low overhead
• Good extension to sadc in systems running different applications/services
– p <pid> can be useful to track activity of a specific process
Pidstat examples
12:46:18 PM PID cswch/s nvcswch/s Command
12:46:18 PM 3 2.39 0.00 smbd
12:46:18 PM 4 0.04 0.00 sshd
12:46:18 PM 1073 123.42 180.18 Xorg
12:47:51 PM PID minflt/s majflt/s VSZ RSS %MEM Command
12:47:51 PM 985 0.06 0.00 15328 3948 0.10 smbd
12:47:51 PM 992 0.04 0.00 5592 2152 0.05 sshd
12:47:51 PM 1073 526.41 0.00 1044240 321512 7.89 Xorg
12:49:18 PM PID kB_rd/s kB_wr/s kB_ccwr/s Command
12:49:18 PM 330 0.00 1.15 0.00 sshd
12:49:18 PM 2899 4.35 0.09 0.04 notes2
12:49:18 PM 3045 23.43 0.01 0.00 audacious2
Sysstat - mpstat
Shows
– u utilization
– I <CPU|SCPU|ALL> Interrupts
Hints
– Can be restricted to selected processor(s) (P)
As one can see there are plenty of different (s)irq sources these days
– Ordered horizontally per type and vertically per cpu
IRQs
10:40:12 CPU EXT/s I/O/s AIO/s CLK/s EXC/s EMS/s TMR/s TAL/s
PFL/s DSD/s VRT/s SCP/s IUC/s CMS/s CMC/s CMR/s CIO/s QAI/s
DAS/s C15/s C70/s TAP/s VMR/s LCS/s CLW/s CTC/s APB/s ADM/s
CSC/s PCI/s MSI/s VIR/s VAI/s NMI/s RST/s
10:40:17 0 2.40 0.00 0.00 2.00 0.20 0.20 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:40:17 1 1.40 0.00 0.00 1.40 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00
[...]
CPU IRQ type Rate per second
SoftIRQs
10:40:26 CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s BLOCK_IOPOLL/s TASKLET/s SCHED/s
HRTIMER/s RCU/s
10:40:31 0 0.00 0.60 0.00 0.00 0.00 0.00 0.00 1.00
0.00 0.60
10:40:31 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
[...]
80 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
DASD statistics
Characteristics: Easy to use, very detailed
Objective: Collects statistics of I/O operations on DASD devices
Usage:
– enable: echo on > /proc/dasd/statistics
– show:
• Overall cat /proc/dasd/statistics
• for individual DASDs tunedasd P /dev/dasda
Package: n/a for kernel interface, s390-tools for dasdstat
Shows:
Histogram of I/O between Histogram between
Histogram of I/O till ssch ssch and IRQ I/O and End
Start Build channel program Processing data transfer Tell block dev layer
End
wait till subchannel is from/to storage server Data has arrived
free
Histogram of I/O times
Hints
– The tools dasdstat is available in s390-utils to handle all that in one place
Histogram of sizes (512B secs)
0 0 9925 3605 1866 4050 4102 933 2700 2251 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Histogram of I/O times (microseconds)
0 0 0 0 0 0 0 1283 1249 6351 7496 3658 8583 805 7 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Histogram of I/O time till ssch ← look here for subchannel busy
2314 283 98 34 13 5 16 275 497 8917 5567 4232 7117 60 4 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Histogram of I/O time between ssch and irq ← look here for slow SAN
0 0 0 0 0 0 0 14018 7189 2402 1031 4758 27 4 3 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Histogram of I/O time between irq and end
2733 6 5702 9376 5781 940 1113 3781 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# of req in chanq at enqueuing (1..32)
0 2740 628 1711 1328 23024 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hints
– Also shows data per sector which usually is only confusing
Time consists of
– “Build subchannel program (usually very fast)
– wait for a free subchannel (can be long without or too few HPAV)
– Please be aware that all columns in dasdstat are scaled by 2n
Number of
I/O time till ssch
requests before
10000
the current one
9000
8000
7000
2+
6000
# of requests
5000
4000
3000 0
2000 1-2
1000
0
FCP statistics
Usage:
– enable
• CONFIG_STATISTICS=y must be set in the kernel config file
• debugfs is mounted at /sys/kernel/debug/
• For a certain LUN in directory
/sys/kernel/debug/statistics/zfcp<devicebusid><WWPN><LUN>
issue echo on=1 > definition (turn off with on=0, reset with data=reset)
– view
• cat /sys/kernel/debug/statistics/zfcp<devicebusid><WWPN>
<LUN>/data
Hint
– FCP and DASD statistics are not directly comparable, because in the FCP case many
I/O requests can be sent to the same LUN before the first response is given. There is
a queue at FCP driver entry and in the storage server
84 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
FCP statistics
Shows:
– Request sizes in bytes (hexadecimal)
– Channel latency in ns Time spent on the FCP channel (internal transfer)
– Fabric latency in ns Time spent in the FCP fabric (outside transfer)
– (Overall) latencies whole time spent entry/exit of the zFCP layer in ms
SCSI stack zfcp I/O passes through I/O queued Completion IRQ handling Completion
Fabric
preps I/O preps I/O QDIO and z/VM for transm. handling by zfcp handled by SCSI
SCSI stack issues I/O
Resp. received
zfcp issues I/O
= blktrace dispatch
Receive IRQ
I/O received
I/O issued
by card
by card
by card
Received by
completed by
SCSI stack
SCSI stack
Channel Fabric Channel
FCP statistics
On popular request – the “where to complain” guide
– █ App to zfcp driver: Linux developers / Distributor in general
– █ zfcp driver to hardware: zfcp/qdio (optional also z/VM) developers
– █ In system hardware to SAN: FCP card HW/FW stack
– █ External: SAN administrator
SCSI stack zfcp I/O passes through I/O queued Completion IRQ handling Completion
Fabric
preps I/O preps I/O QDIO and z/VM for transm. handling by zfcp handled by SCSI
SCSI stack issues I/O
Resp. received
zfcp issues I/O
= blktrace dispatch
Receive IRQ
I/O received
I/O issued
by card
by card
by card
Received by
completed by
SCSI stack
SCSI stack
Channel Fabric Channel
per device latency statistics (f fabric; c channel)
disk rdfmin rdfmax rdfsum rdfavg rdcmin rdcmax rdcsum rdcavg rdcnt | wr* [..wr/cmd]
sda 92 130 1386 126.00 7 10 98 8.91 11
sdb 127 131 2072 129.50 7 10 140 8.75 16
sdc 126 140 2075 129.69 7 14 145 9.06 16
sdd 126 132 1160 128.89 7 8 74 8.22 9
[...]
sdbd n/a n/a 0 0.00 n/a n/a 0 0.00 0
sdbe n/a n/a 0 0.00 n/a n/a 0 0.00 0
per adapter statistics
adapter (subch/dev) rdcnt rdmb rdavgsz wrcnt wrmb wravgsz cmdcnt secactive
0.0.0004/0.0.1700 4899 18 3.76 11572 1249 110.52 240 17324
0.0.000c/0.0.1800 4901 16 3.34 11564 1265 112.02 236 17325
0.0.00d6/0.0.5100 4765 16 3.44 11595 1254 110.75 239 17318
0.0.00e2/0.0.5b00 1888 5 2.71 0 0 0.00 160 17309
iotop
Characteristics: simple, top like I/O monitor
Objective: Check which processes are doing I/O
Usage: iotop
Package: RHEL: iotop SLES: iotop
Shows
– Read/Write per thread
– Can accumulate (-a) for updating summaries instead of live views
• Useful for Disk I/O tests that don't account on their own
– Separate accounting for swap
Hints
– Can be restricted to certain processes via (p)
– Has a batch mode like top
Lszcrypt / icastats
Characteristics: overview of s390 crypto HW and libica usage
Objective: am I really using my crypto hardware
Usage: “lszcrypt VV[V]” “cat /proc/icastats”
Package: RHEL: s390utils-base SLES: s390-tools
lszcrypt VV Cat/proc/icastats
card02: CEX3C online hwtype=9 depth=8 function | # hardware | # software
request_count=443 ++
card03: CEX3A offline hwtype=8 depth=8 SHA1 | 0 | 0
request_count=0 SHA224 | 0 | 0
SHA256 | 0 | 0
SHA384 | 0 | 0
SHA512 | 0 | 0
RANDOM | 187109 | 0
MOD EXPO | 0 | 0
RSA CRT | 93554 | 0
DES ENC | 0 | 0
DES DEC | 0 | 0
3DES ENC | 0 | 0
3DES DEC | 0 | 0
AES ENC | 2574106 | 0
AES DEC | 2075854 | 0
CMAC GEN | 0 | 0
CMAC VER | 0 | 0
lsqeth
Characteristics: overview of network devices
Objective: check your network devices basic setup
Usage: “lsqeth p”
Package: RHEL: s390-utils-base SLES: s390-tools
lsqeth p
devices CHPID interface cardtype port chksum prioq'ing rtr4 rtr6 lay'2 cnt
0.0.e000/0.0.e001/0.0.e002 x84 eth1 OSD_10GIG 0 sw always_q_2 n/a n/a 1 64
0.0.e100/0.0.e101/0.0.e102 x85 eth2 OSD_10GIG 0 sw always_q_2 n/a n/a 1 64
0.0.f200/0.0.f201/0.0.f202 x6B eth0 OSD_1000 0 hw always_q_2 no no 0 64
Ethtool I
Characteristics: overview of network device capabilities / offload settings
Objective: check your network device (offload) settings
Usage: “ethtool <dev>”, “ethtool k <dev>”
Package: RHEL: ethtool SLES: ethtool
ethtool eth1
Settings for eth1:
Supported ports: [ FIBRE ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
10000baseT/Full
Supported pause frame use: No
Supports autonegotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
10000baseT/Full
Advertised pause frame use: No
Advertised autonegotiation: Yes
Speed: 10000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: internal
Autonegotiation: on
Link detected: yes
Ethtool II
Offload Settings via “ethtool k <dev>”
Changes via upper case “-K”
ethtool k eth1 […]
Features for eth1: ntuplefilters: off [fixed]
rxchecksumming: off [fixed] receivehashing: off [fixed]
txchecksumming: off highdma: off [fixed]
txchecksumipv4: off [fixed] rxvlanfilter: on [fixed]
txchecksumipgeneric: off [fixed] vlanchallenged: off [fixed]
txchecksumipv6: off [fixed] txlockless: off [fixed]
txchecksumfcoecrc: off [fixed] netnslocal: off [fixed]
txchecksumsctp: off [fixed] txgsorobust: off [fixed]
scattergather: off txfcoesegmentation: off [fixed]
txscattergather: off [fixed] txgresegmentation: off [fixed]
txscattergatherfraglist: off [fixed] txudp_tnlsegmentation: off [fixed]
tcpsegmentationoffload: off fcoemtu: off [fixed]
txtcpsegmentation: off [fixed] txnocachecopy: off
txtcpecnsegmentation: off [fixed] loopback: off [fixed]
txtcp6segmentation: off [fixed] rxfcs: off [fixed]
udpfragmentationoffload: off [fixed] rxall: off [fixed]
genericsegmentationoffload: off [requested on] txvlanstaghwinsert: off [fixed]
genericreceiveoffload: on rxvlanstaghwparse: off [fixed]
largereceiveoffload: off [fixed] rxvlanstagfilter: off [fixed]
rxvlanoffload: off [fixed]
txvlanoffload: off [fixed]
[...]
Prepare
– System and Workload descriptions Prepare Gather
– Healthy system data for comparison
Gather lessons learned
problem occurs
need assistance
– In case of emergency analysis
Report
– How to report a Problem Description Solve Report
Solve
– Tools to start an analysis more info
End of Part II
The one you should always have → IBM System z Enterprise
Agenda
STRACE
Shows
– Identify kernel entries called more often or taking too long
• Can be useful if you search for increased system time
– Time in call (T)
– Relative timestamp (r)
Hints
– The option “c” allows medium overhead by just tracking counters and
durations
strace - example
The columns “seconds”, “usecs/call”,
“calls”, “errors” can reveal if one has
% time reports shares, useful a lot, slow or failing calls? syscall references are
to rate the importance of a call available as “man pages”
strace cf p 26802
Process 26802 attached interrupt to quit
^Process 26802 detached
% time seconds usecs/call calls errors syscall
58.43 0.007430 17 450 read
24.33 0.003094 4 850 210 access
5.53 0.000703 4 190 10 open
4.16 0.000529 3 175 write
2.97 0.000377 2 180 munmap
1.95 0.000248 1 180 close
1.01 0.000128 1 180 mmap
0.69 0.000088 18 5 fdatasync
0.61 0.000078 0 180 fstat
0.13 0.000017 3 5 pause
100.00 0.012715 2415 225 total
LTRACE
Shows
– Identify library calls that are too often or take too long
• Good if you search for additional user time
• Good if things changed after upgrading libs
– Time in call (T)
– Relative timestamp (r)
Hints
– The option “c” allows medium overhead by just tracking counters and durations
– The option “S” allows to combine ltrace and strace
ltrace - example
The columns “seconds”, “usecs/call”,
“calls”, “errors” can reveal if one has
% time reports shares, useful a lot, slow or failing calls? Most function references are
to rate the importance of a call available as “man pages”
ltrace cf p 26802
% time seconds usecs/call calls function
98.33 46.765660 5845707 8 pause
0.94 0.445621 10 42669 strncmp
0.44 0.209839 25 8253 fgets
0.08 0.037737 11 3168 __isoc99_sscanf
0.07 0.031786 20 1530 access
0.04 0.016757 10 1611 strchr
0.03 0.016479 10 1530 snprintf
0.02 0.010467 1163 9 fdatasync
0.02 0.008899 27 324 fclose
0.02 0.007218 21 342 fopen
0.01 0.006239 19 315 write
0.00 0.000565 10 54 strncpy
100.00 47.560161 59948 total
Example strace'ing a sadc data gatherer
0.000028 write(3, "\0\0\0\0\0\0\0\17\0\0\0\0\0\0\0"..., 680) = 680 <0.000007>
0.000027 write(3, "\0\0\0\0\0\0\0\17\0\0\0\0\0\0\0"..., 680) = 680 <0.000007>
0.000026 fdatasync(3) = 0 <0.002673>
0.002688 pause() = 0 <3.972935>
3.972957 SIGALRM (Alarm clock) @ 0 (0)
0.000051 rt_sigaction(SIGALRM, {0x8000314c, [ALRM], SA_RESTART}, 8) = 0 <0.000005>
0.000038 alarm(4) = 0 <0.000005>
0.000031 sigreturn() = ? (mask now []) <0.000005>
0.000024 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2309, ...}) = 0 <0.000007>
0.000034 open("/proc/uptime", O_RDONLY) = 4 <0.000009>
0.000024 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 <0.000005>
0.000029 mmap(NULL, 4096, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, 1, 0) = 0x3fffd20a000 <0.000006>
0.000028 read(4, "11687.70 24836.04\n", 1024) = 18 <0.000010>
0.000027 close(4) = 0 <0.000006>
0.000020 munmap(0x3fffd20a000, 4096) = 0 <0.000009>
lsof
Shows
– List of files including sockets, directories, pipes
– User, Command, Pid, Size, Device
– File Type and File Flags
Hints
– +fg reports file flags which can provide a good cross check opportunity
lsof - example
COMMAND PID TID USER FD TYPE FILEFLAG DEVICE SIZE/OFF NODE NAME
crond 16129 root mem REG 94,1 165000 881893
/usr/lib64/ld2.16.so
crond 16129 root 0r CHR LG 1,3 0t0 2051 /dev/null
crond 16129 root 1u unix RW 0x0000001f1ba02000 0t0 106645 socket
crond 16129 root 2u unix RW 0x0000001f1ba02000 0t0 106645 socket
crond 16129 root 4r a_inode 0x80000 0,9 0 6675 inotify
crond 16129 root 5u unix RW,0x80000 0x0000001f5d3ad000 0t0 68545 socket
dd 17617 root cwd DIR 94,1 4096 16321 /root
dd 17617 root rtd DIR 94,1 4096 2 /
dd 17617 root txt REG 94,1 70568 1053994 /usr/bin/dd
dd 17617 root mem REG 94,1 165000 881893
/usr/lib64/ld2.16.so
dd 17617 root 0r CHR LG 1,9 0t0 2055 /dev/urandom
dd 17617 root 1w REG W,DIR,LG 94,1 5103616 16423 /root/test
dd 17617 root 2u CHR RW,LG 136,2 0t0 5 /dev/pts/2
lsluns
Characteristics: overview of multipathing
Objective: check your multipath setup hierarchy
Usage: “lsluns a”
Package: RHEL: s390utils-base SLES: s390-tools
lsluns a
adapter = 0.0.1700
port = 0x500507630900c7c1
lun = 0x4020402100000000 /dev/sg0 Disk IBM:2107900
lun = 0x4020402200000000 /dev/sg1 Disk IBM:2107900
lun = 0x4020402300000000 /dev/sg2 Disk IBM:2107900
lun = 0x4021402100000000 /dev/sg3 Disk IBM:2107900
lun = 0x4021402200000000 /dev/sg4 Disk IBM:2107900
lun = 0x4021402300000000 /dev/sg5 Disk IBM:2107900
adapter = 0.0.1780
port = 0x500507630903c7c1
lun = 0x4020402100000000 /dev/sg17 Disk IBM:2107900
lun = 0x4020402200000000 /dev/sg23 Disk IBM:2107900
lun = 0x4020402300000000 /dev/sg32 Disk IBM:2107900
lun = 0x4021402100000000 /dev/sg39 Disk IBM:2107900
lun = 0x4021402200000000 /dev/sg43 Disk IBM:2107900
lun = 0x4021402300000000 /dev/sg46 Disk IBM:2107900
[...]
Lsluns provides a hierarchical view which often easily identifies missing paths,
adapters or similar imbalances
Adapter to WWPN associations can have concurring targets
– Low overhead, max fallback capability, best performance, ...
105 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Multipath -ll
Characteristics: overview of multipathing
Objective: check your multipath setup configuration
Usage: “mutlipath ll”
Package: RHEL: device-mapper-multipath SLES: mutlipath-tools
multipath ll
swap3of6 (36005076309ffc7c10000000000002022) dm2 IBM ,2107900
size=256G features='0' hwhandler='0' wp=rw
`+ policy='servicetime 0' prio=0 status=active
| 0:0:20:1075986464 sdb 8:16 active ready running
| 1:0:22:1075986464 sdx 65:112 active ready running
| 2:0:21:1075986464 sdh 8:112 active ready running
| 3:0:20:1075986464 sdn 8:208 active ready running
| 4:0:26:1075986464 sdz 65:144 active ready running
| 5:0:19:1075986464 sdy 65:128 active ready running
| 7:0:25:1075986464 sdac 65:192 active ready running
` 6:0:24:1075986464 sdad 65:208 active ready running
[...]
Hyptop
Shows
– CPU load & Management overhead
– Memory usage (only under zVM)
– Can show image overview or single image details
Hints
– Good “first view” tool for linux admins that want to look “out of their linux”
– Requirements:
• For z/VM the Guest needs Class B
• For LPAR “Global performance data control” checkbox in HMC
DSTAT
Hints
– Powerful plug-in concept
• “topio” for example identifies the application causing the most I/Os
– Colorization allows fast identification of deviations
totalcpuusage inter system virtualmemory
usr sys idl wai hiq siq| 1 | int csw |majpf minpf alloc free
many
0 0 100 0 0 0| 9 | 145 279 | 0 307 195 689 many
1 2 97 0 0 0| 4 | 12k 1119 | 22k 150k 308k 229k more ...
1 3 96 0 0 0| 3 | 892 960 | 69k 254k 304k 304k
0 0 100 0 0 0| 2 | 309 619 | 2 8 20 1
1 3 96 0 0 0| 1 | 594 875 | 79k 258k 282k 282k
110 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
HTOP
Hints
– Htop can display more uncommon fields (in menu)
– Able to send signals out of its UI for administration purposes
– Processes can be sorted/filtered for a more condensed view
htop example
Configurable utilization visualization
Hierarchy
Common process info
Accumulated Usage
and IO rates
112 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
netstat
Characteristics: Easy to use, connection information
Objective: Lists connections
Usage: netstat eeapn
Package: RHEL: net-tools SLES: net-tools
Shows
– Information about each connection
– Various connection states
Hints
– Inodes and program names are useful to reverse-map ports to applications
netstat -s
Characteristics: Easy to use, very detailed information
Objective: Display summary statistics for each protocol
Usage: netstat s
Shows
– Information to each protocol
– Amount of incoming and outgoing packages
– Various error states, for example TCP segments retransmitted!
Hints
– Shows accumulated values since system start, therefore mostly the differences
between two snapshots are needed
– There is always a low amount of packets in error or resets
– Retransmits occurring only when the system is sending data
When the system is not able to receive, then the sender shows retransmits
– Use sadc/sar to identify the device
netstat -s
Output sample:
Tcp:
15813 active connections openings
35547 passive connection openings
305 failed connection attempts
0 connection resets received
6117 connections established
81606342 segments received
127803327 segments send out
288729 segments retransmitted
0 bad segments received.
6 resets sent
Socket statistics
Characteristics: Information on socket level
Objective: Check socket options and weird connection states
Usage: ss aempi
Package: RHEL: iproute-2 SLES: iproute2
Shows
– Socket options
– Socket receive and send queues
– Inode, socket identifiers
Sample output
ss aempi
State RecvQ SendQ Local Address:Port Peer Address:Port
LISTEN 0 128 :::ssh :::*
users:(("sshd",959,4)) ino:7851 sk:ef858000 mem:(r0,w0,f0,t0)
Hints
– Inode numbers can assist reading strace logs
– Check long outstanding queue elements
116 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
IPTRAF
Shows
– Details per Connection / Interface
– Statistical breakdown of ports / packet sizes
– LAN station monitor
Hints
– Can be used for background logging as well
• Use SIGUSR1 and logrotate to handle the growing amount of data
– Knowledge of packet sizes important for the right tuning
– There are various other tools: iftop, bmon, …
• like with net benchmarks no one seem to fit all
117 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
iptraf
Questions that usually can be addressed
– Connection behavior overview
– Do you have peaks in your workload characteristic
– Who does your host really communicate with
Comparison to wireshark
– Not as powerful, but much easier and faster to use
– Lower overhead and no sniffing needed (often prohibited)
Packet IF
sizes details
Agenda
Perf
Shows
– Sampling for CPU hotspots
• Annotated source code along hotspots
• list functions according to their usage
– CPU event counters
– Further integrated non-sampling tools
Hints
– Without HW support only userspace can be reasonably profiled
– “successor” of oprofile that is available with HW support (SLES11-SP2)
– Perf HW support upstream but still experimental
Perf
Perf profiling
Perf profiling
# Baseline Delta Symbol
# ........ .......... ................... ......
#
12.14% +8.07% [kernel.kallsyms] [k] lock_acquire
8.96% +5.50% [kernel.kallsyms] [k] lock_release
4.83% +0.38% reaim [.] add_long
4.22% +0.41% reaim [.] add_int
4.10% +2.49% [kernel.kallsyms] [k] lock_acquired
3.17% +0.38% libc2.11.3.so [.] msort_with_tmp
3.56% 0.37% reaim [.] string_rtns_1
3.04% 0.38% libc2.11.3.so [.] strncat
perf stat B event=cycles,instructions,r20,r21,r3,r5,sched:sched_wakeup find / iname
"*foobar*"
Performance counter stats for 'find / iname *foobar*':
3,623,031,935 cycles # 0.000 GHz
1,515,404,340 instructions # 0.42 insns per cycle
1,446,545,776 PROBLEM_STATE_CPU_CYCLES
757,589,098 PROBLEM_STATE_INSTRUCTIONS
705,740,759 L1I_PENALTY_CYCLES
576,226,424 L1D_PENALTY_CYCLES
40,675 sched:sched_wakeup
6.156288957 seconds time elapsed
Still experimental
Events
– Cycles/Instructions globally
– Note: counters are now readable, but aliases can still be used
• e.g. r20 = PROBLEM_STATE_CPU_CYCLES
• list of all existing events lscpumf C
• counters available to you lscpumf c
– Not only HW events, you can use any of the currently 163 tracepoints
126 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Slabtop
Shows
– Active / Total object number/size
– Objects per Slab
– Object Name and Size
– Objects per Slab
Hints
– -o is one time output e.g. to gather debug data
– Despite slab/slob/slub in kernel its always slabtop
Slabtop - example
Active / Total Objects (% used) : 2436408 / 2522983 (96.6%)
Active / Total Slabs (% used) : 57999 / 57999 (100.0%)
Active / Total Caches (% used) : 75 / 93 (80.6%)
Active / Total Size (% used) : 793128.19K / 806103.80K (98.4%)
Minimum / Average / Maximum Object : 0.01K / 0.32K / 8.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
578172 578172 100% 0.19K 13766 42 110128K dentry
458316 458316 100% 0.11K 12731 36 50924K sysfs_dir_cache
368784 368784 100% 0.61K 7092 52 226944K proc_inode_cache
113685 113685 100% 0.10K 2915 39 11660K buffer_head
113448 113448 100% 0.55K 1956 58 62592K inode_cache
111872 44251 39% 0.06K 1748 64 6992K kmalloc64
54688 50382 92% 0.25K 1709 32 13672K kmalloc256
40272 40239 99% 4.00K 5034 8 161088K kmalloc4096
39882 39882 100% 0.04K 391 102 1564K ksm_stable_node
38505 36966 96% 0.62K 755 51 24160K shmem_inode_cache
37674 37674 100% 0.41K 966 39 15456K dm_rq_target_io
BLKTRACE
Hints
– Filter masks allow lower overhead if only specific events are of interest
– Has an integrated client/server mode to stream data away
• Avoids extra disk I/O on a system with disk I/O issues
Blocktrace allows to
– Analyze Disk I/O characteristics like sizes and offsets
• Maybe your I/O is split in a layer below
– Analyze the timing with details about all involved Linux layers
• Often useful to decide if HW or SW causes stalls
– Summaries per CPU / application can identify imbalances
Blktrace - events
Common:
A -- remap For stacked devices, incoming i/o is remapped to device below it in the i/o stack. The remap action details what exactly is being remapped to
what.
Q -- queued This notes intent to queue i/o at the given location. No real requests exists yet.
G -- get request To send any type of request to a block device, a struct request container must be allocated first.
I -- inserted A request is being sent to the i/o scheduler for addition to the internal queue and later service by the driver. The request is fully formed at this
time.
D -- issued A request that previously resided on the block layer queue or in the i/o scheduler has been sent to the driver.
C -- complete A previously issued request has been completed. The output will detail the sector and size of that request, as well as the success or failure
of it.
Special:
B -- bounced The data pages attached to this bio are not reachable by the hardware and must be bounced to a lower memory location. This causes a big
slowdown in i/o performance, since the data must be copied to/from kernel buffers. Usually this can be fixed with using better hardware -- either a better
i/o controller, or a platform with an IOMMU.
S -- sleep No available request structures were available, so the issuer has to wait for one to be freed.
X -- split On raid or device mapper setups, an incoming i/o may straddle a device or internal zone and needs to be chopped up into smaller pieces for
service. This may indicate a performance problem due to a bad setup of that raid/dm device, but may also just be part of normal boundary conditions. dm
is notably bad at this and will clone lots of i/o.
Need to Generate a
new request I P
blktrace
Example Case
– The snippet shows a lot of 4k requests (8x512 byte sectors)
• We expected the I/O to be 32k
– Each one is dispatched separately (no merges)
• This caused unnecessary overhead and slow I/O
Maj/Min CPU Seqnr sec.nsec pid Action RWBS sect + size map source / task
94,4 27 21 0.059363692 18994 A R 20472832 + 8 < (94,5) 20472640
94,4 27 22 0.059364630 18994 Q R 20472832 + 8 [qemukvm]
94,4 27 23 0.059365286 18994 G R 20472832 + 8 [qemukvm]
94,4 27 24 0.059365598 18994 I R 20472832 + 8 ( 312) [qemukvm]
94,4 27 25 0.059366255 18994 D R 20472832 + 8 ( 657) [qemukvm]
94,4 27 26 0.059370223 18994 A R 20472840 + 8 < (94,5) 20472648
94,4 27 27 0.059370442 18994 Q R 20472840 + 8 [qemukvm]
94,4 27 28 0.059370880 18994 G R 20472840 + 8 [qemukvm]
94,4 27 29 0.059371067 18994 I R 20472840 + 8 ( 187) [qemukvm]
94,4 27 30 0.059371473 18994 D R 20472840 + 8 ( 406) [qemukvm]
blktrace
Example Case
– Analysis turned out that the I/O was from the swap code
• Same offsets were written by kswapd
– A recent code change there disabled the ability to merge I/O
– The summary below shows the difference after a fix
Total initially
Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB
Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB
Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB
IO unplugs: 149,614 Timer unplugs: 2,940
Total after Fix
Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB
Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB
Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB
IO unplugs: 337,130 Timer unplugs: 11,184
ziomon
Characteristics: in depth zfcp based I/O analysis
Objective: Analyze your FCP based I/O
Usage: “ziomon” → “ziorep*”
Package: RHEL: s390utils(-ziomon) SLES: s390-tools
ziomon Tools ziorep_config
ziorep_utilization
Target
system Data Data
.csv
.log .agg .cfg .config
Be aware that ziomon can be memory greedy if you have very memory constrained systems
The has many extra functions please check out the live virtual class of Stefan Raspl
– PDF: http://www.vm.ibm.com/education/lvc/LVC0425.pdf
– Replay: http://ibmstg.adobeconnect.com/p7zvdjz0yye/
135 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
TCPDump
Characteristics: dumps network traffic to console/file
Objective: analyze packets of applications manually
Usage: “tcpdump ...”
Package: RHEL: tcpdump SLES: tcpdump
tcpdump host pserver1
tcpdump: verbose output suppressed, use v or vv for full protocol decode
listening on eth0, linktype EN10MB (Ethernet), capture size 65535 bytes
13:30:00.326581 IP pserver1.boeblingen.de.ibm.com.38620 > p10lp35.boeblingen.de.ibm.com.ssh: Flags [.], ack 3142, win
102, options [nop,nop,TS val 972996696 ecr 346994], length 0
13:30:00.338239 IP p10lp35.boeblingen.de.ibm.com.ssh > pserver1.boeblingen.de.ibm.com.38620: Flags [P.], seq 3142:3222,
ack 2262, win 2790, options [nop,nop,TS val 346996 ecr 972996696], length 80
13:30:00.375491 IP pserver1.boeblingen.de.ibm.com.38620 > p10lp35.boeblingen.de.ibm.com.ssh: Flags [.], ack 3222, win
102, options [nop,nop,TS val 972996709 ecr 346996], length 0
[...]
^C
31 packets captured
31 packets received by filter
0 packets dropped by kernel
Differences
– Profiling a JVM might hide the Java methods
– Memory allocation of the JVM isn't the allocation of the Application
Shows
– Memory usage
– Method Profiling
– I/O Statistics
– Class loading
– Locking
Hints
– Low overhead, therefore even suitable for monitoring
– Agent activation -Xhealthcenter:port=12345
– Can trigger dumps or verbosegc for in-depth memory analysis
The method profiling view will show the name of the methods and their time
consumption as percentage
139 Linux-Performance-know it all series © 2014 IBM Corporation
Linux on System z Performance Evaluation
Shows
– Memory usage
– Garbage Collection activities
– Pauses
– Memory Leaks by stale references
Hints
– GCMV can also compare output of two runs
– Activate verbose logs verbose:gc Xverbosegclog:<log_file>
140 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Java - Jinsight
Shows
– Call Stack and time
Hints
– Significant slowdown, not applicable to production systems
– No more maintained, but so far still working
Threads
Execution Time
Many horizontal stages mean deep call stacks
Long vertical areas mean long method execution
Rectangles full of horizontal lines can be an issue
End of Part IV
The one you should always have → IBM System z Enterprise
Agenda
Cachestat
Characteristics: Simple per page views of caching
Objective: Detect what parts of a file are in page cache
Usage: Write – or search for example code
Package: n/a (pure code around the mincore system call)
Shows
– How much of a file is in cache
Hints
– We are now going from unsupported to non existent packages
– Still the insight can be so useful, it is good to know
Cachestat usage
./cachestat v ../Music/mysong.flac
pages in cache: 445/12626 (3.5%) [filesize=50501.0K, pagesize=4K]
cache map:
0: |x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
32: |x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
64: |x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
[...]
320: |x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
352: |x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
384: |x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
416: |x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x| | | | |
448: | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
480: | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
[...]
12576: | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
12608: | | | | | | | | | | | | | | | | | |x|
smem
Hints
– Has visual output (pie charts) and filtering options as well
– No support for huge pages or transparent huge pages (kernel interface missing)
PID User Command Swap VSS USS PSS RSS
1860 root /sbin/agetty s sclp_line0 0 2.1M 92.0K 143.0K 656.0K
1861 root /sbin/agetty s ttysclp0 11 0 2.1M 92.0K 143.0K 656.0K
493 root /usr/sbin/atd f 0 2.5M 172.0K 235.0K 912.0K
1882 root /sbin/udevd 0 2.8M 128.0K 267.0K 764.0K
1843 root /usr/sbin/crond n 0 3.4M 628.0K 693.0K 1.4M
514 root /bin/dbusdaemon system 0 3.2M 700.0K 771.0K 1.5M
524 root /sbin/rsyslogd n c 5 0 219.7M 992.0K 1.1M 1.9M
2171 root ./hhhptest 0 5.7G 1.0M 1.2M 3.2M
1906 root bash 0 103.8M 1.4M 1.5M 2.1M
2196 root ./hhhptest 0 6.2G 2.0M 2.2M 3.9M
1884 root sshd: root@pts/0 0 13.4M 1.4M 2.4M 4.2M
1 root /sbin/init 0 5.8M 2.9M 3.0M 3.9M
2203 root /usr/bin/python /usr/bin/sm 0 109.5M 6.1M 6.2M 6.9M
Map Count PIDs Swap VSS USS RSS PSS AVGRSS AVGPSS
[stack:531] 1 1 0 8.0M 0 0 0 0 0
[vdso] 25 25 0 200.0K 0 132.0K 0 5.0K 0
/dev/zero 2 1 0 2.5M 4.0K 4.0K 4.0K 4.0K 4.0K
/usr/lib64/sasl2/libsasldb.so.2.0.23 2 1 0 28.0K 4.0K 4.0K 4.0K 4.0K 4.0K
/bin/dbusdaemon 3 1 0 404.0K 324.0K 324.0K 324.0K 324.0K 324.0K
/usr/sbin/sshd 6 2 0 1.2M 248.0K 728.0K 488.0K 364.0K 244.0K
/bin/systemd 2 1 0 768.0K 564.0K 564.0K 564.0K 564.0K 564.0K
/bin/bash 2 1 0 1.0M 792.0K 792.0K 792.0K 792.0K 792.0K
[stack] 25 25 0 4.1M 908.0K 976.0K 918.0K 39.0K 36.0K
/lib64/libc2.14.1.so 75 25 0 40.8M 440.0K 9.3M 1.2M 382.0K 48.0K
/lib64/libcrypto.so.1.0.0j 8 4 0 7.0M 572.0K 2.0M 1.3M 501.0K 321.0K
[heap] 16 16 0 8.3M 6.4M 6.9M 6.6M 444.0K 422.0K
<anonymous> 241 25 0 55.7G 20.6M 36.2M 22.3M 1.4M 913.0K
smem - visualizations
Valgrind
Shows
– Memory leaks
– Cache profiling
– Heap profiling
Hints
– Runs on binaries, therefore easy to use
– Debug Info not required but makes output more useful
Valgrind Overview
Binary
000000008000062c <main>: valgrind
kernel
stmg %r9,%r15,72(%r15)
lay %r15,-80160(%r15)
lhi %r12,0
lhi %r10,10000
la %r9,160(%r15)
lgr %r13,%r9
lgr %r11,%r9
into IR
sllg %r1,%r1,3
aghi %r11,8
pfd
brct
2,96(%r1,%r9)
%r10,8000064c <main+0x20> some of New System call
lay %r12,80160(%r15)
The library
lg %r2,0(%r13)
aghi %r13,8
brasl %r14,8000048c <free@plt>
binary
cgrjne %r12,%r13,8000067e <main+0x52>
lhi %r13,0
calls by wrapper
instrumentation
lhi %r12,10000 xxx
lgfr %r2,%r13
Using a
ahi %r13,1
brasl %r14,800005c0 <stacker>
brct %r12,8000069c <main+0x70>
lg %r4,80272(%r15)
lmg
br
%r9,%r15,80232(%r15)
%r4 preload
library translation
libraries To machine code
# valgrind buggy_program
==2799== Memcheck, a memory error detector
==2799== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==2799== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==2799== Command: buggy_program
==2799==
==2799== HEAP SUMMARY:
==2799== in use at exit: 200 bytes in 2 blocks
==2799== total heap usage: 2 allocs, 0 frees, 200 bytes allocated
==2799==
==2799== LEAK SUMMARY:
==2799== definitely lost: 100 bytes in 1 blocks
==2799== indirectly lost: 0 bytes in 0 blocks
==2799== possibly lost: 0 bytes in 0 blocks
==2799== still reachable: 100 bytes in 1 blocks
==2799== suppressed: 0 bytes in 0 blocks
==2799== Rerun with --leak-check=full to see details of leaked memory
[...]
Important parameters:
– --leak-check=full
– --track-origins=yes
155 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Valgrind - Tools
Several tools
– Memcheck (default): detects memory and data flow problems
– Cachegrind: cache profiling
– Massif: heap profiling
– Helgrind: thread debugging
– DRD: thread debugging
– None: no debugging (for valgrind JIT testing)
– Callgrind: codeflow and profiling
IRQ Statistics
Shows
– Which interrupts happen on which cpu
– Where softirqs and tasklets take place
Hints
– Recent Versions (SLES11-SP2) much more useful due to better naming
– If interrupts are unintentionally unbalanced
– If the amount of interrupts matches I/O
• This can point to non-working IRQ avoidance
– Essentially this is the data source for “mpstat -I” which we had before
158 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
IRQ Statistics
Example
– Network focused on CPU zero (in this case unwanted)
– Scheduler covered most of that avoiding idle CPU 1-3
– But caused a lot migrations, IPI's and cache misses
CPU0 CPU1 CPU2 CPU3
EXT: 21179 24235 22217 22959
I/O: 1542959 340076 356381 325691
CLK: 15995 16718 15806 16531 [EXT] Clock Comparator
EXC: 255 325 332 227 [EXT] External Call
EMS: 4923 7129 6068 6201 [EXT] Emergency Signal
TMR: 0 0 0 0 [EXT] CPU Timer
TAL: 0 0 0 0 [EXT] Timing Alert
PFL: 0 0 0 0 [EXT] Pseudo Page Fault
DSD: 0 0 0 0 [EXT] DASD Diag
VRT: 0 0 0 0 [EXT] Virtio
SCP: 6 63 11 0 [EXT] Service Call
IUC: 0 0 0 0 [EXT] IUCV
CPM: 0 0 0 0 [EXT] CPU Measurement
CIO: 163 310 269 213 [I/O] Common I/O Layer Interrupt
QAI: 1 541 773 338 857 354 728 324 110 [I/O] QDIO Adapter Interrupt
DAS: 1023 909 1384 1368 [I/O] DASD
[…] 3215, 3270, Tape, Unit Record Devices, LCS, CLAW, CTC, AP Bus, Machine Check
IRQ Statistics II
Also softirqs can be tracked which can be useful to
– check if tasklets execute as intended
– See if network, scheduling and I/O behave as expected
CPU0 CPU1 CPU2 CPU3
HI: 498 1522 1268 1339
TIMER: 5640 914 664 643
NET_TX: 15 16 52 32
NET_RX: 18 34 87 45
BLOCK: 0 0 0 0
BLOCK_IOPOLL: 0 0 0 0
TASKLET: 13 10 44 20
SCHED: 8055 702 403 445
HRTIMER: 0 0 0 0
RCU: 5028 2906 2794 2564
Wireshark
Characteristics: Analyzes captured network traffic
Objective: In depth analysis of handshakes, missing replies, protocols, ...
Usage: Dump in libpcap or pcap-ng format (tcpdump, dumpcap)
then analyze on remote system via “wireshark”
Package: RHEL: wireshark SLES: wireshark
Custom columns and profiles are important to visualize what you want to look for
For more details you might start at
– The share sessions of Mathias Burkhard
https://share.confex.com/share/121/webprogram/Session13282.html
– Official documentation http://www.wireshark.org/docs/wsug_html/
161 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
Wireshark example
1. Dump via “tcpdump w” or wiresharks “dumpcap”
2. analyze on remote system via “wireshark”
tcpdump host pserver1 w traceme
tcpdump: listening on eth0, linktype EN10MB (Ethernet), capture size 65535 bytes
^C40 packets captured
40 packets received by filter
0 packets dropped by kernel
[scp to my system]
wireshark traceme
Tracepoints (Events)
Shows
– Timestamp and activity name
– Tracepoints can provide event specific context data
– Infrastructure adds extra common context data like cpu, preempts depth, ...
Hints
– Very powerful and customizable, there are hundreds of tracepoints
• Some tracepoints have tools to be accessed “perf sched”, “blktrace” both base on them
• Others need custom postprocessing
– There are much more things you can handle with tracepoints check out
Kernel Documentation/trace/tracepoint-analysis.txt (via perf stat)
Kernel Documentation/trace/events.txt (custom access)
163 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
– Output
# _=> CPU#
# / _=> irqsoff
# | / _=> needresched
# || / _=> hardirq/softirq
# ||| / _=> preemptdepth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
<...>24116 0..s. 486183281us+: net_dev_xmit: dev=eth5 skbaddr=0000000075b7e3e8 len=67 rc=0
<idle>0 0..s. 486183303us+: netif_receive_skb: dev=eth5 skbaddr=000000007ecc6e00 len=53
<idle>0 0.Ns. 486183306us+: napi_poll: napi poll on napi struct 000000007d2479a8 fordevice eth
<...>24116 0..s. 486183311us+: net_dev_queue: dev=eth5 skbaddr=0000000075b7e3e8 len=67
<...>24116 0..s. 486183317us+: net_dev_xmit: dev=eth5 skbaddr=0000000075b7e3e8 len=67 rc=0
Example postprocessed
SUM COUNT AVERAGE MIN MAX STDDEV
P2Q: 8478724 1572635 5.39 4 2140 7.41
Q2S: 12188675 1572638 7.65 3 71 4.89
S2R: 38562294 1572636 24.42 1 2158 9.08
R2P: 4197486 1572633 2.57 1 43 2.39
SUM: 63427179 1572635 40.03
SUM COUNT AVERAGE MIN MAX STDDEV
P2Q: 7191885 1300897 5.53 4 171 1.31
Q2S: 10622270 1300897 8.17 3 71 5.99
S2R: 32078550 1300898 24.66 2 286 5.88
R2P: 3707814 1300897 2.85 1 265 2.59
SUM: 53600519 1300897 41.20
– Confirmed that ~all of the 1.2 ms were lost inside Linux (not in the fabric)
– Confirmed that it was not at/between specific function tracepoints
• Eventually it was an interrupt locality issue causing bad caching
Systemtap
Characteristics: tool to “tap” into the kernel for analysis
Objective: analyze in kernel values or behavior that otherwise would be
inaccessible or require a modification/recompile cycle
Usage (mini example): “stap v e 'probe vfs.read {printf(“read
performed\n”); exit()}'”
Package: RHEL: systemtap + systemtap-runtime SLES: systemtap
Also requires kernel debuginfo and source/devel packages
Fore more check out “Using SystemTap on Linux on System z” from Mike O'Reilly
https://share.confex.com/share/118/webprogram/Handout/Session10452/atlanta.pdf
167 September 3, 201 Linux-Performance-know it all series © 2014 IBM Corporation
4
Linux on System z Performance Evaluation
End of Part V
The one you should always have → IBM System z Enterprise
Questions
ehrhardt@de.ibm.com