Larry Shak Perf Summit1 2009 Final
Larry Shak Perf Summit1 2009 Final
Larry Shak Perf Summit1 2009 Final
Agenda
Section1SystemOverview
Section2AnalyzingSystemPerformance
Section3TuningRedHatEnterpriseLinux
Section4PerformanceAnalysisandTuningExamples
References
Processors Supported/Tested
RHEL4 Limitations
x86 32
x86_64 8, 64(LargeSMP)
ia64 64, 512(SGI)
RHEL5 Limitations
x86 32
x86_64 255
ia64 64, 1024(SGI)
Processor types
Uni-Processor
Symmetric Multi Processor
Multi-Core
Symmetric Multi-Thread(Hyper threaded)
Combinations
<socket#>
siblings:16<logicalcpuspersocket>
coreid:0
<core#insocket>
cpucores:8<physicalcorespersocket>
#cat/sys/devices/system/node/node*/cpulist
node0:03
node1:47
Node 0
C0
C1
Memory
Node 1
C0
C1
Memory
N N NS N N N N N N NN
0 1 23 0 1 2 3 0 1 23
Interleaved
C0
C1
C0
C1
Memory
Memory
Node 2
Node 3
(Non-NUMA)
N0
N1
N2 N3
Non-Interleaved (NUMA)
NUMA Support
RHEL4 NUMA Support
NUMA aware memory allocation policy
NUMA aware memory reclamation
Multi-core support
Memory Management
10
11
RHEL5
x86 4GB, 16GB
x86_64 512GB/1TB
ia64 - 2TB
12
Memory Zones
32-bit
64-bit
Up to 64 GB(PAE)
Highmem Zone
End of RAM
Normal Zone
896 MB or 3968MB
4GB
Normal Zone
16MB
DMA Zone
0
13
DMA32 Zone
16MB
DMA Zone
0
14
DMA
Normal
24bit I/O
Kernel Static
Kernel Dynamic
slabcache
bounce buffers
driver allocations
User Overflow
(Highmem x86)
User
Anonymous
Pagecache
Pagetables
24bit I/O
DMA32
32bit I/O
Normal overflow
Normal
Kernel Static
Kernel Dynamic
slabcache
bounce buffers
driver allocations
User
Anonymous
Pagecache
Pagetables
15
Per-Zone Resources
RAM
mem_map
Page lists: free, active and inactive
Page allocation and reclamation
Page reclamation watermarks
16
mem_map
17
18
Normal
217*4kB207*8kB1*16kB1*32kB0*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=3468kB)
HighMem
847*4kB409*8kB17*16kB1*32kB1*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=7924kB)
Memoryallocationfailures
Freelist exhaustion.
Freelist fragmentation.
19
20
Node 1
Normal Zone
Normal Zone
4GB
Node 0
DMA32 Zone
16MB
DMA Zone
0
21
32-bit
3G/1G address space
4G/4G address space(RHEL4 only)
64-bit
X86_64
IA64
22
0GB
3G/1G Kernel(SMP)
3GB
4GB
RAM
DMA Normal
23
HighMem
Virtual
4G/4G Kernel(Hugemem)
User(s)
Kernel
0 GB
3968MB
RAM
DMA
24
Normal
3968MB
HighMem
User
0
Kernel
128TB(2^47)
RAM
IA64
VIRT
RAM
25
Memory Pressure
32- bit
DMA
Normal
Highmem
Kernel Allocations
User Allocations
64- bit
DMA
26
Normal
27
Pagecache Allocations
Page Faults
pagecache
28
anonymous
29
30
DMA
Normal
Kernel Reclamation
(kswapd)
slapcache reaping
inode cache pruning
bufferhead freeing
dentry cache pruning
31
User Allocations
Highmem
User Reclamation
(kswapd/pdflush)
page aging
pagecache shrinking
swapping
RAM
32
Anonymous/pagecache reclaiming
Pagecache Allocations
Page Faults
pagecache
kswapd(bdflush/pdflush, kupdated)
page reclaim
deletion of a file
unmount filesystem
33
anonymous
kswapd
page reclaim (swapout)
unmap
exit
INACTIVE
(Dirty -> Clean)
ACTIVE
Page aging
FREE
Reclaiming
swapout
pdflush(RHEL4/5)
User deletions
34
35
FileSystem&DiskIO
pagecache
Read()/Write()
memory copy
Pagecache
page
I/O
buffer
DMA
User space
Kernel
36
Memory copy
buffer
Pagecache
page(dirty)
User
Kernel
pdflushd and
write()'ng processes
write dirty buffers
37
User
38
Pagecache
page
Kernel
DirectIOfilesystemread()/write()
Read()/Write()
DMA
buffer
User space
Pagecache
39
Section2AnalyzingSystemPerformance
Performance Monitoring Tools
What to run under certain loads
Analyzing System Performance
What to look for
40
oprofile
Kernel Tools
Networking
Profiling
nmi_watchdog=1, profile=2
dprobe, kprobe
41
42
Memory Tools
Process Tools
1 top
1 top
1 top
2 vmstat
2 vmstat -s
2 ps -o pmem
3 ps aux
3 ps aur
3 gprof
4 mpstat -P all
4 ipcs
4 strace,ltrace
5 sar -u
5 sar -r -B -W
5 sar
6 iostat
6 free
7 oprofile
7 oprofile
1 iostat -x
8 gnome-
8 gnome-
2 vmstat - D
system-monitor
system-monitor
3 sar -DEV #
9 KDE-monitor
9 KDE-monitor
4 nfsstat
10 /proc
10 /proc
5 NEED MORE!
Disk Tools
Monitoring Tools
mpstat reveals per cpu stats, Hard/Soft Interrupt usage
vmstat vm page info, context switch, total ints/s, cpu
netstat per nic status, errors, statistics at driver level
lspci
43
44
vmstat(paging vs swapping)
Vmstat10
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussywaid
200548352420052423457600546315251303096
020169784020052439314400057850482108539941221463
300784420052457841090059330589463243144307321842
Vmstat10
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussywaid
200548352420052423457600546315251303096
02016623402005242345760057850482108539941221463
3023567873842005242345761875423745193589463243144307321842
45
46
47
SAR
[root@localhostredhat]#saru33
Linux2.4.2120.EL(localhost.localdomain)05/16/2005
10:32:28PMCPU%user%nice%system%idle
10:32:31PMall0.000.000.00100.00
10:32:34PMall1.330.000.3398.33
10:32:37PMall1.340.000.0098.66
Average:all0.890.000.1199.00
[root]sarnDEV
Linux2.4.2120.EL(localhost.localdomain)03/16/2005
01:10:01PMIFACErxpck/stxpck/srxbyt/stxbyt/srxcmp/s
txcmp/srxmcst/s
01:20:00PMlo3.493.49306.16306.160.00
0.000.00
01:20:00PMeth03.893.532395.34484.700.00
0.000.00
01:20:00PMeth10.000.000.000.000.00
0.000.00
48
Networking tools
Tuning tools
ethtool
sysctl
49
ethtool
Works mostly at the HW level
ethtool -S provides HW level stats
Counters since boot time, create scripts to calculate diffs
ethtool -c - Interrupt coalescing
ethtool -g - provides ring buffer information
ethtool -k - provides hw assist information
ethtool -i - provides the driver information
50
ps
[root@localhostroot]#psaux
[root@localhostroot]#psaux|more
USERPID%CPU%MEMVSZRSSTTYSTATSTARTTIMECOMMAND
root10.10.11528516?S23:180:04init
root20.00.000?SW23:180:00[keventd]
root30.00.000?SW23:180:00[kapmd]
root40.00.000?SWN23:180:00[ksoftirqd/0]
root70.00.000?SW23:180:00[bdflush]
root50.00.000?SW23:180:00[kswapd]
root60.00.000?SW23:180:00[kscand]
53
pstree
init/usr/bin/sealer
acpid
atd
auditdpython
{auditd}
automount6*[{automount}]
avahi-daemonavahi-daemon
bonobo-activati{bonobo-activati}
bt-applet
clock-applet
crond
cupsdcups-polld
3*[dbus-daemon{dbus-daemon}]
2*[dbus-launch]
dhclient
54
55
/proc/meminfo
RHEL4> cat /proc/meminfo
MemTotal: 32749568 kB
MemFree:
31313344 kB
Buffers:
29992 kB
Cached:
1250584 kB
SwapCached:
0 kB
Active:
235284 kB
Inactive:
1124168 kB
HighTotal:
0 kB
HighFree:
0 kB
LowTotal: 32749568 kB
LowFree:
31313344 kB
SwapTotal: 4095992 kB
SwapFree:
4095992 kB
Dirty:
0 kB
Writeback:
0 kB
Mapped:
1124080 kB
Slab:
38460 kB
CommitLimit: 20470776 kB
Committed_AS: 1158556 kB
PageTables:
5096 kB
VmallocTotal: 536870911 kB
VmallocUsed:
2984 kB
VmallocChunk: 536867627 kB
HugePages_Total: 0
HugePages_Free:
0
Hugepagesize: 2048 kB
56
/proc/slabinfo
slabinfoversion:2.1
#name<active_objs><num_objs><objsize><objperslab><pagesperslab>:tunables<limit>
<batchcount><sharedfactor>:slabdata<active_slabs><num_slabs><sharedavail>
nfsd4_delegations0065661:tunables54278:slabdata000
nfsd4_stateids00128301:tunables120608:slabdata000
nfsd4_files0072531:tunables120608:slabdata000
nfsd4_stateowners0042491:tunables54278:slabdata000
nfs_direct_cache00128301:tunables120608:slabdata000
nfs_write_data363683292:tunables54278:slabdata440
nfs_read_data323576851:tunables54278:slabdata770
nfs_inode_cache13831389104031:tunables24128:slabdata4634630
nfs_page00128301:tunables120608:slabdata000
fscache_cookie_jar35372531:tunables120608:slabdata110
ip_conntrack_expect00136281:tunables120608:slabdata000
ip_conntrack75130304131:tunables54278:slabdata10100
bridge_fdb_cache0064591:tunables120608:slabdata000
rpc_buffers88204821:tunables24128:slabdata440
rpc_tasks3030384101:tunables54278:slabdata330
57
/proc/cpuinfo
[lwoodman]$cat/proc/cpuinfo
processor:0
vendor_id:GenuineIntel
cpufamily:6
model:15
modelname:Intel(R)Xeon(R)CPU3060@2.40GHz
stepping:6
cpuMHz:2394.070
cachesize:4096KB
physicalid:0
siblings:2
coreid:0
cpucores:2
fpu:yes
fpu_exception:yes
cpuidlevel:10
wp:yes
flags:fpuvmedepsetscmsrpaemcecx8apicsepmtrrpgemcacmovpatpse36clflush
dtsacpimmxfxsrssesse2sshttmsyscallnxlmconstant_tscpnimonitords_cplvmxesttm2cx16
xtprlahf_lm
bogomips:4791.41
clflushsize:64
cache_alignment:64
addresssizes:36bitsphysical,48bitsvirtual
powermanagement:
58
32-bit /proc/<pid>/maps
[root@dhcp8336proc]#cat5808/maps
0022e0000023b000rxp0000000003:034137068/lib/tls/libpthread0.60.so
0023b0000023c000rwp0000c00003:034137068/lib/tls/libpthread0.60.so
0023c0000023e000rwp0000000000:000
0037f00000391000rxp0000000003:03523285/lib/libnsl2.3.2.so
0039100000392000rwp0001100003:03523285/lib/libnsl2.3.2.so
0039200000394000rwp0000000000:000
00c4500000c5a000rxp0000000003:03523268/lib/ld2.3.2.so
00c5a00000c5b000rwp0001500003:03523268/lib/ld2.3.2.so
00e5c00000f8e000rxp0000000003:034137064/lib/tls/libc2.3.2.so
00f8e00000f91000rwp0013100003:034137064/lib/tls/libc2.3.2.so
00f9100000f94000rwp0000000000:000
080480000804f000rxp0000000003:031046791/sbin/ypbind
0804f00008050000rwp0000700003:031046791/sbin/ypbind
09794000097b5000rwp0000000000:000
b5fdd000b5fde000p0000000000:000
59
64-bit /proc/<pid>/maps
#cat/proc/2345/maps
004000000100b000rxp00000000fd:001933328/usr/sybase/ASE12_5/bin/dataserver.esd3
0110b00001433000rwp00c0b000fd:001933328/usr/sybase/ASE12_5/bin/dataserver.esd3
01433000014eb000rwxp0143300000:000
4000000040001000p4000000000:000
4000100040a01000rwxp4000100000:000
2a95f730002a96073000p0012b000fd:00819273/lib64/tls/libc2.3.4.so
2a960730002a96075000rp0012b000fd:00819273/lib64/tls/libc2.3.4.so
2a960750002a96078000rwp0012d000fd:00819273/lib64/tls/libc2.3.4.so
2a960780002a9607e000rwp2a9607800000:000
2a9607e0002a98c3e000rws0000000000:06360450/SYSV0100401e(deleted)
2a98c3e0002a98c47000rwp2a98c3e00000:000
2a98c470002a98c51000rxp00000000fd:00819227/lib64/libnss_files2.3.4.so
2a98c510002a98d51000p0000a000fd:00819227/lib64/libnss_files2.3.4.so
2a98d510002a98d53000rwp0000a000fd:00819227/lib64/libnss_files2.3.4.so
2a98d530002a98d57000rxp00000000fd:00819225/lib64/libnss_dns2.3.4.so
2a98d570002a98e56000p00004000fd:00819225/lib64/libnss_dns2.3.4.so
2a98e560002a98e58000rwp00003000fd:00819225/lib64/libnss_dns2.3.4.so
2a98e580002a98e69000rxp00000000fd:00819237/lib64/libresolv2.3.4.so
2a98e690002a98f69000p00011000fd:00819237/lib64/libresolv2.3.4.so
2a98f690002a98f6b000rwp00011000fd:00819237/lib64/libresolv2.3.4.so
2a98f6b0002a98f6d000rwp2a98f6b00000:000
35c7e0000035c7e08000rxp00000000fd:00819469/lib64/libpam.so.0.77
35c7e0800035c7f08000p00008000fd:00819469/lib64/libpam.so.0.77
35c7f0800035c7f09000rwp00008000fd:00819469/lib64/libpam.so.0.77
35c800000035c8011000rxp00000000fd:00819468/lib64/libaudit.so.0.0.0
35c801100035c8110000p00011000fd:00819468/lib64/libaudit.so.0.0.0
35c811000035c8118000rwp00010000fd:00819468/lib64/libaudit.so.0.0.0
35c900000035c900b000rxp00000000fd:00819457/lib64/libgcc_s3.4.420050721.so.1
35c900b00035c910a000p0000b000fd:00819457/lib64/libgcc_s3.4.420050721.so.1
35c910a00035c910b000rwp0000a000fd:00819457/lib64/libgcc_s3.4.420050721.so.1
7fbfff10007fc0000000rwxp7fbfff100000:000
ffffffffff600000ffffffffffe00000p0000000000:000
60
/proc/vmstat
cat /proc/vmstat
nr_anon_pages 98893
nr_mapped 20715
nr_file_pages 120855
nr_slab 23060
nr_page_table_pages
5971
nr_dirty 21
nr_writeback 0
nr_unstable 0
nr_bounce 0
numa_hit 996729666
numa_miss 0
numa_foreign 0
numa_interleave 87657
numa_local 996729666
numa_other 0
pgpgin 2577307
pgpgout 106131928
pswpin 0
pswpout 34
pgalloc_dma 198908
pgalloc_dma32
997707549
pgalloc_normal 0
pgalloc_high 0
pgfree 997909734
pgactivate 1313196
pgdeactivate 470908
pgfault 2971972147
pgmajfault 8047.
61
CONTINUED...
pgrefill_dma 18338
pgrefill_dma32 1353451
pgrefill_normal 0
pgrefill_high 0
pgsteal_dma 0
pgsteal_dma32 0
pgsteal_normal 0
pgsteal_high 0
pgscan_kswapd_dma 7235
pgscan_kswapd_dma32 417984
pgscan_kswapd_normal 0
pgscan_kswapd_high 0
pgscan_direct_dma 12
pgscan_direct_dma32 1984
pgscan_direct_normal 0
pgscan_direct_high 0
pginodesteal 166
slabs_scanned 1072512
kswapd_steal 410973
kswapd_inodesteal 61305
pageoutrun 7752
allocstall 29
pgrotated 73
Alt Sysrq M
Freepages:15809760kB(0kBHighMem)
Active:51550inactive:54515dirty:44writeback:0unstable:0free:3952440slab:8727mapped
file:5064mappedanon:20127pagetables:1627
Node0DMAfree:10864kBmin:8kBlow:8kBhigh:12kBactive:0kBinactive:0kBpresent:10460kB
pages_scanned:0all_unreclaimable?no
Node0DMA32free:2643124kBmin:2760kBlow:3448kBhigh:4140kBactive:0kBinactive:0kB
present:2808992kBpages_scanned:0all_unreclaimable?no
Node0Normalfree:13155772kBmin:13480kBlow:16848kBhigh:20220kBactive:206200kB
inactive:218060kBpresent:13703680kBpages_scanned:0all_unreclaimable?no
Node0HighMemfree:0kBmin:128kBlow:128kBhigh:128kBactive:0kBinactive:0kBpresent:0kB
pages_scanned:0all_unreclaimable?no
Node0DMA:4*4kB2*8kB3*16kB1*32kB2*64kB1*128kB1*256kB0*512kB2*1024kB0*2048kB
2*4096kB=10864kB
Node0DMA32:1*4kB0*8kB1*16kB1*32kB0*64kB1*128kB0*256kB2*512kB2*1024kB3*2048kB
643*4096kB=2643124kB
Node0Normal:453*4kB161*8kB44*16kB15*32kB4*64kB4*128kB0*256kB1*512kB0*1024kB
1*2048kB3210*4096kB=13155772kB
Node0HighMem:empty
85955pagecachepages
Swapcache:add0,delete0,find0/0,race0+0
Freeswap=2031608kB
Totalswap=2031608kB
Freeswap:2031608kB
4521984pagesofRAM
446612reservedpages
21971pagesshared
0pagesswapcached
62
63
Alt Sysrq T
gdmgreeterSffff8100090368000751174837489(NOTLB)
ffff81044ae05b38000000000000008200000000000000800000000000000000
0000000000000000000000000000000affff810432ed97a0ffff81010f387080
0000002a3a0d43980000000000003b57ffff810432ed99880000000600000000
CallTrace:
[<ffffffff8006380f>]schedule_timeout+0x1e/0xad
[<ffffffff80049b33>]add_wait_queue+0x24/0x34
[<ffffffff8002db7e>]pipe_poll+0x2d/0x90
[<ffffffff8002f764>]do_sys_poll+0x277/0x360
[<ffffffff8001e99c>]__pollwait+0x0/0xe2
[<ffffffff8008be44>]default_wake_function+0x0/0xe
[<ffffffff8008be44>]default_wake_function+0x0/0xe
[<ffffffff8008be44>]default_wake_function+0x0/0xe
[<ffffffff80012f1a>]sock_def_readable+0x34/0x5f
[<ffffffff8004a81a>]unix_stream_sendmsg+0x281/0x346
[<ffffffff80037c3a>]do_sock_write+0xc6/0x102
[<ffffffff801277da>]avc_has_perm+0x43/0x55
[<ffffffff80276a6e>]unix_ioctl+0xc7/0xd0
[<ffffffff8021f48f>]sock_ioctl+0x1c1/0x1e5
[<ffffffff800420a7>]do_ioctl+0x21/0x6b
[<ffffffff800302a0>]vfs_ioctl+0x457/0x4b9
[<ffffffff800b6193>]audit_syscall_entry+0x180/0x1b3
[<ffffffff8004c4f6>]sys_poll+0x2d/0x34
[<ffffffff8005d28d>]tracesys+0xd5/0xe0
64
65
P4: GLOBAL_POWER_EVENTS
66
IA64: CPU_CYCLES
-t [percentage] theshold to
view
--event=:name:count
-f /path/filename
Example:
-d details
# opcontrol start
opannotate
-s /path/source
# sleep 60
-a /path/assembly
# opcontrol stop
# opcontrol dump
67
39743597184.6702vmlinux
197030644.1976zeus.web
169143173.6034e1000
122085142.6009ld2.5.so
117117462.4951libc2.5.so
51646641.1003sim.cgi
23334270.4971oprofiled
12951610.2759oprofile
10997310.2343zeus.cgi
9686230.2064ext3
2701630.0576jbd
68
69
probe script
elaborate
Dynamic instrumentation
Tool to take a deep look into a running system:
Assists in identifying causes of performance problems
probe-set library
translate to C, compile *
70
probe kernel
object
71
72
kswapd05440150376791943715157430730
kswapd15450180678882434712117341408
memory254359975697573083604621115837
mixer_applet2768764180101333981
Xorg749151906283920382
gnometerminal71612103869512320
gnometerminal77015261422457172
cupsd7100192704128
73
memory25685284278440644082834840398981614048185
kswapd1545300753257000049884
kswapd054462025241000017568
mixer_applet27687302282700101241
sshd25051227000600
kjournald86320728300002149
Xorg74911698980000310
gnomepowerman76531520001800
avahidaemon7252150128000480160
irqbalance67251263641313180190
bash250531220001300
hald7264890008300
gconfd271638252600680116
74
75
76
77
78
79
80
Capacity Tuning
Memory
/proc/sys/vm/overcommit_memory
/proc/sys/vm/overcommit_ratio
/proc/sys/vm/max_map_count
/proc/sys/vm/nr_hugepages
/proc/sys/kernel/msgmax
/proc/sys/kernel/msgmnb
/proc/sys/kernel/msgmni
/proc/sys/kernel/shmall
/proc/sys/kernel/shmmax
/proc/sys/kernel/shmmni
/proc/sys/kernel/threads-max
Kernel
Filesystems
/proc/sys/fs/aio_max_nr
/proc/sys/fs/file_max
OOM kills
81
82
2228223pagesofRAM
1867481pagesofHIGHMEM
150341reservedpages
343042pagesshared
257pagesswapcached
kernel:OutofMemory:Killedprocess3450(hpsmhd).
83
Eliminating OOMkills
RHEL4
RHEL5
84
85
Performance Tuning
Kernel Selection
VM tuning
Processor related tuning
NUMA related tuning
Disk & IO tuning
Hugepages
KVM host and guests
86
X86_64
87
X86_64
IA64
88
VM: swappiness
Controls how aggressively the system reclaims
mapped memory:
Anonymous memory - swapping
Mapped file pages writing if dirty and freeing
System V shared memory - swapping
89
/proc/sys/vm/swappiness
Sybaseserverwith/proc/sys/vm/swappinesssetto60(default)
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwa
51643644267883544323417888801204044749613022084625342516
Sybaseserverwith/proc/sys/vm/swappinesssetto10
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwa
8302422867243228069600238886377612862002024381326
90
/proc/sys/vm/min_free_kbytes
Directly controls the page reclaim watermarks in KB
# echo 1024 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:8kB low:8kB high:12kB
Node 0 DMA32 free:14456kB min:1012kB low:1264kB high:1516kB
----------------------------------------------------------echo 2048 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:20kB low:24kB high:28kB
Node 0 DMA32 free:14456kB min:2024kB low:2528kB high:3036kB
-----------------------------------------------------------
91
92
/proc/sys/vm/dirty_ratio
Absolute limit to percentage of dirty pagecache
memory
Default is 40%
Lower means less dirty pagecache and smaller IO
streams
Higher means more dirty pagecache and larger IO
streams
93
/proc/sys/vm/dirty_background_ratio
Controls when dirty pagecache memory starts getting
written.
Default is 10%
Lower
Higher
94
95
/proc/sys/vm/pagecache
Controls when pagecache memory is deactivated.
Default is 100%
Lower
Higher
96
Pagecache Tuning
Filesystem/pagecache Allocation
Accessed(pagecache under limit)
ACTIVE
Aging
INACTIVE
(new -> old)
97
FREE
reclaim
98
Slab:
Slab:
415420 kB
Hugepagesize:
99
2048 kB
Hugepagesize:
218208 kB
2048 kB
100
CPU Scheduler
Recognizes differences between
logical and physical processors
I.E. Multi-core, hyperthreaded &
chips/sockets
Optimizes process scheduling
to take advantage of shared
on-chip cache, and NUMA memory
nodes
Socket 0
Core 0
Thread 0 Thread 1
Core 1
Socket 1
Thread 0 Thread 1
Thread 0 Thread 1
Process
Process
Process
Process
Process
Process
Process
Process
Process
Process
Socket 2
Process
Process
Numastat
Numactl
Hugetlbfs
/sys/devices/system/node
102
103
TIP
104
LinuxNUMAEvolution(NEWer)
RHEL3,4and5LinpackMultistream
AMD64,8cpudualcore(1/2cpusloaded)
3000000
45
40
PerformanceinKflops
2500000
35
DefaultScheduler
2000000
30
25
1500000
20
1000000
15
10
500000
5
0
0
RHEL3U8
RHEL4U5
RHEL5GOLD
Limitations :
Numa spill to different numa boundaries
Process migrations no way back
Lack of page replication text, read mostly
105
TasksetAffinity
ColumnE
HugeTLBFS
The Translation Lookaside Buffer (TLB) is a
small CPU cache of recently used virtual to
physical address mappings
TLB misses are extremely expensive on today's
very fast, pipelined CPUs
Large memory applications
can incur high TLB miss rates
TLB
Physical Memory
Virtual Address
Space
Hugepagesbefore
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
0001562365631044401120001871416375109720
$cat/proc/meminfo
MemTotal:16301368kB
MemFree:15623604kB
...
HugePages_Total:0
HugePages_Free:0
HugePages_Rsvd:0
Hugepagesize:2048kB
107
Hugepagesreserving
$echo2000>/proc/sys/vm/nr_hugepages
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
000115266323116840178000129101566310981
0
$cat/proc/meminfo
MemTotal:16301368kB
MemFree:11526520kB
...
HugePages_Total:2000
HugePages_Free:2000
HugePages_Rsvd:0
Hugepagesize:2048kB
108
Hugepagesusing
$mountthugetlbfshugetlbfs/huge
$cp1GBfile/huge/junk
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
0001052663231168140178000129101566310981
0
$cat/proc/meminfo
LowTotal:16301368kB
LowFree:11524756kB
...
HugePages_Total:2000
HugePages_Free:1488
HugePages_Rsvd:0
Hugepagesize:2048kB
109
Hugepagesreleasing
$rm/huge/junk
$cat/proc/meminfo
MemTotal:16301368kB
MemFree:11524776kB
...
HugePages_Total:2000
HugePages_Free:2000
HugePages_Rsvd:0
Hugepagesize:2048kB
$echo0>/proc/sys/vm/nr_hugepages
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
00015620488315124019440071614959109810
$cat/proc/meminfo
MemTotal:16301368kB
MemFree:15620500kB
...
HugePages_Total:0
HugePages_Free:0
HugePages_Rsvd:0
Hugepagesize:2048kB
110
NUMAHugepagesreserving
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total:
Node 0 HugePages_Free:
Node 1 HugePages_Total:
Node 1 HugePages_Free:
111
NUMAHugepagesusing
[root@dhcp-100-19-50 ~]# mount -t hugetlbfs hugetlbfs /huge
[root@dhcp-100-19-50 ~]# /usr/tmp/mmapwrite /huge/junk 32 &
[1] 18804
[root@dhcp-100-19-50 ~]# Writing 1048576 pages of random junk to file /huge/junk
wrote 4294967296 bytes to file /huge/junk
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2980
Node 0 HugePages_Free: 2980
Node 1 HugePages_Total: 3020
Node 1 HugePages_Free: 972
112
NUMAHugepagesusing(overcommit)
[root@dhcp-100-19-50 ~]# /usr/tmp/mmapwrite /huge/junk 33 &
[1] 18815
[root@dhcp-100-19-50 ~]# Writing 2097152 pages of random junk to file /huge/junk
wrote 8589934592 bytes to file /huge/junk
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2980
Node 0 HugePages_Free: 1904
Node 1 HugePages_Total: 3020
Node 1 HugePages_Free:
113
NUMAHugepagesreducing
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2980
Node 0 HugePages_Free: 2980
Node 1 HugePages_Total: 3020
Node 1 HugePages_Free: 3020
[root@dhcp-100-19-50 ~]# echo 3000 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total:
Node 0 HugePages_Free:
114
NUMAHugepagesfreeing/reserving
[root@dhcp-100-19-50 ~]# echo 6000 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2982
Node 0 HugePages_Free: 2982
Node 1 HugePages_Total: 3018
Node 1 HugePages_Free: 3018
[root@dhcp-100-19-50 ~]# echo 0 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# echo 3000 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 1500
Node 0 HugePages_Free: 1500
Node 1 HugePages_Total: 1500
Node 1 HugePages_Free: 1500
115
JVMTuning
Eliminate swapping
Lower swappiness to 10%(or lower if
necessary).
Promote pagecache reclaiming
Lower dirty_background_ratio to 10%
Lower dirty_ratio if necessary
Promote inode cache reclaiming
Lower vfs_cache_pressure
116
117
0 .17
0 .17
20 0 0 0 0
0 .17
0 .17
150 0 0 0
0 .16
Base
Base HugePages
% Virt Huge KVM
0 .16
10 0 0 0 0
0 .16
0 .16
50 0 0 0
0 .16
0 .15
0 .15
4
#cpus
118
119
GeneralPerformanceTuningGuidelines
Use hugepages whenever possible.
Minimize swapping.
Maximize pagecache reclaiming
Place swap partition(s) on quite device(s).
Direct IO if possible.
Beware of turning NUMA off.
120
fdisk /dev/sdX
raw /dev/raw/rawX /dev/sdX1
dd if=/dev/raw/rawX bs=64k
121
IOzone commands
Iozone a f /perf1/t1
(incache)
PerformanceMB/sec
1200
1000
EXT_inCache
GFS1InCache
800
NFSInCache
600
400
200
0
122
ALLI/ Initial Re
Read Re
Ran
Ran Back RecR Stride Fwrite Fre Fread Fre
O's
Write Write
Read dom dom ward e
Read
Write
Read
Read Write Read Write
123
80
70
60
50
40
30
20
10
0
124
EXT_DIO
GFS1_DIO
NFS_DIO
ALL
I/O's
Initial
Write
ReWrite
Read
Back
ward
Read
RecRe
Write
Stride
Read
PercentRelativetoEXT3
35
30
InCache
DirectI/O
>Cache
25
20
15
10
0
EXT4DEV
125
EXT4BARRIER=0
XFS
XFSBarrier=0
100000.00
80000.00
RHEL53Base8cpus
ext3
60000.00
RHEL53Base8cpus
xfs
RHEL53Base8cpus
ext4
40000.00
20000.00
0.00
10U
126
20U
40U
60U
80U
100U
NUMA
Localized memory access for certain workloads improves performance
127
250000
RHEL5.4OracleOLTPPerformance
OLTP(tpm)
200000
Tigerton2.93Ghz
32Gbmem
Nehalem2.687Ghz,
36GBmem
150000
100000
50000
0
RHEL52Base4CPU
RHEL52Base8CPU
RHEL52Base16CPU
#CPUs
128
300000.00
250000.00
200000.00
RHEL53Base6CPU
RHEL53Base12CPU
RHEL53Base24CPU
150000.00
100000.00
50000.00
0.00
100U
129
RHEL5.2Oracle10.2Hugepages
Relativeperformance
160
140
2.6.1890.el5
120
100
4
80
3
60
2
40
20
0
40U
130
60U
80U
100U
2.6.1890.el5Huge
Pages
%Difference
AsynchronousI/OtoFileSystems
Allows application to continue processing while
I/O is in progress
Synchronous I/O
App I/O
Request
I/O
No stall for
completion
App I/O
Request
Application
Device
Driver
I/O Request
Issue
I/O
I/O
Completion
131
I/O Request
Completion
Asynchronous I/O
Device
Driver
I/O Request
Issue
I/O Request
Completion
100
80
100U
60
40
20
0
AIO+DIO
132
DIOonly
AIOonly
NoAIOorDIO
RHEL5IOschedulesvsRHEL3forDatabase
Oracle10Goltp/dss(relativeperformance)
100.0%
100.0%
CFQ
87.2%
Deadline
108.9%
84.1%
84.8%
Rhel3
77.7%
75.9%
Noop
%tran/min
%queries/hour
As
0.0%
134
28.4%
23.2%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
135
T ransactions / Minute
200,000
150,000
100,000
50,000
0
Bare Metal
2 Guests
8 vCPUs
4 Guests
4 vCPUs
136
8 Guests
2 vCPUs
100000.00
OLTPTrans/Min
80000.00
RHEL53Base8cpus
RHEL53Base8cpus
SelinuxEnabled
60000.00
RHEL53Base8cpus
SelinuxPermissive
40000.00
20000.00
0.00
10U
20U
40U
60U
80U
100U
SimulatedUsers(x100)
137
BenchmarkTuning
Use Hugepages.
Dont overcommit memory
If memory must be over committed
Eliminate all swapping.
Maximize pagecache reclaiming
Place swap partition(s) on separate
device(s).
Use Direct IO
Dont turn NUMA off.
138
sysctl
sysctl -q
- queries a variable
sysctl -w
- writes a variable
142
1G
bE
10GbE
143
netperf
http://netperf.org
Feature Rich
Read documentation
144
Disable cpuspeed
default gov=ondemand, set governer to performance
Use affinity to maximize multi-core shared cache environments
Process affinity
Use taskset
Interrupt affinity
or MRGs Tuna
145
146
Questions?
147