Slides for MySQL Conference & Expo 2010: http://en.oreilly.com/mysql2010/public/schedule/detail/13519
1 of 52
Downloaded 978 times
More Related Content
SSD Deployment Strategies for MySQL
1. SSD Deployment Strategies
for MySQL
Yoshinori Matsunobu
Lead of MySQL Professional Services APAC
Sun Microsystems
Yoshinori.Matsunobu@sun.com
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 1
2. What do you need to consider? (H/W layer)
• SSD or HDD?
• Interface
– SATA/SAS or PCI-Express?
• RAID
– H/W RAID, S/W RAID or JBOD?
• Network
– Is 1GbE enough?
• Memory
– Is 2GB RAM + PCI-E SSD faster than 64GB RAM +
8HDDs?
• CPU
– Nehalem or older Xeon?
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 2
3. What do you need to consider?
• Redundancy
– RAID
– DRBD (network mirroring)
– Semi-Sync MySQL Replication
– Async MySQL Replication
• Filesystem
– ext3, xfs, raw device ?
• File location
– Data file, Redo log file, etc
• SSD specific issues
– Write performance deterioration
– Write endurance
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 3
4. Why SSD? IOPS!
• IOPS: Number of (random) disk i/o operations per second
• Almost all database operations require random access
– Selecting records by index scan
– Updating records
– Deleting records
– Modifying indexes
• Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow)
• SSD : 2,000+ (writes) / 5,000+ (reads) per drive
– highly depending on SSDs and device drivers
• Let’s start from basic benchmarks
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 4
5. Tested HDD/SSD for this session
• SSD
– Intel X25-E (SATA, 30GB, SLC)
– Fusion I/O (PCI-Express, 160GB, SLC)
• HDD
– Seagate 160GB SAS 15000RPM
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 5
6. Table of contents
• Basic Performance on SSD/HDD
– Random Reads
– Random Writes
– Sequential Reads
– Sequential Writes
– fsync() speed
– Filesystem difference
– IOPS and I/O unit size
• MySQL Deployments
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 6
7. Random Read benchmark
Direct Random Read IOPS (Single Drive, 16KB, xfs)
45000
40000
35000
30000
25000 HDD
IOPS
20000 Intel SSD
15000 Fusion I/O
10000
5000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
# of I/O threads
• HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads
• Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads
• Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads
• Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better
• SSD’s concurrency (4x) is much better than HDD’s (2.2x)
• Very strong reason to use SSD
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 7
8. High Concurrency
• Single SSD drive has multiple NAND Flash Memory chips
(i.e. 40 x 4GB Flash Memory = 160GB)
• Highly depending on I/O controller and Applications
– Single threaded application can not gain concurrency advantage
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 8
9. PCI-Express SSD
CPU
North Bridge South Bridge
PCI-Express Controller SAS/SATA Controller
2GB/s (PCI-Express x 8) 300MB/s
SSD I/O Controller SSD I/O Controller
Flash Flash
• Advantage
– PCI-Express is much faster interface than SAS/SATA
• (current) Disadvantages
– Most motherboards have limited # of PCI-E slots
– No hot swap mechanism
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 9
10. Write performance on SSD
Random Write IOPS (16KB Blocks)
20000
18000
16000
14000
12000
1 i/o thread
10000
100 i/o threads
8000
6000
4000
2000
0
HDD(4 RAID10 xfs) Intel(xfs) Fusion (xfs)
• Very strong reason to use SSD
• But wait.. Can we get a high write throughput *anytime*?
– Not always.. Let’s check how data is written to Flash Memory
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 10
11. Understanding how data is written to SSD (1)
Block (empty) Block (empty)
Block (empty) Block Page
Page
….
Flash memory chips
• Single SSD drive consists of many flash memory chips (i.e. 2GB)
• A flash memory chip internally consists of many blocks (i.e. 512KB)
• A block internally consists of many pages (i.e. 4KB)
• It is *not* possible to overwrite to a non-empty block
– Reading from pages is possible
– Writing to pages in an empty block is possible
– Appending is possible
– Overwriting to pages in a non-empty block is *not* possible
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 11
12. Understanding how data is written to SSD (2)
Block (empty) Block (empty)
New data
Block (empty) Block
×
Page
Page
….
• Overwriting to a non-empty block is not possible
• Writing new data to an empty block instead
• Writing to a non-empty block is fast (-200 microseconds)
• Even though applications write to same positions in same files (i.e. InnoDB Log
File), written pages/blocks are distributed (Wear-Leveling)
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 12
13. Understanding how data is written to SSD (3)
Block P Block P Block P
P P P
1. Reading all pages
Block P Block P Block P
P P New P
2. Erasing the block
Block Block P Block P
P P P 3. Writing all data
P P
• In the long run, almost all blocks will be fully used
New P
– i.e. Allocating 158GB files on 160GB SSD
• New empty block must be allocated on writes
• Basic steps to write new data:
– 1. Reading all pages from a block
– 2. ERASE the block
– 3. Writing all data w/ new data into the block
• ERASE is very expensive operation (takes a few milliseconds)
• At this stage, write performance becomes very slow because of
massive ERASE operations
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 13
14. Data Space Reserved Space Reserved Space
Block P Block P Block P Block (empty)
P P P
Block P Block P Block P Block (empty)
P P P
Block Block P Block P 2. Writing data
P P P
1. Reading pages P
New data
Background jobs ERASE unused blocks P
• To keep high enough write performance, SSDs have a feature
of “reserved space”
• Data size visible to applications is limited to
the size of data space
– i.e. 160GB SSD, 120GB data space, 40GB reserved space
• Fusion I/O has a functionality to change reserved space size
– # fio-format -s 96G /dev/fct0
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 14
15. Write performance deterioration
Write IOPS deterioration (16KB random write)
30000 Continuous write-intensive workloads
25000
20000
IOPS
Fastest
15000
Slowest
10000
5000
Stopping writing for a while
0
Intel Fusion(150G) Fusion(120G) Fusion(96G) Fusion(80G)
• At the beginning, write IOPS was close to “Fastest” line
• When massive writes happened, write IOPS gradually deteriorated toward
“Slowest” line (because massive ERASE happened)
• Increasing reserved space improves steady-state write throughput
• Write IOPS recovered to “Fastest” when stopping writes for a long time
(Many blocks were ERASEd by background job)
• Highly depending on Flash memory and I/O controller (TRIM support,
ERASE scheduling, etc)
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 15
16. Sequential I/O
Sequential Read/Write throughput (1MB consecutive reads/writes)
600
500
400
MB/s
Seq read
300
Seq write
200
100
0
4 HDD(raid10, xfs) Intel(xfs) Fusion(xfs)
• Typical scenario: Full table scan (read), logging/journaling (write)
• SSD outperforms HDD for sequential reads, but less significant
• HDD (4 RAID10) is fast enough for sequential i/o
• Data transfer size by sequential writes tends to be huge, so you
need to care about write deterioration on SSD
• No strong reason to use SSD for sequential writes
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 16
17. fsync() speed
fsync speed
20000
18000
16000
14000
fsync/sec
12000 1KB
10000 8KB
8000 16KB
6000
4000
2000
0
HDD(xfs) Intel (xfs) Fusion I/O(xfs)
• 10,000+ fsync/sec is fine in most cases
• Fusion I/O was CPU bound (%system), not I/O bound
(%iowait).
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 17
18. HDD is fast for sequential writes / fsync
• Best Practice: Writes can be boosted by using BBWC
(Battery Backed up Write Cache), especially for REDO
Logs (because it’s sequentially written)
• No strong reason to use SSDs here
seek & rotation time
Write cache
disk
disk
seek & rotation time
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 18
19. Filesystem matters
Random write iops (16KB Blocks)
20000
18000
16000
14000
12000
1 thread
iops
10000
8000 16 thread
6000
4000
2000
0
Fusion(ext3) Fusion (xfs) Fusion (raw)
Filesystem
• On xfs, multiple threads can write to the same file if opened with
O_DIRECT, but can not on ext*
• Good concurrency on xfs, close to raw device
• ext3 is less optimized for Fusion I/O
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 19
20. Changing I/O unit size
Read IOPS and I/O unit size (4 HDD RAID10)
2500
2000
1KB
1500
IOPS
4KB
1000
16KB
500
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
• On HDD, maximum 22% performance difference was found
between 1KB and 16KB
• No big difference when concurrency < 10
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 20
21. Changing I/O unit size on SSD
Read IOPS and I/O unit size (Fusion I/O)
200000
150000
1KB
IOPS
100000 4KB
16KB
50000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
• Huge difference
• On SSDs, not only IOPS, but also I/O transfer size matters
• It’s worth considering that Storage Engines support
“configurable block size” functionality
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 21
22. Let’s start MySQL benchmarking
• Base: Disk-bound application (DBT-2) running on:
– Sun Fire X4270
– Nehalem 8 Core
– 4 HDD
– RAID1+0, Write Cache with Battery
• What will happen if …
– Replacing HDD with Intel SSD (SATA)
– Replacing HDD with Fusion I/O (PCI-E)
– Moving log files and ibdata to HDD
– Not using Nehalem
– Using two Fusion I/O drives with Software RAID1
– Deploying DRBD protocol B or C
• Replacing 1GbE with 10GbE
– Using MySQL 5.5.4
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 22
23. DBT-2 condition
• SuSE Enterprise Linux 11, xfs
• MySQL 5.5.2M2 (InnoDB Plugin 1.0.6)
• 200 Warehouses (20GB – 25GB hot data)
• Buffer pool size
– 1GB
– 2GB
– 5GB
– 30GB (large enough to cache all data)
• 1000 seconds warm up time
• Running 3600 seconds (1 hour)
• Fusion I/O: 96GB data space, 64GB reserved space
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 23
24. HDD vs Intel SSD
HDD Intel
Buffer pool 1G 1125.44 5709.06 (NOTPM: Transactions
per minute)
• Storing all data on HDD or Intel SSD
• Massive disk i/o happens
– Random reads for all accesses
– Random writes for updating rows and indexes
– Sequential writes for REDO log files, etc
• SSD is very good at these kinds of workloads
• 5.5 times performance improvement, without any application
change!
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 24
25. HDD vs Intel SSD vs Fusion I/O
HDD Intel Fusion I/O
Buffer pool 1G 1125.44 5709.06 15122.75
• Fusion I/O is a PCI-E based SSD
• PCI-E is much faster than SAS/SATA
• 14x improvement compared to 4HDDs
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 25
26. Which should we spend money, RAM or SSD?
HDD Intel Fusion I/O
Buffer pool 1G 1125.44 5709.06 15122.75
Buffer pool 2G 1863.19
Buffer pool 5G 4385.18
Buffer pool 30G 36784.76
(Caching all hot
data)
• Increasing RAM (buffer pool size) reduces random disk reads
– Because more data are cached in the buffer pool
• If all data are cached, only disk writes (both random and
sequential) happen
• Disk writes happen asynchronously, so application queries can
be much faster
• Large enough RAM + HDD outperforms too small RAM + SSD
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 26
27. Which should we spend money, RAM or SSD?
HDD Intel Fusion I/O
Buffer pool 1G 1125.44 5709.06 15122.75
Buffer pool 2G 1863.19 7536.55 20096.33
Buffer pool 5G 4385.18 12892.56 30846.34
Buffer pool 30G 36784.76 - 57441.64
(Caching all hot
data)
• It is not always possible to cache all hot data
• Fusion I/O + good amount of memory (5GB) was pretty good
• Basic rule can be:
– If you can cache all active data, large enough RAM + HDD
– If you can’t, or if you need extremely high throughput, spending on
both RAM and SSD
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 27
28. Let’s think about MySQL file location
• SSD is extremely good at random reads
• SSD is very good at random writes
• HDD is good enough at sequential reads/writes
• No strong reason to use SSD for sequential writes
• Random I/O oriented:
– Data Files (*.ibd)
• Sequential reads if doing full table scan
– Undo Log, Insert Buffer (ibdata)
• UNDO tablespace (small in most cases, except for running long-running batch)
• On-disk insert buffer space (small in most cases, except that InnoDB can not
catch up with updating indexes)
• Sequential Write oriented:
– Doublewrite Buffer (ibdata)
• Write volume is equal to *ibd files. Huge
– Binary log (mysql-bin.XXXXXX)
– Redo log (ib_logfile)
– Backup files
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 28
29. Moving sequentially written files into HDD
Fusion I/O Fusion I/O + HDD Up
Buffer pool 1G 15122.75 19295.94 +28%
(us=25%, wa=15%) (us=32%, wa=10%)
Buffer pool 2G 20096.33 25627.49 +28%
(us=30%, wa=12.5%) (us=36%, wa=8%)
Buffer pool 5G 30846.34 39435.25 +28%
(us=39%, wa=10%) (us=49%, wa=6%)
Buffer pool 30G 57441.64 66053.68 +15%
(us=70%, wa=3.5%) (us=77%, wa=1%)
• Moving ibdata, ib_logfile, (+binary logs) into HDD
• High impact on performance
– Write volume to SSD becomes half because doublewrite area is
allocated in HDD
– %iowait was significantly reduced
– You can delay write performance deterioration
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 29
30. Does CPU matter?
Nehalem Older Xeon
CPUs Memory CPUs
QPI: 25.6GB/s FSB: 10.6GB/s
North Bridge North Bridge
(IOH) Memory (MCH)
PCI-Express PCI-Express
• Nehalem has two big advantages
1. Memory is directly attached to CPU : Faster for in-memory workloads
2. Interface speed between CPU and North Bridge is 2.5x higher, and
interface traffics do not conflict with CPU<->Memory workloads : Faster for
disk i/o workloads when using PCI-Express SSDs
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 30
31. Harpertown X5470 (older Xeon) vs Nehalem X5570 (HDD)
HDD Harpertown X5470, Nehalem(X5570, Up
3.33GHz 2.93GHz)
Buffer pool 1G 1135.37 (us=1%) 1125.44 (us=1%) -1%
Buffer pool 2G 1922.23 (us=2%) 1863.19 (us=2%) -3%
Buffer pool 5G 4176.51 (us=7%) 4385.18(us=7%) +5%
Buffer pool 30G 30903.4 (us=40%) 36784.76 (us=40%) +19%
us: userland CPU utilization
• CPU difference matters on CPU bound workloads
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 31
32. Harpertown X5470 vs Nehalem X5570 (Fusion)
Fusion I/O+HDD Harportown X5470, Nehalem(X5570, Up
3.33GHz 2.93GHz)
Buffer pool 1G 13534.06 (user=35%) 19295.94 (user=32%) +43%
Buffer pool 2G 19026.64 (user=40%) 25627.49 (user=37%) +35%
Buffer pool 5G 30058.48 (user=50%) 39435.25 (user=50%) +31%
Buffer pool 30G 52582.71 (user=76%) 66053.68 (user=76%) +26%
• TPM difference was much higher than HDD
• For disk i/o bound workloads (buffer pool 1G/2G), CPU utilizations
on Nehalem were smaller, but TPM were much higher
– Verified that Nehalem is much more efficient for PCI-E workloads
• Benefit from high interface speed between CPU and PCI-Express
• Fusion I/O fits with Nehalem much better than with traditional CPUs
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 32
33. We need to think about redundancy overhead
• Single server + No RAID is meaningless in the real
database world
• Redundancy
– RAID 1 / 5 / 10
– Network mirroring (DRBD)
– Replication (Sync / Async)
• Relative overhead for redundancy will be (much)
higher than on traditional HDD environment
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 33
34. Fusion I/O + Software RAID1
• Fusion I/O itself has RAID5 feature
– Writing parity bits into Flash Memory
– Flash Chips are not Single Point of Failure
– Controller / PCI-E Board is Single Point of Failure
• Right now no H/W RAID controller is provided for
PCI-E SSDs
• Using Software RAID1 (or RAID10)
– Two Fusion I/O drives in the same machine
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 34
35. Understanding how software RAID1 works
H/W RAID1 App/DB S/W RAID1 App/DB
Writing to files Writing to files
on /dev/sdX Response
Response on /dev/md0
Write cache with battery Software RAID daemon
RAID controller “md0_raid1” process
Background writes Writing to disks
(in parallel) (in parallel)
Disk1 Disk2
Disk1 Disk2
• Response time on Software RAID1 is
max(time-to-write-to-disk1, time-to-write-to-disk2)
• If either of the two takes time for ERASE, response time will be
longer
• On faster storages / faster writes (i.e. sequential write + fsync),
relative overheads of the software raid process are higher
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 35
36. Random Write IOPS, S/W RAID1 vs No-RAID
Random Write IOPS (Fusion I/O 160GB SLC, 16KB I/O unit, XFS)
50000
45000
40000
35000 No-RAID (120G)
30000
IOPS
S/W RAID1 (120G)
25000
No-RAID (96G)
20000
15000 S/W RAID1 (96G)
10000
5000
0
1 61 121 181 241 301 361 421 481
Running time (minutes)
• 120GB data space = 40GB additional reserved space
• 96GB data space = 64GB additional reserved space
• On S/W RAID1, IOPS deteriorated more quickly than on No-RAID
• On S/W RAID1 with 96GB data space, the slowest line was smaller than No-RAID
• 20-25% performance drop can be expected on disk write bound workloads
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 36
37. What about Reads?
Read IOPS (16KB Blocks)
80000
70000
60000
50000
IOPS
No-RAID
40000
S/W RAID1
30000
20000
10000
0
1 2 3 4 5 6 8 10 15 20 30 40 50 100 200
concurrency
• Theoretically reads IOPS can be twice by RAID1
• Peak IOPS was 43636 on No-RAID, 75627 on RAID, 73% up
• Good scalability
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 37
38. DBT-2, No-RAID vs S/W RAID on Fusion I/O
Fusion I/O+HDD RAID 1 Fusion %iowait Down
I/O+HDD
Buffer pool 1G 19295.94 15468.81 10% -19.8%
Buffer pool 2G 25627.49 21405.23 8% -16.5%
Buffer pool 5G 39435.25 35086.21 6-7% -11.0%
Buffer pool 30G 66053.68 66426.52 0-1% +0.56%
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 38
39. Intel SSDs with a traditional H/W raid controller
Single raw Intel Four RAID5 Intel Down
Buffer pool 1G 5709.06 2975.04 -48%
Buffer pool 2G 7536.55 4763.60 -37%
Buffer pool 5G 12892.56 11739.27 -9%
• Raw SSD drives performed much better than using a traditional H/W
raid controller
– Even on RAID10 performance was worse than single raw drive
– H/W Raid controller seemed serious bottleneck
– Make sure SSD drives have write cache and capacitor itself (Intel X25-
V/M/E doesn’t have capacitor)
• Use JBOD + write cache + capacitor
• Research appliances such as Schooner, Gear6, etc
• Wait until H/W vendors release great H/R raid controllers that work well
with SSDs
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 39
40. What about DRBD?
• Single server is not Highly Available
– Mother Board/RAID Controller/etc are Single Point of Failure
• Heartbeat + DRBD + MySQL is one of the most
common HA (Active/Passive) solutions
• Network might be a bottleneck
– 1GbE -> 10GbE, InfiniBand, Dolphin Interconnect, etc
• Replication level
– Protocol A (async)
– Protocol B (sync to remote drbd receiver process)
– Protocol C (sync to remote disk)
• Network channel is single threaded
– Storing all data under /data (single DRBD partition) => single
thread
– Storing log/ibdata under /hdd, *ibd under /ssd => two
threads
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 40
41. DRBD Overheads on HDD
HDD No DRBD DRBD Protocol DRBD Protocol B,
B, 1GbE 10GbE
Buffer pool 1G 1125.44 1080.8 1101.63
Buffer pool 2G 1863.19 1824.75 1811.95
Buffer pool 5G 4385.18 4285.22 4326.22
Buffer pool 30G 36784.76 32862.81 35689.67
• DRBD 8.3.7
• DRBD overhead (protocol B) was not big on disk i/o bound
workloads
• Network bandwidth difference was not big on disk i/o bound
workloads
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 41
42. DRBD Overheads on Fusion I/O
Fusion I/O+HDD No DRBD DRBD Protocol Down DRBD Protocol Down
B, 1GbE B, 10GbE
Buffer pool 1G 19295.94 5976.18 -69.0% 12107.88 -37.3%
Buffer pool 2G 25627.49 8100.5 -68.4% 16776.19 -34.5%
Buffer pool 5G 39435.25 16073.9 -59.2% 30288.63 -23.2%
Buffer pool 30G 66053.68 37974 -42.5% 62024.68 -6.1%
• DRBD overhead was not negligible
• 10GbE performed much better than 1GbE
• Still 6-10 times faster than HDD
• Note: DRBD supports faster interface such as InfiniBand SDP
and Dolphin Interconnect
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 42
43. Misc topic: Insert performance on InnoDB vs MyISAM (HDD)
Time to insert 1 million records (HDD)
5000
4000 250 rows/s
Seconds
3000 innodb
2000 myisam
1000
0
1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145
Existing records (millions)
• MyISAM doesn’t do any special i/o optimization like “Insert
Buffering” so a lot of random reads/writes happen, and highly
depending on OS
• Disk seek & rotation overhead is really serious on HDD
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 43
44. Note: Insert Buffering (InnoDB feature)
• If non-unique, secondary index blocks are not in
memory, InnoDB inserts entries to a special
buffer(“insert buffer”) to avoid random disk i/o operations
– Insert buffer is allocated on both memory and innodb
SYSTEM tablespace
• Periodically, the insert buffer is merged into the
secondary index trees in the database (“merge”)
Insert buffer • Pros: Reducing I/O overhead
– Reducing the number of disk i/o operations by merging i/o
requests to the same block
Optimized i/o – Some random i/o operations can be sequential
• Cons:
Additional operations are added
Merging might take a very long time
– when many secondary indexes must be updated and many
rows have been inserted.
– it may continue to happen after a server shutdown and
restart
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 44
45. Insert performance: InnoDB vs MyISAM (SSD)
Time to insert 1million records (SSD)
600
500 2,000 rows/s
400
Seconds
InnoDB
300
MyISAM
200
5,000 rows/s
100
0
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
Existing records (millions)
Index size exceeded buffer pool size
Filesystem cache was fully used, disk reads began
• MyISAM got much faster by just replacing HDD with SSD !
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 45
46. Try MySQL 5.5.4 !
Fusion I/O + HDD MySQL5.5.2 MySQL5.5.4 Up
Buffer pool 1G 19295.94 24019.32 +24%
Buffer pool 2G 25627.49 32325.76 +26%
Buffer pool 5G 39435.25 47296.12 +20
Buffer pool 30G 66053.68 67253.45 +1.8%
• Got 20-26% improvements for disk i/o bound workloads on
Fusion I/O
– Both CPU %user and %iowait were improved
• %user: 36% (5.5.2) to 44% (5.5.4) when buf pool = 2g
• %iowait: 8% (5.5.2) to 5.5% (5.5.4) when buf pool = 2g, but iops was
20% higher
– Could handle a lot more concurrent i/o requests in 5.5.4 !
– No big difference was found on 4 HDDs
• Works very well on faster storages such as Fusion I/O, lots of disks
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 46
47. Conclusion for choosing H/W
• Disks
– PCI-E SSDs (i.e. Fusion I/O) perform very well
– SAS/SATA SSDs (i.e. Intel X25)
– Carefully research RAID controller. Many controllers do not
scale with SSD drives
– Keep enough reserved space if you need to handle massive
write traffics
– HDD is good at sequential writes
• Use fast network adapter
– 1GbE will be saturated on DRBD
– 10GbE or Infiniband
• Use Nahalem CPU
– Especially when using PCI-Express SSDs
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 47
48. Conclusion for database deployments
• Put sequentially written files on HDD
– ibdata, ib_logfile, binary log files
– HDD is fast enough for sequential writes
– Write performance deterioration can be mitigated
– Life expectancy of SSD will be longer
• Put randomly accessed files on SSD
– *ibd files, index files(MYI), data files(MYD)
– SSD is 10x -100x faster for random reads than HDD
• Archive less active tables/records to HDD
– SSD is still much expensive than HDD
• Use InnoDB Plugin
– Higher scalability & concurrency matters on faster storage
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 48
49. What will happen in the real database world?
• These are just my thoughts..
• Less demand for NoSQL
– Isn’t it enough for many applications just to replace HDD with Fusion I/O?
– Importance on functionality will be relatively stronger
• Stronger demand for Virtualization
– Single server will have enough capacity to run two or more mysqld
instances
• I/O volume matters
– Not just IOPS
– Block size, disabling doublewrite, etc
• Concurrency matters
– Single SSD scales as well as 8-16 HDDs
– Concurrent ALTER TABLE, parallel query
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 49
50. Special Thanks To
• Koji Watanabe – Fusion I/O Japan
• Hideki Endo – Sumisho Computer Systems, Japan
– Rent me two Fusion I/O 160GB SLC drives
• Daisuke Homma, Masashi Hasegawa - Sun Japan
– Did benchmarks together
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 50
51. Thanks for attending!
• Contact:
– E-mail: Yoshinori.Matsunobu@sun.com
– Blog http://yoshinorimatsunobu.blogspot.com
– @matsunobu on Twitter
Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 51
52. Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 52