SlideShare a Scribd company logo
SSD Deployment Strategies
                         for MySQL


                                      Yoshinori Matsunobu

                              Lead of MySQL Professional Services APAC
                                          Sun Microsystems
                                    Yoshinori.Matsunobu@sun.com


Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   1
What do you need to consider? (H/W layer)

       • SSD or HDD?
       • Interface
              – SATA/SAS or PCI-Express?
       • RAID
              – H/W RAID, S/W RAID or JBOD?
       • Network
              – Is 1GbE enough?
       • Memory
              – Is 2GB RAM + PCI-E SSD faster than 64GB RAM +
                8HDDs?
       • CPU
              – Nehalem or older Xeon?

Copyright 2010 Sun Microsystems inc              The World’s Most Popular Open Source Database   2
What do you need to consider?

       • Redundancy
              –   RAID
              –   DRBD (network mirroring)
              –   Semi-Sync MySQL Replication
              –   Async MySQL Replication

       • Filesystem
              – ext3, xfs, raw device ?


       • File location
              – Data file, Redo log file, etc


       • SSD specific issues
              – Write performance deterioration
              – Write endurance
Copyright 2010 Sun Microsystems inc               The World’s Most Popular Open Source Database   3
Why SSD? IOPS!
 •    IOPS: Number of (random) disk i/o operations per second

 •    Almost all database operations require random access
        –   Selecting records by index scan
        –   Updating records
        –   Deleting records
        –   Modifying indexes

 •    Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow)

 •    SSD : 2,000+ (writes) / 5,000+ (reads) per drive
        – highly depending on SSDs and device drivers

 •    Let’s start from basic benchmarks



Copyright 2010 Sun Microsystems inc                     The World’s Most Popular Open Source Database   4
Tested HDD/SSD for this session

       • SSD
              – Intel X25-E (SATA, 30GB, SLC)
              – Fusion I/O (PCI-Express, 160GB, SLC)


       • HDD
              – Seagate 160GB SAS 15000RPM




Copyright 2010 Sun Microsystems inc               The World’s Most Popular Open Source Database   5
Table of contents

       • Basic Performance on SSD/HDD
              –   Random Reads
              –   Random Writes
              –   Sequential Reads
              –   Sequential Writes
              –   fsync() speed
              –   Filesystem difference
              –   IOPS and I/O unit size


       • MySQL Deployments




Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   6
Random Read benchmark
                          Direct Random Read IOPS (Single Drive, 16KB, xfs)
              45000
              40000
              35000
              30000
              25000                                                                                   HDD
       IOPS




              20000                                                                                   Intel SSD
              15000                                                                                   Fusion I/O
              10000
               5000
                  0
                      1     2   3     4   5   6      8 10 15 20      30 40    50 100 200
                                                  # of I/O threads


•   HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads
•   Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads
•   Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads
•   Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better
•   SSD’s concurrency (4x) is much better than HDD’s (2.2x)
•   Very strong reason to use SSD
Copyright 2010 Sun Microsystems inc                                          The World’s Most Popular Open Source Database   7
High Concurrency




   • Single SSD drive has multiple NAND Flash Memory chips
     (i.e. 40 x 4GB Flash Memory = 160GB)
   • Highly depending on I/O controller and Applications
          – Single threaded application can not gain concurrency advantage
Copyright 2010 Sun Microsystems inc                    The World’s Most Popular Open Source Database   8
PCI-Express SSD
                           CPU


                   North Bridge                              South Bridge
               PCI-Express Controller                     SAS/SATA Controller

                                2GB/s (PCI-Express x 8)                     300MB/s
                 SSD I/O Controller                        SSD I/O Controller

                          Flash                                   Flash


        •    Advantage
               – PCI-Express is much faster interface than SAS/SATA

        •    (current) Disadvantages
               – Most motherboards have limited # of PCI-E slots
               – No hot swap mechanism
Copyright 2010 Sun Microsystems inc                            The World’s Most Popular Open Source Database   9
Write performance on SSD
                                       Random Write IOPS (16KB Blocks)

      20000
      18000
      16000
      14000
      12000
                                                                                             1 i/o thread
      10000
                                                                                             100 i/o threads
       8000
       6000
       4000
       2000
          0
                   HDD(4 RAID10 xfs)         Intel(xfs)          Fusion (xfs)


        •    Very strong reason to use SSD
        •    But wait.. Can we get a high write throughput *anytime*?
               – Not always.. Let’s check how data is written to Flash Memory

Copyright 2010 Sun Microsystems inc                                The World’s Most Popular Open Source Database   10
Understanding how data is written to SSD (1)
                                       Block (empty) Block (empty)


                                       Block (empty) Block     Page
                                                               Page

                                               ….
     Flash memory chips

 •    Single SSD drive consists of many flash memory chips (i.e. 2GB)
 •    A flash memory chip internally consists of many blocks (i.e. 512KB)
 •    A block internally consists of many pages (i.e. 4KB)
 •    It is *not* possible to overwrite to a non-empty block
        –   Reading from pages is possible
        –   Writing to pages in an empty block is possible
        –   Appending is possible
        –   Overwriting to pages in a non-empty block is *not* possible
Copyright 2010 Sun Microsystems inc                       The World’s Most Popular Open Source Database   11
Understanding how data is written to SSD (2)
                      Block (empty) Block (empty)

                                                                             New data
                      Block (empty) Block
                                                            ×
                                             Page
                                             Page

                                      ….




  •    Overwriting to a non-empty block is not possible
  •    Writing new data to an empty block instead
  •    Writing to a non-empty block is fast (-200 microseconds)
  •    Even though applications write to same positions in same files (i.e. InnoDB Log
       File), written pages/blocks are distributed (Wear-Leveling)




Copyright 2010 Sun Microsystems inc                     The World’s Most Popular Open Source Database   12
Understanding how data is written to SSD (3)
                        Block         P   Block   P   Block   P
                                      P           P           P
                                                                     1. Reading all pages
                        Block         P   Block   P   Block   P
                                      P           P    New    P
                                                                             2. Erasing the block
                        Block             Block   P   Block   P
                                      P           P           P                     3. Writing all data
                                                                                                   P     P
    • In the long run, almost all blocks will be fully used
                                                                                              New        P
           – i.e. Allocating 158GB files on 160GB SSD
    • New empty block must be allocated on writes
    • Basic steps to write new data:
           – 1. Reading all pages from a block
           – 2. ERASE the block
           – 3. Writing all data w/ new data into the block
    • ERASE is very expensive operation (takes a few milliseconds)
    • At this stage, write performance becomes very slow because of
      massive ERASE operations
Copyright 2010 Sun Microsystems inc                               The World’s Most Popular Open Source Database   13
Data Space                                    Reserved Space                      Reserved Space
        Block        P      Block         P   Block   P       Block (empty)

                     P                    P           P
        Block        P      Block         P   Block   P       Block (empty)

                     P                    P           P
        Block               Block         P   Block   P             2. Writing data
                     P                    P           P
                                          1. Reading pages               P
                                                             New data
Background jobs ERASE unused blocks                                      P

•    To keep high enough write performance, SSDs have a feature
     of “reserved space”
•    Data size visible to applications is limited to
     the size of data space
       – i.e. 160GB SSD, 120GB data space, 40GB reserved space
•    Fusion I/O has a functionality to change reserved space size
       – # fio-format -s 96G /dev/fct0
    Copyright 2010 Sun Microsystems inc                                 The World’s Most Popular Open Source Database   14
Write performance deterioration
                                         Write IOPS deterioration (16KB random write)

                       30000                  Continuous write-intensive workloads
                       25000

                       20000
                IOPS




                                                                                                         Fastest
                       15000
                                                                                                         Slowest
                       10000

                        5000
         Stopping writing for a while
                          0
                                 Intel   Fusion(150G) Fusion(120G) Fusion(96G)          Fusion(80G)

  •    At the beginning, write IOPS was close to “Fastest” line
  •    When massive writes happened, write IOPS gradually deteriorated toward
       “Slowest” line (because massive ERASE happened)
  •    Increasing reserved space improves steady-state write throughput
  •    Write IOPS recovered to “Fastest” when stopping writes for a long time
       (Many blocks were ERASEd by background job)
  •    Highly depending on Flash memory and I/O controller (TRIM support,
       ERASE scheduling, etc)
Copyright 2010 Sun Microsystems inc                                            The World’s Most Popular Open Source Database   15
Sequential I/O
                             Sequential Read/Write throughput (1MB consecutive reads/writes)

                  600

                  500

                  400
           MB/s




                                                                                                       Seq read
                  300
                                                                                                       Seq write
                  200

                  100

                   0
                         4 HDD(raid10, xfs)           Intel(xfs)             Fusion(xfs)

       • Typical scenario: Full table scan (read), logging/journaling (write)
       • SSD outperforms HDD for sequential reads, but less significant
       • HDD (4 RAID10) is fast enough for sequential i/o
       • Data transfer size by sequential writes tends to be huge, so you
         need to care about write deterioration on SSD
       • No strong reason to use SSD for sequential writes
Copyright 2010 Sun Microsystems inc                                        The World’s Most Popular Open Source Database   16
fsync() speed
                                            fsync speed

                   20000
                   18000
                   16000
                   14000
       fsync/sec




                   12000                                                                   1KB
                   10000                                                                   8KB
                    8000                                                                   16KB
                    6000
                    4000
                    2000
                       0
                              HDD(xfs)        Intel (xfs)    Fusion I/O(xfs)



       • 10,000+ fsync/sec is fine in most cases
       • Fusion I/O was CPU bound (%system), not I/O bound
         (%iowait).

Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   17
HDD is fast for sequential writes / fsync

 • Best Practice: Writes can be boosted by using BBWC
    (Battery Backed up Write Cache), especially for REDO
   Logs (because it’s sequentially written)
 • No strong reason to use SSDs here

       seek & rotation time

                                                             Write cache



                       disk
                                                                 disk

                                      seek & rotation time


Copyright 2010 Sun Microsystems inc                    The World’s Most Popular Open Source Database   18
Filesystem matters
                                            Random write iops (16KB Blocks)

                 20000
                 18000
                 16000
                 14000
                 12000
                                                                                                         1 thread
          iops




                 10000
                  8000                                                                                   16 thread
                  6000
                  4000
                  2000
                     0
                             Fusion(ext3)            Fusion (xfs)             Fusion (raw)
                                                     Filesystem


       • On xfs, multiple threads can write to the same file if opened with
         O_DIRECT, but can not on ext*
       • Good concurrency on xfs, close to raw device
       • ext3 is less optimized for Fusion I/O

Copyright 2010 Sun Microsystems inc                                           The World’s Most Popular Open Source Database   19
Changing I/O unit size
                                      Read IOPS and I/O unit size (4 HDD RAID10)

        2500
        2000
                                                                                                                1KB
        1500
 IOPS




                                                                                                                4KB
        1000
                                                                                                                16KB
        500
          0
                 1      2     3       4    5    6    8   10   15   20   30    40    50 100 200
                                                    concurrency




        • On HDD, maximum 22% performance difference was found
          between 1KB and 16KB
        • No big difference when concurrency < 10

Copyright 2010 Sun Microsystems inc                                      The World’s Most Popular Open Source Database   20
Changing I/O unit size on SSD
                                      Read IOPS and I/O unit size (Fusion I/O)

        200000

        150000
                                                                                                                1KB
 IOPS




        100000                                                                                                  4KB
                                                                                                                16KB
         50000

               0
                     1      2     3     4   5    6    8 10 15 20 30 40 50 100 200
                                                     concurrency


        • Huge difference
        • On SSDs, not only IOPS, but also I/O transfer size matters
        • It’s worth considering that Storage Engines support
          “configurable block size” functionality

Copyright 2010 Sun Microsystems inc                                   The World’s Most Popular Open Source Database    21
Let’s start MySQL benchmarking
   • Base: Disk-bound application (DBT-2) running on:
          –   Sun Fire X4270
          –   Nehalem 8 Core
          –   4 HDD
          –   RAID1+0, Write Cache with Battery


   • What will happen if …
          –   Replacing HDD with Intel SSD (SATA)
          –   Replacing HDD with Fusion I/O (PCI-E)
          –   Moving log files and ibdata to HDD
          –   Not using Nehalem
          –   Using two Fusion I/O drives with Software RAID1
          –   Deploying DRBD protocol B or C
                 • Replacing 1GbE with 10GbE
          – Using MySQL 5.5.4

Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   22
DBT-2 condition

       •    SuSE Enterprise Linux 11, xfs
       •    MySQL 5.5.2M2 (InnoDB Plugin 1.0.6)
       •    200 Warehouses (20GB – 25GB hot data)
       •    Buffer pool size
              –   1GB
              –   2GB
              –   5GB
              –   30GB (large enough to cache all data)


       • 1000 seconds warm up time
       • Running 3600 seconds (1 hour)
       • Fusion I/O: 96GB data space, 64GB reserved space
Copyright 2010 Sun Microsystems inc                   The World’s Most Popular Open Source Database   23
HDD vs Intel SSD
                                      HDD       Intel
   Buffer pool 1G                     1125.44   5709.06 (NOTPM: Transactions
                                                per minute)



           •    Storing all data on HDD or Intel SSD
           •    Massive disk i/o happens
                  – Random reads for all accesses
                  – Random writes for updating rows and indexes
                  – Sequential writes for REDO log files, etc
           •    SSD is very good at these kinds of workloads
           •    5.5 times performance improvement, without any application
                change!


Copyright 2010 Sun Microsystems inc                      The World’s Most Popular Open Source Database   24
HDD vs Intel SSD vs Fusion I/O

                               HDD       Intel            Fusion I/O
  Buffer pool 1G               1125.44   5709.06          15122.75




              •    Fusion I/O is a PCI-E based SSD
              •    PCI-E is much faster than SAS/SATA
              •    14x improvement compared to 4HDDs




Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   25
Which should we spend money, RAM or SSD?
                                      HDD        Intel               Fusion I/O
  Buffer pool 1G                      1125.44    5709.06             15122.75
  Buffer pool 2G                      1863.19
  Buffer pool 5G                      4385.18
  Buffer pool 30G                     36784.76
  (Caching all hot
  data)
         •    Increasing RAM (buffer pool size) reduces random disk reads
                – Because more data are cached in the buffer pool
         •    If all data are cached, only disk writes (both random and
              sequential) happen
         •    Disk writes happen asynchronously, so application queries can
              be much faster
         •    Large enough RAM + HDD outperforms too small RAM + SSD
Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   26
Which should we spend money, RAM or SSD?
                                HDD           Intel                  Fusion I/O
  Buffer pool 1G                1125.44       5709.06                15122.75
  Buffer pool 2G                1863.19       7536.55                20096.33
  Buffer pool 5G                4385.18       12892.56               30846.34
  Buffer pool 30G 36784.76                     -                     57441.64
  (Caching all hot
  data)

         •    It is not always possible to cache all hot data
         •    Fusion I/O + good amount of memory (5GB) was pretty good

         •    Basic rule can be:
                – If you can cache all active data, large enough RAM + HDD
                – If you can’t, or if you need extremely high throughput, spending on
                  both RAM and SSD
Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   27
Let’s think about MySQL file location
  •    SSD is extremely good at random reads
  •    SSD is very good at random writes
  •    HDD is good enough at sequential reads/writes
  •    No strong reason to use SSD for sequential writes

  •    Random I/O oriented:
         – Data Files (*.ibd)
                • Sequential reads if doing full table scan
         – Undo Log, Insert Buffer (ibdata)
                • UNDO tablespace (small in most cases, except for running long-running batch)
                • On-disk insert buffer space (small in most cases, except that InnoDB can not
                  catch up with updating indexes)

  •    Sequential Write oriented:
         – Doublewrite Buffer (ibdata)
                • Write volume is equal to *ibd files. Huge
         – Binary log (mysql-bin.XXXXXX)
         – Redo log (ib_logfile)
         – Backup files
Copyright 2010 Sun Microsystems inc                            The World’s Most Popular Open Source Database   28
Moving sequentially written files into HDD
                                  Fusion I/O           Fusion I/O + HDD                   Up
    Buffer pool 1G                15122.75             19295.94                           +28%
                                  (us=25%, wa=15%)     (us=32%, wa=10%)
    Buffer pool 2G                20096.33             25627.49                           +28%
                                  (us=30%, wa=12.5%)   (us=36%, wa=8%)
    Buffer pool 5G                30846.34             39435.25                           +28%
                                  (us=39%, wa=10%)     (us=49%, wa=6%)
    Buffer pool 30G 57441.64                           66053.68                           +15%
                                  (us=70%, wa=3.5%)    (us=77%, wa=1%)

         •    Moving ibdata, ib_logfile, (+binary logs) into HDD
         •    High impact on performance
                – Write volume to SSD becomes half because doublewrite area is
                  allocated in HDD
                – %iowait was significantly reduced
                – You can delay write performance deterioration
Copyright 2010 Sun Microsystems inc                          The World’s Most Popular Open Source Database   29
Does CPU matter?
        Nehalem                                                   Older Xeon

                CPUs           Memory                               CPUs

                      QPI: 25.6GB/s                                       FSB: 10.6GB/s

          North Bridge                                        North Bridge
             (IOH)                            Memory             (MCH)



           PCI-Express                                        PCI-Express


 •      Nehalem has two big advantages
        1. Memory is directly attached to CPU : Faster for in-memory workloads
        2. Interface speed between CPU and North Bridge is 2.5x higher, and
           interface traffics do not conflict with CPU<->Memory workloads : Faster for
           disk i/o workloads when using PCI-Express SSDs
Copyright 2010 Sun Microsystems inc                     The World’s Most Popular Open Source Database   30
Harpertown X5470 (older Xeon) vs Nehalem X5570 (HDD)


   HDD                                Harpertown X5470,   Nehalem(X5570,                  Up
                                      3.33GHz             2.93GHz)
   Buffer pool 1G                     1135.37 (us=1%)     1125.44 (us=1%)                 -1%
   Buffer pool 2G                     1922.23 (us=2%)     1863.19 (us=2%)                 -3%
   Buffer pool 5G                     4176.51 (us=7%)     4385.18(us=7%)                  +5%
   Buffer pool 30G                    30903.4 (us=40%)    36784.76 (us=40%)               +19%

                                                              us: userland CPU utilization


           •    CPU difference matters on CPU bound workloads




Copyright 2010 Sun Microsystems inc                           The World’s Most Popular Open Source Database   31
Harpertown X5470 vs Nehalem X5570 (Fusion)
Fusion I/O+HDD                        Harportown X5470,   Nehalem(X5570,                   Up
                                      3.33GHz             2.93GHz)
Buffer pool 1G                        13534.06 (user=35%) 19295.94 (user=32%) +43%
Buffer pool 2G                        19026.64 (user=40%) 25627.49 (user=37%) +35%
Buffer pool 5G                        30058.48 (user=50%) 39435.25 (user=50%) +31%
Buffer pool 30G                       52582.71 (user=76%) 66053.68 (user=76%) +26%


    • TPM difference was much higher than HDD
    • For disk i/o bound workloads (buffer pool 1G/2G), CPU utilizations
      on Nehalem were smaller, but TPM were much higher
           – Verified that Nehalem is much more efficient for PCI-E workloads
    • Benefit from high interface speed between CPU and PCI-Express
    • Fusion I/O fits with Nehalem much better than with traditional CPUs


Copyright 2010 Sun Microsystems inc                           The World’s Most Popular Open Source Database   32
We need to think about redundancy overhead

       • Single server + No RAID is meaningless in the real
         database world
       • Redundancy
              – RAID 1 / 5 / 10
              – Network mirroring (DRBD)
              – Replication (Sync / Async)
       • Relative overhead for redundancy will be (much)
         higher than on traditional HDD environment




Copyright 2010 Sun Microsystems inc          The World’s Most Popular Open Source Database   33
Fusion I/O + Software RAID1

       • Fusion I/O itself has RAID5 feature
              – Writing parity bits into Flash Memory
              – Flash Chips are not Single Point of Failure
              – Controller / PCI-E Board is Single Point of Failure


       • Right now no H/W RAID controller is provided for
         PCI-E SSDs

       • Using Software RAID1 (or RAID10)
              – Two Fusion I/O drives in the same machine




Copyright 2010 Sun Microsystems inc                    The World’s Most Popular Open Source Database   34
Understanding how software RAID1 works
     H/W RAID1                App/DB                  S/W RAID1                App/DB

            Writing to files                                Writing to files
               on /dev/sdX                                                                Response
                                      Response                on /dev/md0

              Write cache with battery                             Software RAID daemon
                    RAID controller                                 “md0_raid1” process
Background writes                                Writing to disks
      (in parallel)                                   (in parallel)
                                                                      Disk1               Disk2
                       Disk1           Disk2

       • Response time on Software RAID1 is
           max(time-to-write-to-disk1, time-to-write-to-disk2)
       • If either of the two takes time for ERASE, response time will be
         longer
       • On faster storages / faster writes (i.e. sequential write + fsync),
         relative overheads of the software raid process are higher
Copyright 2010 Sun Microsystems inc                           The World’s Most Popular Open Source Database   35
Random Write IOPS, S/W RAID1 vs No-RAID
                          Random Write IOPS (Fusion I/O 160GB SLC, 16KB I/O unit, XFS)

            50000
            45000
            40000
            35000                                                                                No-RAID (120G)
            30000
     IOPS




                                                                                                 S/W RAID1 (120G)
            25000
                                                                                                 No-RAID (96G)
            20000
            15000                                                                                S/W RAID1 (96G)
            10000
             5000
                0
                    1     61          121   181    241     301     361   421     481
                                            Running time (minutes)

•    120GB data space = 40GB additional reserved space
•    96GB data space = 64GB additional reserved space
•    On S/W RAID1, IOPS deteriorated more quickly than on No-RAID
•    On S/W RAID1 with 96GB data space, the slowest line was smaller than No-RAID
•    20-25% performance drop can be expected on disk write bound workloads
Copyright 2010 Sun Microsystems inc                                       The World’s Most Popular Open Source Database   36
What about Reads?
                                                  Read IOPS (16KB Blocks)

             80000
             70000
             60000
             50000
      IOPS




                                                                                                            No-RAID
             40000
                                                                                                            S/W RAID1
             30000
             20000
             10000
                0
                     1     2     3    4   5   6      8 10 15      20   30   40    50 100 200
                                                    concurrency



       • Theoretically reads IOPS can be twice by RAID1
       • Peak IOPS was 43636 on No-RAID, 75627 on RAID, 73% up
       • Good scalability

Copyright 2010 Sun Microsystems inc                                         The World’s Most Popular Open Source Database   37
DBT-2, No-RAID vs S/W RAID on Fusion I/O

                                Fusion I/O+HDD   RAID 1 Fusion           %iowait           Down
                                                 I/O+HDD
Buffer pool 1G                  19295.94         15468.81                10%               -19.8%
Buffer pool 2G                  25627.49         21405.23                8%                -16.5%
Buffer pool 5G                  39435.25         35086.21                6-7%              -11.0%
Buffer pool 30G                 66053.68         66426.52                0-1%              +0.56%




Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   38
Intel SSDs with a traditional H/W raid controller

                                      Single raw Intel   Four RAID5 Intel                    Down
   Buffer pool 1G                     5709.06            2975.04                             -48%
   Buffer pool 2G                     7536.55            4763.60                             -37%
   Buffer pool 5G                     12892.56           11739.27                            -9%

         •    Raw SSD drives performed much better than using a traditional H/W
              raid controller
                – Even on RAID10 performance was worse than single raw drive
                – H/W Raid controller seemed serious bottleneck
                – Make sure SSD drives have write cache and capacitor itself (Intel X25-
                  V/M/E doesn’t have capacitor)
         •    Use JBOD + write cache + capacitor
         •    Research appliances such as Schooner, Gear6, etc
         •    Wait until H/W vendors release great H/R raid controllers that work well
              with SSDs

Copyright 2010 Sun Microsystems inc                                The World’s Most Popular Open Source Database   39
What about DRBD?
       • Single server is not Highly Available
              – Mother Board/RAID Controller/etc are Single Point of Failure
       • Heartbeat + DRBD + MySQL is one of the most
         common HA (Active/Passive) solutions
       • Network might be a bottleneck
              – 1GbE -> 10GbE, InfiniBand, Dolphin Interconnect, etc
       • Replication level
              – Protocol A (async)
              – Protocol B (sync to remote drbd receiver process)
              – Protocol C (sync to remote disk)
       • Network channel is single threaded
              – Storing all data under /data (single DRBD partition) => single
                thread
              – Storing log/ibdata under /hdd, *ibd under /ssd => two
                threads
Copyright 2010 Sun Microsystems inc                   The World’s Most Popular Open Source Database   40
DRBD Overheads on HDD

            HDD                       No DRBD   DRBD Protocol     DRBD Protocol B,
                                                B, 1GbE           10GbE
            Buffer pool 1G            1125.44   1080.8            1101.63
            Buffer pool 2G            1863.19   1824.75           1811.95
            Buffer pool 5G            4385.18   4285.22           4326.22
            Buffer pool 30G           36784.76 32862.81           35689.67



            •     DRBD 8.3.7
            •     DRBD overhead (protocol B) was not big on disk i/o bound
                  workloads
            •     Network bandwidth difference was not big on disk i/o bound
                  workloads

Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   41
DRBD Overheads on Fusion I/O
Fusion I/O+HDD               No DRBD    DRBD Protocol   Down      DRBD Protocol Down
                                        B, 1GbE                   B, 10GbE
Buffer pool 1G               19295.94   5976.18         -69.0% 12107.88                       -37.3%
Buffer pool 2G               25627.49   8100.5          -68.4% 16776.19                       -34.5%
Buffer pool 5G               39435.25   16073.9         -59.2% 30288.63                       -23.2%
Buffer pool 30G              66053.68   37974           -42.5% 62024.68                       -6.1%



    •    DRBD overhead was not negligible
    •    10GbE performed much better than 1GbE
    •    Still 6-10 times faster than HDD
    •    Note: DRBD supports faster interface such as InfiniBand SDP
         and Dolphin Interconnect


Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   42
Misc topic: Insert performance on InnoDB vs MyISAM (HDD)


                                        Time to insert 1 million records (HDD)

          5000
          4000                                                                                            250 rows/s
Seconds




          3000                                                                                               innodb
          2000                                                                                               myisam
          1000
            0
                 1   10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145
                                   Existing records (millions)


            • MyISAM doesn’t do any special i/o optimization like “Insert
              Buffering” so a lot of random reads/writes happen, and highly
              depending on OS
            • Disk seek & rotation overhead is really serious on HDD

  Copyright 2010 Sun Microsystems inc                                        The World’s Most Popular Open Source Database   43
Note: Insert Buffering (InnoDB feature)
                                      •   If non-unique, secondary index blocks are not in
                                          memory, InnoDB inserts entries to a special
                                          buffer(“insert buffer”) to avoid random disk i/o operations
                                           – Insert buffer is allocated on both memory and innodb
                                             SYSTEM tablespace

                                      •   Periodically, the insert buffer is merged into the
                                          secondary index trees in the database (“merge”)


        Insert buffer                 •   Pros: Reducing I/O overhead
                                           – Reducing the number of disk i/o operations by merging i/o
                                             requests to the same block
                   Optimized i/o           – Some random i/o operations can be sequential

                                      •   Cons:
                                           Additional operations are added
                                           Merging might take a very long time
                                           – when many secondary indexes must be updated and many
                                             rows have been inserted.
                                           – it may continue to happen after a server shutdown and
                                             restart
Copyright 2010 Sun Microsystems inc                                   The World’s Most Popular Open Source Database   44
Insert performance: InnoDB vs MyISAM (SSD)
                                        Time to insert 1million records (SSD)

          600
          500                                                                                       2,000 rows/s
          400
Seconds




                                                                                                            InnoDB
          300
                                                                                                            MyISAM
          200
                                                                                                    5,000 rows/s
          100
            0
                1   7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
                                          Existing records (millions)

           Index size exceeded buffer pool size
                                 Filesystem cache was fully used, disk reads began


            • MyISAM got much faster by just replacing HDD with SSD !


  Copyright 2010 Sun Microsystems inc                                      The World’s Most Popular Open Source Database   45
Try MySQL 5.5.4 !
      Fusion I/O + HDD                 MySQL5.5.2   MySQL5.5.4                     Up
      Buffer pool 1G                   19295.94     24019.32                       +24%
      Buffer pool 2G                   25627.49     32325.76                       +26%
      Buffer pool 5G                   39435.25     47296.12                       +20
      Buffer pool 30G                  66053.68     67253.45                       +1.8%

  • Got 20-26% improvements for disk i/o bound workloads on
    Fusion I/O
         – Both CPU %user and %iowait were improved
                • %user: 36% (5.5.2) to 44% (5.5.4) when buf pool = 2g
                • %iowait: 8% (5.5.2) to 5.5% (5.5.4) when buf pool = 2g, but iops was
                  20% higher
         – Could handle a lot more concurrent i/o requests in 5.5.4 !
         – No big difference was found on 4 HDDs
                • Works very well on faster storages such as Fusion I/O, lots of disks
Copyright 2010 Sun Microsystems inc                         The World’s Most Popular Open Source Database   46
Conclusion for choosing H/W
       • Disks
              – PCI-E SSDs (i.e. Fusion I/O) perform very well
              – SAS/SATA SSDs (i.e. Intel X25)
              – Carefully research RAID controller. Many controllers do not
                scale with SSD drives
              – Keep enough reserved space if you need to handle massive
                write traffics
              – HDD is good at sequential writes

       • Use fast network adapter
              – 1GbE will be saturated on DRBD
              – 10GbE or Infiniband

       • Use Nahalem CPU
              – Especially when using PCI-Express SSDs
Copyright 2010 Sun Microsystems inc                  The World’s Most Popular Open Source Database   47
Conclusion for database deployments
       • Put sequentially written files on HDD
              –   ibdata, ib_logfile, binary log files
              –   HDD is fast enough for sequential writes
              –   Write performance deterioration can be mitigated
              –   Life expectancy of SSD will be longer


       • Put randomly accessed files on SSD
              – *ibd files, index files(MYI), data files(MYD)
              – SSD is 10x -100x faster for random reads than HDD


       • Archive less active tables/records to HDD
              – SSD is still much expensive than HDD


       • Use InnoDB Plugin
              – Higher scalability & concurrency matters on faster storage
Copyright 2010 Sun Microsystems inc                        The World’s Most Popular Open Source Database   48
What will happen in the real database world?
    • These are just my thoughts..

    • Less demand for NoSQL
           – Isn’t it enough for many applications just to replace HDD with Fusion I/O?
           – Importance on functionality will be relatively stronger

    • Stronger demand for Virtualization
           – Single server will have enough capacity to run two or more mysqld
             instances

    • I/O volume matters
           – Not just IOPS
           – Block size, disabling doublewrite, etc

    • Concurrency matters
           – Single SSD scales as well as 8-16 HDDs
           – Concurrent ALTER TABLE, parallel query
Copyright 2010 Sun Microsystems inc                      The World’s Most Popular Open Source Database   49
Special Thanks To

       • Koji Watanabe – Fusion I/O Japan
       • Hideki Endo – Sumisho Computer Systems, Japan
              – Rent me two Fusion I/O 160GB SLC drives


       • Daisuke Homma, Masashi Hasegawa - Sun Japan
              – Did benchmarks together




Copyright 2010 Sun Microsystems inc                 The World’s Most Popular Open Source Database   50
Thanks for attending!

       • Contact:
              – E-mail: Yoshinori.Matsunobu@sun.com
              – Blog http://yoshinorimatsunobu.blogspot.com
              – @matsunobu on Twitter




Copyright 2010 Sun Microsystems inc                The World’s Most Popular Open Source Database   51
Copyright 2010 Sun Microsystems inc   The World’s Most Popular Open Source Database   52

More Related Content

SSD Deployment Strategies for MySQL

  • 1. SSD Deployment Strategies for MySQL Yoshinori Matsunobu Lead of MySQL Professional Services APAC Sun Microsystems Yoshinori.Matsunobu@sun.com Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 1
  • 2. What do you need to consider? (H/W layer) • SSD or HDD? • Interface – SATA/SAS or PCI-Express? • RAID – H/W RAID, S/W RAID or JBOD? • Network – Is 1GbE enough? • Memory – Is 2GB RAM + PCI-E SSD faster than 64GB RAM + 8HDDs? • CPU – Nehalem or older Xeon? Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 2
  • 3. What do you need to consider? • Redundancy – RAID – DRBD (network mirroring) – Semi-Sync MySQL Replication – Async MySQL Replication • Filesystem – ext3, xfs, raw device ? • File location – Data file, Redo log file, etc • SSD specific issues – Write performance deterioration – Write endurance Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 3
  • 4. Why SSD? IOPS! • IOPS: Number of (random) disk i/o operations per second • Almost all database operations require random access – Selecting records by index scan – Updating records – Deleting records – Modifying indexes • Regular SAS HDD : 200 iops per drive (disk seek & rotation is slow) • SSD : 2,000+ (writes) / 5,000+ (reads) per drive – highly depending on SSDs and device drivers • Let’s start from basic benchmarks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 4
  • 5. Tested HDD/SSD for this session • SSD – Intel X25-E (SATA, 30GB, SLC) – Fusion I/O (PCI-Express, 160GB, SLC) • HDD – Seagate 160GB SAS 15000RPM Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 5
  • 6. Table of contents • Basic Performance on SSD/HDD – Random Reads – Random Writes – Sequential Reads – Sequential Writes – fsync() speed – Filesystem difference – IOPS and I/O unit size • MySQL Deployments Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 6
  • 7. Random Read benchmark Direct Random Read IOPS (Single Drive, 16KB, xfs) 45000 40000 35000 30000 25000 HDD IOPS 20000 Intel SSD 15000 Fusion I/O 10000 5000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 # of I/O threads • HDD: 196 reads/s at 1 i/o thread, 443 reads/s at 100 i/o threads • Intel : 3508 reads/s at 1 i/o thread, 14538 reads/s at 100 i/o threads • Fusion I/O : 10526 reads/s at 1 i/o thread, 41379 reads/s at 100 i/o threads • Single thread throughput on Intel is 16x better than on HDD, Fusion is 25x better • SSD’s concurrency (4x) is much better than HDD’s (2.2x) • Very strong reason to use SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 7
  • 8. High Concurrency • Single SSD drive has multiple NAND Flash Memory chips (i.e. 40 x 4GB Flash Memory = 160GB) • Highly depending on I/O controller and Applications – Single threaded application can not gain concurrency advantage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 8
  • 9. PCI-Express SSD CPU North Bridge South Bridge PCI-Express Controller SAS/SATA Controller 2GB/s (PCI-Express x 8) 300MB/s SSD I/O Controller SSD I/O Controller Flash Flash • Advantage – PCI-Express is much faster interface than SAS/SATA • (current) Disadvantages – Most motherboards have limited # of PCI-E slots – No hot swap mechanism Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 9
  • 10. Write performance on SSD Random Write IOPS (16KB Blocks) 20000 18000 16000 14000 12000 1 i/o thread 10000 100 i/o threads 8000 6000 4000 2000 0 HDD(4 RAID10 xfs) Intel(xfs) Fusion (xfs) • Very strong reason to use SSD • But wait.. Can we get a high write throughput *anytime*? – Not always.. Let’s check how data is written to Flash Memory Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 10
  • 11. Understanding how data is written to SSD (1) Block (empty) Block (empty) Block (empty) Block Page Page …. Flash memory chips • Single SSD drive consists of many flash memory chips (i.e. 2GB) • A flash memory chip internally consists of many blocks (i.e. 512KB) • A block internally consists of many pages (i.e. 4KB) • It is *not* possible to overwrite to a non-empty block – Reading from pages is possible – Writing to pages in an empty block is possible – Appending is possible – Overwriting to pages in a non-empty block is *not* possible Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 11
  • 12. Understanding how data is written to SSD (2) Block (empty) Block (empty) New data Block (empty) Block × Page Page …. • Overwriting to a non-empty block is not possible • Writing new data to an empty block instead • Writing to a non-empty block is fast (-200 microseconds) • Even though applications write to same positions in same files (i.e. InnoDB Log File), written pages/blocks are distributed (Wear-Leveling) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 12
  • 13. Understanding how data is written to SSD (3) Block P Block P Block P P P P 1. Reading all pages Block P Block P Block P P P New P 2. Erasing the block Block Block P Block P P P P 3. Writing all data P P • In the long run, almost all blocks will be fully used New P – i.e. Allocating 158GB files on 160GB SSD • New empty block must be allocated on writes • Basic steps to write new data: – 1. Reading all pages from a block – 2. ERASE the block – 3. Writing all data w/ new data into the block • ERASE is very expensive operation (takes a few milliseconds) • At this stage, write performance becomes very slow because of massive ERASE operations Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 13
  • 14. Data Space Reserved Space Reserved Space Block P Block P Block P Block (empty) P P P Block P Block P Block P Block (empty) P P P Block Block P Block P 2. Writing data P P P 1. Reading pages P New data Background jobs ERASE unused blocks P • To keep high enough write performance, SSDs have a feature of “reserved space” • Data size visible to applications is limited to the size of data space – i.e. 160GB SSD, 120GB data space, 40GB reserved space • Fusion I/O has a functionality to change reserved space size – # fio-format -s 96G /dev/fct0 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 14
  • 15. Write performance deterioration Write IOPS deterioration (16KB random write) 30000 Continuous write-intensive workloads 25000 20000 IOPS Fastest 15000 Slowest 10000 5000 Stopping writing for a while 0 Intel Fusion(150G) Fusion(120G) Fusion(96G) Fusion(80G) • At the beginning, write IOPS was close to “Fastest” line • When massive writes happened, write IOPS gradually deteriorated toward “Slowest” line (because massive ERASE happened) • Increasing reserved space improves steady-state write throughput • Write IOPS recovered to “Fastest” when stopping writes for a long time (Many blocks were ERASEd by background job) • Highly depending on Flash memory and I/O controller (TRIM support, ERASE scheduling, etc) Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 15
  • 16. Sequential I/O Sequential Read/Write throughput (1MB consecutive reads/writes) 600 500 400 MB/s Seq read 300 Seq write 200 100 0 4 HDD(raid10, xfs) Intel(xfs) Fusion(xfs) • Typical scenario: Full table scan (read), logging/journaling (write) • SSD outperforms HDD for sequential reads, but less significant • HDD (4 RAID10) is fast enough for sequential i/o • Data transfer size by sequential writes tends to be huge, so you need to care about write deterioration on SSD • No strong reason to use SSD for sequential writes Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 16
  • 17. fsync() speed fsync speed 20000 18000 16000 14000 fsync/sec 12000 1KB 10000 8KB 8000 16KB 6000 4000 2000 0 HDD(xfs) Intel (xfs) Fusion I/O(xfs) • 10,000+ fsync/sec is fine in most cases • Fusion I/O was CPU bound (%system), not I/O bound (%iowait). Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 17
  • 18. HDD is fast for sequential writes / fsync • Best Practice: Writes can be boosted by using BBWC (Battery Backed up Write Cache), especially for REDO Logs (because it’s sequentially written) • No strong reason to use SSDs here seek & rotation time Write cache disk disk seek & rotation time Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 18
  • 19. Filesystem matters Random write iops (16KB Blocks) 20000 18000 16000 14000 12000 1 thread iops 10000 8000 16 thread 6000 4000 2000 0 Fusion(ext3) Fusion (xfs) Fusion (raw) Filesystem • On xfs, multiple threads can write to the same file if opened with O_DIRECT, but can not on ext* • Good concurrency on xfs, close to raw device • ext3 is less optimized for Fusion I/O Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 19
  • 20. Changing I/O unit size Read IOPS and I/O unit size (4 HDD RAID10) 2500 2000 1KB 1500 IOPS 4KB 1000 16KB 500 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • On HDD, maximum 22% performance difference was found between 1KB and 16KB • No big difference when concurrency < 10 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 20
  • 21. Changing I/O unit size on SSD Read IOPS and I/O unit size (Fusion I/O) 200000 150000 1KB IOPS 100000 4KB 16KB 50000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Huge difference • On SSDs, not only IOPS, but also I/O transfer size matters • It’s worth considering that Storage Engines support “configurable block size” functionality Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 21
  • 22. Let’s start MySQL benchmarking • Base: Disk-bound application (DBT-2) running on: – Sun Fire X4270 – Nehalem 8 Core – 4 HDD – RAID1+0, Write Cache with Battery • What will happen if … – Replacing HDD with Intel SSD (SATA) – Replacing HDD with Fusion I/O (PCI-E) – Moving log files and ibdata to HDD – Not using Nehalem – Using two Fusion I/O drives with Software RAID1 – Deploying DRBD protocol B or C • Replacing 1GbE with 10GbE – Using MySQL 5.5.4 Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 22
  • 23. DBT-2 condition • SuSE Enterprise Linux 11, xfs • MySQL 5.5.2M2 (InnoDB Plugin 1.0.6) • 200 Warehouses (20GB – 25GB hot data) • Buffer pool size – 1GB – 2GB – 5GB – 30GB (large enough to cache all data) • 1000 seconds warm up time • Running 3600 seconds (1 hour) • Fusion I/O: 96GB data space, 64GB reserved space Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 23
  • 24. HDD vs Intel SSD HDD Intel Buffer pool 1G 1125.44 5709.06 (NOTPM: Transactions per minute) • Storing all data on HDD or Intel SSD • Massive disk i/o happens – Random reads for all accesses – Random writes for updating rows and indexes – Sequential writes for REDO log files, etc • SSD is very good at these kinds of workloads • 5.5 times performance improvement, without any application change! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 24
  • 25. HDD vs Intel SSD vs Fusion I/O HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 • Fusion I/O is a PCI-E based SSD • PCI-E is much faster than SAS/SATA • 14x improvement compared to 4HDDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 25
  • 26. Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 Buffer pool 5G 4385.18 Buffer pool 30G 36784.76 (Caching all hot data) • Increasing RAM (buffer pool size) reduces random disk reads – Because more data are cached in the buffer pool • If all data are cached, only disk writes (both random and sequential) happen • Disk writes happen asynchronously, so application queries can be much faster • Large enough RAM + HDD outperforms too small RAM + SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 26
  • 27. Which should we spend money, RAM or SSD? HDD Intel Fusion I/O Buffer pool 1G 1125.44 5709.06 15122.75 Buffer pool 2G 1863.19 7536.55 20096.33 Buffer pool 5G 4385.18 12892.56 30846.34 Buffer pool 30G 36784.76 - 57441.64 (Caching all hot data) • It is not always possible to cache all hot data • Fusion I/O + good amount of memory (5GB) was pretty good • Basic rule can be: – If you can cache all active data, large enough RAM + HDD – If you can’t, or if you need extremely high throughput, spending on both RAM and SSD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 27
  • 28. Let’s think about MySQL file location • SSD is extremely good at random reads • SSD is very good at random writes • HDD is good enough at sequential reads/writes • No strong reason to use SSD for sequential writes • Random I/O oriented: – Data Files (*.ibd) • Sequential reads if doing full table scan – Undo Log, Insert Buffer (ibdata) • UNDO tablespace (small in most cases, except for running long-running batch) • On-disk insert buffer space (small in most cases, except that InnoDB can not catch up with updating indexes) • Sequential Write oriented: – Doublewrite Buffer (ibdata) • Write volume is equal to *ibd files. Huge – Binary log (mysql-bin.XXXXXX) – Redo log (ib_logfile) – Backup files Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 28
  • 29. Moving sequentially written files into HDD Fusion I/O Fusion I/O + HDD Up Buffer pool 1G 15122.75 19295.94 +28% (us=25%, wa=15%) (us=32%, wa=10%) Buffer pool 2G 20096.33 25627.49 +28% (us=30%, wa=12.5%) (us=36%, wa=8%) Buffer pool 5G 30846.34 39435.25 +28% (us=39%, wa=10%) (us=49%, wa=6%) Buffer pool 30G 57441.64 66053.68 +15% (us=70%, wa=3.5%) (us=77%, wa=1%) • Moving ibdata, ib_logfile, (+binary logs) into HDD • High impact on performance – Write volume to SSD becomes half because doublewrite area is allocated in HDD – %iowait was significantly reduced – You can delay write performance deterioration Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 29
  • 30. Does CPU matter? Nehalem Older Xeon CPUs Memory CPUs QPI: 25.6GB/s FSB: 10.6GB/s North Bridge North Bridge (IOH) Memory (MCH) PCI-Express PCI-Express • Nehalem has two big advantages 1. Memory is directly attached to CPU : Faster for in-memory workloads 2. Interface speed between CPU and North Bridge is 2.5x higher, and interface traffics do not conflict with CPU<->Memory workloads : Faster for disk i/o workloads when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 30
  • 31. Harpertown X5470 (older Xeon) vs Nehalem X5570 (HDD) HDD Harpertown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 1135.37 (us=1%) 1125.44 (us=1%) -1% Buffer pool 2G 1922.23 (us=2%) 1863.19 (us=2%) -3% Buffer pool 5G 4176.51 (us=7%) 4385.18(us=7%) +5% Buffer pool 30G 30903.4 (us=40%) 36784.76 (us=40%) +19% us: userland CPU utilization • CPU difference matters on CPU bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 31
  • 32. Harpertown X5470 vs Nehalem X5570 (Fusion) Fusion I/O+HDD Harportown X5470, Nehalem(X5570, Up 3.33GHz 2.93GHz) Buffer pool 1G 13534.06 (user=35%) 19295.94 (user=32%) +43% Buffer pool 2G 19026.64 (user=40%) 25627.49 (user=37%) +35% Buffer pool 5G 30058.48 (user=50%) 39435.25 (user=50%) +31% Buffer pool 30G 52582.71 (user=76%) 66053.68 (user=76%) +26% • TPM difference was much higher than HDD • For disk i/o bound workloads (buffer pool 1G/2G), CPU utilizations on Nehalem were smaller, but TPM were much higher – Verified that Nehalem is much more efficient for PCI-E workloads • Benefit from high interface speed between CPU and PCI-Express • Fusion I/O fits with Nehalem much better than with traditional CPUs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 32
  • 33. We need to think about redundancy overhead • Single server + No RAID is meaningless in the real database world • Redundancy – RAID 1 / 5 / 10 – Network mirroring (DRBD) – Replication (Sync / Async) • Relative overhead for redundancy will be (much) higher than on traditional HDD environment Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 33
  • 34. Fusion I/O + Software RAID1 • Fusion I/O itself has RAID5 feature – Writing parity bits into Flash Memory – Flash Chips are not Single Point of Failure – Controller / PCI-E Board is Single Point of Failure • Right now no H/W RAID controller is provided for PCI-E SSDs • Using Software RAID1 (or RAID10) – Two Fusion I/O drives in the same machine Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 34
  • 35. Understanding how software RAID1 works H/W RAID1 App/DB S/W RAID1 App/DB Writing to files Writing to files on /dev/sdX Response Response on /dev/md0 Write cache with battery Software RAID daemon RAID controller “md0_raid1” process Background writes Writing to disks (in parallel) (in parallel) Disk1 Disk2 Disk1 Disk2 • Response time on Software RAID1 is max(time-to-write-to-disk1, time-to-write-to-disk2) • If either of the two takes time for ERASE, response time will be longer • On faster storages / faster writes (i.e. sequential write + fsync), relative overheads of the software raid process are higher Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 35
  • 36. Random Write IOPS, S/W RAID1 vs No-RAID Random Write IOPS (Fusion I/O 160GB SLC, 16KB I/O unit, XFS) 50000 45000 40000 35000 No-RAID (120G) 30000 IOPS S/W RAID1 (120G) 25000 No-RAID (96G) 20000 15000 S/W RAID1 (96G) 10000 5000 0 1 61 121 181 241 301 361 421 481 Running time (minutes) • 120GB data space = 40GB additional reserved space • 96GB data space = 64GB additional reserved space • On S/W RAID1, IOPS deteriorated more quickly than on No-RAID • On S/W RAID1 with 96GB data space, the slowest line was smaller than No-RAID • 20-25% performance drop can be expected on disk write bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 36
  • 37. What about Reads? Read IOPS (16KB Blocks) 80000 70000 60000 50000 IOPS No-RAID 40000 S/W RAID1 30000 20000 10000 0 1 2 3 4 5 6 8 10 15 20 30 40 50 100 200 concurrency • Theoretically reads IOPS can be twice by RAID1 • Peak IOPS was 43636 on No-RAID, 75627 on RAID, 73% up • Good scalability Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 37
  • 38. DBT-2, No-RAID vs S/W RAID on Fusion I/O Fusion I/O+HDD RAID 1 Fusion %iowait Down I/O+HDD Buffer pool 1G 19295.94 15468.81 10% -19.8% Buffer pool 2G 25627.49 21405.23 8% -16.5% Buffer pool 5G 39435.25 35086.21 6-7% -11.0% Buffer pool 30G 66053.68 66426.52 0-1% +0.56% Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 38
  • 39. Intel SSDs with a traditional H/W raid controller Single raw Intel Four RAID5 Intel Down Buffer pool 1G 5709.06 2975.04 -48% Buffer pool 2G 7536.55 4763.60 -37% Buffer pool 5G 12892.56 11739.27 -9% • Raw SSD drives performed much better than using a traditional H/W raid controller – Even on RAID10 performance was worse than single raw drive – H/W Raid controller seemed serious bottleneck – Make sure SSD drives have write cache and capacitor itself (Intel X25- V/M/E doesn’t have capacitor) • Use JBOD + write cache + capacitor • Research appliances such as Schooner, Gear6, etc • Wait until H/W vendors release great H/R raid controllers that work well with SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 39
  • 40. What about DRBD? • Single server is not Highly Available – Mother Board/RAID Controller/etc are Single Point of Failure • Heartbeat + DRBD + MySQL is one of the most common HA (Active/Passive) solutions • Network might be a bottleneck – 1GbE -> 10GbE, InfiniBand, Dolphin Interconnect, etc • Replication level – Protocol A (async) – Protocol B (sync to remote drbd receiver process) – Protocol C (sync to remote disk) • Network channel is single threaded – Storing all data under /data (single DRBD partition) => single thread – Storing log/ibdata under /hdd, *ibd under /ssd => two threads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 40
  • 41. DRBD Overheads on HDD HDD No DRBD DRBD Protocol DRBD Protocol B, B, 1GbE 10GbE Buffer pool 1G 1125.44 1080.8 1101.63 Buffer pool 2G 1863.19 1824.75 1811.95 Buffer pool 5G 4385.18 4285.22 4326.22 Buffer pool 30G 36784.76 32862.81 35689.67 • DRBD 8.3.7 • DRBD overhead (protocol B) was not big on disk i/o bound workloads • Network bandwidth difference was not big on disk i/o bound workloads Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 41
  • 42. DRBD Overheads on Fusion I/O Fusion I/O+HDD No DRBD DRBD Protocol Down DRBD Protocol Down B, 1GbE B, 10GbE Buffer pool 1G 19295.94 5976.18 -69.0% 12107.88 -37.3% Buffer pool 2G 25627.49 8100.5 -68.4% 16776.19 -34.5% Buffer pool 5G 39435.25 16073.9 -59.2% 30288.63 -23.2% Buffer pool 30G 66053.68 37974 -42.5% 62024.68 -6.1% • DRBD overhead was not negligible • 10GbE performed much better than 1GbE • Still 6-10 times faster than HDD • Note: DRBD supports faster interface such as InfiniBand SDP and Dolphin Interconnect Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 42
  • 43. Misc topic: Insert performance on InnoDB vs MyISAM (HDD) Time to insert 1 million records (HDD) 5000 4000 250 rows/s Seconds 3000 innodb 2000 myisam 1000 0 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 Existing records (millions) • MyISAM doesn’t do any special i/o optimization like “Insert Buffering” so a lot of random reads/writes happen, and highly depending on OS • Disk seek & rotation overhead is really serious on HDD Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 43
  • 44. Note: Insert Buffering (InnoDB feature) • If non-unique, secondary index blocks are not in memory, InnoDB inserts entries to a special buffer(“insert buffer”) to avoid random disk i/o operations – Insert buffer is allocated on both memory and innodb SYSTEM tablespace • Periodically, the insert buffer is merged into the secondary index trees in the database (“merge”) Insert buffer • Pros: Reducing I/O overhead – Reducing the number of disk i/o operations by merging i/o requests to the same block Optimized i/o – Some random i/o operations can be sequential • Cons: Additional operations are added Merging might take a very long time – when many secondary indexes must be updated and many rows have been inserted. – it may continue to happen after a server shutdown and restart Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 44
  • 45. Insert performance: InnoDB vs MyISAM (SSD) Time to insert 1million records (SSD) 600 500 2,000 rows/s 400 Seconds InnoDB 300 MyISAM 200 5,000 rows/s 100 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 Existing records (millions) Index size exceeded buffer pool size Filesystem cache was fully used, disk reads began • MyISAM got much faster by just replacing HDD with SSD ! Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 45
  • 46. Try MySQL 5.5.4 ! Fusion I/O + HDD MySQL5.5.2 MySQL5.5.4 Up Buffer pool 1G 19295.94 24019.32 +24% Buffer pool 2G 25627.49 32325.76 +26% Buffer pool 5G 39435.25 47296.12 +20 Buffer pool 30G 66053.68 67253.45 +1.8% • Got 20-26% improvements for disk i/o bound workloads on Fusion I/O – Both CPU %user and %iowait were improved • %user: 36% (5.5.2) to 44% (5.5.4) when buf pool = 2g • %iowait: 8% (5.5.2) to 5.5% (5.5.4) when buf pool = 2g, but iops was 20% higher – Could handle a lot more concurrent i/o requests in 5.5.4 ! – No big difference was found on 4 HDDs • Works very well on faster storages such as Fusion I/O, lots of disks Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 46
  • 47. Conclusion for choosing H/W • Disks – PCI-E SSDs (i.e. Fusion I/O) perform very well – SAS/SATA SSDs (i.e. Intel X25) – Carefully research RAID controller. Many controllers do not scale with SSD drives – Keep enough reserved space if you need to handle massive write traffics – HDD is good at sequential writes • Use fast network adapter – 1GbE will be saturated on DRBD – 10GbE or Infiniband • Use Nahalem CPU – Especially when using PCI-Express SSDs Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 47
  • 48. Conclusion for database deployments • Put sequentially written files on HDD – ibdata, ib_logfile, binary log files – HDD is fast enough for sequential writes – Write performance deterioration can be mitigated – Life expectancy of SSD will be longer • Put randomly accessed files on SSD – *ibd files, index files(MYI), data files(MYD) – SSD is 10x -100x faster for random reads than HDD • Archive less active tables/records to HDD – SSD is still much expensive than HDD • Use InnoDB Plugin – Higher scalability & concurrency matters on faster storage Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 48
  • 49. What will happen in the real database world? • These are just my thoughts.. • Less demand for NoSQL – Isn’t it enough for many applications just to replace HDD with Fusion I/O? – Importance on functionality will be relatively stronger • Stronger demand for Virtualization – Single server will have enough capacity to run two or more mysqld instances • I/O volume matters – Not just IOPS – Block size, disabling doublewrite, etc • Concurrency matters – Single SSD scales as well as 8-16 HDDs – Concurrent ALTER TABLE, parallel query Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 49
  • 50. Special Thanks To • Koji Watanabe – Fusion I/O Japan • Hideki Endo – Sumisho Computer Systems, Japan – Rent me two Fusion I/O 160GB SLC drives • Daisuke Homma, Masashi Hasegawa - Sun Japan – Did benchmarks together Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 50
  • 51. Thanks for attending! • Contact: – E-mail: Yoshinori.Matsunobu@sun.com – Blog http://yoshinorimatsunobu.blogspot.com – @matsunobu on Twitter Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 51
  • 52. Copyright 2010 Sun Microsystems inc The World’s Most Popular Open Source Database 52