Tuning ZFS for Database Products

Review the following considerations when using ZFS with a database product.

If the database uses a fixed disk block or record size for I/O, set the ZFS recordsize property to match it. You can do this on a per-file system basis, even though multiple file systems might share a single pool.
With ZFS's copy-on-write design, tuning down the recordsize is a way to improve OLTP performance at the expense of batch reporting queries.
ZFS checksums every block stored on disk. This alleviates the need for the database layer to checksum data an additional time. If checksums are computed by ZFS instead of at the database layer, any discrepancy can be caught and fixed before the data is returned to the application.
UFS direct I/O is used to overcome some of the design deficiencies of UFS and to eliminate double buffering of data. In ZFS, the UFS design deficiencies do not exist and ZFS uses the primarycache and secondarycache properties to manage buffering data in the ARC. Note that using the secondarycache (L2ARC) property to improve random reads also requires the primarycache property to be enabled.
Keep pool space under 90% utilization to maintain pool performance.

Tuning ZFS for an Oracle Database

ZFS is recommended for any Oracle database version in single instance mode. ZFS can be used with an Oracle RAC database when it is available as a NFS-shared file system.

Review the following recommendations below for tuning ZFS for an Oracle database:

Verify that you are running the latest Solaris release

Start with the latest Solaris 10 or Solaris 11 release, with the Solaris 10 9/10 release as a minimum starting point.
Create LUNs for your ZFS storage pools, if needed

Use your storage array tools to create LUNs that will be presented to the ZFS storage pool. Or, consider using whole disks for your mirrored ZFS storage pools. For more information, see Chapter 3, Managing Oracle Solaris ZFS Storage Pools, in Oracle Solaris 11.1 Administration: ZFS File Systems.

Create a storage pool for data files for tables, index, undo and temp data

Consider creating a mirrored storage pool to provide a higher level of data redundancy. For example:

# zpool status dbpool
  pool: dbpool
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        dbpool                     ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c0t5000C500335F95E3d0  ONLINE       0     0     0
            c0t5000C500335F907Fd0  ONLINE       0     0     0
          mirror-1                 ONLINE       0     0     0
            c0t5000C500335BD117d0  ONLINE       0     0     0
            c0t5000C500335DC60Fd0  ONLINE       0     0     0

errors: No known data errors

For databases with high redo log activity, such as a typical OLTP database with many commits, use a separate LUN for a separate log device.

Create a storage pool for the archivelog

If available, a system's internal disk can handle this type of load. The archivelog file system can also be a file system in the dbpool.
```
# zpool create archivepool c0t5000C500335E106Bd0
```

Create the ZFS file systems and set the specific file system properties by using the following guidelines.

Create separate file systems for redo, archive, undo, and temp database components using the default record size of 128 KB. The general rule is to set the file system recordsize = db_block_size for the file systems that contains Oracle data files. For table data and index components, create a file system with an 8 KB record size. Also consider providing metadata caching hints for your database file systems by using the primarycache property. For more information about ZFS file system properties, see Introducing ZFS Properties in Oracle Solaris 11.1 Administration: ZFS File Systems.

Create file systems for the table data files and index data files with an 8 KB recordsize. Use the default value for primarycache.

# zfs create -o recordsize=8k -o mountpoint=/my_db_path/index dbpool/index
# zfs set logbias=throughput dbpool/index 
# zfs get primarycache,recordsize,logbias dbpool/index
NAME            PROPERTY      VALUE         SOURCE
dbpool/index    primarycache  all           default
dbpool/index    recordsize    8K            local
dbpool/index    logbias       throughput    local

Create file systems for temporary and undo table spaces, using the default recordsize and primarycache values.

# zfs create -o mountpoint=/my_db_path/temp dbpool/temp
# zfs set logbias=throughput dbpool/temp
# zfs create -o mountpoint=/my_db_path/undo dbpool/undo
# zfs set logbias=throughput dbpool/undo

Create a storage pool for redo logs with a separate log device. For databases with high redo log activity, such as a typical OLTP database with many commits, use a separate log device LUN.

Partition the disk into two slices, a small slice, s0, in the 64 to 150 MB range, for the separate log device. The s1 slice contains the remaining disk space for the redo log.

# zpool create redopool c0t50015179594B6F11d0s1 log c0t50015179594B6F11d0s0
# zpool status redopool
  pool: redopool
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        redopool                   ONLINE       0     0     0
          c0t50015179594B6F11d0s1  ONLINE       0     0     0
        logs
          c0t50015179594B6F11d0s0  ONLINE       0     0     0

errors: No known data errors

Create a file system for redo logs in the redo pool. Use default file system values for recordsize and primarycache.
```
# zfs create -o mountpoint=/my_db_path/redo redopool/redo
# zfs set logbias=latency redopool/redo
```

Create a file system for archivelog files in the archive pool, enabling compression, use the default value for recordsize and set primarycache to metadata.

# zfs create -o compression=on -o primarycache=metadata -o mountpoint=
/my_db_admin_path/archive archivepool/archive
# zfs get primarycache,recordsize,compressratio,compression,available,
used,quota archivepool/archive
NAME                 PROPERTY       VALUE         SOURCE
archivepool/archive  primarycache   metadata      local
archivepool/archive  recordsize     128K          default
archivepool/archive  compressratio  1.32x         -
archivepool/archive  compression    on            local
archivepool/archive  available      40.0G         -
archivepool/archive  used           10.0G         -
archivepool/archive  quota          50G           local

Consider tuning storage array I/O queues (for systems with HDS or EMC storage arrays)

ZFS aggregates read and write I/O and manages the priority of I/O before sending it to the driver level, which handles the device. The zfs_vdev_max_pending parameter defines the maximum number of I/Os that ZFS sends to any storage pool device.

In a legacy storage environment, the ssd_max_throttle and sd_max_throttle parameters define the maximum number of concurrent I/Os that the driver can send to the storage. By setting the zfs_vdev_max_pending default value equal to the value of the [s]sd_max_throttle parameter, we prevent ZFS from queuing I/O to yet another unnecessary SD layer.

If you have ssd:ssd_max_throttle or sd:sd_max_throttle in the /etc/system file in your existing environment, then set zfs:zfs_vdev_max_pending at the same value. For example, if the storage array administrator asked for the following setting:
```
set ssd:ssd_max_throttle=20 
```
Then, also set this parameter as follows:
```
set ssd:ssd_max_throttle=20 
set zfs:zfs_vdev_max_pending=20 
```
Setting this parameter allows ZFS to control each LUN queue. This means that the total number of pending I/Os in the storage can grow as follows:
```
number of LUNs * ZFS_VDEV_MAX_PENDING
```
Allocate sufficient memory and swap resources

You can reduce ZFS memory consumption by tuning the zfs_arc_max parameter to a low value, but we still recommend provisioning enough memory to cache metadata for the actively used portion of the database, which is estimated at 1.5% with an 8 KB ZFS record size and proportionately less or more with larger or smaller records. The file system that hold index files is the one that has the largest benefit from file system caching because it is the last one to invalidate in case of lack of memory. The zfs_arc_max parameter is in bytes and accepts decimal or hexadecimal values. The following example sets this parameter to 2 GB:
```
set zfs:zfs_arc_max=2147483648 
or 
set zfs:zfs_arc_max=0x80000000
```
To prevent applications from failing due to lack of memory, you must configure some amount of swap space. The amount of swap equivalent to all of system memory is always enough for this purpose. This swap space is not expected to be used, but is needed as a reservation area. For information about increasing swap space, see Managing Your ZFS Swap and Dump Devices in Oracle Solaris 11.1 Administration: ZFS File Systems.
Additional Oracle database configuration recommendations
- Configuring Your Oracle Database on ZFS File Systems in the following white paper:
  
  http://www.oracle.com/technetwork/server-storage/solaris/config-solaris-zfs-wp-167894.pdf
- Dynamic SGA Tuning of Oracle Database on Oracle Solaris with DISM white paper:
  
  http://www.oracle.com/technetwork/articles/systems-hardware-architecture/using-dynamic-intimate-memory-sparc-168402.pdf
- Oracle 11g Installation Guides
  - Oracle Database Quick Installation Guide 11g Release 2 (11.2) for Oracle Solaris on SPARC (64-Bit)
    
    http://docs.oracle.com/cd/E11882_01/install.112/e24349/toc.htm
  - Oracle Database Quick Installation Guide 11g Release 2 (11.2) for Oracle Solaris on x86-64 (64-Bit)
    
    http://docs.oracle.com/cd/E11882_01/install.112/e24351/toc.htm

Using ZFS with MySQL Considerations

Review the following considerations when using ZFS with MySQL.

ZFS recordsize

Match the ZFS recordsize property to the storage engine block size for better OLTP performance.
InnoDB
- With a known application memory footprint, such as for a database application, you might cap the ARC size so that the application will not need to reclaim its necessary memory from the ZFS cache.
Create a separate pool for the logs.
Set a different path for data and log in the my.cnf file.
Set the ZFS recordsize property to 16K for the InnoDB data files, and use the default recordsize value for InnoDB logs, prior to creating data files.

Skip Navigation Links
Exit Print View
	Oracle Solaris 11.1 Tunable Parameters Reference Manual Oracle Solaris 11.1 Information Library