Configure persistent memory (PMEM) for SQL Server on Linux

Applies to: SQL Server - Linux

This article describes how to configure the persistent memory (PMEM) for SQL Server 2019 (15.x) and later versions on Linux.

Overview

SQL Server 2019 (15.x) introduced many in-memory features that use persistent memory. This article covers the steps required to configure persistent memory for SQL Server on Linux.

Note

The term enlightenment was introduced to convey the concept of working with a persistent memory aware file system. Direct access to the file system from user-space applications is facilitated using memory mapping (mmap()). When a memory mapping for a file is created the application can issue load/store instructions bypassing the I/O stack completely. This is considered an "enlightened" file access method from the perspective of the host extension application (which is the code that allows SQLPAL interact with the Windows or Linux OS).

Create namespaces for PMEM devices

Configure the devices

In Linux, use the ndctl utility.

  • Install ndctl to configure PMEM device. You can find it here.
  • Use ndctl to create a namespace. Namespaces are interleaved across PMEM NVDIMMs and can provide different types of user-space access to memory regions on the device. fsdax is default and desired mode for SQL Server.
ndctl create-namespace -f -e namespace0.0 --mode=fsdax --map=dev

We have chosen fsdax mode and are using system memory to store per-page metadata. We recommend using --map=dev. This option stores the metadata on the namespace directly. Storing metadata in memory using --map=mem is experimental at this time.

Use ndctl to verify the namespace.

Sample output follows:

# ndctl list -N
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":4294967296,
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

Create and mount PMEM device

For example, with XFS

mkfs.xfs -f /dev/pmem0
mount -o dax,noatime /dev/pmem0 /mnt/dax
xfs_io -c "extsize 2m" /mnt/dax

For example, with EXT4

mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0
mount -o dax,noatime /dev/pmem0 /mnt/dax

Technical considerations

  • Block allocation of 2 MB for either XFS/EXT4, as described previously
  • Misalignment between block allocation and mmap results in silent fallback to 4 KB
  • File sizes should be a multiple of 2 MB (modulo 2 MB)
  • Don't disable transparent huge pages (THP) (enabled by default on most distributions)

Once the device is configured with ndctl, created, and mounted, you can place database files in it or create a new database.

You can store the SQL Server data files (MDFS, NDFS) and tempdb files on a PMEM device when configured with the mode fsdax using the following command. Don't use this to store the SQL Server log (LDFS) files, as transaction log needs to be on storage that provides sector atomic guarantees:

ndctl create-namespace -f -e namespace0.0 --mode=fsdax --map=dev

Before you set the map option in the preceding command, keep the following points in mind:

  • For best performance at accessing and updating these NVDIMM page entries for this device, it's preferable to use -map=mem
  • If the capacity of the NVDIMM is too large (greater than 512 GB), set the –map=dev, which would impact the IO throughput and stymie the performance

For SQL Server log files on PMEM devices, con the PMEM device(s) to use sector/Block Translation Table (BTT). This provides the needed sector atomicity for SQL Server logs files for this technology of storage devices. We also recommend that you perform workload performance validations. You can compare the SQL Server log performance for your workload between this solution and best-in-class NVMe SSDs, and then select the solution that best meets your needs and provides better performance.

ndctl create-namespace -f -e namespace0.0 --mode= sector

Disable forced flush behavior

Because PMEM devices are O_DIRECT (direct I/O) safe, you can disable the forced flush behavior.

Note

A storage system can make sure that any cached or staged writes are considered safe and durable, by guaranteeing that writes issued to the device are kept on a medium that will persist across system crashes, interface resets and power failures, and the medium itself is hardware redundant.

  • Database (.mdf and .ndf) and transaction log (.ldf) files don't use writethrough and alternatewritethrough by default in SQL Server 2017 (14.x) CU 6 and later versions, because they use the forced flush behavior. Trace Flag 3979 disables the use of the forced flush behavior for database and transaction log files, and uses the writethrough and alternatewritethrough logic.

  • Other files that are opened by using FILE_FLAG_WRITE_THROUGH in SQL Server, such as database snapshots, internal snapshots for database consistency checks (DBCC CHECKDB), profiler trace files, and extended event trace files, use the writethrough and alternatewritethrough optimizations.

For more information about the changes introduced in SQL Server 2017 (14.x) CU 6, see KB 4131496. For more information about forced unit access (FUA) internals, see FUA internals.

SQL Server and Forced Unit Access (FUA) I/O subsystem capability

Certain versions of supported Linux distributions provide support for FUA I/O subsystem capability, which provides data durability. SQL Server uses the FUA capability to provide highly efficient and reliable I/O for SQL Server workloads. For more information on FUA support by Linux distribution and its effect on SQL Server, see SQL Server On Linux: Forced Unit Access (FUA) Internals.

SUSE Linux Enterprise Server 12 SP5, Red Hat Enterprise Linux 8.0, and Ubuntu 18.04 introduced support for FUA capability in the I/O subsystem. If you're using SQL Server 2017 (14.x) CU 6 and later versions, you should use following configuration for high performing and efficient I/O implementation with FUA by SQL Server.

Use this recommended configuration if the following conditions are met.

  • SQL Server 2017 (14.x) CU 6 and later versions

  • Linux distribution and version that supports FUA capability (starting with Red Hat Enterprise Linux 8.0, SUSE Linux Enterprise Server 12 SP5, or Ubuntu 18.04)

  • XFS file system for SQL Server storage

  • Storage subsystem and/or hardware that supports and is configured for FUA capability

Recommended configuration:

  1. Enable Trace Flag 3979 as a startup parameter.

  2. Use mssql-conf to configure control.writethrough = 1 and control.alternatewritethrough = 0.

For almost all other configuration that doesn't meet the previous conditions, the recommended configuration is as follows:

  1. Enable Trace Flag 3982 as a startup parameter (which is the default for SQL Server in the Linux ecosystem), and make sure that Trace Flag 3979 isn't enabled as a startup parameter.

  2. Use mssql-conf to configure control.writethrough = 1 and control.alternatewritethrough = 1.

FUA support for SQL Server containers deployed in Kubernetes

  1. The SQL Server must use persisted mounted storage, and not overlayfs.

  2. The storage must use the XFS filesystem and should support FUA. Before enabling this setting, you should work with your Linux distribution and storage vendor, to ensure that the OS and storage subsystem supports FUA options. On Kubernetes, you can query for the filesystem type using the following command, where <pvc-name> is your PersistentVolumeClaim:

    kubectl describe pv <pvc-name>
    

    In the output, look for the fstype that is set to XFS.

  3. The worker node hosting the SQL Server pods, should be using a Linux distribution and version that supports FUA capability (starting with Red Hat Enterprise Linux 8.0, SUSE Linux Enterprise Server 12 SP5, or Ubuntu 18.04).

If the above conditions are met, then you can use the following recommended FUA settings.

  1. Enable Trace Flag 3979 as a startup parameter.

  2. Use mssql-conf to configure control.writethrough = 1 and control.alternatewritethrough = 0.