Advanced Data Deduplication settings

Artikel
02/18/2022

This document describes how to modify advanced Data Deduplication settings. For recommended workloads, the default settings should be sufficient. The main reason to modify these settings is to improve Data Deduplication's performance with other kinds of workloads.

Modifying Data Deduplication job schedules

The default Data Deduplication job schedules are designed to work well for recommended workloads and be as non-intrusive as possible (excluding the Priority Optimization job that is enabled for the Backup usage type). When workloads have large resource requirements, it is possible to ensure that jobs run only during idle hours, or to reduce or increase the amount of system resources that a Data Deduplication job is allowed to consume.

Changing a Data Deduplication schedule

Data Deduplication jobs are scheduled via Windows Task Scheduler and can be viewed and edited there under the path Microsoft\Windows\Deduplication. Data Deduplication includes several cmdlets that make scheduling easy.

Get-DedupSchedule shows the current scheduled jobs.
New-DedupSchedule creates a new scheduled job.
Set-DedupSchedule modifies an existing scheduled job.
Remove-DedupSchedule removes a scheduled job.

The most common reason for changing when Data Deduplication jobs run is to ensure that jobs run during off hours. The following step-by-step example shows how to modify the Data Deduplication schedule for a sunny day scenario: a hyper-converged Hyper-V host that is idle on weekends and after 7:00 PM on week nights. To change the schedule, run the following PowerShell cmdlets in an Administrator context.

Disable the scheduled hourly Optimization jobs.

 Set-DedupSchedule -Name BackgroundOptimization -Enabled $false
 Set-DedupSchedule -Name PriorityOptimization -Enabled $false

Remove the currently scheduled Garbage Collection and Integrity Scrubbing jobs.

 Get-DedupSchedule -Type GarbageCollection | ForEach-Object { Remove-DedupSchedule -InputObject $_ }
 Get-DedupSchedule -Type Scrubbing | ForEach-Object { Remove-DedupSchedule -InputObject $_ }

Create a nightly Optimization job that runs at 7:00 PM with high priority and all the CPUs and memory available on the system.
```
 New-DedupSchedule -Name "NightlyOptimization" -Type Optimization -DurationHours 11 -Memory 100 -Cores 100 -Priority High -Days @(1,2,3,4,5) -Start (Get-Date "2016-08-08 19:00:00")
```
Note

The date part of the System.Datetime provided to -Start is irrelevant (as long as it's in the past), but the time part specifies when the job should start.

Create a weekly Garbage Collection job that runs on Saturday starting at 7:00 AM with high priority and all the CPUs and memory available on the system.

 New-DedupSchedule -Name "WeeklyGarbageCollection" -Type GarbageCollection -DurationHours 23 -Memory 100 -Cores 100 -Priority High -Days @(6) -Start (Get-Date "2016-08-13 07:00:00")

Create a weekly Integrity Scrubbing job that runs on Sunday starting at 7 AM with high priority and all the CPUs and memory available on the system.

 New-DedupSchedule -Name "WeeklyIntegrityScrubbing" -Type Scrubbing -DurationHours 23 -Memory 100 -Cores 100 -Priority High -Days @(0) -Start (Get-Date "2016-08-14 07:00:00")

Available job-wide settings

You can toggle the following settings for new or scheduled Data Deduplication jobs:

Parameter name	Definition	Accepted values	Why would you want to set this value?
Type	The type of the job that should be scheduled	Optimization GarbageCollection Scrubbing	This value is required because it is the type of job that you want to schedule. This value cannot be changed after the task has been scheduled.
Priority	The system priority of the scheduled job	High Normal Low	This value helps the system determine how to allocate CPU time. High will use more CPU time, low will use less.
Days	The days that the job is scheduled	An array of integers 0-6 representing the days of the week: 0 = Sunday 1 = Monday 2 = Tuesday 3 = Wednesday 4 = Thursday 5 = Friday 6 = Saturday	Scheduled tasks have to run on at least one day.
Cores	The percentage of cores on the system that a job should use	Integers 0-100 (indicates a percentage)	To control what level of impact a job will have on the compute resources on the system
DurationHours	The maximum number of hours a job should be allowed to run	Positive integers	To prevent a job for running into a workload's non-idle hours
Enabled	Whether the job will run	True/false	To disable a job without removing it
Full	For scheduling a full Garbage Collection job	Switch (true/false)	By default, every fourth job is a full Garbage Collection job. With this switch, you can schedule full Garbage Collection to run more frequently.
InputOutputThrottle	Specifies the amount of input/output throttling applied to the job	Integers 0-100 (indicates a percentage)	Throttling ensures that jobs don't interfere with other I/O-intensive processes.
Memory	The percentage of memory on the system that a job should use	Integers 0-100 (indicates a percentage)	To control what level of impact the job will have on the memory resources of the system
Name	The name of the scheduled job	String	A job must have a uniquely identifiable name.
ReadOnly	Indicates that the scrubbing job processes and reports on corruptions that it finds, but does not run any repair actions	Switch (true/false)	You want to manually restore files that sit on bad sections of the disk.
Start	Specifies the time a job should start	`System.DateTime`	The date part of the `System.Datetime` provided to Start is irrelevant (as long as it's in the past), but the time part specifies when the job should start.
StopWhenSystemBusy	Specifies whether Data Deduplication should stop if the system is busy	Switch (True/False)	This switch gives you the ability to control the behavior of Data Deduplication--this is especially important if you want to run Data Deduplication while your workload is not idle.

Modifying Data Deduplication volume-wide settings

Toggling volume settings

You can set the volume-wide default settings for Data Deduplication via the usage type that you select when you enable a deduplication for a volume. Data Deduplication includes cmdlets that make editing volume-wide settings easy:

The main reasons to modify the volume settings from the selected usage type are to improve read performance for specific files (such as multimedia or other file types that are already compressed) or to fine-tune Data Deduplication for better optimization for your specific workload. The following example shows how to modify the Data Deduplication volume settings for a workload that most closely resembles a general purpose file server workload, but uses large files that change frequently.

See the current volume settings for Cluster Shared Volume 1.

 Get-DedupVolume -Volume C:\ClusterStorage\Volume1 | Select *

Enable OptimizePartialFiles on Cluster Shared Volume 1 so that the MinimumFileAge policy applies to sections of the file rather than the whole file. This ensures that the majority of the file gets optimized even though sections of the file change regularly.
```
 Set-DedupVolume -Volume C:\ClusterStorage\Volume1 -OptimizePartialFiles
```

Available volume-wide settings

Setting name	Definition	Accepted values	Why would you want to modify this value?
ChunkRedundancyThreshold	The number of times that a chunk is referenced before a chunk is duplicated into the hotspot section of the Chunk Store. The value of the hotspot section is that so-called "hot" chunks that are referenced frequently have multiple access paths to improve access time.	Positive integers	The main reason to modify this number is to increase the savings rate for volumes with high duplication. In general, the default value (100) is the recommended setting, and you shouldn't need to modify this.
ExcludeFileType	File types that are excluded from optimization	Array of file extensions	Some file types, particularly multimedia or files that are already compressed, do not benefit very much from being optimized. This setting allows you to configure which types are excluded.
ExcludeFolder	Specifies folder paths that should not be considered for optimization	Array of folder paths	If you want to improve performance or keep content in particular paths from being optimized, you can exclude certain paths on the volume from consideration for optimization.
InputOutputScale	Specifies the level of IO parallelization (IO queues) for Data Deduplication to use on a volume during a post-processing job	Positive integers ranging 1-36	The main reason to modify this value is to decrease the impact on the performance of a high IO workload by restricting the number of IO queues that Data Deduplication is allowed to use on a volume. Note that modifying this setting from the default may cause Data Deduplication's post-processing jobs to run slowly.
MinimumFileAgeDays	Number of days after the file is created before the file is considered to be in-policy for optimization.	Positive integers (inclusive of zero)	The Default and Hyper-V usage types set this value to 3 to maximize performance on hot or recently created files. You may want to modify this if you want Data Deduplication to be more aggressive or if you do not care about the extra latency associated with deduplication.
MinimumFileSize	Minimum file size that a file must have to be considered in-policy for optimization	Positive integers (bytes) greater than 32 KB	The main reason to change this value is to exclude small files that may have limited optimization value to conserve compute time.
NoCompress	Whether the chunks should be compressed before being put into the Chunk Store	True/False	Some types of files, particularly multimedia files and already compressed file types, may not compress well. This setting allows you to turn off compression for all files on the volume. This would be ideal if you are optimizing a dataset that has a lot of files that are already compressed.
NoCompressionFileType	File types whose chunks should not be compressed before going into the Chunk Store	Array of file extensions	Some types of files, particularly multimedia files and already compressed file types, may not compress well. This setting allows compression to be turned off for those files, saving CPU resources.
OptimizeInUseFiles	When enabled, files that have active handles against them will be considered as in-policy for optimization.	True/false	Enable this setting if your workload keeps files open for extended periods of time. If this setting is not enabled, a file would never get optimized if the workload has an open handle to it, even if it's only occasionally appending data at the end.
OptimizePartialFiles	When enabled, the MinimumFileAge value applies to segments of a file rather than to the whole file.	True/false	Enable this setting if your workload works with large, often edited files where most of the file content is untouched. If this setting is not enabled, these files would never get optimized because they keep getting changed, even though most of the file content is ready to be optimized.
Verify	When enabled, if the hash of a chunk matches a chunk we already have in our Chunk Store, the chunks are compared byte-by-byte to ensure they are identical.	True/false	This is an integrity feature that ensures that the hashing algorithm that compares chunks does not make a mistake by comparing two chunks of data that are actually different but have the same hash. In practice, it is extremely improbable that this would ever happen. Enabling the verification feature adds significant overhead to the optimization job.

Modifying Data Deduplication system-wide settings

Data Deduplication has additional system-wide settings that can be configured via the registry. These settings apply to all of the jobs and volumes that run on the system. Extra care must be given whenever editing the registry.

For example, you may want to disable full Garbage Collection. More information about why this may be useful for your scenario can be found in Frequently asked questions. To edit the registry with PowerShell:

If Data Deduplication is running in a cluster:

  Set-ItemProperty -Path HKLM:\System\CurrentControlSet\Services\ddpsvc\Settings -Name DeepGCInterval -Type DWord -Value 0xFFFFFFFF
  Set-ItemProperty -Path HKLM:\CLUSTER\Dedup -Name DeepGCInterval -Type DWord -Value 0xFFFFFFFF

If Data Deduplication is not running in a cluster:

  Set-ItemProperty -Path HKLM:\System\CurrentControlSet\Services\ddpsvc\Settings -Name DeepGCInterval -Type DWord -Value 0xFFFFFFFF

Available system-wide settings

Setting name	Definition	Accepted values	Why would you want to change this?
WlmMemoryOverPercentThreshold	This setting allows jobs to use more memory than Data Deduplication judges to actually be available. For example, a setting of 300 would mean that the job would have to use three times the assigned memory to get canceled.	Positive integers (a value of 300 means 300% or 3 times)	If you have another task that will stop if Data Deduplication takes more memory
DeepGCInterval	This setting configures the interval at which regular Garbage Collection jobs become full Garbage Collection jobs. A setting of n would mean that every n^th job was a full Garbage Collection job. Note that full Garbage Collection is always disabled (regardless of the registry value) for volumes with the Backup Usage Type. `Start-DedupJob -Type GarbageCollection -Full` may be used if full Garbage Collection is desired on a Backup volume.	Integers (-1 indicates disabled)	See this frequently asked question

Frequently asked questions

I changed a Data Deduplication setting, and now jobs are slow or don't finish, or my workload performance has decreased. Why? These settings give you a lot of power to control how Data Deduplication runs. Use them responsibly, and monitor performance.

I want to run a Data Deduplication job right now, but I don't want to create a new schedule--can I do this? Yes, all jobs can be run manually.

What is the difference between full and regular Garbage Collection? There are two types of Garbage Collection:

Regular Garbage Collection uses a statistical algorithm to find large unreferenced chunks that meet a certain criteria (low in memory and IOPs). Regular Garbage Collection compacts a chunk store container only if a minimum percentage of the chunks is unreferenced. This type of Garbage Collection runs much faster and uses fewer resources than full Garbage Collection. The default schedule of the regular Garbage Collection job is to run once a week.
Full Garbage Collection does a much more thorough job of finding unreferenced chunks and freeing more disk space. Full Garbage Collection compacts every container even if just a single chunk in the container is unreferenced. Full Garbage Collection will also free space that may have been in use if there was a crash or power failure during an Optimization job. Full Garbage Collection jobs will recover 100 percent of the available space that can be recovered on a deduplicated volume at the cost of requiring more time and system resources compared to a regular Garbage Collection job. The full Garbage Collection job will typically find and release up to 5 percent more of the unreferenced data than a regular Garbage Collection job. The default schedule of the full Garbage Collection job is to run every fourth time Garbage Collection is scheduled.

Why would I want to disable full Garbage Collection?

Garbage Collection could adversely affect the volume's lifetime shadow copies and the size of incremental backup. High churn or I/O-intensive workloads may see a degradation in performance by full Garbage Collection jobs.
You can manually run a full Garbage Collection job from PowerShell to clean up leaks if you know your system crashed.

Dela via