Batch normalization
BatchNormalization(input, scale, bias, runMean, runVariance, spatial,
normalizationTimeConstant = 0,
blendTimeConstant = 0, epsilon = 0.00001, useCntkEngine = true)
Parameters
input
is the input of the batch normalization nodescale
is a ParameterTensor{} that holds the learned componentwise-scaling factors (gamma
term in the equation below).bias
is a ParameterTensor{} that holds the learned bias (beta
term).scale
andbias
must have the same dimensions which must be equal to theinput
dimensions in case ofspatial = false
or number of output convolution feature maps in case ofspatial = true
.runMean
is the running mean which is used during evaluation phase and might be used during training as well. You must pass a ParameterTensor{} with the same dimensions asscale
andbias
, initial value 0, andlearningRateMultiplier=0
set.runVariance
is the running variance. It is represented the same way asrunMean
.spatial
is a flag that specifies whether to compute mean/var for each feature in a minibatch independently or, in case of convolutional layers, per feature map.normalizationTimeConstant
(default 0): time constant for computing running average of mean and variance as a low-pass filtered version of the batch statistics. Note: The default is not typically what you want.blendTimeConstant
(default 0): allows to smooth batch estimates with the running statisticsepsilon
is a conditioner constant added to the variance when computing the inverse standard deviation.useCntkEngine
(default: true): set this tofalse
to select the GPU-only cuDNN implementation
Return value
The batch-normalized input
.
Description
BatchNormalization
implements the technique described in paper
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Sergey Ioffe, Christian Szegedy).
In short, it normalizes layer outputs for every minibatch for each output (feature) independently and applies affine transformation to preserve representation of the layer. That is, for layer input
:
m = mean (input)
var = variance (input)
inputNorm = (input - mean)/sqrt (epsilon + var)
output = gamma * inputNorm + beta
where gamma
and beta
are trainable parameters.
mean
and variance
are estimated from training data. In the simplest case, they are the mean and variance
of the current minibatch during training. In inference, a long-term estimate is used instead.
The long-term estimates are a low-pass-filtered version of the minibatch statistics, with the time constant
(in samples) given by the normalizationTimeConstant
parameter.
A value of 0
means there will be no exponential smoothing and running mean/variance
will always be equal to those of the last seen minibatch.
This is often undesirable.
Instead, it is recommended to use a value of a few thousand here.
The BatchNormalizationLayer{}
wrapper has a default of 5000.
For more information about time constants and exponential smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing#Time_Constant
Because minibatch statistics can be noisy,
CNTK also allows to use a MAP (maximum-a-posteriori) estimate during training,
where the running long-term estimate is taken as the prior.
The weight of the prior is controlled by the blendTimeConstant
parameter.
However, this has not been found useful so far in our experiments.
Note that during inference, CNTK will set both time constants automatically such that only use the existing running mean is used, and that it is not updated. There is no explicit action needed by the user.
cuDNN implementation
By default, this function uses a CNTK implementation which works with both GPUs and CPUs. You can choose to use the cuDNN implementation, which is more performant. Note, however, that the cuDNN implementation does not support all options, and training requires a GPU (CNTK does, however, provide a CPU emulation for inference even if you select the cuDNN implementation).