Times and TransposeTimes
CNTK matrix product.
A * B
Times (A, B, outputRank=1)
TransposeTimes (A, B, outputRank=1)
Parameters
A
first argument of matrix product. Can be a time sequence.B
second argument of matrix product. Can be a time sequence.outputRank
(default: 1): number of axes ofA
that constitute the output dimension. See 'Extended interpretation for tensors' below.
Return Value
Resulting matrix product (tensor). This is a time sequence if either input was a time sequence.
Description
The Times()
function implements the matrix product, with extensions for tensors. The *
operator is a short-hand for it. TransposeTimes()
transposes the first argument.
If A
and B
are matrices (rank-2 tensor) or column vectors (rank-1 tensor), A * B
will compute the common matrix product, just as one would expect.
TransposeTimes (A, B)
computes the matrix product A^T * B
, where ^T
denotes transposition. TransposeTimes (A, B)
has the same result as Transpose (A) * B
, but it is more efficient as it avoids a temporary copy of the transposed version of A
.
Time sequences
Both A
and B
can be either single matrices or time sequences. A common case for recurrent networks is that A
is a weight matrix, while B
is a sequence of inputs.
Note: If A
is a time sequence, the operation is not efficient, as it will launch a separate GEMM invocation for every time step. The exception is TransposeTimes()
where both inputs are column vectors, for which a special optimization exists.
Sparse support
Times()
and TransposeTimes()
support sparse matrix. The result is a dense matrix unless both are sparse. The two most important use cases are:
B
being a one-hot representation of an input word (or, more commonly, an entire sequence of one-hot vectors). Then,A * B
denotes a word embedding, where the columns ofA
are the embedding vectors of the words. The following is the recommended way of realizing embeddings in CNTK:``` Embedding (x, dim) = Parameter (dim, 0/*inferred*/) * x e = Embedding (input, 300) ```
A
being a one-hot representation of an label word. The popular cross-entropy criterion and the error counter can be written usingTransposeTimes()
as follows, respectively, wherez
is the input to the top-level Softmax() classifier, andL
the label sequence which may be sparse:``` CrossEntropyWithSoftmax (L, z) = ReduceLogSum (z) - TransposeTimes (L, z) ErrorPrediction (L, z) = BS.Constants.One - TransposeTimes (L, Hardmax (z)) ```
Multiplying with a scalar
The matrix product can not be used to multiply a matrix with a scalar. You will get an error regarding mismatching dimensions. To multiply with a scalar, use the element-wise product .*
instead. For example, the weighted average of two matrices could be written like this:
z = Constant (alpha) .* x + Constant (1-alpha) .* y
Multiplying with a diagonal matrix
If your input matrix is diagonal and stored as a vector, do not use Times()
but an element-wise multiplication (ElementTimes()
or the .*
operator).
For example
dMat = ParameterTensor {(100:1)}
z = dMat .* v
This leverages broadcasting semantics to multiply every element of v
with the respective row of dMat
.
Extended interpretation of matrix product for tensors of rank > 2
If A
and/or B
are tensors of higher rank, the *
operation denotes a generalized matrix product where all but the first dimension of A
must match with the leading dimensions of B
, and are interpreted by flattening. For example a product of a [I x J x K]
and a [J x K x L]
tensor (which we will abbreviate henceforth as [I x J x K] * [J x K x L]
) gets reinterpreted by reshaping the two tensors as matrices as [I x (J * K)] * [(J * K) x L]
, for which the matrix product is defined and yields a result of dimension [I x L]
. This makes sense if one considers the rows of a weight matrix to be patterns that activation vectors are matched against. The above generalization allows these patterns themselves to be multi-dimensional, such as images or running windows of speech features.
It is also possible to have more than one non-matched dimension in B
. For example [I x J] * [J x K x L]
is interpreted as this matrix product: [I x J] * [J x (K * L)]
which thereby yields a result of dimensions [I x K x L]
. For example, this allows to apply a matrix to all vectors inside a rolling window of L
speech features of dimension J
.
If the result of the product should have multiple dimensions (such as arranging a layer's activations as a 2D field), then instead of using the *
operator, one must say Times (A, B, outputRank=m)
where m
is the number of dimensions in which the 'patterns' are arranged, and which are kept in the output. For example, Times (tensor of dim [I x J x K], tensor of dim [K x L], outputRank=2)
will be interpreted as the matrix product [(I * J) x K] * [K x L]
and yield a result of dimensions [I x J x L]
.