Times and TransposeTimes

Article
08/14/2016

CNTK matrix product.

A * B
Times (A, B, outputRank=1)
TransposeTimes (A, B, outputRank=1)

Parameters

A first argument of matrix product. Can be a time sequence.
B second argument of matrix product. Can be a time sequence.
outputRank (default: 1): number of axes of A that constitute the output dimension. See 'Extended interpretation for tensors' below.

Return Value

Resulting matrix product (tensor). This is a time sequence if either input was a time sequence.

Description

The Times() function implements the matrix product, with extensions for tensors. The * operator is a short-hand for it. TransposeTimes() transposes the first argument.

If A and B are matrices (rank-2 tensor) or column vectors (rank-1 tensor), A * B will compute the common matrix product, just as one would expect.

TransposeTimes (A, B) computes the matrix product A^T * B, where ^T denotes transposition. TransposeTimes (A, B) has the same result as Transpose (A) * B, but it is more efficient as it avoids a temporary copy of the transposed version of A.

Time sequences

Both A and B can be either single matrices or time sequences. A common case for recurrent networks is that A is a weight matrix, while B is a sequence of inputs.

Note: If A is a time sequence, the operation is not efficient, as it will launch a separate GEMM invocation for every time step. The exception is TransposeTimes() where both inputs are column vectors, for which a special optimization exists.

Sparse support

Times() and TransposeTimes() support sparse matrix. The result is a dense matrix unless both are sparse. The two most important use cases are:

B being a one-hot representation of an input word (or, more commonly, an entire sequence of one-hot vectors). Then, A * B denotes a word embedding, where the columns of A are the embedding vectors of the words. The following is the recommended way of realizing embeddings in CNTK:
```
```
Embedding (x, dim) = Parameter (dim, 0/*inferred*/) * x
e = Embedding (input, 300)
```
```
A being a one-hot representation of an label word. The popular cross-entropy criterion and the error counter can be written using TransposeTimes() as follows, respectively, where z is the input to the top-level Softmax() classifier, and L the label sequence which may be sparse:
```
```
CrossEntropyWithSoftmax (L, z) = ReduceLogSum (z) - TransposeTimes (L,          z)
ErrorPrediction         (L, z) = BS.Constants.One - TransposeTimes (L, Hardmax (z))
```
```

Multiplying with a scalar

The matrix product can not be used to multiply a matrix with a scalar. You will get an error regarding mismatching dimensions. To multiply with a scalar, use the element-wise product .* instead. For example, the weighted average of two matrices could be written like this:

z = Constant (alpha) .* x + Constant (1-alpha) .* y

Multiplying with a diagonal matrix

If your input matrix is diagonal and stored as a vector, do not use Times() but an element-wise multiplication (ElementTimes() or the .* operator). For example

dMat = ParameterTensor {(100:1)}
z = dMat .* v

This leverages broadcasting semantics to multiply every element of v with the respective row of dMat.

Extended interpretation of matrix product for tensors of rank > 2

If A and/or B are tensors of higher rank, the * operation denotes a generalized matrix product where all but the first dimension of A must match with the leading dimensions of B, and are interpreted by flattening. For example a product of a [I x J x K] and a [J x K x L] tensor (which we will abbreviate henceforth as [I x J x K] * [J x K x L]) gets reinterpreted by reshaping the two tensors as matrices as [I x (J * K)] * [(J * K) x L], for which the matrix product is defined and yields a result of dimension [I x L]. This makes sense if one considers the rows of a weight matrix to be patterns that activation vectors are matched against. The above generalization allows these patterns themselves to be multi-dimensional, such as images or running windows of speech features.

It is also possible to have more than one non-matched dimension in B. For example [I x J] * [J x K x L] is interpreted as this matrix product: [I x J] * [J x (K * L)] which thereby yields a result of dimensions [I x K x L]. For example, this allows to apply a matrix to all vectors inside a rolling window of L speech features of dimension J.

If the result of the product should have multiple dimensions (such as arranging a layer's activations as a 2D field), then instead of using the * operator, one must say Times (A, B, outputRank=m) where m is the number of dimensions in which the 'patterns' are arranged, and which are kept in the output. For example, Times (tensor of dim [I x J x K], tensor of dim [K x L], outputRank=2) will be interpreted as the matrix product [(I * J) x K] * [K x L] and yield a result of dimensions [I x J x L].

Share via