whydunit?
As part of an attempt to try inferencing machine learning models with my own code, knowing how to use cblas_sgemm can be very useful as it can easily be used to simulate many different ML operators. For example, a layer of neurons, the “GEMM” node itself, Convolutional neural network, and LSTM. The reason this story only focuses on row order is because it is naturally what I use in C.
wheredunit?
Most of my projects are on Windows, and I use an Intel CPU. So the simplest way for me to get a cblas_sgemm implementation is using OneMKL. Simply download it with NuGet(intelmkl.dev), and it can come with other functions like DFT to fulfill other needed dependencies for me to run my model as a part of my entire solution.
Data presentation
I prefer to present data as follow:
float a[10] = {1.f, 2.f, 3.f, 4.f, 5.f, 6.f, 7.f, 8.f,9.f,10.f};
and the dimension of the array as
int a_dim[2] = {2, 5};
For dimension[2,5]. It should look like this in numpy terms.
[[1,2,3,4,5]
[6,7,8,9,10]]
I will use “a_dim[0]” to denote the first dimension of the matrix “a” and “a_dim[1]” to denote the second dimension of the matrix “a”. Much easier for my programmer brain to process.
howdunit?
While I can always refer to Intel documentation for the function, I just wasn't able to understand what the documentation means very fast. What even is m, n, k? Are the numbers before transpose or after? And there is just oh so many lines, that I lost track of the whole picture.
To correctly use cblas_sgemm, 6 parameters have to be correct, Namely, M, N, K, lda, ldb, ldc. Let’s start from the back with the lds.
lda, ldb, ldc
According to the intel documentation, it says for a non-transposed matrix A with dimension M by K, lda is K, and for a transposed matrix A the lda is M.
The confusing part for me is that there are also M, N, and K in the documentation, is this M the same M as prior? A good story writer should never use the same name for different people in the same story unless that is a major plot of the story, and I can’t understand why the documentation has to do this.
Why not just simply say that all the lds are simply the second dimension of the matrix? That is to say, lda is a_dim[1], ldb is b_dim[1], and ldc is c_dim[1], no matter it is transposed or not.
M, N, K
These parameters will be affected by whether the matrices are transposed or not.
According to the documentation:
M is the row of matrix op(A) and C.
N is the Column of matrix op(B) and C.
K is the column of matrix op(A) and row of matrix op(B).
The op() means whether the matrix has been transposed or not, and the matrix dimension after transpose has to be used.
Thus, M is a_dim[0] if matrix A is not transposed, it is a_dim[1] otherwise.
Next, N is b_dim[1] if matrix B is not transposed, it is b_dim[0] otherwise.
Finally, K s a_dim[1] if matrix A is not transposed, it is a_dim[0] otherwise.
To prove this is correct, I wrote some code that will produce the same results in Python and C .
import numpy as np
a = np.arange(15).reshape(3, 5).astype(np.float32)
b = np.arange(35).reshape(5, 7).astype(np.float32)
c = np.matmul(a,b)
print(c)
a = np.arange(15).reshape(5, 3).astype(np.float32)
a = np.transpose(a)
b = np.arange(35).reshape(5, 7).astype(np.float32)
c = np.matmul(a,b)
print(c)
a = np.arange(15).reshape(5, 3).astype(np.float32)
b = np.arange(21).reshape(7, 3).astype(np.float32)
b = np.transpose(b)
c = np.matmul(a,b)
print(c)
a = np.arange(24).reshape(8, 3).astype(np.float32)
a = np.transpose(a)
b = np.arange(48).reshape(6, 8).astype(np.float32)
b = np.transpose(b)
c = np.matmul(a,b)
print(c)
The output of this script
[[ 210. 220. 230. 240. 250. 260. 270.]
[ 560. 595. 630. 665. 700. 735. 770.]
[ 910. 970. 1030. 1090. 1150. 1210. 1270.]]
[[ 630. 660. 690. 720. 750. 780. 810.]
[ 700. 735. 770. 805. 840. 875. 910.]
[ 770. 810. 850. 890. 930. 970. 1010.]]
[[ 5. 14. 23. 32. 41. 50. 59.]
[ 14. 50. 86. 122. 158. 194. 230.]
[ 23. 86. 149. 212. 275. 338. 401.]
[ 32. 122. 212. 302. 392. 482. 572.]
[ 41. 158. 275. 392. 509. 626. 743.]]
[[ 420. 1092. 1764. 2436. 3108. 3780.]
[ 448. 1184. 1920. 2656. 3392. 4128.]
[ 476. 1276. 2076. 2876. 3676. 4476.]]
For C.
//Demo for using cblas_sgemm
#include "stdio.h"
#include "string.h"
#include "mkl.h"
/// <summary>
/// cblass_sgemm helper function. Dimensions should have a size of 2.
/// </summary>
void gemmf(float* A, float* B, float* C, int* a_dim, int* b_dim, int* c_dim, int transA, int transB) {
int m = 0, n = 0, k = 0;
int transpose_A = CblasNoTrans, transpose_B = CblasNoTrans;
int lda = a_dim[1], ldb = b_dim[1], ldc = c_dim[1];
if (transA) {
transpose_A = CblasTrans;
m = a_dim[1];
k = a_dim[0];
}
else {
m = a_dim[0];
k = a_dim[1];
}
if (transB) {
transpose_B = CblasTrans;
n = b_dim[0];
}
else {
n = b_dim[1];
}
cblas_sgemm(CblasRowMajor, transpose_A, transpose_B, m, n, k, 1.0f, A, lda, B, ldb, 1.0f, C, ldc);
}
int main() {
float A[100] = { 0 }, B[100] = { 0 }, C[100] = { 0 };
int a_dim[2] = { 0 }, b_dim[2] = { 0 }, c_dim[2] = { 0 };
// Set array with numbers
for (int i = 0; i < 100; i++) {
A[i] = i;
}
for (int i = 0; i < 100; i++) {
B[i] = i;
}
// A(3,5)* B(5,7) = C(3,7)
a_dim[0] = 3; a_dim[1] = 5;
b_dim[0] = 5; b_dim[1] = 7;
c_dim[0] = 3; c_dim[1] = 7;
gemmf(A, B, C, a_dim, b_dim, c_dim, 0, 0);
// print out result
for (int i = 0; i < c_dim[0] * c_dim[1]; i++) {
printf("%f ", C[i]);
}
printf("\n\n");
memset(C, 0, 100 * sizeof(float));
// 'A(5,3)* B(5,7) = C(3,7)
a_dim[0] = 5; a_dim[1] = 3;
b_dim[0] = 5; b_dim[1] = 7;
c_dim[0] = 3; c_dim[1] = 7;
gemmf(A, B, C, a_dim, b_dim, c_dim, 1, 0);
for (int i = 0; i < c_dim[0] * c_dim[1]; i++) {
printf("%f ", C[i]);
}
printf("\n\n");
memset(C, 0, 100 * sizeof(float));
// A(5,3)* 'B(7,3) = C(5,7)
a_dim[0] = 5; a_dim[1] = 3;
b_dim[0] = 7; b_dim[1] = 3;
c_dim[0] = 5; c_dim[1] = 7;
gemmf(A, B, C, a_dim, b_dim, c_dim, 0, 1);
for (int i = 0; i < c_dim[0] * c_dim[1]; i++) {
printf("%f ", C[i]);
}
printf("\n\n");
memset(C, 0, 100 * sizeof(float));
// 'A(8,3)* 'B(6,8) = C(3,6)
a_dim[0] = 8; a_dim[1] = 3;
b_dim[0] = 6; b_dim[1] = 8;
c_dim[0] = 3; c_dim[1] = 6;
gemmf(A, B, C, a_dim, b_dim, c_dim, 1, 1);
for (int i = 0; i< c_dim[0] * c_dim[1]; i++) {
printf("%f ", C[i]);
}
printf("\n\n");
memset(C, 0, 100 * sizeof(float));
}
The output is:
210.000000 220.000000 230.000000 240.000000 250.000000 260.000000 270.000000 560.000000 595.000000 630.000000 665.000000 700.000000 735.000000 770.000000 910.000000 970.000000 1030.000000 1090.000000 1150.000000 1210.000000 1270.000000
630.000000 660.000000 690.000000 720.000000 750.000000 780.000000 810.000000 700.000000 735.000000 770.000000 805.000000 840.000000 875.000000 910.000000 770.000000 810.000000 850.000000 890.000000 930.000000 970.000000 1010.000000
5.000000 14.000000 23.000000 32.000000 41.000000 50.000000 59.000000 14.000000 50.000000 86.000000 122.000000 158.000000 194.000000 230.000000 23.000000 86.000000 149.000000 212.000000 275.000000 338.000000 401.000000 32.000000 122.000000 212.000000 302.000000 392.000000 482.000000 572.000000 41.000000 158.000000 275.000000 392.000000 509.000000 626.000000 743.000000
420.000000 1092.000000 1764.000000 2436.000000 3108.000000 3780.000000 448.000000 1184.000000 1920.000000 2656.000000 3392.000000 4128.000000 476.000000 1276.000000 2076.000000 2876.000000 3676.000000 4476.000000
Clearly, the results are the same, and I will not be bothered with this problem again.