SIMD Optimization Case Study

Row–column dot products computed in parallel lanes

A compact study of matrix multiplication in C++ with a focus on SIMD (AVX).
The page is visual-first and code-light; details are in the report.

Left: dot products executed in parallel lanes. Middle: 32-byte register pipeline. Right: compact die thumbnail.

AVX Core

void* MultiplyWorker(void* arg){
    ThreadData* data = (ThreadData*)arg;
    int start = data->start_row;
    int end   = data->end_row;

    for (int i = start; i < end; ++i) {
        for (int j = 0; j < N; ++j) {
            __m256 sum = _mm256_setzero_ps();

            int k = 0;

            for (; k + 7 < N; k += 8) {
                __m256 a = _mm256_loadu_ps(&A[i * N + k]);
                __m256 b = _mm256_loadu_ps(&B_T[j * N + k]);
                sum = _mm256_add_ps(sum, _mm256_mul_ps(a, b));
            }

            float temp[8];
            _mm256_storeu_ps(temp, sum);

            float total = 0.0f;
            for (int t = 0; t < 8; ++t) {
                total += temp[t];
            }

            for (; k < N; ++k) {
                total += A[i * N + k] * B_T[j * N + k];
            }
            C[i * N + j] = total;
        }
    }

    pthread_exit(nullptr);
    return nullptr;
}

Build

```bash g++-O2-mavx-pthread main.cpp-o program