You could use vector intrinsics to make it faster:
Code: Select all
__m128i broadcast = _mm_setr_epi32(source[ii], source[ii], source[ii], source[ii]);
surface[ii] = surface1[ii] = surface2[ii] = surface3[ii] = broadcast;
If you don't want to use intrinsics, compilers are hard to get to optimize things like these completely.
Both gcc and clang will not apply SIMD optimizations to your code.
If you switch around the column / row ordering like this, at least clang will use SSE or AVX, gcc still won't.
Code: Select all
surface_00[0] = surface_00[1] = surface_00[2] = surface_00[3] =
surface_08[0] = surface_08[1] = surface_08[2] = surface_08[3] =
surface_16[0] = surface_16[1] = surface_16[2] = surface_16[3] =
surface_24[0] = surface_24[1] = surface_24[2] = surface_24[3] = value;
clang will generate assembly like this
Code: Select all
vbroadcastss xmm0, dword ptr [rdx + 4*rsi]
inc rsi
vmovups xmmword ptr [rcx + 4*r8], xmm0
vmovups xmmword ptr [rcx + 4*rdi], xmm0
vmovups xmmword ptr [rcx + 4*rax], xmm0
vmovups xmmword ptr [rcx], xmm0
Keeping a separate pointer for each line might or might not be a good idea, it depends heavily on surrounding code, architecture and compiler. As a rule of thumb however, if you have copying code in a loop, compilers usually optimize better if you use indexing instead and use row-major ordering, e.g.
Code: Select all
for (int i = 0; i < n; ++i) { surface[i * 4 + w * 0 + 0] = surface[i * 4 + w * 0 + 1] = .... = source_image[i]; }