Example of CUDA Bayer or CFA demosaicing

I developed a Bayer demosaicing routine using CUDA4, but it is slower than a single-threaded CPU implementation on a 16-core GTS250. The block size is set to (16,16) and image dimensions are multiples of 16. Changing these parameters doesn’t seem to speed it up.

Am I making some obvious mistakes?

— Calling Routine —

uchar4 *d_output;
size_t num_bytes;

cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource);

// Perform conversion and store result in PBO for display
kernel_wrapper(imageWidth, imageHeight, blockSize, gridSize, d_output);

cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);

— CUDA Kernel —

texture<uchar, 2, cudaReadModeElementType> tex;
cudaArray *d_imageArray = 0;

__global__ void demosaicKernel(uchar4 *d_output, uint width, uint height) {
    uint x = blockIdx.x * blockDim.x + threadIdx.x;
    uint y = blockIdx.y * blockDim.y + threadIdx.y;
    uint i = y * width + x;

    if (x < width && y < height) {
        if (y % 2) {
            if (x % 2) {  
                d_output[i].x = (tex2D(tex, x + 1, y) + tex2D(tex, x - 1, y)) / 2;
                d_output[i].y = tex2D(tex, x, y);
                d_output[i].z = (tex2D(tex, x, y + 1) + tex2D(tex, x, y - 1)) / 2;
            } else {
                d_output[i].x = tex2D(tex, x, y);
                d_output[i].y = (tex2D(tex, x + 1, y) + tex2D(tex, x - 1, y) + tex2D(tex, x, y + 1) + tex2D(tex, x, y - 1)) / 4;
                d_output[i].z = (tex2D(tex, x + 1, y + 1) + tex2D(tex, x + 1, y - 1) + tex2D(tex, x - 1, y + 1) + tex2D(tex, x - 1, y - 1)) / 4;
            }
        } else {
            if (x % 2) {
                d_output[i].y = (tex2D(tex, x + 1, y + 1) + tex2D(tex, x + 1, y - 1) + tex2D(tex, x - 1, y + 1) + tex2D(tex, x - 1, y - 1)) / 4;
                d_output[i].z = tex2D(tex, x, y);
                d_output[i].y = (tex2D(tex, x + 1, y) + tex2D(tex, x - 1, y) + tex2D(tex, x, y + 1) + tex2D(tex, x, y - 1)) / 4;
            } else {
                d_output[i].x = (tex2D(tex, x, y + 1) + tex2D(tex, x, y - 1)) / 2;
                d_output[i].y = tex2D(tex, x, y);
                d_output[i].z = (tex2D(tex, x + 1, y) + tex2D(tex, x - 1, y)) / 2;
            }
        }
    }
}

— Texture Initialization —

void initializeTexture(int width, int height, uchar *imageData) {
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
    cudaMallocArray(&d_imageArray, &channelDesc, width, height);
    size_t size = width * height * sizeof(uchar);
    cudaMemcpyToArray(d_imageArray, 0, 0, imageData, size, cudaMemcpyHostToDevice);
    free(imageData);

    // Configure texture settings
    tex.addressMode[0] = cudaAddressModeClamp;
    tex.addressMode[1] = cudaAddressModeClamp;
    tex.filterMode = cudaFilterModePoint;
    tex.normalized = false;

    cudaBindTextureToArray(tex, d_imageArray);
}

Hey there, Ethan85! I’m really intrigued by your CUDA demosaicing project. Have you considered using shared memory in your kernel? It could be a game-changer for performance.

I’m curious, have you tried profiling your kernel with the NVIDIA Visual Profiler? It’s a fantastic tool that might uncover some hidden bottlenecks. Sometimes it’s the little things like memory access patterns that can slow things down.

Oh, and speaking of memory, how’s your data transfer between the host and device looking? I’ve found that sometimes what seems like slow computation is actually just data movement eating up time.

By the way, what kind of optimizations have you applied to your CPU version? I’ve been surprised before by how competitive a well-optimized CPU implementation can be, especially with SIMD instructions.

Would love to hear more about your approach and what you’ve tried so far. Maybe we can brainstorm some ideas to speed things up!

hey ethan, have u tried using shared memory in ur kernel? that could be a game changer for speed. also, profiling with nvidia visual profiler might show u some bottlenecks u didnt notice.

another thing - check ur data transfers between host and device. sometimes thats the real slowdown, not the actual computation.

whats ur cpu implementation like? a well-optimized cpu version can be surprisingly fast, especially with simd instructions.

I’ve encountered similar issues with CUDA implementations being slower than expected. Have you considered using shared memory to optimize your kernel? This can significantly reduce global memory accesses and improve performance.

Also, make sure you’re profiling your kernel with tools like NVIDIA Visual Profiler. It can highlight bottlenecks you might not spot otherwise. In my experience, seemingly small issues like uncoalesced memory accesses or thread divergence can have a huge impact on performance.

Another thing to check is your data transfer between host and device. Sometimes, the actual computation is fast, but excessive data movement becomes the bottleneck. Try to minimize these transfers where possible.

Lastly, don’t overlook CPU optimizations. A well-optimized CPU version using SIMD instructions can be surprisingly competitive for certain tasks. It might be worth revisiting your CPU implementation to ensure it’s fully optimized before comparing.