When doing (possibly heavy) pixel processing on a large image, multithreading becomes a must. The standard practice is to initiate a loop whose indices are partitioned into multiple threads within a thread pool. The performance benefits become immediately apparent after taking the appropriate thread-safety measures to ensure correctness of results.
However, there are multiple possible configurations how one can partition the indices. The most common methods are partitioning by row or by pixel. Here is my interpretation of the advantages and drawbacks of each:
Less thread creation overhead
Thread load may not be even due to the number of rows possibly not being divisible by the number of threads. This can cause an image that is wide but not tall to be processed inefficiently across multiple cores
More thread creation overhead
Thread load can be distributed more evenly due to the fact that the time taken to process the indices that are not divisible by the number of threads is relatively small
Is my interpretation correct, or is there more to the story? Should I always choose one over the other?
For reference, I am using the Parallel.For() function in C#.
I use an approach where each task gets
ltrb rect along with the pixels to represent a rectangular region of the image to process.
That allows me to take those cases where an image is much wider than it is tall and still split it up into rectangular chunks to process with, say, 1024 pixels to process per thread. For small images with less than 1024 pixels total, I don't even bother applying a parallel for loop since I've found it's generally cheaper to just use a single-threaded for loop in those cases.
Typically you won't get such good performance trying to assign one pixel per task. At least with libraries like OMP and TBB, you need a sufficient amount of work to do in each task or else the overhead of scheduling the tasks will outweigh the benefits of multithreading to the point where you can easily get worse than single-threaded performance.
Also unless your image algorithms don't care about the positions of the pixels they're processing, that carries another overhead of having to pass the pixel coordinate along per pixel.
So I recommend processing in rectangular chunks like I do or even having each thread process a scanline isn't bad either and will generally be good enough for the common cases.