2

I'm using this code to find the highest temperature pixel in a thermal image and the coordinates of the pixel.

void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
    int temp = 0;

    for (int i = sz; i > 0; i--)
    {
        if (returnPixel->temperature < *image)
        {
            returnPixel->temperature = *image;
            temp = i;
        }
        image++;
    }

    returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
    returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}

With an image size of 640x480 it takes around 35ms to run through this function, which is too slow for what I need it for (under 10ms ideally).

This is executing on an ARM A9 processor running Linux.

The compiler I'm using is ARM v8 32-Bit Linux gcc compiler.

I'm using optimize -O3 and the following compile options: -march=armv7-a+neon -mcpu=cortex-a9 -mfpu=neon-fp16 -ftree-vectorize.

This is the output from the compiler:

000127f4 <_findMax>:
    for(int i = sz; i > 0; i--)
   127f4:   e3510000    cmp r1, #0
{
   127f8:   e52de004    push    {lr}        ; (str lr, [sp, #-4]!)
    for(int i = sz; i > 0; i--)
   127fc:   da000014    ble 12854 <_findMax+0x60>
   12800:   e1d2c0b0    ldrh    ip, [r2]
   12804:   e2400002    sub r0, r0, #2
    int temp = 0;
   12808:   e3a0e000    mov lr, #0
        if(returnPixel->temperature < *image)
   1280c:   e1f030b2    ldrh    r3, [r0, #2]!
   12810:   e153000c    cmp r3, ip
            returnPixel->temperature = *image;
   12814:   81a0c003    movhi   ip, r3
   12818:   81a0e001    movhi   lr, r1
   1281c:   81c230b0    strhhi  r3, [r2]
    for(int i = sz; i > 0; i--)
   12820:   e2511001    subs    r1, r1, #1
   12824:   1afffff8    bne 1280c <_findMax+0x18>
   12828:   e30c3ccd    movw    r3, #52429  ; 0xcccd
   1282c:   e34c3ccc    movt    r3, #52428  ; 0xcccc
   12830:   e0831e93    umull   r1, r3, r3, lr
   12834:   e1a034a3    lsr r3, r3, #9
   12838:   e0831103    add r1, r3, r3, lsl #2
   1283c:   e6ff3073    uxth    r3, r3
   12840:   e04ee381    sub lr, lr, r1, lsl #7
   12844:   e6ffe07e    uxth    lr, lr
    returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
   12848:   e1c2e0b4    strh    lr, [r2, #4]
    returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
   1284c:   e1c230b6    strh    r3, [r2, #6]
}
   12850:   e49df004    pop {pc}        ; (ldr pc, [sp], #4)
    for(int i = sz; i > 0; i--)
   12854:   e3a03000    mov r3, #0
   12858:   e1a0e003    mov lr, r3
   1285c:   eafffff9    b   12848 <_findMax+0x54>

For clarity after comments:

Each pixel is a unsigned 16 bit integer, image[0] would be the pixel with coordinates 0,0, and the last in the array would have the coordinates 639,479.

18
  • 1
    You could check every second line and every second column, which would make it roughly 4 times faster. You loose precision but it might be good enough for your needs. Commented Nov 5, 2020 at 9:20
  • 1
    a) Unfold for loop a little. e.g. process several pixels each iteration. See: duffs device -> en.wikipedia.org/wiki/Duff%27s_device, b) Use some gradient path algorithm: en.wikipedia.org/wiki/Gradient_descent Commented Nov 5, 2020 at 9:57
  • 1
    How are you reading the pixels? Finding the hottest at that time may be trivial. Commented Nov 5, 2020 at 10:13
  • 1
    @pqans -I added the option -funroll-loops which got it down to 13ms Commented Nov 5, 2020 at 11:14
  • 1
    Yep so that algorithm isn't useful, unless you can find a way to split the image into several "hot spot squares". If you can do that, then gradient descent is much faster than "brute force" which is the current algorithm.
    – Lundin
    Commented Nov 5, 2020 at 12:54

6 Answers 6

3

This is executing on an ARM A9 processor running Linux.

ARM Cortex-A9 supports Neon.

With this in mind the goal should be to load 8 values (128 bits of pixel data) into a register, then do "compare with the current maximums for each of the 8 places" to get a mask, then use the mask and its inverse to mask out the "too small" old maximums and the "too small" new values; then OR the results to merge the new higher values into the "current maximums for each of the 8 places".

Once that has been done for all pixels (using a loop); you'd want to find the highest value in the "current maximums for each of the 8 places".

However; to find the location of the hottest pixel (rather than just how hot it is) you'd want to split the image into tiles (e.g. maybe 8 pixels wide and 8 pixels tall). This allows you to find the max. temperature within each tile (using Neon); then find the pixel within the hottest tile. Note that for huge images this lends itself to a "multi-layer" approach - e.g. create a smaller image containing the maximum from each tile in the original image; then do the same thing again to create an even smaller image containing the maximum from each "group of tiles", then ...

Making this work in plain C means trying to convince the compiler to auto-vectorize. The alternatives are to use compiler intrinsics or inline assembly. In any of these cases, using Neon to do 8 pixels in parallel (without any branches) could/should improve performance significantly (how much depends on RAM bandwidth).

3
  • Re “get a mask, then use the mask”: I believe the Cortex-A9 has vmax instructions that can update a maximum without using multiple instruction to generate a mask and use it. If one is doing an initial loop to find the maximum, it should use vmax. Commented Nov 5, 2020 at 12:16
  • Re “split the image into tiles”: What benefit is there to tiling the image two-dimensionally? The one-dimensional search has no impediments for cache or memory access. If the concern is the cost of reloading lines evicted in the first loop, one can strip mine the two loops (find the maximum, then find its location) one-dimensionally (limiting the combined loop to using a little under S bytes of image, where S is the total cache size). Commented Nov 5, 2020 at 12:20
  • @EricPostpischil: A "tile" could be any rectangle (including "1 high, X wide" rectangle). For the "multi-layer" approach square tiles (e.g. 8 high, 8 wide) are much easier to visualize (because a hot spot in one place of the original image will be in the same place in smaller image/s); which makes it easier for the original asker to understand. Of course once the original asker has understood the basic concept and begins trying to implement it they'll probably figure out what is the most optimal "tile" themselves, so there's little point in "prematurely de-optimizing the learning process".
    – Brendan
    Commented Nov 5, 2020 at 14:37
3

You should minimize memory access, especially in loops.
Every * or -> could cause unnecessary memory access resulting in serious performance hits.
Local variables are your best friend:

void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
    int temp = 0;
    uint16_t temperature = returnPixel->temperature;
    uint16_t pixel;

    for (int i = sz; i > 0; i--)
    {
        pixel = *image++;
        if (temperature < pixel)
        {
            temperature = pixel;
            temp = i;
        }
    }

    returnPixel->temperature = temperature;
    returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
    returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}

Below is how this can be optimized by utilizing neon:

#include <stdint.h>
#include <arm_neon.h>
#include <assert.h>

static inline void findMax128_neon(uint16_t *pDst, uint16x8_t *pImage)
{
    uint16x8_t in0, in1, in2, in3, in4, in5, in6, in7, in8, in9, in10, in11, in12, in13, in14, in15;
    uint16x4_t dmax;
    in0 = vld1q_u16(pImage++);
    in1 = vld1q_u16(pImage++);
    in2 = vld1q_u16(pImage++);
    in3 = vld1q_u16(pImage++);
    in4 = vld1q_u16(pImage++);
    in5 = vld1q_u16(pImage++);
    in6 = vld1q_u16(pImage++);
    in7 = vld1q_u16(pImage++);
    in8 = vld1q_u16(pImage++);
    in9 = vld1q_u16(pImage++);
    in10 = vld1q_u16(pImage++);
    in11 = vld1q_u16(pImage++);
    in12 = vld1q_u16(pImage++);
    in13 = vld1q_u16(pImage++);
    in14 = vld1q_u16(pImage++);
    in15 = vld1q_u16(pImage);

    in0 = vmaxq_u16(in1, in0);
    in2 = vmaxq_u16(in3, in2);
    in4 = vmaxq_u16(in5, in4);
    in6 = vmaxq_u16(in7, in6);
    in8 = vmaxq_u16(in9, in8);
    in10 = vmaxq_u16(in11, in10);
    in12 = vmaxq_u16(in13, in12);
    in14 = vmaxq_u16(in15, in14);

    in0 = vmaxq_u16(in2, in0);
    in4 = vmaxq_u16(in6, in4);
    in8 = vmaxq_u16(in10, in8);
    in12 = vmaxq_u16(in14, in12);

    in0 = vmaxq_u16(in4, in0);
    in8 = vmaxq_u16(in12, in8);

    in0 = vmaxq_u16(in8, in0);

    dmax = vmax_u16(vget_high_u16(in0), vget_low_u16(in0));

    dmax = vpmax_u16(dmax, dmax);

    dmax = vpmax_u16(dmax, dmax);

    vst1_lane_u16(pDst, dmax, 0);
}

void _findMax_neon(uint16_t *image, int sz, sPixelData *returnPixel)
{
    assert((sz % 128) == 0);
    const uint32_t nSector = sz/128;
    uint16_t max[nSector];
    uint32_t i, s, nMax;
    uint16_t *pImage;

    for (i = 0; i < nSector; ++i)
    {
        findMax128_neon(&max[i], (uint16x8_t  *) &image[i*128]);
    }

    s = 0;
    nMax = max[0];

    for (i = 1; i < nSector; ++i)
    {
        if (max[i] > nMax)
        {
            s = i;
            nMax = max[i];
        }
    }

    if (nMax < returnPixel->temperature)
    {
        returnPixel->x_location = 0;
        returnPixel->y_location = 0;
        return;
    }

    pImage = &image[s];
    i = 0;

    while(1) {
        if (*pImage++ == nMax) break;
        i += 1;
    }

    i += 128 * s;

    returnPixel->temperature = nMax;
    returnPixel->x_location = i % IMAGE_HORIZONTAL_SIZE;
    returnPixel->y_location = i / IMAGE_HORIZONTAL_SIZE;
}

Beware that the function above assumes sz being a multiple of 128.
And yes, it will run in less than 10ms.

2

The culprit here is the slow linear search for the highest "temperature". I'm not quite sure how to improve that search algorithm with the information given, if at all possible (could you sort the data in advance?), but you could start with this:

uint16_t max = 0;
size_t found_index = 0;

for(size_t i=0; i<sz; i++)
{
  if(max < image[i])
  {
    max = image[i];
    found_index = sz - i - 1; // or whatever makes sense here for the algorithm
  }
}

returnPixel->temperature = max;
returnPixel->x_location = found_index % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = found_index / IMAGE_HORIZONTAL_SIZE;

This might give a very slight performance gain because of the top to bottom iteration order and not touching unrelated memory returnPixel in the middle of the loop. max should get stored in a register and with luck you might get slightly better cache performance overall. Still, this comes with a branch like the original code, so it is a minor improvement.

Another micro-optimization is to change the parameter to const uint16_t* image - this might give slightly better pointer aliasing, in case returnPixel happens to contain a uint16_t too. image should be const regardless of performance, since const correctness is always good practice.

Further more obscure optimization tricks might be possible if you read image 32 or 64 bits at a time, then come up with a fast look-up method to find the largest image inside that 32/64 bit chunk.

1

If you have to find the hottest pixel in the image and if there is no structure to the image data itself then I think your stuck with iterating through the pixels. If so you have a number of different ways to make this faster:

  • As suggested above try loop unrolling and other micro-optimisation tricks, this might give you the performance boost you need
  • Go parallel, split the array into N chunks and find the MAX[N] for each chunk and then find the largest of the MAX[N] values. You have to be careful here as setting up the parallel processes can take longer than doing the work.

If there is some structure to the image, lots of cold pixels and a hot spot (larger than 1 pixel) that your trying to find say, then there are other techniques you could use.

One approach could be to split the image up into N boxes and then sample each box, The hottest pixel in the hottest box (and maybe in the boxes adjacent too it) would then be your result. However this depends on their being some structure to the image which you can rely on.

1

The assembly language reveals the compiler is storing in returnPixel->temperature each time a new maximum is found, with the instruction strhhi r3, [r2]. Eliminate this unnecessary store by caching the maximum in a local object and only updating returnPixel->temperature after the loop ends:

uint16_t maximum = returnPixel->temperature;
for (int i = sz; i > 0; i--)
    {
        if (maximum < *image)
        {
            maximum = *image;
            temp = i;
        }
        image++;
    }
returnPixel->temperature = maximum;

That is unlikely to reduce the execution time as much as you need, but it might if there is some bad cache or memory interaction occurring. It is a very simple change, so try it before moving on to the SIMD vectorizations suggested in other answers.

Regarding vectorization, two approaches are:

  • Iterate through the image using a vmax instruction to update the maximum value seen so far in each SIMD lane. Then consolidate the lanes to find the overall maximum. Then iterate through the image again looking for that maximum in any lane. (I forget what that architecture has for instructions that would assist in testing whether a comparison produced true in any lane.)
  • Iterate through the image maintaining three registers: One with the maximum seen so far in each lane, one with a counter of position in the image, and one with, in each lane, a record of the counter value at the time each new maximum was seen. The first can be updated with vmax, as above. The second can be updated with vadd. The third can be updated with vcmp and vbit. After the loop, figure out which lane has the maximum, and get the position of the maximum from the recorded counter for that lane.

Depending on the performance of the necessary instructions, a hybrid approach may be faster:

  • Set some strip size S. Partition the image into strips of that size. For each strip in the image, find the maximum (using the fast vmax loop described above). If the maximum is greater than seen in previous strips, remember it and the current strip number. After processing the whole image, the task has been reduced to finding the location of the maximum in a particular strip. Use the second loop from the first approach above for that. (For large images, further refinements may be possible, possibly refining the location using a shorter strip size before finding the exact location, depending on cache behavior and other factors.)
0

In my opinion you cannot improve that algorithim, because any array element can hold the maximum, so you need to do at least one pass through the data, I don't believe you can improve this without going multithreading. you can start several threads (as many as cores/processors you may have) and give each a subset of your image. Once they are finished, you will have as many local maximum values as the number of threads you started. just do a second pass on those vaues to get the maximum total value, and you are finished. But consider the extra workload of creating threads, allocating stack memory for them, and scheduling, as that can be higher if the number of values is low than the workload of running all in a single thread. If you have a thread pool somewhere to provide ready to run threads, and that is something you can get on, then probably you'll be able to finished in one Nth part of the time to run all the loop in one single processor (where N is the number of cores you have on the machine)

Note: using a dimension that is a power of two will save you the job of calculating a quotient and a remainder by solving the division problem with a bit shift and a bit mask. You use it only once in your function, but it's an improving, anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.