Hello everyone! Here's a bit of what's going on this week.
We were looking again at dotproduct.c from the neon_test archive. This time, we were referring to it as a template for creating a version of dotproduct that takes 16-bit signed integer inputs and returns a 32-bit signed integer output. This is in order to port a demodulator Phil Karn is working on to ARM/NEON. This forced us to think more carefully aboutregister allocation, and we ran into a subtlety we didn't fully understand.
In the original dotproduct.c, half of the elements are multiply-accumulated into q8, and half into q9. Then at the end, q9 is added into q8. Like this:
In the original dotproduct.c, half of the elements are multiply-accumulated into q8, and half into q9. Then at the end, q9 is added into q8. Like this:
...
vld1.32 {d0,d1,d2,d3}, [%1]!
vld1.32 {d4,d5,d6,d7}, [%2]!
vmla.f32 q8, q0, q2
vmla.f32 q9, q1, q3
bgt 1b
vadd.f32 q8, q8, q9
...
We thought, why not change this to accumulate directly into q8? This would be logically equivalent:
...
vld1.32 {d0,d1,d2,d3}, [%1]!
vld1.32 {d4,d5,d6,d7}, [%2]!
vmla.f32 q8, q0, q2
vmla.f32 q8, q1, q3
bgt 1b
# no vadd.f32 required
...
We ran this on the Beagleboard, and it does indeed get the same answer. But it's about 30% slower.
The two vmla instructions in the original code are independent in both inputs and outputs, whereas the two vmla instructions in our new version share the destination register q8. It would appear that the execution hardware in the processor on the Beagleboard is smart enough to do those two operations largely in parallel if the inputs and outputs are independent. That's very cool.
So, the questions are:
1. Is that the correct explanation of what's happening?
2. How were we supposed to know that such parallel execution was possible? Put another way, how can we find out whether the same is true of any other two operations, short of coding up multiple versions and trying them out?
3. There must be a limit on how many operations (such as vmla's) can proceed in parallel. I think there are enough registers available to do more in the dotproduct.c code, but the original code stopped at two. So I'm guessing the author was able to know that two was the maximum for parallel execution. Is this in the documentation somewhere or did the author have to run experiments?
Apologies to Philip Balister, who will get a slightly longer version of this email!
We're making progress learning ARM and NEON code and enjoying ourselves in the process.
-Michelle W5NYV, Paul KB5MU
No comments:
Post a Comment