How to get lower and higher 32 bits of a 64-bit integer for gcc inline asm? (ARMV5 platform)

I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).

How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).

I know, that compiler would do it for me, but I just write the small function for an example.

int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   int64_t tmp_acc;



   asm("SMULL %0, %1, %2, %3"

      : "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;

      : "r"(x), "r"(y)

      );



return tmp_acc + acc;

}

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

1

"does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

– EOF
Dec 28 '18 at 15:49

took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

– old_timer
Dec 30 '18 at 7:03

1

Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

– David Wohlferd
Dec 30 '18 at 22:42

add a comment |

How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).

I know, that compiler would do it for me, but I just write the small function for an example.

int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   int64_t tmp_acc;



   asm("SMULL %0, %1, %2, %3"

      : "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;

      : "r"(x), "r"(y)

      );



return tmp_acc + acc;

}

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

1

"does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

– EOF
Dec 28 '18 at 15:49

took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

– old_timer
Dec 30 '18 at 7:03

1

Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

– David Wohlferd
Dec 30 '18 at 22:42

add a comment |

How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).

I know, that compiler would do it for me, but I just write the small function for an example.

int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   int64_t tmp_acc;



   asm("SMULL %0, %1, %2, %3"

      : "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;

      : "r"(x), "r"(y)

      );



return tmp_acc + acc;

}

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).

I know, that compiler would do it for me, but I just write the small function for an example.

int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   int64_t tmp_acc;



   asm("SMULL %0, %1, %2, %3"

      : "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;

      : "r"(x), "r"(y)

      );



return tmp_acc + acc;

}

c gcc assembly arm inline-assembly

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

edited Dec 31 '18 at 7:32

Peter Cordes

122k17184312

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

asked Dec 28 '18 at 15:42

Yevhen Tsyba

255

1

"does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

– EOF
Dec 28 '18 at 15:49

took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

– old_timer
Dec 30 '18 at 7:03

1

Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

– David Wohlferd
Dec 30 '18 at 22:42

add a comment |

1

"does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

– EOF
Dec 28 '18 at 15:49

took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

– old_timer
Dec 30 '18 at 7:03

1

Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

– David Wohlferd
Dec 30 '18 at 22:42

"does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

– EOF
Dec 28 '18 at 15:49

took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

– old_timer
Dec 30 '18 at 7:03

Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

– David Wohlferd
Dec 30 '18 at 22:42

add a comment |

1 Answer
1

active

oldest

votes

You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:

int64_t accum(int64_t acc, int32_t x, int32_t y) {

    return acc + x * (int64_t)y;

}

which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)

accum:

    smlal   r0, r1, r3, r2        @, y, x

    bx      lr  @

As a bonus, this pure C also compiles efficiently for AArch64.

https://gcc.gnu.org/wiki/DontUseInlineAsm

If you insist on shooting yourself in the foot and using inline asm:

Or in the general case with other instructions, there might be a case where you'd want this.

First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.

This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.

Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.

So you can specify a 64-bit input or output as a pair of 32-bit variables.

#include <stdint.h>



int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   uint32_t prod_lo, prod_hi;



   asm("SMULL %0, %1, %2, %3"

      : "=&r" (prod_lo), "=&r"(prod_hi)  // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    int64_t prod = ((int64_t)prod_hi) << 32;

    prod |= prod_lo;        // + here won't optimize away, but | does, with gcc

    return acc + prod;

}

Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.

Again from Godbolt:

@ gcc -O3 output with early-clobber, valid even before ARMv6

testFunc:

    str     lr, [sp, #-4]!    @,         Save return address (link register)

    SMULL ip, lr, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, ip, r0      @, prod, acc

    adc     r1, lr, r1        @, prod, acc

    ldr     pc, [sp], #4      @          return by popping the return address into PC





@ gcc -O3 output without early-clobber (&) on output constraints:

@ valid only for ARMv6 and later

testFunc:

    SMULL r3, r2, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, r3, r0      @, prod, acc

    adc     r1, r2, r1        @, prod, acc

    bx      lr  @

Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.

// using an int64_t directly with inline asm, using %Q0 and %R0 constraints

// Q is the low half, R is the high half.

int64_t testFunc2(int64_t acc, int32_t x, int32_t y)

{

   int64_t prod;    // gcc and clang seem to want more free registers this way



   asm("SMULL %Q0, %R0, %1, %2"

      : "=&r" (prod)         // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    return acc + prod;

}

again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)

@ gcc -O3 with the early-clobber so it's safe on ARMv5

testFunc2:

    push    {r4, r5}        @

    SMULL r4, r5, r2, r3    @ prod, x, y

    adds    r0, r4, r0      @, prod, acc

    adc     r1, r5, r1        @, prod, acc

    pop     {r4, r5}  @

    bx      lr  @

So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.

edited Dec 31 '18 at 8:54

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53960952%2fhow-to-get-lower-and-higher-32-bits-of-a-64-bit-integer-for-gcc-inline-asm-arm%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:

int64_t accum(int64_t acc, int32_t x, int32_t y) {

    return acc + x * (int64_t)y;

}

which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)

accum:

    smlal   r0, r1, r3, r2        @, y, x

    bx      lr  @

As a bonus, this pure C also compiles efficiently for AArch64.

https://gcc.gnu.org/wiki/DontUseInlineAsm

If you insist on shooting yourself in the foot and using inline asm:

Or in the general case with other instructions, there might be a case where you'd want this.

So you can specify a 64-bit input or output as a pair of 32-bit variables.

#include <stdint.h>



int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   uint32_t prod_lo, prod_hi;



   asm("SMULL %0, %1, %2, %3"

      : "=&r" (prod_lo), "=&r"(prod_hi)  // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    int64_t prod = ((int64_t)prod_hi) << 32;

    prod |= prod_lo;        // + here won't optimize away, but | does, with gcc

    return acc + prod;

}

Again from Godbolt:

@ gcc -O3 output with early-clobber, valid even before ARMv6

testFunc:

    str     lr, [sp, #-4]!    @,         Save return address (link register)

    SMULL ip, lr, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, ip, r0      @, prod, acc

    adc     r1, lr, r1        @, prod, acc

    ldr     pc, [sp], #4      @          return by popping the return address into PC





@ gcc -O3 output without early-clobber (&) on output constraints:

@ valid only for ARMv6 and later

testFunc:

    SMULL r3, r2, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, r3, r0      @, prod, acc

    adc     r1, r2, r1        @, prod, acc

    bx      lr  @

// using an int64_t directly with inline asm, using %Q0 and %R0 constraints

// Q is the low half, R is the high half.

int64_t testFunc2(int64_t acc, int32_t x, int32_t y)

{

   int64_t prod;    // gcc and clang seem to want more free registers this way



   asm("SMULL %Q0, %R0, %1, %2"

      : "=&r" (prod)         // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    return acc + prod;

}

again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)

@ gcc -O3 with the early-clobber so it's safe on ARMv5

testFunc2:

    push    {r4, r5}        @

    SMULL r4, r5, r2, r3    @ prod, x, y

    adds    r0, r4, r0      @, prod, acc

    adc     r1, r5, r1        @, prod, acc

    pop     {r4, r5}  @

    bx      lr  @

So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.

edited Dec 31 '18 at 8:54

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

add a comment |

You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:

int64_t accum(int64_t acc, int32_t x, int32_t y) {

    return acc + x * (int64_t)y;

}

which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)

accum:

    smlal   r0, r1, r3, r2        @, y, x

    bx      lr  @

As a bonus, this pure C also compiles efficiently for AArch64.

https://gcc.gnu.org/wiki/DontUseInlineAsm

If you insist on shooting yourself in the foot and using inline asm:

Or in the general case with other instructions, there might be a case where you'd want this.

So you can specify a 64-bit input or output as a pair of 32-bit variables.

#include <stdint.h>



int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   uint32_t prod_lo, prod_hi;



   asm("SMULL %0, %1, %2, %3"

      : "=&r" (prod_lo), "=&r"(prod_hi)  // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    int64_t prod = ((int64_t)prod_hi) << 32;

    prod |= prod_lo;        // + here won't optimize away, but | does, with gcc

    return acc + prod;

}

Again from Godbolt:

@ gcc -O3 output with early-clobber, valid even before ARMv6

testFunc:

    str     lr, [sp, #-4]!    @,         Save return address (link register)

    SMULL ip, lr, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, ip, r0      @, prod, acc

    adc     r1, lr, r1        @, prod, acc

    ldr     pc, [sp], #4      @          return by popping the return address into PC





@ gcc -O3 output without early-clobber (&) on output constraints:

@ valid only for ARMv6 and later

testFunc:

    SMULL r3, r2, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, r3, r0      @, prod, acc

    adc     r1, r2, r1        @, prod, acc

    bx      lr  @

// using an int64_t directly with inline asm, using %Q0 and %R0 constraints

// Q is the low half, R is the high half.

int64_t testFunc2(int64_t acc, int32_t x, int32_t y)

{

   int64_t prod;    // gcc and clang seem to want more free registers this way



   asm("SMULL %Q0, %R0, %1, %2"

      : "=&r" (prod)         // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    return acc + prod;

}

again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)

@ gcc -O3 with the early-clobber so it's safe on ARMv5

testFunc2:

    push    {r4, r5}        @

    SMULL r4, r5, r2, r3    @ prod, x, y

    adds    r0, r4, r0      @, prod, acc

    adc     r1, r5, r1        @, prod, acc

    pop     {r4, r5}  @

    bx      lr  @

So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.

edited Dec 31 '18 at 8:54

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

add a comment |

You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:

int64_t accum(int64_t acc, int32_t x, int32_t y) {

    return acc + x * (int64_t)y;

}

which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)

accum:

    smlal   r0, r1, r3, r2        @, y, x

    bx      lr  @

As a bonus, this pure C also compiles efficiently for AArch64.

https://gcc.gnu.org/wiki/DontUseInlineAsm

If you insist on shooting yourself in the foot and using inline asm:

Or in the general case with other instructions, there might be a case where you'd want this.

So you can specify a 64-bit input or output as a pair of 32-bit variables.

#include <stdint.h>



int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   uint32_t prod_lo, prod_hi;



   asm("SMULL %0, %1, %2, %3"

      : "=&r" (prod_lo), "=&r"(prod_hi)  // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    int64_t prod = ((int64_t)prod_hi) << 32;

    prod |= prod_lo;        // + here won't optimize away, but | does, with gcc

    return acc + prod;

}

Again from Godbolt:

@ gcc -O3 output with early-clobber, valid even before ARMv6

testFunc:

    str     lr, [sp, #-4]!    @,         Save return address (link register)

    SMULL ip, lr, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, ip, r0      @, prod, acc

    adc     r1, lr, r1        @, prod, acc

    ldr     pc, [sp], #4      @          return by popping the return address into PC





@ gcc -O3 output without early-clobber (&) on output constraints:

@ valid only for ARMv6 and later

testFunc:

    SMULL r3, r2, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, r3, r0      @, prod, acc

    adc     r1, r2, r1        @, prod, acc

    bx      lr  @

// using an int64_t directly with inline asm, using %Q0 and %R0 constraints

// Q is the low half, R is the high half.

int64_t testFunc2(int64_t acc, int32_t x, int32_t y)

{

   int64_t prod;    // gcc and clang seem to want more free registers this way



   asm("SMULL %Q0, %R0, %1, %2"

      : "=&r" (prod)         // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    return acc + prod;

}

again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)

@ gcc -O3 with the early-clobber so it's safe on ARMv5

testFunc2:

    push    {r4, r5}        @

    SMULL r4, r5, r2, r3    @ prod, x, y

    adds    r0, r4, r0      @, prod, acc

    adc     r1, r5, r1        @, prod, acc

    pop     {r4, r5}  @

    bx      lr  @

So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.

edited Dec 31 '18 at 8:54

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:

int64_t accum(int64_t acc, int32_t x, int32_t y) {

    return acc + x * (int64_t)y;

}

which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)

accum:

    smlal   r0, r1, r3, r2        @, y, x

    bx      lr  @

As a bonus, this pure C also compiles efficiently for AArch64.

https://gcc.gnu.org/wiki/DontUseInlineAsm

If you insist on shooting yourself in the foot and using inline asm:

Or in the general case with other instructions, there might be a case where you'd want this.

So you can specify a 64-bit input or output as a pair of 32-bit variables.

#include <stdint.h>



int64_t testFunc(int64_t acc, int32_t x, int32_t y)

{

   uint32_t prod_lo, prod_hi;



   asm("SMULL %0, %1, %2, %3"

      : "=&r" (prod_lo), "=&r"(prod_hi)  // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    int64_t prod = ((int64_t)prod_hi) << 32;

    prod |= prod_lo;        // + here won't optimize away, but | does, with gcc

    return acc + prod;

}

Again from Godbolt:

@ gcc -O3 output with early-clobber, valid even before ARMv6

testFunc:

    str     lr, [sp, #-4]!    @,         Save return address (link register)

    SMULL ip, lr, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, ip, r0      @, prod, acc

    adc     r1, lr, r1        @, prod, acc

    ldr     pc, [sp], #4      @          return by popping the return address into PC





@ gcc -O3 output without early-clobber (&) on output constraints:

@ valid only for ARMv6 and later

testFunc:

    SMULL r3, r2, r2, r3    @ prod_lo, prod_hi, x, y

    adds    r0, r3, r0      @, prod, acc

    adc     r1, r2, r1        @, prod, acc

    bx      lr  @

// using an int64_t directly with inline asm, using %Q0 and %R0 constraints

// Q is the low half, R is the high half.

int64_t testFunc2(int64_t acc, int32_t x, int32_t y)

{

   int64_t prod;    // gcc and clang seem to want more free registers this way



   asm("SMULL %Q0, %R0, %1, %2"

      : "=&r" (prod)         // early clobber for pre-ARMv6

      : "r"(x), "r"(y)

      );



    return acc + prod;

}

again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)

@ gcc -O3 with the early-clobber so it's safe on ARMv5

testFunc2:

    push    {r4, r5}        @

    SMULL r4, r5, r2, r3    @ prod, x, y

    adds    r0, r4, r0      @, prod, acc

    adc     r1, r5, r1        @, prod, acc

    pop     {r4, r5}  @

    bx      lr  @

So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.

edited Dec 31 '18 at 8:54

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

edited Dec 31 '18 at 8:54

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

answered Dec 31 '18 at 2:14

Peter Cordes

122k17184312

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk