How to get lower and higher 32 bits of a 64-bit integer for gcc inline asm? (ARMV5 platform)












1















I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).



How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).



I know, that compiler would do it for me, but I just write the small function for an example.



int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;

asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);

return tmp_acc + acc;
}









share|improve this question




















  • 1





    "does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

    – EOF
    Dec 28 '18 at 15:49













  • took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

    – old_timer
    Dec 30 '18 at 7:03






  • 1





    Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

    – David Wohlferd
    Dec 30 '18 at 22:42
















1















I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).



How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).



I know, that compiler would do it for me, but I just write the small function for an example.



int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;

asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);

return tmp_acc + acc;
}









share|improve this question




















  • 1





    "does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

    – EOF
    Dec 28 '18 at 15:49













  • took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

    – old_timer
    Dec 30 '18 at 7:03






  • 1





    Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

    – David Wohlferd
    Dec 30 '18 at 22:42














1












1








1








I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).



How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).



I know, that compiler would do it for me, but I just write the small function for an example.



int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;

asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);

return tmp_acc + acc;
}









share|improve this question
















I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).



How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).



I know, that compiler would do it for me, but I just write the small function for an example.



int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;

asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);

return tmp_acc + acc;
}






c gcc assembly arm inline-assembly






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 31 '18 at 7:32









Peter Cordes

122k17184312




122k17184312










asked Dec 28 '18 at 15:42









Yevhen TsybaYevhen Tsyba

255




255








  • 1





    "does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

    – EOF
    Dec 28 '18 at 15:49













  • took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

    – old_timer
    Dec 30 '18 at 7:03






  • 1





    Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

    – David Wohlferd
    Dec 30 '18 at 22:42














  • 1





    "does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

    – EOF
    Dec 28 '18 at 15:49













  • took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

    – old_timer
    Dec 30 '18 at 7:03






  • 1





    Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

    – David Wohlferd
    Dec 30 '18 at 22:42








1




1





"does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

– EOF
Dec 28 '18 at 15:49







"does not look good" might be an important aesthetic concern, but doesn't particularly matter for inline assembly.

– EOF
Dec 28 '18 at 15:49















took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

– old_timer
Dec 30 '18 at 7:03





took roughly 30-90 seconds to find the answer on google, an additional 2-5 minutes of reading the gcc docs and googling to find why that answer works.

– old_timer
Dec 30 '18 at 7:03




1




1





Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

– David Wohlferd
Dec 30 '18 at 22:42





Template modifiers for AArch32 state. You might also want to do a bit more googling. I've heard gcc will use dsp instructions if it believes they are valuable (and available). If you can avoid using inline asm, you will almost certainly be happier.

– David Wohlferd
Dec 30 '18 at 22:42












1 Answer
1






active

oldest

votes


















4














You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:



int64_t accum(int64_t acc, int32_t x, int32_t y) {
return acc + x * (int64_t)y;
}


which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)



accum:
smlal r0, r1, r3, r2 @, y, x
bx lr @


As a bonus, this pure C also compiles efficiently for AArch64.



https://gcc.gnu.org/wiki/DontUseInlineAsm





If you insist on shooting yourself in the foot and using inline asm:



Or in the general case with other instructions, there might be a case where you'd want this.



First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.



This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.



Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.



So you can specify a 64-bit input or output as a pair of 32-bit variables.



#include <stdint.h>

int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
uint32_t prod_lo, prod_hi;

asm("SMULL %0, %1, %2, %3"
: "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);

int64_t prod = ((int64_t)prod_hi) << 32;
prod |= prod_lo; // + here won't optimize away, but | does, with gcc
return acc + prod;
}


Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.



Again from Godbolt:



@ gcc -O3 output with early-clobber, valid even before ARMv6
testFunc:
str lr, [sp, #-4]! @, Save return address (link register)
SMULL ip, lr, r2, r3 @ prod_lo, prod_hi, x, y
adds r0, ip, r0 @, prod, acc
adc r1, lr, r1 @, prod, acc
ldr pc, [sp], #4 @ return by popping the return address into PC


@ gcc -O3 output without early-clobber (&) on output constraints:
@ valid only for ARMv6 and later
testFunc:
SMULL r3, r2, r2, r3 @ prod_lo, prod_hi, x, y
adds r0, r3, r0 @, prod, acc
adc r1, r2, r1 @, prod, acc
bx lr @


Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.



// using an int64_t directly with inline asm, using %Q0 and %R0 constraints
// Q is the low half, R is the high half.
int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
{
int64_t prod; // gcc and clang seem to want more free registers this way

asm("SMULL %Q0, %R0, %1, %2"
: "=&r" (prod) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);

return acc + prod;
}


again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)



@ gcc -O3 with the early-clobber so it's safe on ARMv5
testFunc2:
push {r4, r5} @
SMULL r4, r5, r2, r3 @ prod, x, y
adds r0, r4, r0 @, prod, acc
adc r1, r5, r1 @, prod, acc
pop {r4, r5} @
bx lr @


So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53960952%2fhow-to-get-lower-and-higher-32-bits-of-a-64-bit-integer-for-gcc-inline-asm-arm%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4














    You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:



    int64_t accum(int64_t acc, int32_t x, int32_t y) {
    return acc + x * (int64_t)y;
    }


    which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)



    accum:
    smlal r0, r1, r3, r2 @, y, x
    bx lr @


    As a bonus, this pure C also compiles efficiently for AArch64.



    https://gcc.gnu.org/wiki/DontUseInlineAsm





    If you insist on shooting yourself in the foot and using inline asm:



    Or in the general case with other instructions, there might be a case where you'd want this.



    First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.



    This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.



    Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.



    So you can specify a 64-bit input or output as a pair of 32-bit variables.



    #include <stdint.h>

    int64_t testFunc(int64_t acc, int32_t x, int32_t y)
    {
    uint32_t prod_lo, prod_hi;

    asm("SMULL %0, %1, %2, %3"
    : "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
    : "r"(x), "r"(y)
    );

    int64_t prod = ((int64_t)prod_hi) << 32;
    prod |= prod_lo; // + here won't optimize away, but | does, with gcc
    return acc + prod;
    }


    Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.



    Again from Godbolt:



    @ gcc -O3 output with early-clobber, valid even before ARMv6
    testFunc:
    str lr, [sp, #-4]! @, Save return address (link register)
    SMULL ip, lr, r2, r3 @ prod_lo, prod_hi, x, y
    adds r0, ip, r0 @, prod, acc
    adc r1, lr, r1 @, prod, acc
    ldr pc, [sp], #4 @ return by popping the return address into PC


    @ gcc -O3 output without early-clobber (&) on output constraints:
    @ valid only for ARMv6 and later
    testFunc:
    SMULL r3, r2, r2, r3 @ prod_lo, prod_hi, x, y
    adds r0, r3, r0 @, prod, acc
    adc r1, r2, r1 @, prod, acc
    bx lr @


    Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.



    // using an int64_t directly with inline asm, using %Q0 and %R0 constraints
    // Q is the low half, R is the high half.
    int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
    {
    int64_t prod; // gcc and clang seem to want more free registers this way

    asm("SMULL %Q0, %R0, %1, %2"
    : "=&r" (prod) // early clobber for pre-ARMv6
    : "r"(x), "r"(y)
    );

    return acc + prod;
    }


    again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)



    @ gcc -O3 with the early-clobber so it's safe on ARMv5
    testFunc2:
    push {r4, r5} @
    SMULL r4, r5, r2, r3 @ prod, x, y
    adds r0, r4, r0 @, prod, acc
    adc r1, r5, r1 @, prod, acc
    pop {r4, r5} @
    bx lr @


    So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.






    share|improve this answer






























      4














      You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:



      int64_t accum(int64_t acc, int32_t x, int32_t y) {
      return acc + x * (int64_t)y;
      }


      which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)



      accum:
      smlal r0, r1, r3, r2 @, y, x
      bx lr @


      As a bonus, this pure C also compiles efficiently for AArch64.



      https://gcc.gnu.org/wiki/DontUseInlineAsm





      If you insist on shooting yourself in the foot and using inline asm:



      Or in the general case with other instructions, there might be a case where you'd want this.



      First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.



      This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.



      Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.



      So you can specify a 64-bit input or output as a pair of 32-bit variables.



      #include <stdint.h>

      int64_t testFunc(int64_t acc, int32_t x, int32_t y)
      {
      uint32_t prod_lo, prod_hi;

      asm("SMULL %0, %1, %2, %3"
      : "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
      : "r"(x), "r"(y)
      );

      int64_t prod = ((int64_t)prod_hi) << 32;
      prod |= prod_lo; // + here won't optimize away, but | does, with gcc
      return acc + prod;
      }


      Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.



      Again from Godbolt:



      @ gcc -O3 output with early-clobber, valid even before ARMv6
      testFunc:
      str lr, [sp, #-4]! @, Save return address (link register)
      SMULL ip, lr, r2, r3 @ prod_lo, prod_hi, x, y
      adds r0, ip, r0 @, prod, acc
      adc r1, lr, r1 @, prod, acc
      ldr pc, [sp], #4 @ return by popping the return address into PC


      @ gcc -O3 output without early-clobber (&) on output constraints:
      @ valid only for ARMv6 and later
      testFunc:
      SMULL r3, r2, r2, r3 @ prod_lo, prod_hi, x, y
      adds r0, r3, r0 @, prod, acc
      adc r1, r2, r1 @, prod, acc
      bx lr @


      Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.



      // using an int64_t directly with inline asm, using %Q0 and %R0 constraints
      // Q is the low half, R is the high half.
      int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
      {
      int64_t prod; // gcc and clang seem to want more free registers this way

      asm("SMULL %Q0, %R0, %1, %2"
      : "=&r" (prod) // early clobber for pre-ARMv6
      : "r"(x), "r"(y)
      );

      return acc + prod;
      }


      again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)



      @ gcc -O3 with the early-clobber so it's safe on ARMv5
      testFunc2:
      push {r4, r5} @
      SMULL r4, r5, r2, r3 @ prod, x, y
      adds r0, r4, r0 @, prod, acc
      adc r1, r5, r1 @, prod, acc
      pop {r4, r5} @
      bx lr @


      So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.






      share|improve this answer




























        4












        4








        4







        You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:



        int64_t accum(int64_t acc, int32_t x, int32_t y) {
        return acc + x * (int64_t)y;
        }


        which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)



        accum:
        smlal r0, r1, r3, r2 @, y, x
        bx lr @


        As a bonus, this pure C also compiles efficiently for AArch64.



        https://gcc.gnu.org/wiki/DontUseInlineAsm





        If you insist on shooting yourself in the foot and using inline asm:



        Or in the general case with other instructions, there might be a case where you'd want this.



        First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.



        This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.



        Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.



        So you can specify a 64-bit input or output as a pair of 32-bit variables.



        #include <stdint.h>

        int64_t testFunc(int64_t acc, int32_t x, int32_t y)
        {
        uint32_t prod_lo, prod_hi;

        asm("SMULL %0, %1, %2, %3"
        : "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
        : "r"(x), "r"(y)
        );

        int64_t prod = ((int64_t)prod_hi) << 32;
        prod |= prod_lo; // + here won't optimize away, but | does, with gcc
        return acc + prod;
        }


        Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.



        Again from Godbolt:



        @ gcc -O3 output with early-clobber, valid even before ARMv6
        testFunc:
        str lr, [sp, #-4]! @, Save return address (link register)
        SMULL ip, lr, r2, r3 @ prod_lo, prod_hi, x, y
        adds r0, ip, r0 @, prod, acc
        adc r1, lr, r1 @, prod, acc
        ldr pc, [sp], #4 @ return by popping the return address into PC


        @ gcc -O3 output without early-clobber (&) on output constraints:
        @ valid only for ARMv6 and later
        testFunc:
        SMULL r3, r2, r2, r3 @ prod_lo, prod_hi, x, y
        adds r0, r3, r0 @, prod, acc
        adc r1, r2, r1 @, prod, acc
        bx lr @


        Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.



        // using an int64_t directly with inline asm, using %Q0 and %R0 constraints
        // Q is the low half, R is the high half.
        int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
        {
        int64_t prod; // gcc and clang seem to want more free registers this way

        asm("SMULL %Q0, %R0, %1, %2"
        : "=&r" (prod) // early clobber for pre-ARMv6
        : "r"(x), "r"(y)
        );

        return acc + prod;
        }


        again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)



        @ gcc -O3 with the early-clobber so it's safe on ARMv5
        testFunc2:
        push {r4, r5} @
        SMULL r4, r5, r2, r3 @ prod, x, y
        adds r0, r4, r0 @, prod, acc
        adc r1, r5, r1 @, prod, acc
        pop {r4, r5} @
        bx lr @


        So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.






        share|improve this answer















        You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:



        int64_t accum(int64_t acc, int32_t x, int32_t y) {
        return acc + x * (int64_t)y;
        }


        which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)



        accum:
        smlal r0, r1, r3, r2 @, y, x
        bx lr @


        As a bonus, this pure C also compiles efficiently for AArch64.



        https://gcc.gnu.org/wiki/DontUseInlineAsm





        If you insist on shooting yourself in the foot and using inline asm:



        Or in the general case with other instructions, there might be a case where you'd want this.



        First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.



        This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.



        Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.



        So you can specify a 64-bit input or output as a pair of 32-bit variables.



        #include <stdint.h>

        int64_t testFunc(int64_t acc, int32_t x, int32_t y)
        {
        uint32_t prod_lo, prod_hi;

        asm("SMULL %0, %1, %2, %3"
        : "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
        : "r"(x), "r"(y)
        );

        int64_t prod = ((int64_t)prod_hi) << 32;
        prod |= prod_lo; // + here won't optimize away, but | does, with gcc
        return acc + prod;
        }


        Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.



        Again from Godbolt:



        @ gcc -O3 output with early-clobber, valid even before ARMv6
        testFunc:
        str lr, [sp, #-4]! @, Save return address (link register)
        SMULL ip, lr, r2, r3 @ prod_lo, prod_hi, x, y
        adds r0, ip, r0 @, prod, acc
        adc r1, lr, r1 @, prod, acc
        ldr pc, [sp], #4 @ return by popping the return address into PC


        @ gcc -O3 output without early-clobber (&) on output constraints:
        @ valid only for ARMv6 and later
        testFunc:
        SMULL r3, r2, r2, r3 @ prod_lo, prod_hi, x, y
        adds r0, r3, r0 @, prod, acc
        adc r1, r2, r1 @, prod, acc
        bx lr @


        Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.



        // using an int64_t directly with inline asm, using %Q0 and %R0 constraints
        // Q is the low half, R is the high half.
        int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
        {
        int64_t prod; // gcc and clang seem to want more free registers this way

        asm("SMULL %Q0, %R0, %1, %2"
        : "=&r" (prod) // early clobber for pre-ARMv6
        : "r"(x), "r"(y)
        );

        return acc + prod;
        }


        again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)



        @ gcc -O3 with the early-clobber so it's safe on ARMv5
        testFunc2:
        push {r4, r5} @
        SMULL r4, r5, r2, r3 @ prod, x, y
        adds r0, r4, r0 @, prod, acc
        adc r1, r5, r1 @, prod, acc
        pop {r4, r5} @
        bx lr @


        So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Dec 31 '18 at 8:54

























        answered Dec 31 '18 at 2:14









        Peter CordesPeter Cordes

        122k17184312




        122k17184312






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53960952%2fhow-to-get-lower-and-higher-32-bits-of-a-64-bit-integer-for-gcc-inline-asm-arm%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Mossoró

            Error while reading .h5 file using the rhdf5 package in R

            Pushsharp Apns notification error: 'InvalidToken'