The Difficulty of ARM Assembly

You can talk about almost anything that you want to on this board.

Moderator: Moderators

User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

The Difficulty of ARM Assembly

Post by Drew Sebastino »

I recently started programming on an STM32 microcontroller (ARM Cortex-M0 processor) for college and was naïve enough to try programming in assembly. There's little way to work with immediate numbers, and the limitations always seem to change heavily based on the instruction (sometimes it's an 8 bit value that can be shifted, sometimes it's a regular 16 bit number and sometimes it's even a 12 bit number). There's never any absolute addressing either due to the 32 bit instruction size, and the limitations of relative addressing appear just as random as that of immediate values. It's confusing enough for a human (or me at least) that I wouldn't be surprised if a compiler generated substantially faster code.
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: The Difficulty of ARM Assembly

Post by lidnariq »

ARM assemblers usually deliberately allocate a region in the CODE segment near any given routine called a "constant pool" for exactly this reason.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: The Difficulty of ARM Assembly

Post by Oziphantom »

Yeah ARM and RISC in general is designed for compilers. The idea being having a smaller instruction set means you can't do any complex paths, which reduces the search space for a compiler, thus the compiler makes about as good code. That being said when I was doing M0 the code gcc was making was horrendous.

Basically the instruction are fixed bit length, so you get number of bits to encode the instruction + param = what ever is left. As the ARM instruction set has been moving more and more CISC they have started to need to create with instruction packing.

Typically when you hand asm some RISC you use an "higher level asm" although I don't know once for ARM, but on MIPS you have MAL which is a helper for TAL.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 569
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: The Difficulty of ARM Assembly

Post by Jarhmander »

ARMv6-M (the core of Cortex-M0) is a bastardization; it's a stripped down, not very orthogonal copy of ARMv7-M (Cortex-M3). For instance, on the ARMv6-M, much of the instructions can only act on r0-r7, and the destination is forced to be the same than one of the operand —very few instructions can use all registers. And indeed, you can only move an 8-bit litteral, or load a 32-bit constant from a constant pool. If you want something more pleasant to work with, consider a Cortex-M3, much less restrictions, more fun to work with.
((λ (x) (x x)) (λ (x) (x x)))
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: The Difficulty of ARM Assembly

Post by tepples »

But how does ARMv6-M compare to ARMv4 Thumb, as used in Game Boy Advance? Wikipedia says it supports "most" Thumb instructions and "some" Thumb-2 instructions. Does Thumb cause more of a problem than it did on GBA?
User avatar
Drew Sebastino
Formerly Espozo
Posts: 3496
Joined: Mon Sep 15, 2014 4:35 pm
Location: Richmond, Virginia

Re: The Difficulty of ARM Assembly

Post by Drew Sebastino »

This is probably a stupid question, but for what reason would a processor be forced to use a certain instruction width? (Well, THUMB is either 16 or 32 bit, but cannot be larger than that). ...I was just about to say about I'm just confused as to why there would be a 16 bit instruction that contains the address of a 32 bit number that still needs to be loaded, but I suppose with a 16 bit data bus, the waste of having that address included in the instruction doesn't really matter...

And I was running into problems before I wrote "processor cpu32_6m". :|

And what did Thumb do on the GBA?
lidnariq
Posts: 11432
Joined: Sun Apr 13, 2008 11:12 am

Re: The Difficulty of ARM Assembly

Post by lidnariq »

Drew Sebastino wrote:This is probably a stupid question, but for what reason would a processor be forced to use a certain instruction width?
Theoretically, it makes things a lot simpler, because you don't need special lookup tables or handling to deal with variable length instructions.

In practice ... it turns out that a fixed length of 32 bit instructions is actually kinda lousy. Cache pressure and memory bandwidth is often the biggest hindrance to any modern CPU, and the simplest way to address that is to make your instructions shorter. (edit: and having to refer to a constant pool that's not in the literal flow of instructions means you need to special case cache prefetch anyway, so there's less benefit) And the cost of evaluating where the program counter needs to be isn't large.

Hence SuperH, THUMB, and MIPS16le. (btw, THUMB's always 16-bit)
And what did Thumb do on the GBA?
Basically exactly what I said above: more instructions can fit into the GBA's internal 32KB of 32-bit RAM and 256KB of 16-bit RAM, take less time to execute from the 256KB internal RAM, and take less time to fetch or execute from the cart.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: The Difficulty of ARM Assembly

Post by Oziphantom »

basically in the old days, RAM was small and slow. so things like the Z80 and the 68K would use large amounts of die, to have fancy FSMs and variable length instructions to get the most out of small RAM and they had clocks to spend where they don't touch the bus to work out what to do next etc.

RAM got cheaper and faster, and did so faster than transistors got smaller. So RISC ditched the fancy FSMs and Microcode to just hit RAM hard and fast, as this gave more data through the CPU, since it knows its going to get data each clock, the pipeline was simplified and more of the chip could be used for instructions.

Then cooling got better, process shrunk faster than external bus speed could be speed up. Then Cache became the way to solve the slow 'FSB' to which we are back to RAM being precious and packing more in is a lot better. As the CPUs can hit Cache at 100Mhz but RAM at 33mhz. Thus CISC start to pull back, and now even ARM is CISC ditching the RISC purity in favor of power.

The really dumb aspect is making a 32bit cpu with a 32bit instruction size, if you have a 32bit bus and a 16bit cpu it would make a lot more sense. This is what Thumb is, it drops you to a 16bit CPU but still has a 32bit bus. Thus it can get instruction + data every clock, which really boosts your speed. Just you can't go over 64K anymore. Personally I think going to a 24bit CPU would be the sweet spot, honestly when I'm coding I very rarely need more than 65,536 values, normally I'm doing <1000, but I can see for things like spreadsheets etc 65,000 is not enough. However 16,777,216 is probably plenty for 98% of the time. The issue then becomes that it limits you do a measly 16MB, doubling it up to get 48bit pointers however gets you to 281,474,976,710,656 or 32GB which is starting to get "normal" but I still think is overkill.
nocash
Posts: 1405
Joined: Fri Feb 24, 2012 12:09 pm
Contact:

Re: The Difficulty of ARM Assembly

Post by nocash »

I like coding ARM in ASM. The instructions, addressing modes, and register set are much more powerful than 6502 or the like. Needing the literal pool for 32bit immediates might be a bit unfamilar at first. But if you get familar with it then you have 32bit maths, and memory accesses with auto-increasing addresses, and ALU opcodes that could do obscure things like "IF equal THEN r0=r2 xor (r3*8)" in single opcode & single clock cycle, and there are enough registers to store operands & pointers & loop counters in registers instead of RAM.

At least ARM can do that. THUMB should be able to do most of that, too, but it might come up with some confusing restrictions & its syntax is having confusing rules about whether/which opcodes do update flags. I don't know if THUMB-2 has fixed some of that restrictions and syntax issues.

Using compiler code: What I have seen in commercial games on GBA and NDS consoles isn't optimized at all. You would need to be really confused to create anything equivalent in ASM.
Oziphantom wrote:This is what Thumb is, it drops you to a 16bit CPU but still has a 32bit bus. Thus it can get instruction + data every clock
Uh, that is vice and versa and still not quite right.
The CPU is 32bit no matter if using THUMB or ARM (it can do 32bit maths and has 32bit address space).

THUMB 16bit opcodes can be faster than 32bit opcodes if your memory is "uncached memory with 16bit databus" (if your memory doesn't have that restriction then THUMB is just smaller, but not actually faster).

If you think that 16bit opcode and 16bit data can be transferred through 32bit databus within a single clock cycle: No, they can't. What you mean might be memory systems with separate data cache and code cache, that might work in a single clock cycle - but that's unrelated to using 32bit ARM opcodes or 16bit THUMB opcodes.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: The Difficulty of ARM Assembly

Post by Oziphantom »

the GBA is 16bit bus, no cache right?

I also though that it made its more practical do do 16bit operations, in that you ignore the upper half and just focus on the lower half of registers. But it has been a long time, and a lot of ARM variants since :D maybe it was 16 registers not 16bits...
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: The Difficulty of ARM Assembly

Post by tepples »

Game Boy Advance has a 32-bit bus to BIOS, IWRAM, and MMIO, and a 16-bit bus to most other memory (ROM, EWRAM, VRAM, CGRAM, and OAM). IWRAM is also fairly small (32768 bytes) yet with fewer wait states than EWRAM or ROM, so if ARM in IWRAM is too big, Thumb in IWRAM may make sense.
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: The Difficulty of ARM Assembly

Post by psycopathicteen »

If I was programming the GBA in assembly, I'd probably dedicate a register as an indexed to a table of constants.
User avatar
Dwedit
Posts: 4924
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

Re: The Difficulty of ARM Assembly

Post by Dwedit »

You don't need an indexed table of constants, you just use the program counter for that.
There's even a pseudo-instruction for that: `ldr r0,=0x12345678`, which transforms to a PC-relative load to a local literal pool.

Now an indexed table of global variables, that's far more useful.
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 569
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: The Difficulty of ARM Assembly

Post by Jarhmander »

... and that's essentially what a Global Offset Table (GOT) is.
((λ (x) (x x)) (λ (x) (x x)))
psycopathicteen
Posts: 3140
Joined: Wed May 19, 2010 6:12 pm

Re: The Difficulty of ARM Assembly

Post by psycopathicteen »

Dwedit wrote:You don't need an indexed table of constants, you just use the program counter for that.
There's even a pseudo-instruction for that: `ldr r0,=0x12345678`, which transforms to a PC-relative load to a local literal pool.

Now an indexed table of global variables, that's far more useful.
How does the assember know where to put the table?
Post Reply