It is currently Tue Jul 23, 2019 1:54 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 28 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Wed Jun 19, 2019 12:31 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7528
Location: Canada
DRW wrote:
I'm just a bit confused why even the assigned value makes a difference in calculating the array itself.

Well, honestly it's not like the operation of cc65's optimizer is at all obvious. It generates the same initial assembly code whether or not -O is used, and then after doing this -O does a series of pattern match steps to refactor the already generated assembly code. It's a bit of a backward approach to the problem, and really weird. It would be much better to start optimizing at a higher level, but I'm not going to fantasize too much about that. This is what the compiler we have does.

If you want to understand what it does, like I said just above, you can use --debug-opt-output to observe the process if you're curious. It will show you exactly what the initial generated assembly code is, and every step it takes to optimize.

If you want an example:
Code:
// starting C line

(oam+3)[oam_pos] = x+5;

; 1. initial generated assembly:

   lda     #<(_oam+3)
   ldx     #>(_oam+3)
   clc
   adc     _oam_pos
   bcc     L02DF
   inx
L02DF:   jsr     pushax
   ldy     #$04
   ldx     #$00
   lda     (sp),y
   jsr     incax5 ; this is the +5 operation, which critically breaks up the optimization pattern (see below)
   ldx     #$00
   ldy     #$00
   jsr     staspidx

;  2. OptAdd5

   lda     #<(_oam+3)
   ldx     #>(_oam+3)
   clc
   adc     _oam_pos
   bcc     L02DF
   inx
L02DF:   jsr     pushax
   ldy     #$04
   ldx     #$00
   lda     (sp),y
   clc
   adc     #$05 ; jsr incax5 replaced with an inline add
   ldx     #$00
   ldy     #$00
   jsr     staspidx

; 3. OptUnusedLoads

   lda     #<(_oam+3)
   ldx     #>(_oam+3)
   clc
   adc     _oam_pos
   bcc     L02DF
   inx
L02DF:   jsr     pushax
   ldy     #$04 ; unused ldx #00 eliminated
   lda     (sp),y
   clc
   adc     #$05
   ldy     #$00 ; unused ldx #00 eliminated
   jsr     staspidx

; 4. OptStackOps

   lda     #<(_oam+3)
   ldx     #>(_oam+3)
   clc
   adc     _oam_pos
   bcc     L02DF
   inx
L02DF:   sta     ptr1 ; jsr pushax / staspidx (temporary pointer on stack) replaced by keeping it in ptr1
   stx     ptr1+1
   ldy     #$02
   lda     (sp),y
   clc
   adc     #$05
   ldy     #$00
   sta     (ptr1),y


At that point, it's out of patterns to match and stops optimizing.

For comparison, without the intervening +5:
Code:
// starting C line

(oam+3)[oam_pos] = x;

; 1. initial generated assembly

   lda     #<(_oam+3)
   ldx     #>(_oam+3)
   clc
   adc     _oam_pos
   bcc     L02DF
   inx
L02DF:   jsr     pushax
   ldy     #$04
   ldx     #$00
   lda     (sp),y
   ldy     #$00
   jsr     staspidx

;  2. OptPtrStore2

   lda     #<(_oam+3)
   ldx     #>(_oam+3)
   ldy     #$02
   ldx     #$00
   lda     (sp),y
   ldy     _oam_pos
   sta     _oam+3,y ; the temporary pointer on stack is eliminated, and the index add is replaced with Y index

; 3. OptUnusedLoads

   ldy     #$02 ; unused ldx #00 eliminated
   lda     (sp),y
   ldy     _oam_pos
   sta     _oam+3,y


So you can see the pattern that failed to apply is called OptPtrStore2.

The optimization patterns are each a function in the cc65 source. They tend to have a good explanation of the operation. Looking up OptPtrStore2:

Code:
unsigned OptPtrStore2 (CodeSeg* S)
/* Search for the sequence:
**
**      clc
**      adc     xxx
**      bcc     L
**      inx
** L:   jsr     pushax
**      ldy     yyy
**      ldx     #$00
**      lda     (sp),y
**      ldy     #$00
**      jsr     staspidx
**
** and replace it by:
**
**      sta     ptr1
**      stx     ptr1+1
**      ldy     yyy-2
**      ldx     #$00
**      lda     (sp),y
**      ldy     xxx
**      sta     (ptr1),y
**
** or by
**
**      ldy     yyy-2
**      ldx     #$00
**      lda     (sp),y
**      ldy     xxx
**      sta     (zp),y
**
** or by
**
**      ldy     yyy-2
**      ldx     #$00
**      lda     (sp),y
**      ldy     xxx
**      sta     label,y
**
** or by
**
**      ldy     yyy-2
**      ldx     #$00
**      lda     (sp),y
**      ldy     xxx
**      sta     $xxxx,y
**
** depending on the code preceeding the sequence above.
*/


As you can see, this pattern was written to match specifically a simple fetch from the stack and store. There's two other variations of this type of pattern, you can read about them all in the header.

The problem is simply that each of them can only match a very simple pattern. It's looking for a specific beginning, middle, and end. You can't just start adding arbitrary extra code in the middle (i.e. the expression to be resolved), that new stuff in the middle has to fit a pattern that's known to be safe to optimize. Probably possible to write such a thing, but it's complicated and difficult... so, instead only this bunch of simpler cases were written.

When you put an expression in a temporary variable first, that code gets generated before the array access portion, so doesn't interfere with the simple "generate address, fetch, store" pattern that these are capable of matching.


Anyhow, that's just one example of how to analyze cc65's optimizer. It's really all spelled out by --debug-opt-output, so if you want to know about any specific cases you're dealing with, that's the tool to use.

Still probably easier to just rewrite offending code in assembly as needed* than it is to try and understand whatever this byzantine optimizer is doing, though.

* ...and don't worry about it when it's not needed.


Top
 Profile  
 
PostPosted: Wed Jun 19, 2019 1:17 pm 
Offline
User avatar

Joined: Sat Sep 07, 2013 2:59 pm
Posts: 1884
Thanks for the explanation. I'll have a more detailed look at it to understand how the optimization in compilers works.

rainwarrior wrote:
Still probably easier to just rewrite offending code in assembly as needed* than it is to try and understand whatever this byzantine optimizer is doing, though.

* ...and don't worry about it when it's not needed.

Yeah, that's what I'm doing anyway.
Although, in this case, I don't even need Assembly. I simply need to put a temporary variable in between and it's good enough.

Changing it into Assembly would mean that I also have to write the calculation itself in Assembly, so it would be different code for different situations. And sometimes, it is stuff like this:
array[index] = Tile16XToAbsoluteX(heroTile16X) + (width16Block - 1) * 16 + 1;
Which I don't feel at all like turning into inline Assembly.

But with my C macro, I can still write:
Code:
SetArrayValueFromCalculatedValue(
    array, index,
    Tile16XToAbsoluteX(heroTile16X) + (width16Block - 1) * 16 + 1,
    temp);

The only downside is one single additional STA temp in the generated output code.

_________________
Available now: My game "City Trouble".
Website: https://megacatstudios.com/products/city-trouble
Trailer: https://youtu.be/IYXpP59qSxA
Gameplay: https://youtu.be/Eee0yurkIW4
German Retro Gamer article: http://i67.tinypic.com/345o108.jpg


Top
 Profile  
 
PostPosted: Wed Jun 19, 2019 1:25 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7528
Location: Canada
So, like, to address this specific case, you might be able to augment each of the OptPtrStore functions to look for 1 or 2 extra instructions between the fetch and the store. INC (+1), DEC (-1), CLC/ADC (+), SEC/SBC (-), and if found leave them in there. Of course you might look for EOR, ORA, etc. as well, there's a lot of possibilities here.

That might take care of most expressions that are just 2 terms. 3 term expressions will still be unmatchable, but maybe that optimization would be worth writing code for since + a constant is such a common case.

Similarly for the [a+1] case you could try to pattern match the "generate address + add a constant to an index" thing, and other similar cases...

Of course after writing these you must make sure they still passes all the unit tests, and try it out on whatever you can to make sure it doesn't have some unexpected interaction with one of the other 100 optimization patterns that breaks stuff. Testing itself is a lot of work, and if you do happen to find a conflict, every attempt to resolve that multiplies your testing too. :P


Maybe you can see how this creates a combinatorial explosion of work for a fairly marginal gain. In a lot of ways, even if you could write these pattern matchers correctly and they worked, it's just not worth adding that much code to the optimizer to solve the problem backwards like this. The better solution is high level optimization before the assembly begins, but that's a whole project of its own.


Top
 Profile  
 
PostPosted: Wed Jun 19, 2019 1:29 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7528
Location: Canada
DRW wrote:
Changing it into Assembly would mean that I also have to write the calculation itself in Assembly, so it would be different code for different situations. And sometimes, it is stuff like this:
array[index] = Tile16XToAbsoluteX(heroTile16X) + (width16Block - 1) * 16 + 1;
Which I don't feel at all like turning into inline Assembly.

The quick way is: write it in C, compile, copy paste the generated assembly, then just rewrite the bits you're interested in improving.


Top
 Profile  
 
PostPosted: Thu Jun 20, 2019 12:18 pm 
Offline
User avatar

Joined: Sat Sep 07, 2013 2:59 pm
Posts: 1884
rainwarrior wrote:
Maybe you can see how this creates a combinatorial explosion of work for a fairly marginal gain.

Sure. Besides, I never intended that anybody should actually write an optimization in the compiler. I merely asked about this to find out whether there are tricks to make it smaller by doing some stuff in the C code that I haven't thought of yet.

rainwarrior wrote:
The quick way is: write it in C, compile, copy paste the generated assembly, then just rewrite the bits you're interested in improving.

Yeah, but I also want to keep my code readable. If I intended to clutter everything with inline Assembly, just to save one byte or two, I would write the whole thing in Assembly in the first place.

I mean, if this code:
Code:
Pointer[Index] = Value;
gets turned into this:
Code:
   lda     _Pointer
   ldx     _Pointer+1
   clc
   adc     _Index
   bcc     L0005
   inx
L0005:   sta     ptr1
   stx     ptr1+1
   lda     _Value
   ldy     #$00
   sta     (ptr1),y
then it really pays off if I have a general purpose inline Assembly macro that simply does this:
Code:
LDA Value
LDY Index
STA (Pointer), Y

Especially if I can reuse it many times and the code is still readable if I put it into a macro called
Code:
AsmSetVarFromPtrAtIdxVar(variable, pointer, indexVariable)

But I don't think it's a good idea to analyze every little code line to see if I can turn it into slightly better Assembly than the C compiler does.
After all, there's a reason why I write in C. Because this:
Code:
Value = heroTile16X * 16 + (width16Block - 1) * 16 + 1;
is much more readable than its Assembly version.

Also, I wouldn't even know how to start optimizing the output.

For example, if I had to write the above code in Assembly, I would probably do it like this:
Code:
LDA heroTile16X
ASL
ASL
ASL
ASL
STA tmp1
LDX width16Block
DEX
TXA
ASL
ASL
ASL
ASL
CLC
ADC tmp1
TAX
INX
STX Value

On the other hand, this is what the compiler creates:
Code:
   ldx     #$00
   lda     _heroTile16X
   jsr     shlax4
   sta     ptr1
   stx     ptr1+1
   ldx     #$00
   lda     _width16Block
   jsr     decax1
   jsr     shlax4
   clc
   adc     ptr1
   pha
   txa
   adc     ptr1+1
   pla
   clc
   adc     #$01
   sta     _Value

This doesn't look even slightly like my code, so I have no idea how I should take this one as the base and then convert it into something better. And if I had the patience to write my own code above for everything, I wouldn't have used C in the first place.

So, yeah, not really the place to manually fiddle with one-time code expressions that never get repeated again.
I have a bunch of inline Assembly macros for common performance issues, like pointer and array access. And for most of the rest, I simply swallow the CPU and ROM penalty that is included in C. I mean, in the moment, I can have about 12 characters on screen without any lag, so it looks like the C code isn't that bad.

_________________
Available now: My game "City Trouble".
Website: https://megacatstudios.com/products/city-trouble
Trailer: https://youtu.be/IYXpP59qSxA
Gameplay: https://youtu.be/Eee0yurkIW4
German Retro Gamer article: http://i67.tinypic.com/345o108.jpg


Top
 Profile  
 
PostPosted: Thu Jun 20, 2019 12:42 pm 
Offline
User avatar

Joined: Fri May 08, 2015 7:17 pm
Posts: 2536
Location: DIGDUG
Quote:
This doesn't look even slightly like my code


Implied promotion to INT on many C library functions.

Quote:
in the moment, I can have about 12 characters on screen without any lag, so it looks like the C code isn't that bad.


Depends on what kind of game you want. When I made a card game, C greatly sped up dev time. If you want a shoot em up with dozens of objects on screen, I would do it in ASM.

_________________
nesdoug.com -- blog/tutorial on programming for the NES


Top
 Profile  
 
PostPosted: Thu Jun 20, 2019 1:55 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7528
Location: Canada
dougeff wrote:
Quote:
This doesn't look even slightly like my code

Implied promotion to INT on many C library functions.

Yeah, compounded by the fact that it does not attempt to identify that it only needs an 8-bit result at the high level. That reduction is sometimes made after the fact by assembly pattern recognition, but like we saw above it can't really identify 8-bit result patterns from expressions that aren't rather simple.

In many cases breaking an expression into intermediate results might actually help:
Code:
// C code, value/ht/wb are char
value = ht * 16;
value += (wb - 1) * 16;
value += 1;

;
; value = ht * 16;
;
   lda     _ht
   asl     a
   asl     a
   asl     a
   asl     a
   sta     _value
;
; value += (wb - 1) * 16;
;
   lda     _wb
   sec
   sbc     #$01
   asl     a
   asl     a
   asl     a
   asl     a
   clc
   adc     _value
   sta     _value
;
; value += 1;
;
   inc     _value
;
; }
;
   rts


There's not much rhyme or reason to this, though. What works or doesn't isn't very easy to predict.


Top
 Profile  
 
PostPosted: Thu Jun 20, 2019 2:55 pm 
Offline
User avatar

Joined: Sat Sep 07, 2013 2:59 pm
Posts: 1884
dougeff wrote:
When I made a card game, C greatly sped up dev time. If you want a shoot em up with dozens of objects on screen, I would do it in ASM.

This comparison comes up so often: "If you want to use C, you might be able to program a "Pac-Man" clone. For anything else, you should use Assembly."

Well, I programmed a lagless platformer. And I'm in the middle of programming a "Zelda"-like action adventure.
Yes, I had to use some Assembly functions, usually when it comes to array-based stuff.
However, the general gameplay algorithm is still mostly in C.

And I'm pretty sure a shoot'em up would be doable in C, especially since enemies in these games mostly move in fixed patterns and don't even care about the environment. (Unlike in many other games, where each character always has to check for walls.)

So, yeah, whoever is fluent in Assembly should write the game directly in this language. However, it is not true that C only enables you to write simple 1985-style games. I'm pretty positive that a game like "Ninja Gaiden" would largely be possible in C.

_________________
Available now: My game "City Trouble".
Website: https://megacatstudios.com/products/city-trouble
Trailer: https://youtu.be/IYXpP59qSxA
Gameplay: https://youtu.be/Eee0yurkIW4
German Retro Gamer article: http://i67.tinypic.com/345o108.jpg


Top
 Profile  
 
PostPosted: Thu Jun 20, 2019 3:30 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 151
rainwarrior wrote:
Yeah, compounded by the fact that it does not attempt to identify that it only needs an 8-bit result at the high level. That reduction is sometimes made after the fact by assembly pattern recognition, but like we saw above it can't really identify 8-bit result patterns from expressions that aren't rather simple.


The authors of the Standard noted in the Rationale that on commonplace implementations, if the result of an additive, multiplicative, or bitwise is coerced to a type which is not larger, it won't matter if the operation is performed as signed or unsigned(*). For the same reasons, it wouldn't matter if the operand were performed on a type larger than the result type. A good compiler for 8-bit machines really should use a production rules for assignment operators that evaluates the right side as an 8-bit quantity if the left side is likewise, and for a production rule for 8-bit expressions which processes additive, multiplicative, and bitwise operators by evaluating their operands as 8-bit quantities.

(*) The authors of the Standard expected that commonplace implementations would do this even for values between INT_MAX+1u and UINT_MAX, but did not want to require such behavior on platforms where it would be impractical. They didn't think should it necessary to say that commonplace implementations should handle such values in the expected fashion, but gcc exploits the Standard's laxity by generating code that doesn't.


Top
 Profile  
 
PostPosted: Thu Jun 20, 2019 3:38 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7528
Location: Canada
supercat wrote:
The authors of the Standard noted... A good compiler for 8-bit machines really should...

What I said has nothing to do with the standard. I was talking about cc65. cc65 just doesn't have the capability to recognize it, and the many cases that it manages to generate succinctly are kind of miraculous given what it's working with as an optimizer. What it "should" do is something completely different than what it does, but I doubt this will change any time soon, unfortunately. (Fortunately, it's still pretty useful in this compromised state anyway.)


Last edited by rainwarrior on Thu Jun 20, 2019 3:46 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Thu Jun 20, 2019 3:45 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 151
DRW wrote:
Yes, I had to use some Assembly functions, usually when it comes to array-based stuff.
However, the general gameplay algorithm is still mostly in C.


If 25% of the program's non-vblank time is spent waiting for vblank, 70% is spent in 10% of the code, and the remaining 5% of the time is spent running the other 90% of the code, the performance of that 90% of the code would simply not matter unless something caused it to run more than five times as slowly. Some games have their execution time spread out among large parts of the program, but others have it concentrated in a few tiny places. Micro-optimizing parts of the code that dominate execution time can be much easier and more effective than worrying about the performance of anything else.


Top
 Profile  
 
PostPosted: Fri Jun 21, 2019 7:36 am 
Offline
User avatar

Joined: Sat Jan 09, 2016 9:21 pm
Posts: 619
Location: Central Illinois, USA
DRW wrote:
So, yeah, whoever is fluent in Assembly should write the game directly in this language. However, it is not true that C only enables you to write simple 1985-style games. I'm pretty positive that a game like "Ninja Gaiden" would largely be possible in C.


As somebody who likes using C, I partially agree. If you're willing to optimize a few routines in assembly, write your C carefully and evaluate bottlenecks, and use a library of assembly routines for interacting with the hardware, you could do Ninja Gaiden in C.

That said, you can definitely run into trouble using C if you aren't careful. Both my Robo-Ninja Climb and Super Homebrew War were written mostly in C (For the sake of development time), but my first pass was too slow in both of them, requiring some optimization (both re-writing some C parts to better take advantage of how cc65 does things, and also rewriting some critical bits into assembly)

_________________
My games: http://www.bitethechili.com


Top
 Profile  
 
PostPosted: Fri Jun 21, 2019 8:18 am 
Offline
User avatar

Joined: Fri Nov 24, 2017 2:40 pm
Posts: 168
I've skimmed some of the cc65 optimizer code, and there's really not that much to it. It simply doesn't recognize many cases where it can optimize, and those are the only places where it really seems to relax the spec.

I'm not going to tell you "just write it all in assembly", because I don't. I just do what you do: Spend some minimal effort to trick the compiler into doing what you want. Once you know what you really want, rewrite parts using inline or regular assembly. Rinse and repeat.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 28 posts ]  Go to page Previous  1, 2

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group