It is currently Sat Sep 21, 2019 4:38 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: A simple optimization
PostPosted: Wed May 08, 2019 9:10 am 
Offline
User avatar

Joined: Mon Mar 13, 2017 5:21 pm
Posts: 67
I'm sure this has probably been covered before, but wanted to share anyway.

I was analyzing my code to see what routines were using the most cycles overall. I was surprised to see that one of my most cycle hungry routines was just a simple loop I was using to clear out my shadow OAM before reloading sprite data into it:

Code:
   LDX #0
   LDA #$FE
   .Loop:
      STA spriteTable, X
      INX
      BNE .Loop

   RTS


I wanted to see if I could reduce the amount of cycles used here, and tried the following instead:

Code:
   LDX #31
   LDA #$FE
   .Loop:
      STA spriteTable, X
      STA spriteTable + 32, X
      STA spriteTable + 64, X
      STA spriteTable + 96, X
      STA spriteTable + 128, X
      STA spriteTable + 160, X
      STA spriteTable + 192, X
      STA spriteTable + 224, X
      DEX
      BPL .Loop

   RTS


I was really surprised to see that this uses roughly half as many cycles, for just a handful of extra bytes of code. I understood in theory that a loop which makes 256 comparisons would take more cycles that a loop that only makes 32, but I had never actually tried it and looked at the difference in speed.

I know this is probably old-hat to many of you, but I'm excited about it. it's enough of a difference to visibly reduce lag in some areas, and I'm very pleased with it.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 9:27 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 161
gravelstudios wrote:
I'm sure this has probably been covered before, but wanted to share anyway.
I was really surprised to see that this uses roughly half as many cycles, for just a handful of extra bytes of code. I understood in theory that a loop which makes 256 comparisons would take more cycles that a loop that only makes 32, but I had never actually tried it and looked at the difference in speed.


Many short loops have roughly 50% overhead. If a loop has 50% overhead, unrolling it even to 2x will cut execution time by 25%. To 4x would offer another 17% (12.5 percentage points from the original time). The benefits going from 4X to e.g. 64X would be fairly slight, but going from 1x to 2x is huge.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 9:53 am 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 4211
Meanwhile, I just helped Snes9x get a tiny performance boost by getting rid of an unrolled loop. If it's modern, unrolling doesn't help. If it's ancient, it does.

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
PostPosted: Wed May 08, 2019 10:25 am 
Offline
User avatar

Joined: Mon Mar 13, 2017 5:21 pm
Posts: 67
I would assume the compiler for just about any modern language could convert a for loop into the most optimized assembly code possible without too much fuss on the part of the programmer. There's really no reason to do anything convoluted.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 10:26 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21595
Location: NE Indiana, USA (NTSC)
LDX #28 and doing four DEX at a time will speed it up even more, as only the Y coordinate really needs to be cleared.

_________________
Pin Eight | Twitter | GitHub | Patreon


Top
 Profile  
 
PostPosted: Wed May 08, 2019 11:09 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
gravelstudios wrote:
I would assume the compiler for just about any modern language could convert a for loop into the most optimized assembly code possible without too much fuss on the part of the programmer. There's really no reason to do anything convoluted.

Unrolling loops is a well-known optimisation technique for 65xx CPUs (and sometimes others). It's sometimes more efficient than a rolled loop, and other times the opposite is better. Every situation is different/needs vary.

I'm kind of confused by this statement, though. Where is a compiler involved? You wrote native 6502 code, so you were using an assembler, not a compiler. There is a substantial difference.

Not to mention, you didn't even disclose *what* assembler you were using, re: assumptions that an assembler/tool would know how/when to turn a rolled loop into an unrolled loop (hint: most assemblers will not do this for you, *ever*. If that surprises you, respectfully of course, pause for a moment and think about the ramifications of automatically changing someone's 6502 instructions around on them. You're working with the CPU at the lowest level possible from a programming perspective; this is not an "intermediate" higher-level language like C).


Top
 Profile  
 
PostPosted: Wed May 08, 2019 11:12 am 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7582
Location: Canada
Unrolled loops are a good concept. If the assembly has a .repeat directive, this is often a good place to use it, too.
Code:
LDA #31
LDA #$FE
@loop:
  .repeat 8, I
    STA spriteTable + (32 * I), X
  .endrepeat
  DEX
  BPL @loop
RTS

Another general thought is to avoid overlapping work. In this case any sprite you clear that gets overwritten becomes a redundancy. Instead you might clear only what's left over after filling up the used portion of the sprites.

Though, once it's only taking up the slack in this way, it probably becomes less necessary to optimize it at all. Typically if there's fewer sprites on the screen, there's more CPU time remaining, and when there's more sprites on the screen, less time will be spent in the clearing loop anyway, so the load might be balanced in such a way that this particular task is never becomes a performance issue. (This is not always the case, but I think it is typical.)


Also, not that you need it for this situation, but you can still unroll a loop even if the number of steps is variable: C example (same idea applies in assembly)


Just as a funny analogue, the strange way that out of bounds areas typically look in 3D games tends to come from a similar reason to what I suggested for sprites. 3D games often don't clear the screen before they start drawing to it, because they expect it to be entirely covered up by the time everything else has drawn on top. It's another attempt to avoid redundancy.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 12:49 pm 
Offline
User avatar

Joined: Mon Mar 13, 2017 5:21 pm
Posts: 67
Quote:
LDX #28 and doing four DEX at a time will speed it up even more, as only the Y coordinate really needs to be cleared.
I never thought about this. that's a really good idea.
Quote:
Where is a compiler involved? You wrote native 6502 code, so you were using an assembler, not a compiler.
I wasn't referring to my code here. I was referring to Dwedit's comment about Snes9x, which I assume is written in C++ or some other compiled language. I would assume that in C++ there's no need to unroll loops. the compiler will take care of optimizing it for you.
Quote:
Instead you might clear only what's left over after filling up the used portion of the sprites.
I tried doing something like that, but with the way I was already handling transferring sprite data from ROM to RAM and also sprite cycling, it ended up being more convoluted than it was worth. Maybe in a future project.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 1:00 pm 
Offline
User avatar

Joined: Fri Sep 30, 2016 8:57 pm
Posts: 113
thanks guys, i just sped my loop up 80% and it makes quite a bit of difference on the boss battles where there are a lot of objects flying around at once.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 1:07 pm 
Offline
User avatar

Joined: Mon Mar 13, 2017 5:21 pm
Posts: 67
So here is what I think will be the final version:
Code:
ClearSpriteTable:
   LDX #60            ;note: only overwriting Y coordinate to put sprites off screen.
   LDA #$FE
   .Loop:
      STA spriteTable, X
      STA spriteTable + 64, X
      STA spriteTable + 128, X
      STA spriteTable + 192, X
      DEX
      DEX
      DEX
      DEX
      BPL .Loop

   RTS
I can't believe I never thought about the fact that only the Y coordinate has to be changed to hide sprites. Thanks for pointing that out tepples. I think this version is my preferred balance between size and speed. It's turned a routine that was a cycle-eater into something trivial.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 1:22 pm 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7741
Location: Chexbres, VD, Switzerland
tepples wrote:
LDX #28 and doing four DEX at a time will speed it up even more, as only the Y coordinate really needs to be cleared.

Not only that, but I'd go even further and say only the Y coordinate of unused sprites needs to be cleared. If many sprites are on the screen, mazing them takes time so you'll be very glad that the routine that clears the reast of them will be faster by clearing a smaller rest. This will not compensate for the larger amount of sprites mazed (and the time it took to maze them), but this will still help getting the overall CPU load under control.


Last edited by Bregalad on Thu May 09, 2019 12:40 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Wed May 08, 2019 1:25 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 4208
Location: A world gone mad
gravelstudios wrote:
I wasn't referring to my code here. I was referring to Dwedit's comment about Snes9x, which I assume is written in C++ or some other compiled language. I would assume that in C++ there's no need to unroll loops. the compiler will take care of optimizing it for you.

Ah, I see. Thanks for clarifying.

No, this is a common misconception about software today -- that somehow the compiler (or PL in general) always knows what's best. I think in 2019 compilers do a pretty good job of making intelligent choices based on tons of heuristics, but there are still cases where doing something manually (ex. unrolling a loop) can result in more performant code. Godbolt (for C/C++) can sometimes come in handy for analysing stuff like this on a per-compiler and per-compiler-version basis, if the routine/thing is simple. I've run into this situation with Perl, Python, etc. as well ("edge cases" where doing something in a loop one way is worse than another way, but both still worse than doing it unrolled). Every situation varies.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 1:55 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 11412
Location: Rio de Janeiro - Brazil
Bregalad wrote:
Not only that, but I'd go even further and say only the Y coordinate of unused sprites needs to be cleared.

I'm surprised it took this long for someone to mention this... But yeah, don't hide all the sprites beforehand, hide only the ones you haven't used once you're done generating OAM data.

Like Bregalad said, this will also help balance the time you spend on OAM manipulation on busy frames and not so busy ones... If you hide all sprites on a busy frame (lots of sprites), all the time you'll spend hiding them will be for nothing, since you'll end up bringing all sprites back, which will take even more time, increasing the chances of slowdowns.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 2:16 pm 
Offline
User avatar

Joined: Mon Mar 13, 2017 5:21 pm
Posts: 67
The way my sprite copying/cycling routine works right now relies on all of the sprites being cleared off the screen first. I tried to patch it up to only move the sprites off-screen that aren't being used, but it would have involved reworking a lot of other code to make it work right, and I decided it wasn't worth the hassle. Now that I know more, maybe I will try to engineer things that way from the start on my next project.


Top
 Profile  
 
PostPosted: Wed May 08, 2019 3:38 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 11412
Location: Rio de Janeiro - Brazil
Really? You are able to claim OAM slots for actual sprites, obviously... how is that any different from claiming them for "dummy" sprites which actually just put the Y coordinate off-screen? Whatever method you're using for selecting OAM slots for use, do the exact same thing for putting the remaining Y coordinates off-screen, until all 64 slots have been claimed.

You'd have to give up some of the optimizations suggested in this thread if you can't clear the slots sequentially, but you'd avoid that worst case where all slots are written to twice.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Google Adsense [Bot] and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group