It is currently Fri Feb 23, 2018 11:57 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 35 posts ]  Go to page Previous  1, 2, 3
Author Message
PostPosted: Fri Jan 26, 2018 11:44 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19665
Location: NE Indiana, USA (NTSC)
In this post about the Spectre attacks on branch prediction, Dwedit mentioned ROP, or return-oriented programming. Before ROP became famous for use in stack-smashing attacks, this technique of storing a list of subroutines to be called was known as threaded code.

The name Popslide was derived from NOP slide and clockslide, both of which involve a jump into an unrolled loop, like a computed Duff's device.

The technique discussed here combines storing a subroutine sequence in the return stack (the "ROP") with jumping into an unrolled copy loop (like the "slides"). So could we call it ROPslide?


Top
 Profile  
 
PostPosted: Fri Jan 26, 2018 3:30 pm 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Sounds good :D

I've come up with another neat little trick. Just like with unrolled loops, it can sometimes be beneficial to jump a ways into a subroutine to skip some unnecessary code. If you set the PPU "increment mode" at the beginning of every subroutine, you can opt to keep track of it manually as you build your video stack outside of vblank, and simply plus $5 to the subroutine's address to skip over the initial LDA #, STA $2000 when you know the increment mode is already correct.

Unlike having "set increment mode to 1" and "set increment mode to 32" subroutines, this trick doesn't carry any extra overhead of jumping around additional times in the video stack.


Top
 Profile  
 
PostPosted: Sun Jan 28, 2018 10:29 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10299
Location: Rio de Janeiro - Brazil
Drakim wrote:
Actually, re-reading a couple of times, what do you mean "But the extra JMP and the bogus VRAM address are used only once", how does that work? What exactly do you do to end your video subroutines if you are only using the JMP one time during the entire sequence? Don't you have to do it one time per subroutine?

I went looking for the posts I made about this and apparently this is the last piece of code I posted about this. The basic idea was to have 2 unrolled loops, both copying up to 32 bytes, but one sets the VRAM address before RTS'ing to the next update, while the other is meant to be used for the final update, so it doesn't set the VRAM address at the end. With this setup, there's no overhead at all, even the JMP at the end can be eliminated if the 2nd unrolled loop flows directly into the rest of the NMI handler.

But then I decided that having two unrolled loops was too much trouble, and that it'd be better to just let the last update set a bogus VRAM address and RTS to the remainder of the NMI handler. Here's what the code could look like:

As soon as possible in the NMI handler:
Code:
  ;(swap the stack pointer)

  ;execute the first update
  pla
  sta $2006
  pla
  sta $2006
  rts

RestoreStackPointer:

  ;(restore the stack pointer)


The unrolled data transfer loop, which has 32 possible entry points:
Code:
Copy32Bytes:

  pla
  sta $2007

Copy31Bytes:

  pla
  sta $2007

  ;(...)

Copy1Byte:

  pla
  sta $2007

  ;execute the next update
  pla
  sta $2006
  pla
  sta $2006
  rts

In this case, the only overhead is setting up a bogus VRAM address that won't be used for anything, something that happens only once per vblank, after the final data transfer (can be during the pre-render scanline, so it doesn't waste any actual vblank time).

Like I said before, this is mostly for those looking to use generic VRAM update code without giving up on speed or lots of ROM space. I personally prefer to use specialized update routines so I can take advantage of certain characteristics of the NES architecture (e.g. palette mirroring, changing only the low byte of the VRAM address, etc.) to make updates even faster.


Top
 Profile  
 
PostPosted: Mon Jan 29, 2018 1:29 am 
Offline

Joined: Mon Apr 04, 2016 3:19 am
Posts: 67
Oh, I understand finally!

Your technique is cool in that you shave off the RTS in the subroutine that sets the address before the mass copy:

Quote:
Video_SetAddress:
pla
sta $2006
pla
sta $2006
rts


By baking the address setting into the previous mass copy. Very clever!

While this is probably the fastest way of pushing raw bytes from the faux stack to the VRAM directly in a "video command" manner (probably only beaten by doing it all static every vblank and not having any "command" system at all), I think you were right in switching to specialized routines, there are just too many situations on the NES that doesn't involve copying enormous amounts of bytes mindlessly. Plus, the specialized routines allows for micro optimizations here and there, like using "LDA #" instead of "PLA" when it's a static VRAM address, or skipping over the "set increment" code if the increment mode is already correct, or having specialized unrolled loops that doesn't PLA if the same value appears multiple times in a row. The sum of optimizations might very well end up being faster.

Still, I'm going to have a think about the possibility of using this in my own specialized system in the case of several mass-copying routines being queued up in a row. :beer:


Top
 Profile  
 
PostPosted: Mon Jan 29, 2018 9:49 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 10299
Location: Rio de Janeiro - Brazil
Yeah, there are several opportunities for optimizations we can do with specialized update routines. You already mentioned how you can use a hardcoded VRAM address for palette updates, but there are many other optimizations you can do in this particular case:

- start writing at $3F01 rather than $3F00, and set the background color later, when you reach $3F10 (a mirror of $3F00);

- don't update any of the "non-displayable" entries ($3F04, $3F08, $3F0C) or their mirrors ($3F14, $3F18, $3F1C), just bit $2007 instead, to advance the VRAM address and skip these;

- read the palette data straight from the place where it's normally stored, not from the stack, saving the time it'd take to copy it there.

Let's see how much time we could save:

- set VRAM address to $3F01 (12 cycles);
- update 3 colors (24 cycles);
- skip $3F04 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F08 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F0C (4 cycles);
- update 3 colors (24 cycles); 108
- update the background color via $3F10 (8 cycles);
- update 3 colors (24 cycles);
- skip $3F14 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F18 (4 cycles);
- update 3 colors (24 cycles);
- skip $3F1C (4 cycles);
- update 3 colors (24 cycles);
- call the next update (6 cycles);

That's a total of 242 cycles, compared to the 278 cycles that a "raw transfer" of 32 bytes would take, and don't forget the time you save outside of vblank, by not copying the palette data to the stack.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 35 posts ]  Go to page Previous  1, 2, 3

All times are UTC - 7 hours


Who is online

Users browsing this forum: Sour and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group