6502 ASM trick

Discuss technical or other issues relating to programming the Nintendo Entertainment System, Famicom, or compatible systems.

Moderator: Moderators

Celius
Posts: 2157
Joined: Sun Jun 05, 2005 2:04 pm
Location: Minneapolis, Minnesota, United States
Contact:

Post by Celius » Tue Nov 13, 2007 3:55 pm

I'll probably do some improving in the future, but for now, I think it's pretty okay. I'm very comfortable using it. I think I will first do something where instead of having the unused ones going to "Return", I think I'll have the first one that's not used go to a routine that just does RTI, and not waste NMI time. However, as it is, I can simply tell it to do what I want whenever, and I won't have to look at an ugly NMI routine when debugging.

User avatar
tokumaru
Posts: 11692
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru » Tue Nov 13, 2007 10:01 pm

I understand why you needed such a dynamic way to handle your NMIs, because I had the same problem. My game has a LOT of things to do during VBlank, but the time is really short and there is no way to do everything every frame. Because of this, some tasks are interchangeable. For example: I have reserved some time for updating a row of metatiles, but if the screen did not scroll there is no need to, and I'll update the palette instead. I'm pretty sure that a similar situation inspired you to come up with that mode.

But I see a serious problem with your solution: The insane amount of valuable zero page RAM you set aside just for holding pointers that most of the time you won't even use.

My solution to the same problem, as I said above was to define interchangeable tasks. My solution ends up looking a lot like yours, but I waste less memory by defining which tasks replace which.

As I said before, I use the concept of multiple NMIs. In the main game, for example, which has it's own NMI, I have defined that during VBlank I need to:
1. Update a row of metatiles;
2. Update a column of metatiles;
3. Update the palette;
4. Update sparse metatiles;
5. Update the OAM through a sprite DMA;
6. Update some patterns;

There is no time to do it all every frame, but I can right away tell what will be done every frame: OAM updates. The rest of the tasks I handle just like you:

Code: Select all

	jsr CopySlot1
	jsr CopySlot2
	jsr CopySlot3

	;Do OAM transfer here
	;Handle music here;

	rti
This would be the NMI for the main game. I used a different trick to compensate for the lack of an indirect JSR instruction:

Code: Select all

CopySlot1:
	jmp (CopyRout1)	;column or palette
CopySlot2:
	jmp (CopyRout2) ;pattern block or palette and metatiles
CopySlot3:
	jmp (CopyRout3) ;row or metatiles
With the concept of interchangeable tasks, I defined just a few time slots that I allocate to subroutines as needed. And in the cases where you need to do 2 things during the time otherwise used by a slower routine, you can just make a wrapper routine to call the other two:

Code: Select all

UpdatePalette:
	(...)
	rts

UpdateMetatiles:
	(...)
	rts

UpdatePAletteAndMetatiles:
	jsr UpdatePalette
	jsr UpdateMetatiles
	rts
Anyway, I'm just pointing out a different solution for the same problem. Your way is probablymore dynamic, because you can just use the same NMI routine for the whole game. But when combined, the two techniques I use (multiple NMIs with interchangeable tasks) are almost as dynamic, without using so much RAM.

User avatar
Bregalad
Posts: 7879
Joined: Fri Nov 12, 2004 2:49 pm
Location: Chexbres, VD, Switzerland

Post by Bregalad » Wed Nov 14, 2007 9:35 am

Well, Tokumaru's solution also looks clever, and definitely more optimal.
My simple solution so far has always been to just give a priority to tasks. If something like the palette should absolutely be updataed, the main programm knowns that it should do it when all request to update metatiles are done, else the palette will be ignored, so I just programm things with this in mind. My game doesn't scroll very often and does it pretty slowly, so it is dratiscally differnt from a sonic platformer wich scrolls very fasts and almost always.
I will post the NMI I use (and that worked fine for me) for global information, but be sure to not use it as it or anything, because that's a piece of code I actually use (not just an example).

Code: Select all

NMI
	pha
	txa
	pha
	tya
	pha			;Saves the registers
	bit $2002.w		;VBlank flag clear

	lda SpriteDMAFlag
	beq _skipSpriteDMA	;Is the sprite buffer ready ?
	lda #$00
	sta SpriteDMAFlag	;Sprite buffer is read
	sta $2003.w
	lda #$02
	sta $4014.w		;Do sprite DMA at $200
_skipSpriteDMA
	lda BlockFlag
	beq +
	lda #$00
	sta BlockFlag
	jsr WriteBlock		;Need to write some block data to the PPU ?
	jmp _skipPPUUpload
+	lda PPUStringSize	;Need to upload a name-table string ?
	beq +
	jsr WriteNamStrings
	jmp _skipPPUUpload
+	lda PalFlag		;Need to upload the palette ?
	beq _skipPPUUpload
	lda #$00
	sta PalFlag
	jsr WritePalette
_skipPPUUpload
	lda #$3f
	sta $2006.w
	lda #$00
	sta $2006.w
	sta $2006.w
	sta $2006.w
	sta $2005.w		;PPUScrollH is always 0 (exept when changed midframe)
	lda PPUScrollV
	sta $2005.w		;Upload PPU regs
	lda M2000
	sta $2000.w
	jsr PlaySound
	jsr NumberLogic
	lda #$00	;Clears the flag to stop waiting the frame
	sta NMIFlag
	pla		;Restore registers
	tay
	pla
	tax
	pla
IRQ	rti
So yeah this is exactly the routine I use, and it's absolutely simple, and works absolutley fine. The Sprite DMA is updated every frame, unless the main programm doesn't do this or if the game lags. Then the other PPU updates are sorted by priority, first comes the "block" update (block is just the word I used for metatiles, where the routine write a single "big" 32x32 metatile on the screen, the format I use in my game). If no block write is requested, it will check if a regular PPU string update should be done (there is a universal format my game uses for all non-map graphics). If none of this is to be done, the palette is updated if needed. If all 3 of these things are requested by the main program at the exact same time, 3 NMIs are needed in order to complete everything. Finally the music is handled every frame (no matter if we're still in VBlank or not), and "Number Logic" is applied (this routine increase/decrease some counters/timers and re-seed the random number generator). Then the NMI returns. I could do some improvement, because I guess it could be possible to update both a PPU String and the palette at the same time, or it could be possible to update only the BG or sprite palette to save time, however I keep things as simple as possible in the NMI (as long as it doesn't limit too much the main programm), and try to work with it as it.

By the way I noted I used a coule of times (4 exactly) code looking like this :

Code: Select all

asl A
tax
lda JumpTable,X
sta PointerL
lda JumpTable+1,X
sta PointerH
jmp [Pointer]

JumpTable:
 .dw Adress1
 .dw Adress2
 etc...
Would it be possible to just replace it by this and save space without corrupting anything ?

Code: Select all

asl A
tax
lda JumpTable,X
pha
lda JumpTable+1,X
pha
rts

JumpTable:
 .dw Adress1-1
 .dw Adress2-1
etc...
This saves 4 bytes and the use of the word Pointer, which is free for use anyways since I already wrote the code with that in mind and since I use that variable for other fonctions as well.

By the way, since Tokumary mentionned this, I have a complete game engine (without much running on it for now) that uses zero page bytes $00 to $9e (62% of zero page taken), uses BSS RAM bytes $300 to $583 (plus the stack and OAM taking $100-$2ff), making the BSS RAM (50.2% of BSS RAM taken, without counting the OAM nor the stack).
So even if your game engine is more complex that mine, and even if the my game isn't compltetly completed, I guess RAM is still somewhere you have enough space to work what you must do.

Celius
Posts: 2157
Joined: Sun Jun 05, 2005 2:04 pm
Location: Minneapolis, Minnesota, United States
Contact:

Post by Celius » Wed Nov 14, 2007 10:01 am

tokumaru wrote:But I see a serious problem with your solution: The insane amount of valuable zero page RAM you set aside just for holding pointers that most of the time you won't even use.
I really don't see how I'm using an insane amount of zero page. I just think I'm using a moderately sized section of it. I have plenty of RAM to work with. I can store a variable or two in something that isn't zero page. I'm sure it'll take more time, and the cycles will add up. But, for the flexibility that I have now, I think it's well worth the zero page. I'll change it in the future to use less.

User avatar
Bregalad
Posts: 7879
Joined: Fri Nov 12, 2004 2:49 pm
Location: Chexbres, VD, Switzerland

Post by Bregalad » Wed Nov 14, 2007 12:34 pm

I tried the pha/pha/rts thing, and it works fine ! In fact I think this is the standard way to do the jmp() thing on some other processors who doesn't have the jmp() isntruction.

About the NMI I trough of a programm who can automatically write an adress to $2006 and a string to $2007, while computing the time it will take to itself so it can automatically maximize the amount of data written to the PPU. Then the main programm doesn't have to worry much about things, as things are getting updated as soon as possible, and in the order specified, with as much data as possible on one single frame, the rest being automatically left for the next frame. Thing of it as a FIFO between the main programm and the NMI, where the buffer is emptied by a computed size automatically. This should be somewhat complicated to implement, tough. I'd prefer this to have the programm manually change the adress where the NMI is going to update things, tough, as the NMI can happen anywhere and you're always set for it, and the code will be overall more clearer, and easier to use as view from the main programm.

User avatar
tokumaru
Posts: 11692
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru » Sat Apr 12, 2008 1:02 am

Soooo... I was thinking of a way to generalize the fast updating of Name Tables I talked about in this post, and I ended up coding some sort of pseudo-DMA thing for writing data to the PPU.

This uses a lot of ROM space, because of the big jump table and the big unrolled loop (about 1KB total, not counting the ByteTable I use for X and Y operations), but it optimizes a lot the whole process of writing data to the Name Tables. Here is the code I used to test the idea (it runs on the 6502 simulator):

Code: Select all

TempAddress = $00
Buffer = $0480

	.start Reset

	.org $8000

ByteTable:
ByteValue .set 0
	.rept 256
	.db ByteValue
ByteValue .set ByteValue + 1
	.endr

JumpTableLo:
CopyAddress .set CopyEnd - 6
	.rept 128
	.db <CopyAddress
CopyAddress .set CopyAddress - 6
	.endr
JumpTableHi:
CopyAddress .set CopyEnd - 6
	.rept 128
	.db >CopyAddress
CopyAddress .set CopyAddress - 6
	.endr

Reset:
	ldx #$7f
FillBuffer:
	txa
	sta Buffer, x
	dex
	bpl FillBuffer

	lda #$0f	;data length
	ldx #$00	;data offset
	jsr UpdateData

	brk

UpdateData:
	;INPUT:
	;A: length of the string to copy -1
	;X: where in the buffer the string starts;
	tay
	clc
	adc ByteTable, x
	tax
	lda JumpTableLo, y
	sta TempAddress+0
	lda JumpTableHi, y
	sta TempAddress+1
	jmp (TempAddress)

CopyData:
Displacement .set 128
	.rept 128
Displacement .set Displacement - 1
	lda Buffer-Displacement, x
	sta $2007
	.endr
CopyEnd:
	rts
Just watch as the values are written to $2007 to make sure the right values are being fetched. This code is pretty incomplete, of course, (I can't even say if it's 100% bug-free) it just demonstartes the idea, but you could very well implement a list of updates to perform during VBlank, composed of the destination address, the increment mode (1 or 32), the length of the data and it's offset into the buffer. Then, during VBlank, you simply scan this list and execute all the requested copying.

I limited the buffer to 128 bytes because that's the most you can do without crossing pages (which would make reads 1 cycle slower), and the buffer must be placed in the upper half of a memory page for the same reason. However, 128 bytes should be enough for data that is witten to unpredictable locations. I mean, you wouldn't use this to update the palette, since you always know where the palette goes, this is mostly for Name Table writes, and hardly you'll need to write more than 128 bytes to them when scrolling.
Last edited by tokumaru on Sat Apr 12, 2008 1:18 am, edited 1 time in total.

User avatar
Bregalad
Posts: 7879
Joined: Fri Nov 12, 2004 2:49 pm
Location: Chexbres, VD, Switzerland

Post by Bregalad » Sat Apr 12, 2008 1:17 am

One thing that bothers me is that useless byte table :

Code: Select all

stx Temp
clc
adc Temp
tax
Would take about the same time as :

Code: Select all

   tay
   clc
   adc ByteTable, x
   tax 
and would save 256 bytes.

Also, why use indirect adressing in the Displacement routine ? This could potentially waste cycles if a page boundary is crossed, but well... anyway that was just for the idea. It looks good for really fast transfer, as it would be almost as fast as self-modified code but wouldn't waste a lot of RAM.
Using $300-$7ff (or $200-$6ff or anything else) you're able to store a chain of exactly 256 lda#$xx/sta $2007 instruction which is cool.

User avatar
tokumaru
Posts: 11692
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru » Sat Apr 12, 2008 1:40 am

Bregalad wrote:One thing that bothers me is that useless byte table
In this case, yeah, there is no advantage in using it, but I used because I have this table in my game and got used to using it for X and Y operations (as described in the first post of this topic). Since I have the table anyway, and in some cases it really does reduce the complexity of the code a lot, I used it here too, just to avoid the temp.
Also, why use indirect adressing in the Displacement routine ? This could potentially waste cycles if a page boundary is crossed, but well...
Indirect? Where? Do you mean indexed? If you think about this code a bit more, you'll see that LDA absolute can't be used for this, as it wouldn't allow for random access to the buffer. X is used to displace the start of the buffer around. As far as I know, the way I did it would be the only way to have access to any part of the buffer, starting anywhere in the list of LDAs STAs (depending on the amount of bytes to copy) so that the copying process ends with the last byte you want to copy. If you know of a better way to do the same thing, I'd like to know.

Also, as long as you respect the rules I talked about (the limit of 128 bytes and the location of the buffer), there will be no page crossing.

EDIT: Just to make it clear, the beauty of this is that you can copy a variable amount of data from anywhere in the buffer, without having to manage an index register for every byte, and without having to check for an ending condition. This is what makes it fast, it's a flexible unrolled loop. With absolute addressing you'd only be able to read hardcoded positions of the buffer.

doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

Post by doynax » Sat Apr 12, 2008 4:59 am

I've actually been using something similar in my tile copying loop, it needs to be able to start and stop at arbitrary indices because a single transfer usually takes several frames. In my case I need to transfer up to 240 different characters from two separate buffers which would correspond to about 53k of unrolled code, perhaps a bit hefty ;)

Thankfully with a bit of arithmetic and you can get much the same benefits from partially unrolled loop. Now the following codes tries to copy the buffer to VRAM starting from x up continuing up to (but not including) limit. Unfortunately you lose as many bytes as the unrolling factor at the end of the array, so in this case we could only copy up to 248 bytes, and be sure to place those "wasted" bytes at the start of the page such that buffer-8 doesn't cross a page. Furthermore the initialization code and especially the computed goto could be sped up a bit but I'm thinking that it might as well be prepared outside of vblank anyway. Oh and "identity" is tokumaru's byte table which I've actually been using in my own code. It's damned nice having bytes to spare on such things for once :)

Code: Select all

copy:
	lda limit
	clc
	sbc identity,x
	and #$07
	tay
	adc identity,x
	tax

	lda .enter_lo,y
	sta trampoline+0
	lda .enter_hi,y
	sta trampoline+1

	lda #$ff
	jmp (trampoline)

.enter_lo:
	.byte <.enter7,<.enter6,<.enter5,<.enter4
	.byte <.enter3,<.enter2,<.enter1,<.enter0
.enter_hi:
	.byte >.enter7,>.enter6,>.enter5,>.enter4
	.byte >.enter3,>.enter2,>.enter1,>.enter0

.loop:
	sbx #-8

.enter0:
	ldy buffer-8,x
	sty $2007
.enter1:
	ldy buffer-7,x
	sty $2007
.enter2:
	ldy buffer-6,x
	sty $2007
.enter3:
	ldy buffer-5,x
	sty $2007
.enter4:
	ldy buffer-4,x
	sty $2007
.enter5:
	ldy buffer-3,x
	sty $2007
.enter6:
	ldy buffer-2,x
	sty $2007
.enter7:
	ldy buffer-1,x
	sty $2007

	cpx limit
	bne .loop

User avatar
tokumaru
Posts: 11692
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru » Sat Apr 12, 2008 12:18 pm

Yeah, the idea is very similar... I understand that you have to do things a bit differently, as the purposes aren't the same.

My motivation for this was my Sonic game, since because of the fast scrolling, I must be able to write up to 128 bytes of Name Table data, for the new columns and rows of metatiles that scroll in. I doubt it would be possible to write as much data within standard VBlank time, considering that there are still other things to write, such as Attribute Tables, sprite DMA, and so on.

doynax
Posts: 162
Joined: Mon Nov 22, 2004 3:24 pm
Location: Sweden
Contact:

Post by doynax » Sat Apr 12, 2008 12:40 pm

tokumaru wrote:Yeah, the idea is very similar... I understand that you have to do things a bit differently, as the purposes aren't the same.
I'd say it's exactly the same thing, the only difference is that I'm copying up to 8k which makes complete unrolling tad impractical. But my "innerloop" works exactly the same way as your raw copying, well except for that fact that you're copying downwards IIRC.
Perhaps even 768 bytes of code might become too some day if the interrupt code has to run from the fixed bank.
My motivation for this was my Sonic game, since because of the fast scrolling, I must be able to write up to 128 bytes of Name Table data, for the new columns and rows of metatiles that scroll in. I doubt it would be possible to write as much data within standard VBlank time, considering that there are still other things to write, such as Attribute Tables, sprite DMA, and so on.
Just how are you extending the blanking period by the way? I had all kinds of issues with this if you recall.

User avatar
tokumaru
Posts: 11692
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru » Sat Apr 12, 2008 1:16 pm

doynax wrote:Perhaps even 768 bytes of code might become too some day if the interrupt code has to run from the fixed bank.
I see what you mean... I'm not worried about this because I have an 8KB ROM bank dedicated to the VBlank routines, enough to unroll everything.
Just how are you extending the blanking period by the way? I had all kinds of issues with this if you recall.
I'm not. I was when I was using the MMC1 with CHR-RAM, so that I could copy 256 bytes to it every frame, in addition to all the other PPU operations. I simply had all my VBlank code use a fixed amount of cycles (which was a bit hard, because I alternated some tasks, so I had to make sure that alternating tasks took the same amount of time), so I always enabled rendering at the same point, 16 scanlines past the start of the frame. It worked fine, except for a few sprite problems (I couldn't seem to find the exact moment to turn sprite rendering on, so I always had a few glitchy pixels at the top left corner of the screen), and the different dot crawl pattern that results from skipping the pre-render scanline, which may or may not bother people.

Because of how hard it was to keep the Vblank code constant-timed and because of the sprite glitches, I ended up switching to the MMC3 with CHR-ROM, and all PPU updates fit in the standard Vblank time now. I didn't have to sacrifice much, and ended up getting more usable frame time from not having to manually copy tiles to the PPU. I still blank the top 16 scanlines though, by having blank tiles switched in, and only switching the actual patterns when a mapper IRQ fires 16 scanlines into the frame. This is to avoid scrolling glitches usually visible when you use vertical mirroring and scroll vertically.

I know a lot of commercial games had visual glitches, but I just couldn't live with them in my game(s). If it is possible to get rid of them, I'll do everything I can! =)

User avatar
Bregalad
Posts: 7879
Joined: Fri Nov 12, 2004 2:49 pm
Location: Chexbres, VD, Switzerland

Post by Bregalad » Sat Apr 12, 2008 1:40 pm

Well, time-optimising is definitely the exact opposite to byte optimising.
These times I had to do a lot of or byte-optimising, where I could save 4 bytes or so about everywhere in my game engine, and now you guys talk about 8k of unrolled code that makes me crazy.

That identity thing is a bit fun, it's true it allows easy and fast operations with index registers (adding, substract, and even and, or etc..), and each time you use it you save 1 byte, but still you have to use this trick more than 256 times for it to be really worth it.

Speaking of 6502 tricks, I just figured out how many times I had to do 4 consecutive ASL A or LSR A, so I just made 2 routine that does them and return, and call them, saving one byte everytime this trick is used. I was able to save about 30 bytes doing that, which is great.

However I had to do a lot of time-optimising for my mode-7 demo, as I perform raster-timed code and calculations at the same time.
Put some code in RAM and modify the argument of the instructions rocks, as you can add one level of indirection and then save time (lda #$xx can be equivalent to lda $xxxx, and lda $xxxx,Y can be the equivalent to lda [$xx],Y, plus you get the equivalent of the inexistant lda [$xx],X).

User avatar
tokumaru
Posts: 11692
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Post by tokumaru » Sat Apr 12, 2008 2:01 pm

I know you are not using any kind of PRG bankswitching in your current project, so these tricks probably do not interest you much.

These tricks are for games with much larger PRG-ROM, which are usually more complex anyway. You'll hardly see an NROM or CNROM with complex scrolling and animations, large sprites, and so on...

User avatar
MottZilla
Posts: 2832
Joined: Wed Dec 06, 2006 8:18 pm

Post by MottZilla » Sat Apr 12, 2008 3:18 pm

You can definitely make a very fun game with NROM or CNROM. You may have heard of a little game called Super Mario Bros. :p

But seriously there are alot of fun games that use no mapper. Atleast I find them fun.

Post Reply