It is currently Wed May 22, 2019 4:52 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 47 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Sun May 05, 2019 5:29 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
Bananmos wrote:
I sort of struggle to see what MMC2 / MMC4 would provide that a scanline IRQ couldn't, except for a modest amount of saved CPU cycles.


Using a scanline IRQ would require a relatively complex mapper chip. While MMC2/MMC4 are somewhat fancy, a game designed for MMC2 could be readily adapted for a discrete-logic board more readily than that required a scan-line interrupt timed to be suitable for the purpose.


Top
 Profile  
 
PostPosted: Sun May 05, 2019 6:36 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
supercat wrote:
Bananmos wrote:
I sort of struggle to see what MMC2 / MMC4 would provide that a scanline IRQ couldn't, except for a modest amount of saved CPU cycles.


Using a scanline IRQ would require a relatively complex mapper chip. While MMC2/MMC4 are somewhat fancy, a game designed for MMC2 could be readily adapted for a discrete-logic board more readily than that required a scan-line interrupt timed to be suitable for the purpose.
Either approach could work, but if the scan line IRQ isn't triggered at the optimal time it may be necessary to waste a fair number of CPU cycles.


Top
 Profile  
 
PostPosted: Mon May 06, 2019 3:18 am 
Offline

Joined: Wed Mar 09, 2005 9:08 am
Posts: 464
tokumaru wrote:
Bananmos wrote:
And if you can live without the DMC channel in your game's music, you may not even need a mapper-based IRQ, but could use the DMC channel's IRQ as explained on the wiki: https://wiki.nesdev.com/w/index.php/APU_DMC

I'm pretty sure you'd lose a lot of CPU time if you used this trick multiple times per frame, as one of the key elements of this technique is to waste CPU time to compensate for timing errors.


Well, let's try to do the numbers then :)

For playback rate $C, we have a period of $06A = 106 cycles on NTSC. This means 106*8 = 848 cycles until an interrupt hits. Or ~7.46 scanlines.

This leaves us with 0.54 of a scanline, to do the work in our IRQ handler, or 61.3 CPU cycles. This should be plenty of time to enter the IRQ handler, write the bank-switching register, and leave the IRQ handler.

We'll need to repeat this IRQ handling up to 30 times in the frame, which means wasting ~1840 cycles in an NTSC frame, or just 6.1% of CPU cycles. I'd dare to call this "modest", considering that games like SMB1 wasted more than that just waiting for a sprite#0 hit. I'd say anything below 10% is an acceptable trade-odd for lower hardware costs.

Speaking of sprite#0 hit, there is of course a problem with this 1-DMC-IRQ-per-8-scanlines-proposal: After enough of these, the delay introduced when IRQ fires inside an instruction might have skewed your writes and pushed them outside of hblank into the visible portion of the screen.
For these reasons, I'd also suggest having a sprite#0 hit somewhere in the middle of the screen, on an IRQ-firing scanline, which can be waited on inside the half-time IRQ handler, in order to "recalibrate" your writes.

As a bonus, ensuring that sprite#0 always occurs allows you to wait for the end of vblank by polling for sprite#0 hit being cleared. Which can be used to activate/deactivate masking of the the top portion of the screen, to avoid scrolling artifacts.

Were this 20 years ago, I'd say you'd have to painstakingly do all this IRQ aligning on a real system. But the Event Viewer Sour added to Mesen makes these things very easy to develop in an emulator, and even highly enjoyable.
(just keep in mind that if you're targeting PAL, then not only are the periods different, but no emulator accurately emulates the DMC cycle steals AFAIK, so you'd need to be back to doing this on real hardware again, like in the old days...)


Top
 Profile  
 
PostPosted: Mon May 06, 2019 3:34 am 
Offline

Joined: Wed Mar 09, 2005 9:08 am
Posts: 464
supercat wrote:
Bananmos wrote:
I sort of struggle to see what MMC2 / MMC4 would provide that a scanline IRQ couldn't, except for a modest amount of saved CPU cycles.


Using a scanline IRQ would require a relatively complex mapper chip. While MMC2/MMC4 are somewhat fancy, a game designed for MMC2 could be readily adapted for a discrete-logic board more readily than that required a scan-line interrupt timed to be suitable for the purpose.


True. And I certainly don't want to dissuade your from either using the MMC2, or designing your own simplified mapper if that's an important part of your project goals. We're all doing this odd hobby for fun, and if thinking up/implementing new hardware boards is what excites you, then by all means go ahead.

In fact, Never-obsolete isn't the only one who's designed an FPGA mapper prototype for extended attributes. I did this with more of a hypothetical "what's the minimum HW do you need at minimum for 8x8 attributes", which might be a more useful startin point for you. Have a look at this thread, where I've posted sources for both Powerpak and Everdrive: Mapper30 8x8 attributes on Everdrive
While it is made for 8x8 color attributes, it'd be trivial to change it to use the latched A0/A5 to control upper bits for CHR instead of the attribute table.

I haven't actually tried to implement it with discrete logic, but can't imagine it would be a lot of chips or dollars to add it to any discrete logic board. I just didn't pursue that further, because - as lidnariq would have put it - it was more of a solution looking for a problem. Very few game ideas actually *require* 8x8 color attributes.

Said that, it still sounds to me like designing a new mapper is a bit overkill for your game idea, and could easily end up being a bit of a distraction from designing the game itself. It's incredibly common with NES development in particular to got down the rabbit hole of hardware selection/design and end up not having time to focus on the software itself. Which is why I think the software-only solution with a DMC IRQ is neater, and a fun challenge to tackle in its own right - if you can live without DMC samples in your game, that is.


Top
 Profile  
 
PostPosted: Mon May 06, 2019 5:03 am 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 11348
Location: Rio de Janeiro - Brazil
Bananmos wrote:
For playback rate $C, we have a period of $06A = 106 cycles on NTSC. This means 106*8 = 848 cycles until an interrupt hits. Or ~7.46 scanlines.

This leaves us with 0.54 of a scanline, to do the work in our IRQ handler, or 61.3 CPU cycles. This should be plenty of time to enter the IRQ handler, write the bank-switching register, and leave the IRQ handler.

We'll need to repeat this IRQ handling up to 30 times in the frame, which means wasting ~1840 cycles in an NTSC frame, or just 6.1% of CPU cycles.

Unfortunately, I don't think it's that simple. DMC IRQs don't fire a constant number of cycles after you start playback (it would be amazing if that was the case!), what really happens is that there's always an unpredictable delay, which you have to measure and compensate for in subsequent IRQs. It's this error measuring and compensation that wastes CPU time, and you have to compensate the error (which can be several scanlines) on every IRQ.


Top
 Profile  
 
PostPosted: Mon May 06, 2019 6:23 am 
Offline

Joined: Wed Mar 09, 2005 9:08 am
Posts: 464
tokumaru wrote:
Bananmos wrote:
For playback rate $C, we have a period of $06A = 106 cycles on NTSC. This means 106*8 = 848 cycles until an interrupt hits. Or ~7.46 scanlines.

This leaves us with 0.54 of a scanline, to do the work in our IRQ handler, or 61.3 CPU cycles. This should be plenty of time to enter the IRQ handler, write the bank-switching register, and leave the IRQ handler.

We'll need to repeat this IRQ handling up to 30 times in the frame, which means wasting ~1840 cycles in an NTSC frame, or just 6.1% of CPU cycles.

Unfortunately, I don't think it's that simple. DMC IRQs don't fire a constant number of cycles after you start playback (it would be amazing if that was the case!), what really happens is that there's always an unpredictable delay, which you have to measure and compensate for in subsequent IRQs. It's this error measuring and compensation that wastes CPU time, and you have to compensate the error (which can be several scanlines) on every IRQ.


Oops. I really should have read that wiki more carefully myself!
I was under the impression that you could actually control the phase of this timer by starting a new sample at some variable point in the NMI. But yes it does appear this was misunderstanding of mine. To be honest I ever only tried the simple DMC IRQ variant in practice (use DMC IRQ for coarse timing, then use sprite#0 for sync)

So guess that means the documented method would be closer to wasting ~50% of CPU time to consistently bank-switch every 8th scanline, and the op's suggestion for MMC2 / a purpose-built mapper latching A5 easily wins out.

Though it does make me wonder if it's practically possible to jiggle the phase of the counter a bit in the vblank period to make the CPU use a bit less wasteful...
Because the places where the IRQs happen are reasonably deterministic if you're attempting to trigger them systematically in a frame, I suppose you should also be able to predict what "phase shift" you'll get in each NMI from one frame to the other, and possibly set off a few dummy IRQs, just to align the IRQs in the rendered frame a bit closer to your ideal IRQ firing? And then alternate playback rates $C and $B to keep the delay to your bank-switch write minimal.
Of course, it would make the programming effort way, way bigger and perhaps not very practical to do in anything but a tech demo... though maybe it's a good use-case for going crazy with Mesen's event viewer... :P


Top
 Profile  
 
PostPosted: Mon May 06, 2019 7:16 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
Bananmos wrote:
Oops. I really should have read that wiki more carefully myself!
I was under the impression that you could actually control the phase of this timer by starting a new sample at some variable point in the NMI. But yes it does appear this was misunderstanding of mine. To be honest I ever only tried the simple DMC IRQ variant in practice (use DMC IRQ for coarse timing, then use sprite#0 for sync)

So guess that means the documented method would be closer to wasting ~50% of CPU time to consistently bank-switch every 8th scanline, and the op's suggestion for MMC2 / a purpose-built mapper latching A5 easily wins out.


I've not really settled on the pros/cons of using interrupts vs MMC2 vs a custom mapper.

The first potential advantage I can see to an interrupt would be the ability to cleanly blank the top and bottom at consistent positions. I don't think a "load seam" would be visible on a typically-calibrated NTSC set, but on an NTSC set configured to show all visible lines it would be. Balancing this might be the need to special-case the scenario where a row character flip would occur just after the top of the screen, but I don't think the game will ever need to scroll by an odd number of scan lines, so I could avoid that issue by picking my scroll amounts vs. the top line placement.

The second advantage to an interrupt, which would be huge *if* it were workable, would be an ability to have pairs of rows use the same bytes of nametable RAM, so as to cut in half the number of writes required to draw the screen. Unfortunately, I don't think there's any way to disable rendering, switch the PPU address, and re-enable rendering, without needing at least one blank scan line.

Using MMC2 would eliminate the need for interrupt-handling code, and would "just work" on NTSC and PAL--a useful attribute since I don't have any real PAL hardware to test with. The downside would be an unclean top and bottom edge in PAL mode.

Using a custom mapper would allow almost 256 tiles rather than 128, though the advantage going beyond 128 is much less than that of going from 64 to 128. Using a CPLD-based mapper could offer a bigger advantage of cutting the time to update each row of metatiles from 448 cycles down to 142. If I use CHRAM wired so that the same chip also replaces the NIRAM, and all four cells of each metatile map to the same address, then if my understanding of the PPU is correct, writing a row of metatiles would simply be:
Code:
    ; Assume Y is zero
    lda rowZeroAddrH
    sta $2006
    lda rowZeroAddrL
    sta $2006
    lda rowZeroTile+0
    sta $2007,y
    lda rowZeroTile+1
    sta $2007,y
    ... 14 more tiles

each indexed STA would read PPU data, ignoring the result, but incrementing the address, so each tile write would be reduced to a 3-cycle load and 5-cycle store, as distinct from having to use:
Code:
    lda rowZeroTile+0
    sta $2007
    eor #$02
    sta $2007

for the top half of all sixteen tiles and then having to do:
Code:
    lda rowZeroTile+0
    eor #$01
    sta $2007
    eor #$02
    sta $2007

for the bottom half. Using a CPLD for the CPU side as well as the PPU side could improve update speeds even further: I think a pair of even minimal-cost CPLDs, a bunch of resistors to isolate the cartridge buses from those within the NES, a 74HC373, and a 32K WRAM could probably speed up even the enhanced version by another factor of four (with the fast-transfer mode latch enabled, have any address in the range $7000-$7FFF enable the WRAM with the corresponding address in some 4K block until M2 goes high; when M2 goes high, strobe the 373 and PPU CPLD and switch the WRAM address to $7000-$7FFF. If the upper range of addresses contains mostly do-something-immediate and jmp instructions that target the next address, the act of running code from those addresses would copy the "parallel" region of WRAM to the PPU-side RAM memory via the 373. I don't think that latter speedup would be needed for Ruby Runner, but it would make Elite-style graphics practical on NTSC systems.


Top
 Profile  
 
PostPosted: Mon May 06, 2019 11:10 am 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 8367
Location: Seattle
supercat wrote:
If I use CHRAM wired so that the same chip also replaces the NIRAM,
A significant number of Famiclones don't let you disable NTRAM. There aren't very many historical games that rely on being able to – roughly 10 – and Memblers decided that compatibility with these consoles wasn't a concern for his new GTROM board. But it could possibly influence your decisions.


Top
 Profile  
 
PostPosted: Mon May 06, 2019 11:38 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21392
Location: NE Indiana, USA (NTSC)
lidnariq wrote:
A significant number of Famiclones don't let you disable NTRAM.

Is there a reliable way to test for these in software? Would it work, say, to write different values to $2000, $2400, $2800, and $2C00, and then try to read them all back? And if so, is there a reliable way to display a message if both the internal nametable memory and the cartridge are responding to reads and writes of $2000-$2FFF?

Code:
THIS GAME IS FOR NINTENDO
ENTERTAINMENT SYSTEM AND
FULLY COMPATIBLE THIRD-PARTY
GAME CONSOLES.

A PIN ON THE GAME PAK
CONNECTOR USED FOR EXPANDING
VIDEO MEMORY IS NOT PRESENT
IN YOUR CONSOLE.  IT WOULD
ALSO SHOW PROBLEMS WITH
GAUNTLET, RAD RACER II,
AND CASTLEVANIA III.

_________________
Pin Eight | Twitter | GitHub | Patreon


Top
 Profile  
 
PostPosted: Mon May 06, 2019 12:42 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
tepples wrote:
lidnariq wrote:
A significant number of Famiclones don't let you disable NTRAM.

Is there a reliable way to test for these in software? Would it work, say, to write different values to $2000, $2400, $2800, and $2C00, and then try to read them all back? And if so, is there a reliable way to display a message if both the internal nametable memory and the cartridge are responding to reads and writes of $2000-$2FFF?

If internal and external memory are both responding in typical fashion to reads and writes in the range $2000-$2FFF, writing and reading back data in the range $2000-$23FF without any intervening accesses to $2400-$2FFF should behave normally without bus contention. If resistors are used to isolate the cart's internal bus from that of the console, then the console's NTRAM would reliably "win" in case of conflict--a condition that could be detected. Given that the ability to use external NTRAM was part of the design intention of the NES, however, I don't see much reason that a console that can't support it shouldn't be considered "broken".


Top
 Profile  
 
PostPosted: Mon May 06, 2019 4:59 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 8367
Location: Seattle
supercat wrote:
Given that the ability to use external NTRAM was part of the design intention of the NES, however, I don't see much reason that a console that can't support it shouldn't be considered "broken".
Regardless of reasoning, my point is just that making a game that relies on disabling nametables has very little company, and will be broken on a significant minority of consoles. I strongly suspect that more famiclones have been sold with this brokenness than licensed PAL consoles. (You could also suspect – I think rightly – that more of those famiclones have been discarded due to being manufactured for disposabilty)

It's perfectly ok to choose to be incompatible, but that's all it is.


Top
 Profile  
 
PostPosted: Mon May 06, 2019 6:19 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21392
Location: NE Indiana, USA (NTSC)
A bunch of people including koitsu and myself were discussing this in Kaydus's NESdev Discord server.

Sometimes people try to solve a technical problem with the NES using a custom mapper. The problem often takes this form:

  • NES is just barely too weak for the application.
  • Super NES is too strong, and players would expect production values that a solo or duo can't deliver on time and on a ramen budget.
  • TurboGrafx-16 is just right but too obscure. Goldilocks likes Little Bear's TG16 console but realizes there might not be a chance of actually selling any HuCards.

Attachment:
Goldilocks_and_Little_Bear.jpg
Goldilocks_and_Little_Bear.jpg [ 99.43 KiB | Viewed 256 times ]


The NES sits in a sweet spot:

  • 2C02 PPU is capable enough for things to be recognizable but limited enough to be practical for 1 or 2 people making the assets. The former rules out the Atari 2600, the latter the Super NES.
  • The NES has a substantial installed base, meaning enough consoles in the field that people are willing to dig out of their closets and use. This rules out the TG16.

As for famiclone incompatibility, there are probably more Famicom, NES, and famiclone consoles that are compatible with 4-screen than there are TG16 consoles. And even with the engineering needed to build a custom mapper, you might still sell more copies of an NES game than an equivalent TG16 game.

_________________
Pin Eight | Twitter | GitHub | Patreon


Top
 Profile  
 
PostPosted: Mon May 06, 2019 7:05 pm 
Offline
User avatar

Joined: Sat Feb 12, 2005 9:43 pm
Posts: 11348
Location: Rio de Janeiro - Brazil
And don't forget that most of us do this for fun, not for profit, even when money is involved, so we choose the NES because we love this specific machine, not because we're looking for the ideal retro console to do what we want.

The NES was designed with a pretty versatile cartridge slot, and during the life of the console people designed not only software for the machine, but a multitude of hardware improvements to go along with the games as well. Mapper design has the potential to be pretty fun, provided you know what you're doing and are willing to go through all the steps to get your designs out there and widely supported.


Top
 Profile  
 
PostPosted: Tue May 07, 2019 7:05 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
tepples wrote:
Sometimes people try to solve a technical problem with the NES using a custom mapper. The problem often takes this form:

  • NES is just barely too weak for the application.
  • Super NES is too strong, and players would expect production values that a solo or duo can't deliver on time and on a ramen budget.
  • TurboGrafx-16 is just right but too obscure. Goldilocks likes Little Bear's TG16 console but realizes there might not be a chance of actually selling any HuCards.


The NES was designed to allow for the possibility of custom hardware in cartridges. The routing of /PPUA13 and CIRAM /CE wouldn't make any sense otherwise. Loading up a cartridge with something that would have been impractical in the day (e.g. using an on-cartridge mmicrocontroller to run all the game logic and simply use DMA to feed the CPU and PPU muses while the main CPU simply executes:
Code:
loop:
    lda patchpoint1
    sta patchpoint2
    jmp loop

as needed to operate the controllers, sound, and sprites) would be cheating, but including hardware that could have been built, but wasn't, isn't.

There is much greater satisfaction in designing a game that pushes the limits of what would have been possible back in the day, than in designing a game which would be considered unexceptional on the target platform. If logic can be fit on a couple of CPLDs, that would suggest that it would probably have cost no more to make than something like the MMC3 chip that was, in fact, considered practical for use in production cartridges, with the provisos that RAM and ROM would have cost significant money. If one needed 128KB+32K of ROM and 8K+8K of RAM to produce a super awesome game that couldn't possibly be done with fewer resources, so be it, but a programmer who used such resources on a game that could have been implemented as 8K+0K of ROM and 0K+0K of RAM would not have been appreciated (though I don't know that any historical games ever went below 32K+8K).


Top
 Profile  
 
PostPosted: Tue May 07, 2019 7:12 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
tokumaru wrote:
The NES was designed with a pretty versatile cartridge slot, and during the life of the console people designed not only software for the machine, but a multitude of hardware improvements to go along with the games as well. Mapper design has the potential to be pretty fun, provided you know what you're doing and are willing to go through all the steps to get your designs out there and widely supported.


The reason I'd like to see a "universal emulator" is to facilitate exactly this. A CPLD-based cart could allow programmers a lot of versatility while being reasonably practical and cheap to manufacture; there's no reason a programmer using such a cart should need to be confined to the limits of existing CPLD fusemaps.

With regard to things like the PowerPak, I would expect that if a cart has a reprogrammable FPGA, it would probably be relatively straightforward to produce a web page that could accept CPLD fusemaps targeting a particular cart, along with any necessary jumper configurations, and produce a Verilog or VHDL implementation thereof, assuming the existence of a clock which is sufficiently fast relative to the CPU and PPU clocks). Essentially, one would simply generate a block which, on each clock edge, would compute the level at each macrocell relative to the current or previous values of all macrocells (the converter should apply logic propagation when possible without loops, and introduce a one-clock delay in each loop that's detected). CPLDs should be sufficiently simple relative to the routing capacity of FPGAs as to allow synthesis of any reasonable design.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 47 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group