#3 is unworkable since most games will at least split the screen somewhere for a status bar -- meaning this approach will fail miserably for those games.
The "logging writes" approach sounds good, until you realize that it doesn't solve the problem of $2002 reads for things like sprite-0 hit... and trying to work those into a "logging writes" system ends up being really hard (I've tried it).
The two ways I've tried with great success are:
1) Effectively your #1. Run the CPU for one cycle, then run all other subsystems enough to catch up. Easy, reliable, but slow
2) Use the tried and true 'catch up' system. Run the CPU until it does something 'interesting' like a register read or write -- then run the appropriate subsystems enough to 'catch up' to the CPU timestamp. IRQs/NMIs, and other things that cut into CPU behavior (like DMC stole cycles) can be predicted.... so you'd run the CPU up until the next interesting event (either end of frame, or NMI, or whatever).
This can be accomplished with a timestamp system, where you scale up all subsystems to a common time base. On NTSC, you can give your CPU cycles a time base of 3 (every CPU cycle incs the timestamp by 3), and a corresponding PPU time base of 1. On PAL, you can give CPU=16 and PPU=5 ... giving the appropriate 3.2:1 ratio
Example, you run the CPU for a while, then it writes to $2000 after 100 cycles.... putting it at a timestamp of '300'. Your $2000 handler will then run the PPU up to that timestamp (effectively running it for 300 cycles) to catch up, then the write is performed, then you continue with the CPU.
Tricky part is IRQ/NMI prediction. And the skipped cycle on odd NTSC frames. But those are all workable with some effort.