Maybe it's obvious, but I'd like to share: When short of frame time, I usually measure (visually, via toggling the grayscale bit) individual, short tasks, and make a list of them and the max amount of scanlines they take, so I can move some of them before the sprite 0 hit wait. That will earn you some precious time after it, and you will be throwing away less cycles just waiting. If your hud is, for example, 32 scanlines tall, I'm sure you can stuff some minor tasks right after VBLANK and before the sprite 0 hit wait.
There are some other things you can do to save time. For example, when checking bullets vs enemies collisions, I tend to run the checks in the enemies processing loop, right after I move them, 'cause I tend to cut short the checks - there are more bullets than enemies, and if a collision is registered, you can cut the loop short by discarding every other possible bullet<->enemy collision after you register one for the current enemy. The total amount of checks seems to be shorter this way (looping all the bullets for every enemy) than the way around (looping all the enemies for every bullet). Or maybe I'm talking nonsense. Also, check first for the condition which is more prone to fail, so you can discard subsequent checks: the screen is wider than taller, so checking for X first is a better idea than checking for Y first. There are more chances that the X check fails so you can move to the next entity.
PS - I'm sure there's a better way to measure the amount of scanlines individual tasks take, but I'm still learning
