First, thanks for the great input.
Yeah, this chip could always use more memory. Instructions are 4 bytes, and the shared RAM is 32kB. There could be another chip in the series, but who knows if/when that will be.
Spin looks OK, but I want to program it in assembly since that's the only way I ever do anything. And actually your assembly code needs to fit in 2kB, which is 512 instructions. Sounds like fun to me. The instruction set is suited for self-modifying code, some people have written programs that read out of the shared memory, and something similar to that is how I imagine a 6502 emulator would work.
Multicore 6502 is very practical if they can have some private per processor memory and some shared memory/device between them. It helps greatly if there is a way to cheaply synchronize them, either through a message (irq), shared high speed fifo. Shared memory alone is rough because without a CPU level lock operation it's hard (but not impossible!) to reliably read-write a mutex or semaphore without creating a race.
For the emulator I had considered having some portion of zero-page (or all) to be local to the cog, all the other memory in the main shared RAM. I figured there should be some custom 6502 instructions added in (use macros in asm) to use the native locking instructions.
If you have indexed addressing, then you have a stack. If you also have an instruction to jump to the address in a register, you have a call stack. I thought ARM already demonstrated that (mov pc, lr anyone?). Or does the propeller not even have those?
There's no stack, and there's no indexing either. Each 32-bit word (or whatever it'd be called) contains an opcode, source address, and destination address. The remaining bits are quite interesting - you can (dis)allow the instructions to affect the status flags, and also conditional execution which can (dis)allow instructions from being executed based on the status flags. So you modify the 'source' in the instruction to do something like indexing.