It is currently Fri Aug 23, 2019 2:39 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 55 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
 Post subject:
PostPosted: Tue Jun 12, 2012 1:43 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
Well, technically yes, it appears to save and load SP.

If you want a more in-depth explanation than I can give:
http://en.wikipedia.org/wiki/Setjmp.h#C ... imitations
Quote:
If the function in which setjmp was called returns, it is no longer possible to safely use longjmp with the corresponding jmp_buf object. This is because the stack frame is invalidated when the function returns.


To be honest, the setjmp/longjmp version in libco was written by Nach. I'm not a true expert on that.

But ultimately, there's a reason why there are dozens, if not hundreds, of cooperative threading implementations, rather than people just using a simple alloca() + longjmp() pair.

These are true threads. You don't have setjmp+longjmp, you just have switch_to(target); and you can easily have any number of them, switching in any order from any source to any target, and they don't have conditions where they become 'invalidated.'


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 12, 2012 7:14 am 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
Okay, I think I finally get it @ byuu. I'm starting to agree with you that cooperative threading would be better for an NES emu. This is very interesting.

I might still try it both ways and see which I like better. I'm not entirely ready to rule out preemptive because I still think it might get better performance on multicore machines.

Detecting multicore machines (in theory) is simple with C++11 (std::thread::hardware_concurrency), but I don't think that has been widely implemented yet. I suppose I could fall back to boost for it.

Allowing for both in the same emu wouldn't be complicated at all. You'd just have to abstract the switching mechanism.

I'll have to play around with it. But this is definitely sounding like a very fun project.


@ tepples: The problem about rending being enabled/disabled that I was trying to illustrate before is that you have to wrap each cycle in an if statement:

Code:
tick();
if( rendering_on )
{
  // fetch tiles, render pixel normally, adjust scroll, etc
}
else
{
  // render bg pixel, don't fetch anything, no scroll updates
}


Might not seem like a big deal, but if you have to do this for every cycle it will get ugly very fast.

Interestingly, it doesn't look like byuu is doing this in the source he posted. How are you handling mid-scanline rendering disables/enables @ byuu?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 12, 2012 8:10 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21560
Location: NE Indiana, USA (NTSC)
Disch wrote:
The problem about rending being enabled/disabled that I was trying to illustrate before is that you have to wrap each cycle in an if statement:

Code:
tick();
if( rendering_on )
{
  // fetch tiles, render pixel normally, adjust scroll, etc
}
else
{
  // render bg pixel, don't fetch anything, no scroll updates
}

The PPU can be split into a chain of several stages. One generates indices into $3F00-$3F1F; the operation of this phase varies greatly based on whether or not rendering is turned on. The rest of the stages are fairly self-contained: palette lookup, the monochrome bit, the tint bits, NTSC filtering and/or other scaling, and Zapper light detection. None of these stages depends at all on whether rendering is turned on for a particular dot, and thus they're prime candidates to run in another preemptive thread.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 12, 2012 8:55 am 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
I don't really see the benefit to moving everything after pallet lookup/tint to yet another thread (other than *maybe* whole image filtering, but that's probably better left to a pixel shader anyway).

Besides that doesn't solve the issue of one thread (the one actually determining which pixel the PPU is outputting, before pallet/tint/etc) having to run two very different flows of logic.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 12, 2012 9:04 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 21560
Location: NE Indiana, USA (NTSC)
Disch wrote:
I don't really see the benefit to moving everything after pallet lookup/tint to yet another thread (other than *maybe* whole image filtering, but that's probably better left to a pixel shader anyway).

Which illustrates my point. A pixel shader itself is another thread, albeit one that runs on the GPU if your GPU is sufficiently powerful.

Quote:
Besides that doesn't solve the issue of one thread (the one actually determining which pixel the PPU is outputting, before pallet/tint/etc) having to run two very different flows of logic.

That's the same problem whether you use threads or not. The idea is that you run one flow of logic in a loop until the timestamped event that causes the switch from rendering logic to not-rendering logic or vice versa. If anything, this will be the switch to not-rendering logic at the start of line 240, the post-render scanline.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 12, 2012 9:28 am 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
Quote:
That's the same problem whether you use threads or not.


The difference is, with a state machine you can write two entirely different functions, and only call the one which is appropriate.

With a separate thread, you can't make a "rendering disabled" thread and a "rendering enabled" thread, because the whole point is that you want to be able to write the code without having to enter/exit at any cycle.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 12, 2012 11:28 am 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
> I might still try it both ways and see which I like better. I'm not entirely ready to rule out preemptive because I still think it might get better performance on multicore machines.

My preference for cooperative threading is based on dozens of hours of testing each model.

But I'm far from perfect, and I've never actually implemented a preemptive threaded emulator. So I say go for it! Perhaps you'll end up convincing me to implement a preemptive model, too :D

I'd love to see a model that allows cooperative and preemptive as a compile-time switch. It's definitely something I've thought about. Keep in mind again that although the underlying code can likely be identical, there will have to be some differences.

Preemptive means "threads can start and stop on their own", and may not exist on their own cores. Make sure when you use preemptive that you have a -real core- for each thread. Otherwise fallback on cooperative, which will be faster. You may also want to wrap your atomic accesses. Your preemptive code will have dead locks and race conditions that the cooperative code won't. But once preemptive works like a cooperative model with appropriate locks to maintain synchronization, making it preemptive should be easier.

[In a sense: preemptive was really made for the era where most people had one core and didn't want a hung cooperative program to crash their OS. We really need a new name for the model of one true core per thread. We really don't -want- preemption in this case. Although there's no way to 100% own a core yet, that I know of.]

Just be sure to test context switching heavily with a skeleton emulator first. Understand the speed of each approach before you write a full emulator, so that you know how much overhead is spent there.

> How are you handling mid-scanline rendering disables/enables @ byuu?

raster_pixel(), raster_sprite(), scroll(xy)_increment(), etc bail out early or blank the pixel when raster_enable() [bg+sprites on] is false.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 12, 2012 2:11 pm 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
Quote:
I'd love to see a model that allows cooperative and preemptive as a compile-time switch


I was even thinking as a runtime option, rather than a compile-time switch. If you abstract the thread switching behind a common interface, the only thing you'd have to change is which implementation you're going to use. That can be easily accomplished at runtime.

But of course you'd have to write the code under the assumption that it will be running as preemptive, which might hinder coop performance.

But this is all stuff I want to try out. This is kind of exciting.


Thanks for all the ideas and feedback everyone!


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 14, 2012 9:49 pm 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
Okay I have a basic plan outlined, but haven't actually tried it out yet.

Context switching is abstracted behind a ContextManager class, with 2 implementations: PreemptiveContext and CooperativeContext.

Since I haven't really looked at byuu's coop lib yet, I kind of made some assumptions about how it worked. Maybe that will bite me in the ass, but I think I should be alright:

(note of course the classes are incomplete (missing ctors), but hopefully the idea is illustrated)
Code:

#ifndef SCHPUNE_CONTEXTMANAGER_H_INCLUDED
#define SCHPUNE_CONTEXTMANAGER_H_INCLUDED

namespace schpune
{

class ContextManager
{
public:
    virtual ~ContextManager()       { }

    virtual void Wait() = 0;
    virtual void WaitUntil(ContextManager* switchto, function<bool()> pred) = 0;
    virtual void Suspend() = 0;
    virtual void Resume() = 0;
    // virtual void Join() = 0;

private:
    ContextManager(const ContextManager&) { }   // no copying
    void operator = (const ContextManager&) { } // no assigning
};

}

#endif // SCHPUNE_CONTEXTMANAGER_H_INCLUDED

Code:

#ifndef SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED
#define SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED

#include "contextmanager.h"

namespace schpune
{

class PreemptiveContext : public ContextManager
{
public:
    virtual void Wait()
    {
        mParent->mCV.notify_all();
        if(mSuspended)
            WaitUntil( mParent, [this] () { return !mSuspended; } );
        else
            threadlib::this_thread::yield();
    }

    virtual void WaitUntil(ContextManager* switchto, function<bool()> pred)
    {
        static_cast<PreemptiveContext*>(switchto)->mCV.notify_all();
        mCV.wait( threadlib::unique_lock<threadlib::mutex>(mMutex, pred) );
    }

    virtual void Suspend()
    {
        mSuspended = true;
    }

    virtual void Resume()
    {
        mSuspended = false;
    }

private:
    PreemptiveContext*              mParent;
    threadlib::mutex                mMutex;  // note, 'threadlib' is std if C++11 threads supported, or boost otherwise
    threadlib::condition_variable   mCV;
    volatile bool                   mSuspended;
};

}

#endif // SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED

Code:

#ifndef SCHPUNE_COOPERATIVECONTEXT_H_INCLUDED
#define SCHPUNE_COOPERATIVECONTEXT_H_INCLUDED

#include "contextmanager.h"

namespace schpune
{

class CooperativeContext : public ContextManager
{
public:
    virtual void Wait()
    {
        // TODO: switch to parent
    }
    virtual void WaitUntil(ContextManager* switchto, function<bool()>)
    {
        // TODO: while(!pred()) switch to 'switchto'
    }
    virtual void Suspend()
    {
        // do nothing - suspension is not required with cooperative threading
    }
    virtual void Resume()
    {
        // do nothing - suspension is not required with cooperative threading
    }

private:
    // TODO - look up byuu's cooperative threading lib and implement
};

}

#endif // SCHPUNE_PREEMPTIVECONTEXT_H_INCLUDED




For purposes of this discussion, there are 3 threads: NES, CPU, and PPU. The NES is the one the user will be interfacing with, and the CPU/PPU will only be referenced inside of the NES.

PPU's context parent is CPU
CPU's context parent is NES
and NES's context parent is null.

The intended flow is like so:

When the NES is called to emulate a frame:
- Call CPU::FrameStart (not shown above)
- WaitUntil( CPU_context, cpu_is_finished_with_frame && ppu_is_finished )


CPU logic on FrameStart:
- Resume current context
- Resume PPU context

CPU logic (in CPU's thread):
- emulate up to a given timestamp
- on writes where it's necessary to sync with PPU, WaitUntil( PPU_Context, ppu_is_caught_up )
- when frame is complete, Suspend()


PPU logic:
- emulate up to CPU's current (realtime) timestamp
- if PPU catches up to CPU, Wait()
- when frame is complete, Suspend()




Additionally, the Suspend/Resume bits can be used for implementing a debugger with breakpoints (when a breakpoint is hit, Suspend() all contexts except for NES's)



But like I said it's so far untested. Against byuu's advice I'm kind of jumping right in with this rather than doing smaller tests first. It's mostly an educational/experimental project anyway, so if it doesn't turn out I won't feel so bad.

Anyway, thought you guys might be interested. If you spot any problems with the format let me know. Other feedback/critique also welcome.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 14, 2012 10:29 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
Well, here's the entire libco API library:

Code:
void* co_active();
void* co_create(unsigned stack_size, void (*entrypoint)());
void co_switch(void *target);
void co_delete(void *target);


I know it's really huge and complicated ... but bear with me here :P

co_active() returns a handle of the active thread. This even happens for the main program thread (and is in fact the only way you can jump back to it later. Store it when your program starts.)

co_create() makes a new stack, with however many bytes you want. 1M or less is more than enough. C++ apps usually allocate 512K or 1M. Go too small (256 bytes or something absurd) and expect crashes. Do not return from the entrypoint, that won't destroy the thread. It can't, because that thread is active.

co_switch() will save the context in the active thread, then load the context from the target thread, and resume processing there. Don't switch to the already active context. It'll probably be a no-op, but don't be stupid.

co_delete() will free the memory of a created context. Do not try and delete the main context, do not try and delete the current context. If you do, things will blow up, and you will deserve it.

libco doesn't have any concept of parent<>child relationships. They are all siblings. It also doesn't pass around parameters to entry points. There is no scheduler built-in. You can do all of this with wrapper code that you create. (Note: parameters to entry points are easy, use an std::map<void *thread, ParameterList>) The idea is for you to do things your way, and leave libco simple and easy to port to other targets. There's also a handy-dancy typedef void* cothread_t; ... you should NEVER stab at the raw memory, of course.

The way I do my scheduler:

cothread_t host; //this is the thread that the GUI runs in
cothread_t active; //this is the currently executing -emulation thread-
cothread_t cpu, ppu, apu;

On reset, co_delete() all threads; co_create() all threads [gets us to entry point on all threads]; active = cpu; reset all variables in all classes
When you call Emulator::runForOneFrame(), it does co_switch(active), and then the core keeps switching between threads as it needs to.
Once an entire frame has been rendered, it does active = co_active() [so it knows where to re-enter], co_switch(host), and the host thread calls videoRefresh(ppuVideoData); and returns.

You're free to do things however you want, of course.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Jun 15, 2012 7:00 pm 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
I figured it would be simple like that. Very nice.

I'm going to give preemptive a run first and see how it goes. Then I'll work on getting your coop lib in there (should be VERY easy to drop in, I've already abstracted around it). Once both are in side-by-side comparisons will be easy. I gotta say I'm very curious as to how each approach is going to go.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Jun 17, 2012 4:46 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
Dwedit wrote:
Really? Longjmp is just a PC reassignment? Never seen that before. On Newlib for ARM, longjmp swaps all registers, including the stack pointer, but not r0-r3. I'm not as familiar with other implementations of setjmp/longjmp.

No, you're correct, it's not just PC reassignment. longjmp saves the current registers on the stack, then swaps PC with another thread's PC and restores its registers off the stack. This is what libco does as well. longjmp()-based threading can work on some platforms, since it's doing the same thing, though it's undefined behavior since you're using longjmp() in an illegal way. The main trick is getting each thread a separate stack. "Portably" you can allocate a large char array on the stack, as you mentioned, then divide it up with a recursive function with a smaller char array on its stack, noting where each stack starts.

One snag is that it doesn't save volatile registers, which may mean that local variables not declared volatile may lose changes to their values after setjmp(). byuu's library preserves all local variables even if not declared volatile.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 19, 2012 8:00 pm 
Offline
User avatar

Joined: Wed Nov 10, 2004 6:47 pm
Posts: 1849
Well I got preemptive and cooperative multithreading in as a compile time switch now. Can't speak to performance on either one of them yet, as my preemptive implementation still have a few kinks I need to work out.

libco was EXTREMELY easy to drop in, though. Took me all of 10 minutes.

Kudos, byuu.


In other news, basic CPU/PPU are working. No sprites or attributes currently. I'll probably do sprites next.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 20, 2012 9:30 pm 
Offline

Joined: Mon Mar 27, 2006 5:23 pm
Posts: 1518
Awesome job! Can't wait to see cooperative vs preemptive numbers on the full emulator. Especially on quad cores and higher. If it's a huge improvement, I may want to ask for your help in a wrapper library to allow cooperative or preemptive as a switch, so I can use it too :D


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 21, 2012 9:03 am 
Offline

Joined: Mon Mar 06, 2006 3:42 pm
Posts: 94
Location: Montreal, canada
blargg wrote:
Dwedit wrote:
Really? Longjmp is just a PC reassignment? Never seen that before. On Newlib for ARM, longjmp swaps all registers, including the stack pointer, but not r0-r3. I'm not as familiar with other implementations of setjmp/longjmp.

No, you're correct, it's not just PC reassignment. longjmp saves the current registers on the stack, then swaps PC with another thread's PC and restores its registers off the stack. This is what libco does as well. longjmp()-based threading can work on some platforms, since it's doing the same thing, though it's undefined behavior since you're using longjmp() in an illegal way. The main trick is getting each thread a separate stack. "Portably" you can allocate a large char array on the stack, as you mentioned, then divide it up with a recursive function with a smaller char array on its stack, noting where each stack starts.

One snag is that it doesn't save volatile registers, which may mean that local variables not declared volatile may lose changes to their values after setjmp(). byuu's library preserves all local variables even if not declared volatile.


Wow, its been a long time since I dropped by here and I'm so happy to see the same crowd still obsessed with this stuff. :lol:

An important note about the co-operative multithreading: If you allocate your own stack area (as libco's co_create does) then you MUST BE CAREFUL not to make any OS calls while running that co-thread! Some of them may work, but there is no guarantee. On at least some versons of Windows (95 through XP at least, I think) any Win32 API function that goes into kernel space will check that the user-mode stack pointer is inside the stack area that Windows allocated for the thread. If you have allocated your own stack space elsewhere and set the stack pointer to it, it will raise an SEH exception or something. (I encountered this ages ago while working on a Win32 emulator for a mobile device which ran a cooperatively-threaded OS).

I don't know if this still applies to newer versions of Windows (Vista+) or to other systems like OSX, but the safest thing to do is never make OS calls while your stack pointer has been swapped out to some other stack area besides the OS-allocated stack supplied for that OS thread. To be safe, all you have to do is use the original host cothread (from co_active) for running the GUI and use your other co_create-d cothreads just for emulator code. Any file access, etc. should also be done from the host co-thread.

Also: I am a big fan of cooperative multithreading for emulators, and I am highly skeptical of this preemptive multithreading idea you guys are discussing. Preemptive thread switches are thousands of times more expensive than a libco co-operative switch, and Windows thread switching has all sorts of gotchas that will surprise you.

Read this post, for example. High-performance multi-threaded programming is painful and complicated. Here's an index of that guy's posts on the topic.

Be EXTREMELY careful if you decide to try and implement your own lock-free data structures or primitives! This is a subtle art filled with unexpected gotchas, and you will waste weeks or months trying to get this right (or more likely, you won't get it right and there will be one-in-a-million crash bugs in every future version of your preemptive emulator).

And be even more careful if you decide to use somebody else's. DO NOT trust any lock-free implementations you find on the Internet or even in academic research papers--nearly all of them have bugs or race conditions, or make assumptions about memory effect ordering that are not always true on real hardware, etc. In my day job as a game engine programmer, I've watched an extremely smart programmer develop a set of working lock-free primitives for our game engine, and it was a six-month process where he would be positive he had a working implementation, and for around one month it would apparently work until finally a rare-and-weird crash would be traced back to the lock-free stuff and he'd find yet another extremely obscure and impossible-to-reproduce race condition that none of us had thought of, and re-design the thing yet again.. this happened at least 5 times, and every single published paper we read about lock-free data structures contained errors of this type. Believe me, lock-free stuff can be a real minefield and sensible programmers should stay the hell out of it! ;)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 55 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group