It is currently Sat Jun 23, 2018 6:41 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun May 06, 2018 5:18 pm 
Offline
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3339
Location: Nacogdoches, Texas
With how just about every modern processor is multi core, this is something that should be trivial to find, but it isn't for whatever reason. :| From what I've seen, most every multi core processor has at least two levels of cache; L1 is for each core, and L2 is shared by all of them. What I'm wondering, is how is L2 shared? Assuming we're talking about a dual core processor, is it that even cycles go to one core, and odd cycles go to the other? Or does each processor generally handle this differently? Of course, there's the issue of general external ram then; can each core even interact with ram directly, or must it perform a dma transfer to move the data in and out of cache in chunks?


Top
 Profile  
 
PostPosted: Mon May 07, 2018 12:10 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7438
Location: Chexbres, VD, Switzerland
Quote:
Assuming we're talking about a dual core processor, is it that even cycles go to one core, and odd cycles go to the other?

No. Only L1 cache is accessed randomly, L2 cache is read in bursts, which are copied to L1 caches before being acessed randomly, and then when the data is no longer needed it's written back to L2 cache (or further down). So basically when either core needs data which is not in L1 cache, it's code execution halt and there's a burst read from L2 to L1 before it can resume it's activity. I suppose that if the other core needs data at the same time it has to wait the burst from the other core then it's own burst before continuing. This only work because execution can be so much faster after the data is here in L1 cache.

This is an extremely complex topic which is subject to research at Master and PHD levels, and actually this is the source of the infamous spectre and meltdown vulnerabilities.

Quote:
Of course, there's the issue of general external ram then; can each core even interact with ram directly, or must it perform a dma transfer to move the data in and out of cache in chunks?

Multiple models of caching exists and several have advantages and drawbacks. For instance, write-through cache is simpler because writes are performed to cache and RAM simultaneously, while only reads are truly cached, this is simpler and avoids the problem of data in RAM being out of sync with reality (note, this is really a huge problem when having multiple cores), however it really halfs the efficiency of caching in the 1st place.


Top
 Profile  
 
PostPosted: Mon May 07, 2018 5:37 am 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 449
the Cache is transparent to the CPU, the CPU just says "I want address X", it never engages in cache management. So if its in L1/L2/L3 RAM makes no difference to the CPU.

The cache managers deal with what pages should and shouldn't be swapped in and out of each cache. This has a variety of techniques and methods over the years, based upon co-locality and recent usage hits etc. The X86 actually form read and flush lanes ( at at least they use to, might have changed by now) and once a lane gets full, it flushes the data out and down. There are way to command the cache controllers to force a RAM read, with special addresses and commands etc, its only something you need to do very rarely, I think I've done it a whole 5 times in my career.

There is nothing to stop both CPUs trying to write the same RAM at the same time, this is known as a race condition, while the logic for the RAM will have a priority system to handle such clashes, the cache logic is unable to determine which data is the right one. This is called a Race Condition, and the onus is on the programmer to ensure that none happen. Ensuring this is what makes multi-core or even multi-thread programming very tricky.

the ARM multi cores are typically not connected via Cache and you have to use special "pigeon hole" transfers to get data from one to the other..


Top
 Profile  
 
PostPosted: Mon May 07, 2018 3:05 pm 
Offline
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3339
Location: Nacogdoches, Texas
Bregalad wrote:
this is the source of the infamous spectre and meltdown vulnerabilities.

I thought it was due to some sort of out of execution mumbo jumbo, unless that's what this is.

Bregalad wrote:
write-through cache is simpler because writes are performed to cache and RAM simultaneously

So there's no discrepancy between cache and ram I guess?

Oziphantom wrote:
the Cache is transparent to the CPU, the CPU just says "I want address X", it never engages in cache management. So if its in L1/L2/L3 RAM makes no difference to the CPU.

Interesting... Do you have any idea where cache and ram are located in the address space, unless I'm not understanding this correctly?

Oziphantom wrote:
There is nothing to stop both CPUs trying to write the same RAM at the same time, this is known as a race condition, while the logic for the RAM will have a priority system to handle such clashes

Wouldn't it be the logic for the memory controller managing the processor's FSB?



So what I'm getting from all of this is that to the programmer, a multi core processor isn't much different than a single core?


Top
 Profile  
 
PostPosted: Mon May 07, 2018 3:38 pm 
Offline

Joined: Sun Mar 27, 2011 10:49 am
Posts: 252
Location: Seattle
Cache doesn't live in the address space. When the CPU executes a memory read, it checks to see if the address(es) it's accessing are in the cache. If it is, it fetches the data from cache; if not, it loads the data from RAM into the cache in order to fetch it. From the programmer's perspective, the only real difference is that one is way slower than the other. Also, things can get weird if different cores are reading/writing to memory at the same time, because operations can happen out of order and the CPU won't necessarily write back data written on one CPU's cache to main memory before another reads it. Maybe give this paper a gander if you're curious about that.

There's no standard for where memory is mapped into the physical address space, AFAIK. The BIOS/EFI tells the OS kernel that.

Applications get a virtual address space. The OS configures the CPU so that when an application tries to read from a certain address, it maps onto certain physical addresses - or doesn't map onto anything, which triggers an interrupt (page fault) that lets the OS, say, load the desired data out of swap on the hard disk, or crash the program for its bad access. The OS dynamically allocates physical address space for processes, so they can be basically anywhere in physical memory, and they don't really know anything about where.

Even a process' virtual address space tends to be dynamically arranged on modern OSes (see address space layout randomization).


Top
 Profile  
 
PostPosted: Mon May 07, 2018 3:50 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 926
(ninja'd, but oh well)

Cache isn't a separate address space; it's just a way for the CPU to remember what the data (or code) at some address was last time it saw it, so it doesn't incur the latency of hitting RAM every time it needs it. At least, that's the idea for reads; writes are a similar principle in that it's faster to pretend to operate on RAM and then just have the cache make sure RAM is actually updated at some point.

The Super FX has an instruction cache. Normally, instruction execution is 3 cycles per byte in 10.7 MHz mode, or 5 in 21.4 MHz mode (plus any internal processing that happens after the instruction is loaded, such as with multiplication), and it has to wait for data loads through the ROM buffer (or reads/writes/pixel plotting through the RAM buffer, if you're executing from RAM. Don't execute from RAM if you can avoid it). But if you throw in a cache instruction, the GSU starts copying executed code to the instruction cache, so the next time you loop through that code it's only one cycle per byte and loads in parallel with both bus buffers. There's no way for the programmer to access the instruction cache specifically (at least, not with the GSU - it's mapped to the A bus for some reason, so the S-CPU can write to it); it just does its thing in the background. The only difference is that execution magically gets faster.

Modern CPUs have data caching as well as code caching, write as well as read, and the specifics are far more sophisticated. But the basic idea is the same, that being that it's transparent to the programmer. (Also, I suspect the use of an explicit start-caching-here instruction is archaic.) This is apparently why some people object to the term "cache" being used for TMEM on the N64, because unlike the PSX's texture cache, you have to explicitly load data into TMEM in order to use it.


Last edited by 93143 on Mon May 07, 2018 4:47 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Mon May 07, 2018 4:47 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 6348
Location: Canada
93143 wrote:
Also, I suspect the use of an explicit start-caching-here instruction is archaic.

It's not archaic. Cache control instructions are very useful in high performance development.


Top
 Profile  
 
PostPosted: Mon May 07, 2018 4:55 pm 
Offline
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3339
Location: Nacogdoches, Texas
93143 wrote:
This is apparently why some people object to the term "cache" being used for TMEM on the N64, because unlike the PSX's texture cache, you have to explicitly load data into TMEM in order to use it.

I'd have thought this would be irrelevant; what's the textbook definition of cache in computer science? It'd have thought it'd just be "smaller but faster ram."

rainwarrior wrote:
93143 wrote:
Also, I suspect the use of an explicit start-caching-here instruction is archaic.

It's not archaic. Cache control instructions are very useful in high performance development.

You just reiterated what he said. :lol:

Anyway, if the processor itself does so much abstracting already, then do you still need to program for each core specifically, or does the processor somehow handle that during runtime as well?


Top
 Profile  
 
PostPosted: Mon May 07, 2018 5:06 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 926
rainwarrior wrote:
93143 wrote:
Also, I suspect the use of an explicit start-caching-here instruction is archaic.
It's not archaic. Cache control instructions are very useful in high performance development.

Interesting. Is that the only thing I got wrong?

Espozo wrote:
You just reiterated what he said. :lol:

No, he... OH SNAP

Quote:
Anyway, if the processor itself does so much abstracting already, then do you still need to program for each core specifically, or does the processor somehow handle that during runtime as well?

I've used MPI, and you do need to pay attention. Even in C++, there's a bunch of functions involving broadcasting data, communicating between specific processors, checks to see which processor this particular instance of the code is running on, and blocking execution at checkpoints until all processors have finished what they need to get done before the next step starts. It's all scalable to any number of processors, but it's not entirely transparent to the programmer.

To be fair, the version I've been using is over ten years old...


Top
 Profile  
 
PostPosted: Mon May 07, 2018 5:16 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 6348
Location: Canada
Espozo wrote:
93143 wrote:
This is apparently why some people object to the term "cache" being used for TMEM on the N64, because unlike the PSX's texture cache, you have to explicitly load data into TMEM in order to use it.

I'd have thought this would be irrelevant; what's the textbook definition of cache in computer science? It'd have thought it'd just be "smaller but faster ram."

Cache is a more generic term, it's not memory-specific. A cache optimization stores some result and reuses it, instead of having to do it the "long" way. That long way could refer to a slow main memory fetch, or it could be some kind of computation, etc.

For instance, a search engine will cache recent search results so that when someone else asks for the same thing, they can just reuse that result instead of doing the whole search process again.

Outside the computer context a cache is just a place where you store stuff, or a collection of stored stuff. In computer science "cache" means "use a cache as an optimization". Store something to save effort.

Espozo wrote:
rainwarrior wrote:
93143 wrote:
Also, I suspect the use of an explicit start-caching-here instruction is archaic.

It's not archaic. Cache control instructions are very useful in high performance development.

You just reiterated what he said. :lol:

I don't think I did, but I guess 93143 really just meant that the SuperFX's mechanism is archaic, or maybe even more specifically that the SuperFX's instruction caching doesn't happen unless you use an instruction to enable it? Modern CPUs do most of their caching in an automated way, without programmer or compiler intervention. For high performance code, though, and occasionally for other reasons, you have some explicit control instructions that can be taken advantage of. Ultimately, you can know more about your code and data than the compiler or CPU does; in the right place doing it "by hand" will outperform the automated version.

Espozo wrote:
Anyway, if the processor itself does so much abstracting already, then do you still need to program for each core specifically, or does the processor somehow handle that during runtime as well?

I think controlling which thread goes on which core is normally the operating system's job. It's got to multitask all running programs, not just yours. If you have 4 cores and create 4 threads and do heavy activity on each of them, the OS will generally automatically spread that load over the 4 cores. That's its job, and it will normally manage to do it fairly well. (Threads can move across cores, too. They don't have to stay where they began.)

...but again, there's ways to explicitly control this too. You can ask an OS how many cores you have, and you can tell it to put a thread on a specific core. On a console where you're not really multitasking in the same way, and the computer architecture is guaranteed, it's often very appropriate to do this. It's a refinement that can be used if needed.

See: Wikipedia: Processor Affinity


Top
 Profile  
 
PostPosted: Mon May 07, 2018 5:28 pm 
Offline
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3339
Location: Nacogdoches, Texas
rainwarrior wrote:
I don't think I did

It was a poor joke; I was saying that "high performance development" was another way of saying "archaic." Never mind. :lol:

Yeah, for running multiple programs, the operating system definitely can spread the load onto multiple cores (a program or two per core), but I don't know how it would do this for a single intensive program that wants to use the system's full resource; I wouldn't think you could just generically divide up the load, not only because how would you even divide it in the first place, but because obviously programs are going to assume things are going to be processed linearly.


Top
 Profile  
 
PostPosted: Mon May 07, 2018 6:40 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 6348
Location: Canada
Espozo wrote:
Yeah, for running multiple programs, the operating system definitely can spread the load onto multiple cores (a program or two per core), but I don't know how it would do this for a single intensive program that wants to use the system's full resource; I wouldn't think you could just generically divide up the load, not only because how would you even divide it in the first place, but because obviously programs are going to assume things are going to be processed linearly.

If you want to control it, you query the OS about how many cores there are, create a thread that's bound to each core, and run what you want to on each thread.

If a program doesn't create multiple threads, it will only ever use one core. The OS can't split it up.

Though, you could just create a few threads, assign them independent work as you can, and let the OS sort out balancing them for you. That's probably easier to write, and will translate to a variety of systems better. Doing it manually is a little more common on consoles where you can make a lot stronger assumptions about the architecture.


Top
 Profile  
 
PostPosted: Mon May 07, 2018 6:52 pm 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 926
rainwarrior wrote:
I guess 93143 really just meant that the SuperFX's mechanism is archaic, or maybe even more specifically that the SuperFX's instruction caching doesn't happen unless you use an instruction to enable it?

Kinda, but I also didn't know that manual was an option on modern CPUs. I never dive deeper than C++ on modern processors.


Last edited by 93143 on Tue May 08, 2018 2:49 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Mon May 07, 2018 7:24 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 6348
Location: Canada
The main case I've seen it used for performance is just that the automated caching only helps with repeated use of the same block, because it doesn't know what memory blocks to cache until the CPU actually uses it for the first time. Since a cache fetch can happen in parallel with other execution (that doesn't depend on it), you can give the CPU advance notice to have the block cached before your code needs it, and bypass that first fetch penalty altogether.

There are a bunch of other reasons to use manual cache control, for threading or communication between devices. The XBox 360 GPU shared some memory with the main CPU, but they had separate caches that didn't know about each other. If you didn't manually flush the cache, there was no guarantee about when that data would migrate to the GPU. I once had a weird bug because of this where a character would load in as a ball of random spikes that would slowly turn into the character as you kept playing.


Top
 Profile  
 
PostPosted: Mon May 07, 2018 7:47 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 7227
Location: Seattle
(x86 has had the PREFETCH instruction since pentium 3 days.)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group