It is currently Wed Dec 12, 2018 4:00 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 43 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Wed Nov 28, 2018 10:41 pm 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3404
Location: Richmond, Virginia
Well, I finally got around to having learned C, and while I think I have a good understanding of how most features get translated into machine code, I either don't understand how multidimensional arrays work, or I do and they're really just as slow as I think they are. It appears to me that say in a two dimensional array, to generate the offset, the processor would multiply the number in the left set of square brackets by the size of the first dimension, then add this to the number in the right set of square brackets? I figure this can't actually be what's going on, because this is stupid inefficient past anything other than a single random access.

Kind of a shame there doesn't seem to be anything between C and assembly. Cross compatibility is great, but not at the sake of an unnecessarily large performance loss.


Top
 Profile  
 
PostPosted: Wed Nov 28, 2018 10:49 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 7817
Location: Seattle
Why do you think integer multiplication is meaningfully slow on modern computers? (By which I mean everything newer than the pentium 2)


Top
 Profile  
 
PostPosted: Wed Nov 28, 2018 11:06 pm 
Offline
User avatar

Joined: Thu Mar 31, 2016 11:15 am
Posts: 441
It takes more time to read from memory nowadays than it does to multiply two numbers together. Besides, most of those multiplications will get optimized by the compiler.

Quote:
Cross compatibility is great, but not at the sake of an unnecessarily large performance loss.

You don't know enough to hold such an opinion :) It's okay just learn a bit more. I think you'll see things differently when you do.


Top
 Profile  
 
PostPosted: Wed Nov 28, 2018 11:16 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7011
Location: Canada
Drew Sebastino wrote:
...or I do and they're really just as slow as I think they are.

If you want to know how fast or slow they are, time it.

However...
Drew Sebastino wrote:
It appears to me that say in a two dimensional array, to generate the offset, the processor would multiply the number in the left set of square brackets by the size of the first dimension, then add this to the number in the right set of square brackets?

What are you comparing against? What's the alternative here? If you were to time how "slow" this thing is or isn't, what is the alternative that you would also time and compare with?

If you've got something that's conceptually a 2D grid, what's your alternative way of dealing with it? You need to turn 2 coordinates into an address. How else do you propose to do that?

You should also understand that in most C compilers (even going back 20 years) these 3 things would generate exactly the same code:
Code:
a[(y<<8)+x] // 1D array
a[(y*256)+x] // 1D array
a[y][x] // 2D array


Also, like others said above, we have pipelines and latency such that the relevant instructions are usually executed while waiting for other things like memory access anyway.


Top
 Profile  
 
PostPosted: Wed Nov 28, 2018 11:47 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3722
Location: Mountain View, CA
lidnariq wrote:
Why do you think integer multiplication is meaningfully slow on modern computers? (By which I mean everything newer than the pentium 2)

Going off of what I know of the OP: probably because when it comes to CPU architectures and assembly, all he's familiar with is 65816 -- the final member of the 65xx family that never had native multiplication/division instructions for integers of any value**. What's easily unknown to "younger folks" is that other CPUs at the time the 6502 came out also lacked such capability (read: Intel 808, Zilog Z80). The only exceptions I know of for that same time frame are the VAX-11 and IBM's System 360 mainframe (a bloody behemoth), neither of which are microprocessors (the latter two are microcomputers, whose CPUs were physically enormous). Today people are "spoiled" by mul/imul/div/idiv (x86, starting with 8086) or muls/mulu/divs/divu (68K) or whatever ARM offers (AFAIK, multiplication-only up until roughly ARMv6 or v6, I forget, which added division).

I mention age because when I started doing x86 (80286/80386) for the first time in 1993 or so -- I was 16 years old -- the native imul/idiv instructions were a godsend. So were having more than 3 registers (particularly having more than one common-purpose accumulator, subsequently followed by having 32-bit registers and linear addressing (real mode/segmented memory is AWFUL)). I believe the OP is equally young and may be equally unaware.

** -- I have to say this because some pendant will come along and pee in my corn flakes blabbing about bit shifts.


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 1:40 am 
Offline

Joined: Tue Oct 06, 2015 10:16 am
Posts: 845
Sample values for a Phenom II: integer multiplication 1-4 cycles, RAM access 121+ cycles.


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 2:19 am 
Offline

Joined: Fri Jul 04, 2014 9:31 pm
Posts: 991
Which is also (so I hear) why it's good to know whether the language stores multidimensional arrays in row-column or column-row order, because if you write your inner loop so that it increments the wrong one, you can end up skipping all over memory and load yourself down with cache misses.

Maybe a modern compiler would optimize that out; I don't know. Quite a trick to optimize it out if you have multiple loops that don't use the same order...


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 4:37 am 
Offline

Joined: Sat Jun 01, 2013 11:55 am
Posts: 33
Location: Maine, U.S.A.
There is a slight advantage to manually designing the array, say to pre-compute a position in an array for later use, if it doesn't change throughout an operation.

Example:

Code:
const int timings[5*8] = {...};
int current_frequency;
int *current_timings;

void set_freq(int freq)
{
  current_frequency = freq;
  current_timings = timings + freq*5;
}

void tick_frame()
{
  ticks+= current_timings[frame];   // instead of timings[frame+freq*5]
  ++frame;
}


As for multiplication versus shifts or tables for performance, it depends on the CPU/RAM being used. PowerPC had particularly fast arithmetic compared to slow system RAM. Today's chips can execute so many instructions per cycle.

The compiler tends to take care of these things well enough that most cases are trivial on manual improvements. If there's anything I've learned, it's better to write the way you like or understand it, so you can get to prototype stage faster. Time/profile it afterward, if it matters. Making optimizations can be particularly time consuming.


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 8:55 am 
Offline
User avatar

Joined: Tue Jun 24, 2008 8:38 pm
Posts: 2124
Location: Fukuoka, Japan
@Koitsu
Yes, some bits can shift in your corn flakes... what where we talking about, condiments? :lol:

@Drew Sebastino

On a serious note, unless you target a very specific CPU that have low specs or you just are starting to learn plain C, you don't have to worry yet about those things: those are the kind of things you worry later when you have specific bottle neck and have more experience on the subject. These days you don't have to worry much about that anymore.


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 9:29 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20854
Location: NE Indiana, USA (NTSC)
Banshaku wrote:
@Koitsu
Yes, some bits can shift in your corn flakes... what where we talking about, condiments? :lol:

I assume koitsu was referring to my habit of pointing out edge cases, particularly if they are believed not to change the conclusion meaningfully. Shift instructions are good for multiplication and division by powers of 2 but not for arbitrary multipliers and divisors.

Edge case II: Digital eventually turned VAX into a microprocessor several years later (CVAX, 1987). But by then, the market for microprocessors with a PDP-11-inspired 32-bit ISA was favoring sɯɐᴉllᴉM's 68K, and a few years later, Digital pivoted to Alpha.

Banshaku wrote:
On a serious note, unless you target a very specific CPU that have low specs

Such as 6502, LR35902, or what have you.


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 10:06 am 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3404
Location: Richmond, Virginia
lidnariq wrote:
Why do you think integer multiplication is meaningfully slow on modern computers? (By which I mean everything newer than the pentium 2)

Well, it's not like it's never been "meaningfully slow", but it'll always be slower than integer addition, I'd imagine. Not to mention it still has to do an extra addition for generating the offset in a multidimensional array that it wouldn't have to otherwise.

pubby wrote:
It takes more time to read from memory nowadays than it does to multiply two numbers together.

Even if in cache? (Depends on what level of cache, but you get the point)

rainwarrior wrote:
If you want to know how fast or slow they are, time it.

With what, a stopwatch? :lol:

rainwarrior wrote:
If you've got something that's conceptually a 2D grid, what's your alternative way of dealing with it? You need to turn 2 coordinates into an address. How else do you propose to do that?

The starting offset will be generated the same way no matter what, which is why I said it wouldn't be slower for a single access. Say if you wanted to increment all the data on the fifth [4] row of a 20x20 array though; this:

Code:
i = 4 * 20;
j = i + 20;

for(; i < j; i++)
  array[i]++;

should be faster than this:

Code:
j = 4;

for(i = 0; i < 20; i++)
  array[j][i]++;

Edit: I think ap9 was talking about the same thing?

The obvious thing would be just don't use a multidimensional array for this then. Except, I think there are many people who are not aware that the above code would be faster, never mind not caring; it looks uglier, for one thing. Not saying that programming needs to be fool-proofed though...

koitsu wrote:
I believe the OP is equally young and may be equally unaware.

Actually 19 now (kind of pathetic at this point...) The NECV33 / 80186 in the Irem M92 has multiplication and division instructions, but I think I remember they're still a fair bit slower than addition / subtraction. Maybe I'm just thinking of the 68000 where they're infamously slow.

calima wrote:
RAM access 121+ cycles.

Shit...


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 10:13 am 
Offline
User avatar

Joined: Sat Jan 09, 2016 9:21 pm
Posts: 518
Location: Central Illinois, USA
Quote:
Code:
i = 4 * 20;
j = i + 20;

for(; i < j; i++)
  array[i]++;

should be faster than this:

Code:
j = 4;

for(i = 0; i < 20; i++)
  array[j][i]++;



Maybe. Maybe not. Optimizers are so crazy these days that it's hard to guess what will faster without actually checking. But for most things like this, the difference is irrelevant other than in the most extreme situations (a bottle-neck of code that has to run bajillions of times, or something extremely timing-sensitive).

_________________
My games: http://www.bitethechili.com


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 10:23 am 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7011
Location: Canada
Drew Sebastino wrote:
lidnariq wrote:
Why do you think integer multiplication is meaningfully slow on modern computers? (By which I mean everything newer than the pentium 2)

Well, it's not like it's never been "meaningfully slow", but it'll always be slower than integer addition, I'd imagine.

How do you make a 1 cycle instruction faster than another 1 cycle instruction?

Drew Sebastino wrote:
pubby wrote:
It takes more time to read from memory nowadays than it does to multiply two numbers together.

Even if in cache? (Depends on what level of cache, but you get the point)

Yes.

Drew Sebastino wrote:
rainwarrior wrote:
If you want to know how fast or slow they are, time it.

With what, a stopwatch? :lol:

That's not an invalid way of timing a test that takes long enough, but of course you should probably use something designed for the purpose of timing software.

Drew Sebastino wrote:
rainwarrior wrote:
If you've got something that's conceptually a 2D grid, what's your alternative way of dealing with it? You need to turn 2 coordinates into an address. How else do you propose to do that?

The starting offset will be generated the same way no matter what, which is why I said it wouldn't be slower for a single access. Say if you wanted to increment all the data on the fifth [4] row of a 20x20 array though; this:

Code:
i = 4 * 20;
j = i + 20;

for(; i < j; i++)
  array[i]++;

should be faster than this:

Code:
j = 4;

for(i = 0; i < 20; i++)
  array[j][i]++;

Edit: I think ap9 was talking about the same thing?

No, in that case it won't be faster at all. ap9 was saying that in that example iterating over j will be slower than iterating over i. Iterating directly on the row won't have any meaningful improvement. For a loop like that the compiler would definitely know that j is a invariant here, and will operate accordingly.

It would generally be capable of doing this even if you put j=4 inside the loop. (Identifying loop invariants is something compilers usually do very well.) ...but even if it didn't it might not matter for performance due to latency/pipelines, and then if it does there's certainly lower impact ways to fix the problem than having to change your data structure everywhere else it's used.

Drew Sebastino wrote:
The obvious thing would be just don't use a multidimensional array for this then. Except, I think there are many people who are not aware that the above code would be faster...

...except it isn't faster. There's a reason I suggest that you time these things. What's seems to be "obvious" to you is factually wrong. You won't be able to correct your intuition about this without finding some way to actually verify what you believe.


Also, talking about shifts vs. multiplies, even though the instruction speed is a probably not different, there are often still advantages to keeping rows of your structures a power of 2 in length. Sometimes it helps with caching, sometimes it helps with other optimziations. (Most of them there is no reason to do "by hand".)

Even on older systems where e.g. a multiply by 10 might potentially have been a slower instruction than two shifts and an add, compilers would automatically compile * 10 into those shift and add instructions. Just because you write * does not specify imul. The compiler is allowed to get that operation done in whatever ways it knows how. The point of optimizing compilers is to keep the programmer from having to make brainless transpositions of the code like this and let them write something easier to read.


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 10:44 am 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 7817
Location: Seattle
93143 wrote:
Which is also (so I hear) why it's good to know whether the language stores multidimensional arrays in row-column or column-row order, because if you write your inner loop so that it increments the wrong one, you can end up skipping all over memory and load yourself down with cache misses
A friend of mine was in a competition to write an AI for Laser Chess ≈ Khet. He managed to fit the entire board state into a single slice of L1 cache, and as such in the time limit alloted per turn he was able to run his less-clever AI to search through 3 or 4 more moves in the future ("ply"), which got him to second place out of 20-ish contestants (First place was an AI that really had a better intuition about how the game worked).


Top
 Profile  
 
PostPosted: Thu Nov 29, 2018 10:59 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 20854
Location: NE Indiana, USA (NTSC)
rainwarrior wrote:
Drew Sebastino wrote:
lidnariq wrote:
Why do you think integer multiplication is meaningfully slow on modern computers? (By which I mean everything newer than the pentium 2)

Well, it's not like it's never been "meaningfully slow", but it'll always be slower than integer addition, I'd imagine.

How do you make a 1 cycle instruction faster than another 1 cycle instruction?

By a constraint of a superscalar microarchitecture requiring it to be scheduled on an execute unit that another instruction is already using. If you have one and a half units (U-V pipeline in Pentium and Pentium MMX as described in John Stokes' article), or one full unit and two half units (4-1-1 pipeline in Pentium Pro, Pentium II, and Pentium III), it's best for the compiler to interleave instructions that can run on the half units with instructions that need the full unit. Lately, even different threads have started to share resources, as seen in Intel processors with Hyper-Threading Technology and AMD processors with Clustered MultiThreading.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 43 posts ]  Go to page 1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group