It is currently Tue Jul 16, 2019 5:50 pm

All times are UTC - 7 hours





Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Wed Jun 26, 2019 12:12 am 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3501
Location: Richmond, Virginia
Sorry if this is a silly question, but I haven't seen much literature on this subject, other than double condition checking with subtraction. I've been playing with x86-64 Linux assembly and was making code that would search through a buffer until it finally reached ascii '0'-'9', 'A'-'F', or 'a'-'f' (for hexadecimal conversion). While I could accomplish this with a "jump if less than" and a "jump if greater than" instruction for each range, I figure there must be a faster way of doing this, particularly on a modern pipelined processor where you generally want to avoid branching.

About branching, it's unfortunate branch prediction doesn't appear to be too standardized. I understand this is largely do processor manufacturers trying to develop smarter branch prediction algorithms, but this also makes it more difficult to optimize code (not for any specific processor) dealing with branches. (I don't really understand why the Pentium 4 branch prediction hints later went unused; I figure even a compiler has more knowledge than a CPU about program flow...)


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 1:10 am 
Offline

Joined: Tue Feb 07, 2017 2:03 am
Posts: 733
apart from using an AND to convert A-F -> a-f and avoid a 3rd check range, not really that much you can do. You might be able to use Sub from original, as you have a byte operand the system will have a few 8bits worth of ALU to spare, so it might be able to do original - X,Y,Z in parallel vs doing original -x, result - y, result -z


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 1:17 am 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 4191
Subtract 0x30
First condition: <= 9
AND ~0x20
Subtract 0x11
Second condition: <= 5

This gets it down to two comparisons...

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 2:14 am 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3501
Location: Richmond, Virginia
Wow, thank you both! :) I don't know why, but the link in the example I posted was a bit hard for me to understand, but I now realize the whole point of the subtraction is to shift the range of values to where the first number in the range is 0. (Conversely, I suppose you could add until the largest number in the range is 0xFFFF or whatever and check for greater than / equal to, but then you have to worry about register size.)

I had thought it was silly that seemingly no processors (at least not ones I know of) are unable to perform an OR operation on processor flags rather than only being able to overwrite them, but it occurs to me that for the majority of cases, this wouldn't gain much.


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 9:14 am 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7526
Location: Canada
Drew Sebastino wrote:
I don't really understand why the Pentium 4 branch prediction hints later went unused

Because it wasn't very useful.

Even 6502 had something equivalent. Any branch instruction takes more time if taken than if not, so the optimization: put the most common case in the not-taken path. (In practice a little more complicated, but I'm illustrating the idea.)

Intel branch instructions have a similar natural difference between branch taken or not. One way is slower than the other. Even without a hint, a compiler could usually reorganize code to put the common case on the fast side of the branch.

So... all the hint did was let you reverse the natural speed order of a branch instruction. Since in many cases you can just reorganize the code, the times where a hint prefix on the branch could really help are relatively rare. It just wasn't that great a tool.

Then you get to a later generation CPU with dynamic branch prediction, and it blows static prediction out of the water. The hint which was now only marginally useful is now entirely useless. The dynamic prediction will do a better job, and can't make use of the hint anyway. The hint is just a wasted byte at that point.

Oziphantom wrote:
I figure even a compiler has more knowledge than a CPU about program flow...

Compilers know a little bit of relevant useful information. For example they are likely to know when something is a simple loop counter, and can optimize that accordingly. There are a lot of cases where the compiler doesn't know much at all about which branch is going to be taken more, though.

Consider a utility that loads a file, processes its bytes, outputs some other file. There's no place in most high level languages to tell the compiler much about what kind of data is expected to be found in the file. You might know that e.g. your files are filled with mostly 0 bytes but the compiler won't.

Profile-guided optimzation can help with this a bit. You record some runs of the program and save the analysis, and then the compiler can use that information to inform its own optimizer. This generally does improve performance a little bit, but it tends to require a lot of setup work (e.g. re-running the profile test every time you build).

This is actually one of the areas where "hand optimization" on the part of the programmer can help most: you can know more about the incoming data than a compiler can, and that knowledge can be applied, whether by revising the algorithm at a higher level, reorganizing the data itself, tweaking the high level code to get a more appropriate result from the compiler, rewriting something critical in assembly, etc.

Drew Sebastino wrote:
About branching, it's unfortunate branch prediction doesn't appear to be too standardized... ...this also makes it more difficult to optimize code (not for any specific processor) dealing with branches.

This is not really a problem. They're not standardized because successive generations are generally just better at predicting than previous ones. Nothing is lost there, and even if you somehow optimized your code for a previous generation in a way that doesn't carry forward (extremely unlikely), the newer CPU would be faster anyway. It won't matter.

Optimizing branches is a more or less generic problem: the rule of thumb is just to try and set up branches so that the same direction is taken in clusters. Randomly going left and right is the worst case. Going left a little more than right is better. Going only left many times, then switching and going only right many times is the best case.

There really isn't much need (or ability) to get CPU-specific with this kind of optimization, for the most part. There are isolated cases but the bulk of the problem is generic.


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 12:56 pm 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7720
Location: Chexbres, VD, Switzerland
rainwarrior wrote:
Because it wasn't very useful.

Even 6502 had something equivalent. Any branch instruction takes more time if taken than if not, so the optimization: put the most common case in the not-taken path. (In practice a little more complicated, but I'm illustrating the idea.

The 6502 has nothing equivalent. In the case of a loop, you have to branch backwards in all cases, even though this is by far the most common case. The 6502 "branch-prediction" would be always-non-taken which is awful for loops. In the case of an if/else statement you'd be correct, but another instruction would have to be used to skip the "else" part at the end of the "if" part, so the optimisation of having the most common case non taken could sometimes be reversed. In practice we often don't mind this when coding 6502 code because that's already enough trouble.


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 1:55 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7526
Location: Canada
rainwarrior wrote:
(In practice a little more complicated, but I'm illustrating the idea.)

My whole point was to appeal to a platform OP is more familiar with. By choosing what is in or out of a branch, you save a cycle on 6502. That's the only important part of the illustration.

6502 branch optimization is more complicated than just inverting a branch (the savings are too small, it's inadequate for loops, an if/else needs a jump that inverts the savings, etc.) but I thought the principle was clear. That kind branch optimization did work better for the CPU s we're really talking about.

On a Pentium the branch penalty is from a pipeline stall rather than just a question of branch taken or not, hence prediction is the only relevant factor. For loops, earlier static branching implementations already treated a backward branch (i.e. the most common for a loop) as a prediction that it will take the branch. The added hint did add some nuanced optimization possibilities, but only subtly so compared to what we already had without the hint.

That's the point: the hint didn't gain much, and dynamic prediction made it obsolete.


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 4:23 pm 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 4191
There was also the time I used bit twiddling hacks to eliminate branches from Saturated Arithmetic code used in Snes9x. To implement saturated addition, you only concern yourself with the High Bit becoming set. Shift right, multiply your mask, then OR with the mask, then you've saturated the number. When I did it for Snes9x, I combined the Red and Blue channels into the same 32 bit register to improve speed.

Example: Branchless 8 bit saturated add:
//add the numbers
C = A + B;
//look at high bit only
D = C >> 8;
//Multiply by a mask (0xFF)
D *= 0xFF
//Restrict range to 00-FF, then OR with FF if the high bit had become set before.
C = (C & 0xFF) | D;

Another Example: Branchless 8 bit saturated subtract:
//subtract the numbers
C = A - B
//If number became negative, isolate the sign only, and invert it
D = ((C >> 31) & 1) ^ 1;
//Multiply by FF to make an AND mask (value 0 if the subtraction went negative, FF otherwise)
D *= 0xFF;
//Do the AND mask
C = C & D

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
PostPosted: Wed Jun 26, 2019 7:08 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7526
Location: Canada
Oh yeah, that's the other generic way to optimize branches: get rid of them!

Actually you'll find compilers do this automatically for a lot of simple things these days. Sometimes it's better to do e.g. a multiply and an add than a branch. A pipeline stall might easily be worth a few arithmetic instructions.


Top
 Profile  
 
PostPosted: Thu Jun 27, 2019 1:17 am 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3501
Location: Richmond, Virginia
Dwedit wrote:
There was also the time I used bit twiddling hacks to eliminate branches from Saturated Arithmetic code used in Snes9x.

Hmm... Is there a particular reason you did not use any of the SSE instructions for this? I think saturated arithmetic is the only type of arithmetic available here.


Top
 Profile  
 
PostPosted: Mon Jul 01, 2019 12:13 pm 
Online

Joined: Sun Mar 27, 2016 7:56 pm
Posts: 217
Dwedit wrote:
There was also the time I used bit twiddling hacks to eliminate branches from Saturated Arithmetic code used in Snes9x.

On the other hand, I've done similar stuff to avoid a branch, only to have the optimizing compiler put the branch back in.

It does make me suspect that blindly removing branches isn't always faster, since the branchless code tends to be more complicated than branching code. I didn't think to benchmark it at the time, though.


Top
 Profile  
 
PostPosted: Mon Jul 01, 2019 3:12 pm 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 4191
Note that use of comparison operators (==, !=, <, <=, >, >=), logical operators (&& ||), as well as the Question Colon operator ( ? : ) are all branches, so if you used them, you added in a branch.

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
PostPosted: Mon Jul 01, 2019 3:33 pm 
Offline

Joined: Sun Apr 13, 2008 11:12 am
Posts: 8478
Location: Seattle
Dwedit wrote:
Note that use of comparison operators (==, !=, <, <=, >, >=)
That's not quite true. LAHF allows the flags register to be used in arithmetic. Whether the flags are in a useful encoding, or whether there's any way to coax the compiler to emit LAHF is another question. (A few earlier x86-64 Intel cores don't support it, but everything 32bit only, less than ten years old, or made by AMD, does. GCC's manual says that it will use the related SAHF to optimize a few math builtins)


Top
 Profile  
 
PostPosted: Mon Jul 01, 2019 11:12 pm 
Offline
User avatar

Joined: Thu Mar 31, 2016 11:15 am
Posts: 520
Don't forget CMOV 8-)

Quote:
I've been playing with x86-64 Linux assembly and was making code that would search through a buffer until it finally reached ascii '0'-'9', 'A'-'F', or 'a'-'f' (for hexadecimal conversion).

There's probably a sse/avx version that blows everything else out of the water.


Top
 Profile  
 
PostPosted: Tue Jul 02, 2019 2:59 am 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3501
Location: Richmond, Virginia
Rather surprisingly, there doesn't seem to be; text processing probably just isn't a big enough resource hog for this sort of thing.

I did, however, design a function that converts a 64 bit hexadecimal number to 16 ascii characters using several different sse instructions. I don't think this could be much, if any, faster without utilizing newer instructions.

Code:
  bswap rax                  ;reverse the order of bytes (number needs to be backwards)
  movq xmm1, rax            ;move low nibble qword into low qword of xmm1
  shr rax, 4
  movq xmm0, rax             ;move high nibble qword into low qword of xmm0
  punpcklbw xmm0, xmm1         ;interleave bytes
  pand xmm0, [lowNibbleMask]   ;and with mask to get low nibbles of xmm0
  movapd xmm1, xmm0
  pcmpgtb xmm1,[checkIfAtoF]   ;find what numbers are represened as A through F
  psubusb xmm1,[correctAtoF]   ;subtract with saturation to yield 7 for these numbers
  paddb   xmm1, [asciiNumStart]   ;add the ascci base number offset
  paddb xmm0, xmm1            ;add xmm1 to xmm0 to obtain correct ascii values


Also, unrelated, but does anyone know if there's any way for nasm/yasm to know the data size of macro arguments? (For example, al, ax, eax, rax.) I have macros that generate slightly different code based on one argument and have this argument's size is a separate argument, but it would be nice if I could omit this.



Going off on a tangent now, I remember first getting into x86 and thinking: "why on earth do new instructions keep being added? How are there not enough already?" but now, I keep wanting to use instructions that were introduced only a few years ago, lol. However, I've restricted myself to only using nothing more recent than SSE2; Notepad++ (the version I have, anyway) doesn't even recognize anything newer and won't format it.

I had heard x86 (especially modern x86-64) was a nighmare to program for, but I find it rather fun; there are plenty of registers that you often don't even need to use because immediate performance is so good, and while there are a bazillion different instructions, once you get to learn most of them, they can really make solving various different problems much easier. (Of course, if I cared about ease of development, I wouldn't touch assembly at all, lol.) Not to mention the documentation is excellent; meanwhile, even with ARM, you need to look for several different documents. As for PowerPC and many other (dead) ISAs; good luck.


Last edited by Drew Sebastino on Tue Jul 02, 2019 10:45 am, edited 1 time in total.

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group