It is currently Wed May 22, 2019 1:44 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Fri May 10, 2019 2:20 am 
Offline
Formerly Espozo
User avatar

Joined: Mon Sep 15, 2014 4:35 pm
Posts: 3459
Location: Richmond, Virginia
...Is something that is almost always mentioned in any discussion about assembly programing in the present. I never see any code comparisons, statistics, or citations or anything; I assume it's just supposed to be obvious enough that it doesn't require an in-depth explanation. Maybe this is due to me being biased (even after learning C and much of C++), but I find this difficult to believe. I don't see how a compiler could generate more efficient code when it has no real intuition of how any specific program operates, and (at least for what I've seen for x86), function calls will always be expensive no matter what, because seemingly no assumptions are made about where / how functions are called.

I thought then that, maybe with how complex instruction timing has become and other headaches with modern computing, that a compiler may be able to write more efficient code this way, but this specific level of optimization should change with just about every different processor, or at least every different microarchitecture, although I don't know how drastically.


Top
 Profile  
 
PostPosted: Fri May 10, 2019 2:41 am 
Offline
User avatar

Joined: Wed Feb 13, 2008 9:10 am
Posts: 690
Location: Estonia, Rapla city (50 and 60Hz compatible :P)
A compiler can have intimate knowledge of how to order instructions so that various latencies inside modern CPUs won't add up but get overlapped instead. I know I'll be able to beat a compiler when writing 486 and Pentium code since there's very little that gets in the way but with a P4 and most current x86 CPU I definitely won't be. As far as other modern high performance architectures go, I don't know.

_________________
http://www.tmeeco.eu


Top
 Profile  
 
PostPosted: Fri May 10, 2019 6:37 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
Drew Sebastino wrote:
Maybe this is due to me being biased (even after learning C and much of C++), but I find this difficult to believe. I don't see how a compiler could generate more efficient code when it has no real intuition of how any specific program operates, and (at least for what I've seen for x86), function calls will always be expensive no matter what, because seemingly no assumptions are made about where / how functions are called.


Generating optimal code often requires knowing some things that compilers can know but programmers generally won't, but also requires knowing some things that programmers will know but compilers can't. Further, it requires recognizing that any "optimizations" predicated upon the notion that a programmer won't do X will be counter-productive if a programmer needs to do X.

Unfortunately, development on C compilers like clang and gcc has been taken over by people who mistakenly believe "clever" and "stupid" are antonyms. The authors C Standard didn't think it necessary to say "This Standard makes no attempt to forbid obtuse compiler writers from doing stupid things that make their compilers unsuitable for many purposes". They recognize that principle in the published Rationale, but not within the Standards document itself. Consequently, there's an increasing divergence between the language that became popular in the 1980s, and the dialect that clang and gcc process with optimizations enabled, with the latter being suitable only for a few specialized applications that do not involve processing any potentially-invalid input.


Top
 Profile  
 
PostPosted: Fri May 10, 2019 8:03 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
TmEE wrote:
A compiler can have intimate knowledge of how to order instructions so that various latencies inside modern CPUs won't add up but get overlapped instead. I know I'll be able to beat a compiler when writing 486 and Pentium code since there's very little that gets in the way but with a P4 and most current x86 CPU I definitely won't be. As far as other modern high performance architectures go, I don't know.


While it's possible for modern compilers to exploit things about architectures that programmers wouldn't be able to with any reasonable amount of effort, programmers know more than compilers about what needs to be done. Consider, for example, a function which is supposed to accept pointers to two 4096-byte blocks of memory and return 0 if they contain matching contents or any non-zero value if they're don't. One approach would be to simply compare the two blocks a byte at a time, and stop after all 4096 bytes have been compared or a mismatch has found. Another would be to check the pointers for alignment, select one of 64 comparison methods based upon the pointer alignments, and then compare process eight bytes at a time, using whatever shifts are required. The first approach may be more than double the speed of almost anything else in cases where it's necessary to compare many blocks that match except for the first byte. The second approach may be almost eight times as fast as the first in cases where the two blocks match. If the Standard were to include alternate memory-comparison functions which implementations could simply chain to `memcmp` if desired, but which implementations would be encouraged to optimize for different cases, then picking the right routine in various situations could do more to improve performance than anything a clever implementation might hope to accomplish.

While some compiler writers might suggest "profile-driven optimization" as a solution to such issues, that fails to consider the fact that a compiler will generally have no way of distinguishing cases where performance matters from those where it doesn't. If an operation needs to be performed 1000 times, it may be better to have it take 10us 99% of the time and 1000us 1% of the time, than to have it take 100us every time, provided the time for all 1000 operations never exceeds 200ms. If the time for 1000 operations could sometimes reach a full second, however, that may be unacceptable. Such issues can be especially important in cases where performance can depend upon the data a program receives from potentially-malicious sources. If a language has no way of expressing such concepts, compilers cannot be expected to generate optimal code that meets them.


Top
 Profile  
 
PostPosted: Fri May 10, 2019 1:54 pm 
Offline

Joined: Wed Nov 30, 2016 4:45 pm
Posts: 135
Location: Southern California
I suspect the idea that compilers can outperform handwritten assembly comes from modern 32- and 64-bit multi-core high-end processors that have super-complex instructions, deep pipelines that must be kept full, multi-level cache, out-of order execution, and things that simply don't apply to the 6502. I've never heard of a 6502 compiler that could remotely compete with hand assembly. Perhaps a great compiler could be written if there were the commercial interest to invest the necessary man-hours, but I still don't think it would produce code that could match, let alone exceed, the performance of code from any skilled 6502 assembly-language programmer.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Top
 Profile  
 
PostPosted: Fri May 10, 2019 2:49 pm 
Offline

Joined: Sun Sep 19, 2004 11:07 pm
Posts: 167
The idea post-dates most of the 32 bit era.

The compilers started getting competitive with hand written stuff in the 486/pentium era, and have gotten better as the used subset of x86 shrinks down, and the chips become a bit more regular as far as timing goes. A hefty amount is due to the microarchitecture doing a lot of the work for them.

For modern stuff, the only real reason to drop to inline assembly anymore is when you're trying to express concepts that C/C++ really don't want to do, or possibly SIMD. But those areas are shrinking rapidly, and SIMD is far, far easier to deal with via intrinsics.

Compilers are generally pretty good about tedious, mechanical, predictable, repetitive operations like register allocation, loop unrolling, and inlining (though unrolling is a lot more situational now, and arguably has been since branch prediction became a thing). They're at best tied on subexpression elimination, but their hands are tied a bit there by language semantics.

It's been a rather long while since I last ran into a situation where the compiler output wasn't mostly what I would have written anyways, for anything where it mattered, and even those cases there was usually a way to tweak the C/C++ code to make the compiler do the better thing without having to go to assembly.


Top
 Profile  
 
PostPosted: Fri May 10, 2019 2:50 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
Garth wrote:
I suspect the idea that compilers can outperform handwritten assembly comes from modern 32- and 64-bit multi-core high-end processors that have super-complex instructions, deep pipelines that must be kept full, multi-level cache, out-of order execution, and things that simply don't apply to the 6502. I've never heard of a 6502 compiler that could remotely compete with hand assembly. Perhaps a great compiler could be written if there were the commercial interest to invest the necessary man-hours, but I still don't think it would produce code that could match, let alone exceed, the performance of code from any skilled 6502 assembly-language programmer.


Actually, the idea substantially predates the 6502. Many optimization techniques would result in code that was very brittle, and would thus need to be substantially reworked if requirements changed. If one were trying to write an assembly-language program to process a two-dimensional array, one would likely have to either write code that would only work for a power-of-two array size, or write less efficient code that would work with any size. If one were instead writing the code in FORTRAN, however, one wouldn't have to change anything other than the array dimension to make the compiler automatically generate whatever code was appropriate for the specified size. Hand-written assembly code that was specifically tailored for a particular array size might outperform the compiler-generated machine code for that size, but assembly code that needed to be easily maintainable could not outperform the machine code generated by a compiler that was not thus encumbered.


Top
 Profile  
 
PostPosted: Fri May 10, 2019 3:30 pm 
Offline

Joined: Sun Sep 19, 2004 11:07 pm
Posts: 167
Fair 'nuff. My experience is almost entirely on 486+ PC hardware, retro consoles, and PS2/GC-present consoles.

I could certainly see the promise of a sufficiently smart compiler being closer to reality on the sort of machines people wrote FORTRAN code for. It definitely wasn't the case for most of the PC compilers (Watcom excepted).

Console compilers have been a wide range of hit and miss there, with a particularly painful era of false promises in the 360/PS3 era when IBM managed to convince everyone that out of order was a passing fad and the compiler would totally be able to schedule code properly.


Top
 Profile  
 
PostPosted: Fri May 10, 2019 6:14 pm 
Offline

Joined: Wed Nov 30, 2016 4:45 pm
Posts: 135
Location: Southern California
supercat wrote:
Garth wrote:
I suspect the idea that compilers can outperform handwritten assembly comes from modern 32- and 64-bit multi-core high-end processors that have super-complex instructions, deep pipelines that must be kept full, multi-level cache, out-of order execution, and things that simply don't apply to the 6502. I've never heard of a 6502 compiler that could remotely compete with hand assembly. Perhaps a great compiler could be written if there were the commercial interest to invest the necessary man-hours, but I still don't think it would produce code that could match, let alone exceed, the performance of code from any skilled 6502 assembly-language programmer.

Actually, the idea substantially predates the 6502. Many optimization techniques would result in code that was very brittle, and would thus need to be substantially reworked if requirements changed. If one were trying to write an assembly-language program to process a two-dimensional array, one would likely have to either write code that would only work for a power-of-two array size, or write less efficient code that would work with any size. If one were instead writing the code in FORTRAN, however, one wouldn't have to change anything other than the array dimension to make the compiler automatically generate whatever code was appropriate for the specified size. Hand-written assembly code that was specifically tailored for a particular array size might outperform the compiler-generated machine code for that size, but assembly code that needed to be easily maintainable could not outperform the machine code generated by a compiler that was not thus encumbered.

I think that's a very small part of programming though; and even then, you could write assembly macros that would generate the optimum code for the needed array size. If you decide to change the array size, there's no need to re-write the code that calls the macros. I've had macro definitions that were 50 lines or more yet only laid down two ML instructions, depending on all the conditions the macros watched for.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources


Top
 Profile  
 
PostPosted: Sat May 11, 2019 9:42 am 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
Garth wrote:
I think that's a very small part of programming though; and even then, you could write assembly macros that would generate the optimum code for the needed array size. If you decide to change the array size, there's no need to re-write the code that calls the macros. I've had macro definitions that were 50 lines or more yet only laid down two ML instructions, depending on all the conditions the macros watched for.


A macro assembler that can let the programmer specify and handle all of the proper cases will generally be more complicated than a compiler that can simply generate the proper code for the proper cases, and even when using such tools the level of programmer effort required to handle such cases will be greater than would have been needed in a higher-level language.

I gave array sizes as a simple example of something I know compilers could do even in the 1960s. Register usage and constant reuse would actually be even more significant issues, but I'm not sure what the state of the art was when the 6502 was introduced. If one writes something in FORTRAN equivalent to the C code:
Code:
   #define SCALE1 2.5f
   #define SCALE2 1.5f

    double adjust(float x, float y)
    {
      return x*SCALE1 + y*(SCALE2+1.0f);
    }

a compiler could exploit the fact that the same value 2.5f would be used in both places, but exploiting that in assembly code while still having the program work if the constants changed would be harder. And given something equivalent to:
Code:
    int array1[1000];
    int test(int n)
    {
      for (int i=0; i<100; i++)
      {
        array1[i] += 1;
        someFunction();
        array1[i]--;
    }

it would likely be worthwhile to keep the address of array1[i] in a "marching pointer" register if it wouldn't get disturbed by the call to someFunction(), but it wouldn't be worth putting it into a register that would need to be saved before the function call and restored afterward. I don't think compilers of the 1960s would have spent the effort to examine the callers of someFunction() to decide what registers should be left alone, but I would expect they could had earlier-compiled functions report any registers that they would "naturally" leave undisturbed.


Top
 Profile  
 
PostPosted: Sat May 11, 2019 10:55 am 
Offline
User avatar

Joined: Mon Feb 07, 2011 12:46 pm
Posts: 1040
"Modern compilers will outperform handwritten assembly." I think it depends on the program being written, on the compiler, on the computer (or VM) being targeted, and on the programmer. (In some cases, I know because I have tried it, that the handwritten assembly code will work better.)

_________________
.


Top
 Profile  
 
PostPosted: Sat May 11, 2019 2:42 pm 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7711
Location: Chexbres, VD, Switzerland
I think the cases where people even try to hand-write assembly, and to optimize it, are now so rare that basically no engineers are able to do that anymore. The performance gain is not just the code but the time it takes to actually hand-write highly performing assembly code. In the industry it's not worth paying an engineer to take two weeks optimizing his assembly routine when you could just use sub-optimal code and better, hardware instead. The hardware might be more expensive but the lower time to market finally makes this solution the most performing.

In the case of retro consoles it's different because the hardware is fixed and we have to push it to it's limit.


Top
 Profile  
 
PostPosted: Sat May 11, 2019 4:07 pm 
Offline

Joined: Thu Apr 18, 2019 9:13 am
Posts: 103
Bregalad wrote:
I think the cases where people even try to hand-write assembly, and to optimize it, are now so rare that basically no engineers are able to do that anymore. The performance gain is not just the code but the time it takes to actually hand-write highly performing assembly code. In the industry it's not worth paying an engineer to take two weeks optimizing his assembly routine when you could just use sub-optimal code and better, hardware instead. The hardware might be more expensive but the lower time to market finally makes this solution the most performing.


Even on modern microcontrollers, there are many cases where spending a few hours on hand-written assembly may offer some major performance improvements. Most of the improvements that would be achievable within a couple weeks wouldn't take more than a few hours to achieve. A bigger issue is determining what dialect of C makes the most sense, and ensuring that one's compiler will actually process it without jumping the rails.

For example, if one needs a function that takes a pointer to a 32-bit-aligned byte address and interprets the four bytes there as a big-endian 32-bit unsigned int, such a function might be written a couple of ways for e.g. the Cortex-M0
Code:
  uint32_t fetch32le_v1(void *p)
  {
    uint8_t *pp = p;
    return (pp[0]<<24) | (pp[1]<<16) | (pp[2]<<8) | pp[3];
  }

  uint32_t fetch32le_v2(void *p)
  {
    uint32_t result = *(uint32_t*)p;
    return (result<<24) | ((result<<8) & 0xFF0000) | ((result>>8) & 0xFF00) | (result >> 24);
  }

A compiler that has no way of knowing that `p` will be word-aligned couldn't process fetch32le_v1 using less than four byte reads, three shifts, and three "or" instructions. GCC's optimizer uses those operations to assemble the value in little-endian order and then uses a byte-reverse instruction. When using the -fno-strict-optimization dialect, the fetch32le_v2 function could achieve the required behavior using two instructions plus the return rather than ten (or eleven), but in gcc's "preferred" dialect the latter function won't be reliable.


Top
 Profile  
 
PostPosted: Sun May 12, 2019 12:22 pm 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7711
Location: Chexbres, VD, Switzerland
This seems more a problem with C language being inapropriate to describe the program, rather than the compiler failing at optimizing the assembly.


Top
 Profile  
 
PostPosted: Sun May 12, 2019 5:55 pm 
Offline
User avatar

Joined: Sun Jan 22, 2012 12:03 pm
Posts: 7461
Location: Canada
Bregalad wrote:
This seems more a problem with C language being inapropriate to describe the program, rather than the compiler failing at optimizing the assembly.

Standard C maybe, but that's also why most compilers have extensions like __declspec(align), or __attribute__ ((aligned)). There's a lot of very practical ways to address domain specific problems when you actually allow domain specific solutions.

Also if you wait long enough things that everybody implements anyway do usually eventually make it into the standard spec.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: Majestic-12 [Bot] and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group