Strategies for implementing macro functionality in assemblers

You can talk about almost anything that you want to on this board.

Moderator: Moderators

User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Strategies for implementing macro functionality in assemblers

Post by tokumaru »

When thinking about writing my own assembler, one of the things that constantly baffles me is macros. Macro arguments, more specifically. In theory this shouldn't be so complicated, macro arguments could be treated like regular text substitutions (i.e. equates/defines), except that they'd only be active during one expansion of the macro they belong to. In practice though, having temporary text substitutions can be more complicated than it sounds.

In a language like C, where you have to manage all the memory you use manually, the annoyance and overhead of using malloc/free for each individual argument of each macro call of each pass sounds like something to be avoided. For most types of structures I prefer to use the concept of memory pools, where I allocate memory for several objects at a time and then manage the distribution of those objects, allocating more whenever they run out.

One alternative I considered was to cache the original source code and reference the text in the actual macro calls when replacing parameter names with actual arguments, avoiding the need to allocate temporary strings, but the fact that text substitutions can be nested and not all of them are global (e.g. if one macro calls another and passes along any arguments it received, the substitutions must be made *before* the call, because those arguments won't be visible from the second macro). Because of this, I believe that each line of source code must be fully expanded before it's processed, meaning that macro arguments should always be considered dynamic.

In this case, one solution would be to limit the number of nested macro calls that can be made and buffer the expanded source code lines containing the macro calls, so that the arguments can be copied from the fully expanded macro calls. The downside is the limitation imposed on macro nesting, and limitations are never cool.

Am I overthinking this? Am I overlooking something? Can anyone think of a simpler solution for implementing macro arguments in an assembler?
User avatar
pubby
Posts: 583
Joined: Thu Mar 31, 2016 11:15 am

Re: Strategies for implementing macro functionality in assemblers

Post by pubby »

Everything depends on the semantics and evaluation strategy of your macro system. Typically though you would identify all the macro arguments before you expand anything. Everything that isn't a macro argument would just be strings to the original code.

For example:

Code: Select all

.define FOO(X, Y) 3 * X + Y
would be represented as:

Code: Select all

STRING "3 * "
ARG 0
STRING " + "
ARG 1
Each STRING would be a pointer to the original text along with a substring length, and each ARG would be the integer index of whatever argument it defined.

Then when you expand macros, you end up with a list of STRINGs. For example, this text:

Code: Select all

print FOO(1, 7);
print 5;
Would expand to:

Code: Select all

STRING "print " 
STRING "3 * "
STRING "1"
STRING " + "
STRING "7"
STRING "; print 5;"
Of course, instead of using strings of characters, you ought to use strings of tokens.
User avatar
Bregalad
Posts: 8055
Joined: Fri Nov 12, 2004 2:49 pm
Location: Divonne-les-bains, France

Re: Strategies for implementing macro functionality in assemblers

Post by Bregalad »

I always found C's #define word processing macros to be silly and are probably an exeption among the rule in the world of programing languages. Almost everything else would expand macro arguments first and use the result in the macro, which is the behavior who makes common sense.

C language is poorly designed by ended up very popular just because of the snowball effect.
Useless, lumbering half-wits don't scare us.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 568
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: Strategies for implementing macro functionality in assemblers

Post by Jarhmander »

You seem to worry too much about optimizations at this time; even worse, you're worrying about how to manage memory down to the macro-handling code. Your time will be much better spent on how to structure things and separate concerns. Strings, for instance, are best managed in objects with a minimal but suffisent interface. Your macro-handing code deals with string objects, string buffers and higher data structures; it should never manage blocks of memory with malloc and free calls interspersed throughout the entire code. Doing this will complexify the code for no good reason, and you'll end up doing needless and useless optimizations. Note that I repeatedly talk about objects, but I do not imply going full OOP; while it's a good paradigm, there much more to it than putting interfaces everywhere and have objects with deep inheritance chains that people tend to assume for no good reason. Moreover, since it appears you're programming in C, that will be needlessly painful to do anyway ―ever tried to do polymorphism in C? Doable, but certainly much more boilerplate than doing it in a OOP language. Plain structs with functions (or macros!) to create, manipulate and destroy them are perfectly fine. Just never manipulate implementation details directly from you code (wrap that in functions/macros), and you're good to go. Memory pool handling code, if deemed useful, could be added later with not so much additional effort; I don't imply there would be no change to the code, but that would be not as big as you might think. In any case, the C programmer always end up reinventing the wheel, might as well make it do it precisely what you want: dynamic strings, string builders, string pools, hash tables etc. that works good enough for your needs, not more than that.

Also, I don't think assembler macros should manipulate text strings directly, but tokens and/or expressions. That would be no that much more work: your assembler can parse basic arithmetic expressions, right? Then you likely have what is needed for that kind of implementation. Text-based macro substitution is sometimes useful but finicky and lends to bad or cryptic error messages, because these kind of macro expanders operate at a too low level. You can end up with problems with multiple expansions, or problem separating tokens from source because of lack of whitespace (!) in macro expansion, or commas that means something different when reparsed in a new context, or all sort of silly things like that. Save yourself the hassle, make your macro expander works with tokens and/or expressions, not raw strings.

Then there's the problem of recursive macros. They are useful, but of course there's the possibility of infinite recursion. There's no way around that: use a recursion limit. No, not some of stupid magic constant that limits your recursion stack: a variable holding the maximum recursion level. Then do the obvious: when macro-expanding, check the current recursion level, bail out with an error when it exceeded the limit. That variable should not dictate directly the size of that recursion stack, this stack should just be dynamic and grow on demand. Then that variable could be tweaked by a command-line argument, if really needed. Make the default large enough (100? 500? who knows). That will efficiently trap any runaway recursion.

Now, some useful... "idiom"? If you want some kind of array dynamic, but not necessarily dynamically allocated, one may find that kind of code useful:

Code: Select all

void some_func(... size_t num ...)
{
     // Some define, enum or constant dictating the initial size of the array; the value is your pick.
     // For illustration purposes, it's directly in the function, but could be elsewhere of course.
     enum { OBJ_ARRAY_INITIAL_SIZE = 16 };
     
     Obj stack_storage[OBJ_ARRAY_INITIAL_SIZE];
     Obj *storage = stack_storage, *dynamic_storage = NULL;
     
     // If we need more storage, allocate it.
     // numel => (sizeof(X)/sizeof(*X))
     if (numel(stack_storage) > num) {
         // Ideally, check the return value of malloc
         storage = dynamic_storage = malloc(sizeof (Obj) * num);
     }
     
     // Do something with _storage_, never manipulate the other variables
     // (dynamic_storage could be NULL, stack_storage may not be the actual storage).
     // ...
     
     // Clean-up.
     free(dynamic_storage);
}
Of course, as shown here, it is only valid for some storage needed in the scope of the function. Also, for illustration I put the code directly in the body of the function, but you would want to wrap that in functions/macros. If the static storage and pointers were in a struct however, then the storage is valid as long as the object is alive; this is kinda like small buffer optimization (SBO), except normally when people talk about SBO, it's a more extreme variant with an union that shares storage between various pointers and a local buffer.

Rather than using a fully contiguous data structure, you may find yourself wanting some more dynamic structure, like a deque, where you have multiple smaller arrays linked together. That needs no major reallocation; for a stack-like structure, you don't even need a contiguous data structure anyway.
((λ (x) (x x)) (λ (x) (x x)))
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Strategies for implementing macro functionality in assemblers

Post by tokumaru »

I started this thinking I'd use tokens all the way, but then found that macros and defines would make that really hard, since tokens can mean different things when used in different places. For example:

Code: Select all

  .define whatever z:

  ;in this case, "whatever" expands to a label definition
whatever JMP z

  ;but here, it expands to an address size modifier (forcing ZP addressing)
  lda whatever$1b
So how would I tokenize the "z:" as the expansion of "whatever", considering that it can be interpreted as different things depending on where it's used later on? To avoid problems like these, I figured it'd be better to delay the tokenization until each line of code was fully expanded, so there'd be no ambiguity to worry about.

But still, strings, tokens, whatever you go with, the problem of nested temporary expansions is one that appears to complicate the memory management of the assembler significantly. Most other things are permanent (e.g. labels, variables, literals, etc.), once created they remain in use until the assembler is done, but the replacement of macro parameters is temporary, which really complicated things from where I'm seeing.
User avatar
aa-dav
Posts: 219
Joined: Tue Apr 14, 2020 9:45 pm
Location: Russia

Re: Strategies for implementing macro functionality in assemblers

Post by aa-dav »

I belive that powerful macros must be turing-complete compile-time tool capable of anything.
That is macros must behave as functions returning strings.
And string allocation questions are not worth of narrowing of this design to simple string substitution.
However implementation of turing-complete language inside macros could be more powerfull task than the rest of assembler.
So I was interested in simple designs like Tcl language which is built around string substitution and passes code to if/for/while constructions as... strings! Code inside sting of code inside string of code and so on - it's possible due to special string syntax with start/end markers, so strings can be nested. So if/for/while could be implemented in language itself using built-in 'apply' function or something like that.
Interesting thing to implement, however I don't have time to make this experiment yet.
calima
Posts: 1745
Joined: Tue Oct 06, 2015 10:16 am

Re: Strategies for implementing macro functionality in assemblers

Post by calima »

You *can* use the cpp or m4 preprocessors if you don't feel like reimplementing #define.
User avatar
Movax12
Posts: 541
Joined: Sun Jan 02, 2011 11:50 am

Re: Strategies for implementing macro functionality in assemblers

Post by Movax12 »

tokumaru wrote: Thu Aug 20, 2020 3:15 pm I started this thinking I'd use tokens all the way, but then found that macros and defines would make that really hard, since tokens can mean different things when used in different places. For example:

Code: Select all

  .define whatever z:

  ;in this case, "whatever" expands to a label definition
whatever JMP z

  ;but here, it expands to an address size modifier (forcing ZP addressing)
  lda whatever$1b
So how would I tokenize the "z:" as the expansion of "whatever", considering that it can be interpreted as...
Did you try this in ca65? I'm 99% sure it won't work, as z: is read as a single token. Everything in ca65 is tokenized always. I would test it myself, but I'm away from my computer.
Oziphantom
Posts: 1565
Joined: Tue Feb 07, 2017 2:03 am

Re: Strategies for implementing macro functionality in assemblers

Post by Oziphantom »

the string substitution, just stack alloc the temp buffer, that should easily handle every thing a sane person would through at a 6502 assembler. So make a large buffer, then start from the back and expand as you need to doing find replaces. Keep the src and convert one into the buffer going forwards as you find things. These are nothing to worry about on a machine made in the last 20 years.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Strategies for implementing macro functionality in assemblers

Post by tokumaru »

Movax12 wrote: Fri Aug 21, 2020 5:16 amDid you try this in ca65? I'm 99% sure it won't work, as z: is read as a single token.
I didn't try it, but regardless of how ca65 handles this, from a logical point of view it should work, and one popular assembler not supporting it shouldn't be an excuse to not do this properly.
I would test it myself, but I'm away from my computer.
Yeah, I'm away from my computer so much these days that I've even been coding on my phone (I can compile and run C programs using Termux).
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 568
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: Strategies for implementing macro functionality in assemblers

Post by Jarhmander »

Oziphantom wrote: Fri Aug 21, 2020 7:08 am the string substitution, just stack alloc the temp buffer, that should easily handle every thing a sane person would through at a 6502 assembler. So make a large buffer, then start from the back and expand as you need to doing find replaces. Keep the src and convert one into the buffer going forwards as you find things. These are nothing to worry about on a machine made in the last 20 years.
Pretty much my thinking regarding memory.
tokumaru wrote: Thu Aug 20, 2020 3:15 pm I started this thinking I'd use tokens all the way, but then found that macros and defines would make that really hard, since tokens can mean different things when used in different places. For example:

Code: Select all

  .define whatever z:

  ;in this case, "whatever" expands to a label definition
whatever JMP z

  ;but here, it expands to an address size modifier (forcing ZP addressing)
  lda whatever$1b
So how would I tokenize the "z:" as the expansion of "whatever", considering that it can be interpreted as different things depending on where it's used later on? To avoid problems like these, I figured it'd be better to delay the tokenization until each line of code was fully expanded, so there'd be no ambiguity to worry about.
Well, it all depends on what you are trying to achieve. When I see your example, I shiver a bit, but if that's what you want... Then obviously, "whatever" is not associated with the token (LABEL="z"), but several lower level tokens like [(id="z"), (colon)]; then when it's expanded, the tokens will tell what they means according to context. In whatever JMP z, id + colon means it's a label; in lda whatever$1b, id + colon is an address specifier. That's because in your "reduction rules", in the first case you match a "maybe-label", that is a label that may be omitted for a regular ASM "statement", and in the second case, it's a "memory-operand", which contains a "maybe-address-specifier". Note that your reduction rules may well be represented by functions, if you're doing a recursive-descent parser, which is simply about the simplest thing that just works.

Working with tokens like this will still be better than in raw strings IMHO, because you will not have to worry about token boundaries so much in expansions. For the record, the C preprocessor does exactly this: it tokenizes its input, and except for the "token paste" and "stringify" operators, nothing really changes the interpretation of what is read wrt tokens. The C compiler just reads tokens from the preprocessor.
((λ (x) (x x)) (λ (x) (x x)))
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Strategies for implementing macro functionality in assemblers

Post by tokumaru »

Oziphantom wrote: Fri Aug 21, 2020 7:08 amthe string substitution, just stack alloc the temp buffer
Yeah, I figured I'd try something like this: fully expand each line to a local (stack) buffer, and process the result. If it's a macro call, recursively process it so that the original line will remain in memory, allowing arguments to be copied from it. This imposes no specific limit on the number of nested macro calls, but the stack space will eventually run out.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Strategies for implementing macro functionality in assemblers

Post by tokumaru »

Jarhmander wrote: Fri Aug 21, 2020 8:30 amWhen I see your example, I shiver a bit, but if that's what you want...
It's not so much what I want, but what I think makes sense... defines are presented as "text replacements", and if you don't have the freedom to have the same text interpreted differently in different contexts, then that's false advertising.
Then obviously, "whatever" is not associated with the token (LABEL="z"), but several lower level tokens like [(id="z"), (colon)];
I see. So your suggestion is to make the tokens as basic as necessary so they can still be interpreted in different contexts without the awkwardness of having to "mutate" or be "reinterpreted". I guess that's a fair strategy.
Working with tokens like this will still be better than in raw strings IMHO, because you will not have to worry about token boundaries so much in expansions.
Tokens sound really efficient when you have to do multiple passes over the source code, or when you need to evaluate the same expressions over and over. This was my main motivation for going with tokens from the start.
For the record, the C preprocessor does exactly this: it tokenizes its input, and except for the "token paste" and "stringify" operators, nothing really changes the interpretation of what is read wrt tokens. The C compiler just reads tokens from the preprocessor.
I did not know that.
User avatar
tokumaru
Posts: 12427
Joined: Sat Feb 12, 2005 9:43 pm
Location: Rio de Janeiro - Brazil

Re: Strategies for implementing macro functionality in assemblers

Post by tokumaru »

If you were coding an assembly program and did this:

Code: Select all

.define first second
.define second third

first:
  ;what's the name of the label above?
What would you expect the name of that label to be? Is it "second" or is it "third"? I'm under the impression that in C it would be "second", but without using a pre-processor I believe that the implementation where that becomes "third" would be simpler to write.
User avatar
Jarhmander
Formerly ~J-@D!~
Posts: 568
Joined: Sun Mar 12, 2006 12:36 am
Location: Rive nord de Montréal

Re: Strategies for implementing macro functionality in assemblers

Post by Jarhmander »

It is "third". A C preprocessor just expands until 1) there's nothing left to expand, or 2) while expanding, it encounters a macro that was already expanded before (in the same expansion context), and let it as it is, to prevent any recursion.

That means

Code: Select all

#define foo bar, baz
#define bar foo, bar
"foo" expands to "foo, bar, baz".

It's a bit limitative, and about any other macro system allows recursion. Some (like much Lisps I've checked) don't bother explicitly limiting the recursion level at all, they either do an infinite loop (Scheme systems, probably happily tail recursing the expansion, and SBCL also, surprisingly) or let the stack overflow (CLisp, and it does recover gracefully from that).
((λ (x) (x x)) (λ (x) (x x)))
Post Reply