Strategies for implementing macro functionality in assemblers
Moderator: Moderators
Strategies for implementing macro functionality in assemblers
When thinking about writing my own assembler, one of the things that constantly baffles me is macros. Macro arguments, more specifically. In theory this shouldn't be so complicated, macro arguments could be treated like regular text substitutions (i.e. equates/defines), except that they'd only be active during one expansion of the macro they belong to. In practice though, having temporary text substitutions can be more complicated than it sounds.
In a language like C, where you have to manage all the memory you use manually, the annoyance and overhead of using malloc/free for each individual argument of each macro call of each pass sounds like something to be avoided. For most types of structures I prefer to use the concept of memory pools, where I allocate memory for several objects at a time and then manage the distribution of those objects, allocating more whenever they run out.
One alternative I considered was to cache the original source code and reference the text in the actual macro calls when replacing parameter names with actual arguments, avoiding the need to allocate temporary strings, but the fact that text substitutions can be nested and not all of them are global (e.g. if one macro calls another and passes along any arguments it received, the substitutions must be made *before* the call, because those arguments won't be visible from the second macro). Because of this, I believe that each line of source code must be fully expanded before it's processed, meaning that macro arguments should always be considered dynamic.
In this case, one solution would be to limit the number of nested macro calls that can be made and buffer the expanded source code lines containing the macro calls, so that the arguments can be copied from the fully expanded macro calls. The downside is the limitation imposed on macro nesting, and limitations are never cool.
Am I overthinking this? Am I overlooking something? Can anyone think of a simpler solution for implementing macro arguments in an assembler?
In a language like C, where you have to manage all the memory you use manually, the annoyance and overhead of using malloc/free for each individual argument of each macro call of each pass sounds like something to be avoided. For most types of structures I prefer to use the concept of memory pools, where I allocate memory for several objects at a time and then manage the distribution of those objects, allocating more whenever they run out.
One alternative I considered was to cache the original source code and reference the text in the actual macro calls when replacing parameter names with actual arguments, avoiding the need to allocate temporary strings, but the fact that text substitutions can be nested and not all of them are global (e.g. if one macro calls another and passes along any arguments it received, the substitutions must be made *before* the call, because those arguments won't be visible from the second macro). Because of this, I believe that each line of source code must be fully expanded before it's processed, meaning that macro arguments should always be considered dynamic.
In this case, one solution would be to limit the number of nested macro calls that can be made and buffer the expanded source code lines containing the macro calls, so that the arguments can be copied from the fully expanded macro calls. The downside is the limitation imposed on macro nesting, and limitations are never cool.
Am I overthinking this? Am I overlooking something? Can anyone think of a simpler solution for implementing macro arguments in an assembler?
Re: Strategies for implementing macro functionality in assemblers
Everything depends on the semantics and evaluation strategy of your macro system. Typically though you would identify all the macro arguments before you expand anything. Everything that isn't a macro argument would just be strings to the original code.
For example:
would be represented as:
Each STRING would be a pointer to the original text along with a substring length, and each ARG would be the integer index of whatever argument it defined.
Then when you expand macros, you end up with a list of STRINGs. For example, this text:
Would expand to:
Of course, instead of using strings of characters, you ought to use strings of tokens.
For example:
Code: Select all
.define FOO(X, Y) 3 * X + Y
Code: Select all
STRING "3 * "
ARG 0
STRING " + "
ARG 1
Then when you expand macros, you end up with a list of STRINGs. For example, this text:
Code: Select all
print FOO(1, 7);
print 5;
Code: Select all
STRING "print "
STRING "3 * "
STRING "1"
STRING " + "
STRING "7"
STRING "; print 5;"
Re: Strategies for implementing macro functionality in assemblers
I always found C's #define word processing macros to be silly and are probably an exeption among the rule in the world of programing languages. Almost everything else would expand macro arguments first and use the result in the macro, which is the behavior who makes common sense.
C language is poorly designed by ended up very popular just because of the snowball effect.
C language is poorly designed by ended up very popular just because of the snowball effect.
Useless, lumbering half-wits don't scare us.
- Jarhmander
- Formerly ~J-@D!~
- Posts: 568
- Joined: Sun Mar 12, 2006 12:36 am
- Location: Rive nord de Montréal
Re: Strategies for implementing macro functionality in assemblers
You seem to worry too much about optimizations at this time; even worse, you're worrying about how to manage memory down to the macro-handling code. Your time will be much better spent on how to structure things and separate concerns. Strings, for instance, are best managed in objects with a minimal but suffisent interface. Your macro-handing code deals with string objects, string buffers and higher data structures; it should never manage blocks of memory with malloc and free calls interspersed throughout the entire code. Doing this will complexify the code for no good reason, and you'll end up doing needless and useless optimizations. Note that I repeatedly talk about objects, but I do not imply going full OOP; while it's a good paradigm, there much more to it than putting interfaces everywhere and have objects with deep inheritance chains that people tend to assume for no good reason. Moreover, since it appears you're programming in C, that will be needlessly painful to do anyway ―ever tried to do polymorphism in C? Doable, but certainly much more boilerplate than doing it in a OOP language. Plain structs with functions (or macros!) to create, manipulate and destroy them are perfectly fine. Just never manipulate implementation details directly from you code (wrap that in functions/macros), and you're good to go. Memory pool handling code, if deemed useful, could be added later with not so much additional effort; I don't imply there would be no change to the code, but that would be not as big as you might think. In any case, the C programmer always end up reinventing the wheel, might as well make it do it precisely what you want: dynamic strings, string builders, string pools, hash tables etc. that works good enough for your needs, not more than that.
Also, I don't think assembler macros should manipulate text strings directly, but tokens and/or expressions. That would be no that much more work: your assembler can parse basic arithmetic expressions, right? Then you likely have what is needed for that kind of implementation. Text-based macro substitution is sometimes useful but finicky and lends to bad or cryptic error messages, because these kind of macro expanders operate at a too low level. You can end up with problems with multiple expansions, or problem separating tokens from source because of lack of whitespace (!) in macro expansion, or commas that means something different when reparsed in a new context, or all sort of silly things like that. Save yourself the hassle, make your macro expander works with tokens and/or expressions, not raw strings.
Then there's the problem of recursive macros. They are useful, but of course there's the possibility of infinite recursion. There's no way around that: use a recursion limit. No, not some of stupid magic constant that limits your recursion stack: a variable holding the maximum recursion level. Then do the obvious: when macro-expanding, check the current recursion level, bail out with an error when it exceeded the limit. That variable should not dictate directly the size of that recursion stack, this stack should just be dynamic and grow on demand. Then that variable could be tweaked by a command-line argument, if really needed. Make the default large enough (100? 500? who knows). That will efficiently trap any runaway recursion.
Now, some useful... "idiom"? If you want some kind of array dynamic, but not necessarily dynamically allocated, one may find that kind of code useful:
Of course, as shown here, it is only valid for some storage needed in the scope of the function. Also, for illustration I put the code directly in the body of the function, but you would want to wrap that in functions/macros. If the static storage and pointers were in a struct however, then the storage is valid as long as the object is alive; this is kinda like small buffer optimization (SBO), except normally when people talk about SBO, it's a more extreme variant with an union that shares storage between various pointers and a local buffer.
Rather than using a fully contiguous data structure, you may find yourself wanting some more dynamic structure, like a deque, where you have multiple smaller arrays linked together. That needs no major reallocation; for a stack-like structure, you don't even need a contiguous data structure anyway.
Also, I don't think assembler macros should manipulate text strings directly, but tokens and/or expressions. That would be no that much more work: your assembler can parse basic arithmetic expressions, right? Then you likely have what is needed for that kind of implementation. Text-based macro substitution is sometimes useful but finicky and lends to bad or cryptic error messages, because these kind of macro expanders operate at a too low level. You can end up with problems with multiple expansions, or problem separating tokens from source because of lack of whitespace (!) in macro expansion, or commas that means something different when reparsed in a new context, or all sort of silly things like that. Save yourself the hassle, make your macro expander works with tokens and/or expressions, not raw strings.
Then there's the problem of recursive macros. They are useful, but of course there's the possibility of infinite recursion. There's no way around that: use a recursion limit. No, not some of stupid magic constant that limits your recursion stack: a variable holding the maximum recursion level. Then do the obvious: when macro-expanding, check the current recursion level, bail out with an error when it exceeded the limit. That variable should not dictate directly the size of that recursion stack, this stack should just be dynamic and grow on demand. Then that variable could be tweaked by a command-line argument, if really needed. Make the default large enough (100? 500? who knows). That will efficiently trap any runaway recursion.
Now, some useful... "idiom"? If you want some kind of array dynamic, but not necessarily dynamically allocated, one may find that kind of code useful:
Code: Select all
void some_func(... size_t num ...)
{
// Some define, enum or constant dictating the initial size of the array; the value is your pick.
// For illustration purposes, it's directly in the function, but could be elsewhere of course.
enum { OBJ_ARRAY_INITIAL_SIZE = 16 };
Obj stack_storage[OBJ_ARRAY_INITIAL_SIZE];
Obj *storage = stack_storage, *dynamic_storage = NULL;
// If we need more storage, allocate it.
// numel => (sizeof(X)/sizeof(*X))
if (numel(stack_storage) > num) {
// Ideally, check the return value of malloc
storage = dynamic_storage = malloc(sizeof (Obj) * num);
}
// Do something with _storage_, never manipulate the other variables
// (dynamic_storage could be NULL, stack_storage may not be the actual storage).
// ...
// Clean-up.
free(dynamic_storage);
}
Rather than using a fully contiguous data structure, you may find yourself wanting some more dynamic structure, like a deque, where you have multiple smaller arrays linked together. That needs no major reallocation; for a stack-like structure, you don't even need a contiguous data structure anyway.
((λ (x) (x x)) (λ (x) (x x)))
Re: Strategies for implementing macro functionality in assemblers
I started this thinking I'd use tokens all the way, but then found that macros and defines would make that really hard, since tokens can mean different things when used in different places. For example:
So how would I tokenize the "z:" as the expansion of "whatever", considering that it can be interpreted as different things depending on where it's used later on? To avoid problems like these, I figured it'd be better to delay the tokenization until each line of code was fully expanded, so there'd be no ambiguity to worry about.
But still, strings, tokens, whatever you go with, the problem of nested temporary expansions is one that appears to complicate the memory management of the assembler significantly. Most other things are permanent (e.g. labels, variables, literals, etc.), once created they remain in use until the assembler is done, but the replacement of macro parameters is temporary, which really complicated things from where I'm seeing.
Code: Select all
.define whatever z:
;in this case, "whatever" expands to a label definition
whatever JMP z
;but here, it expands to an address size modifier (forcing ZP addressing)
lda whatever$1b
But still, strings, tokens, whatever you go with, the problem of nested temporary expansions is one that appears to complicate the memory management of the assembler significantly. Most other things are permanent (e.g. labels, variables, literals, etc.), once created they remain in use until the assembler is done, but the replacement of macro parameters is temporary, which really complicated things from where I'm seeing.
Re: Strategies for implementing macro functionality in assemblers
I belive that powerful macros must be turing-complete compile-time tool capable of anything.
That is macros must behave as functions returning strings.
And string allocation questions are not worth of narrowing of this design to simple string substitution.
However implementation of turing-complete language inside macros could be more powerfull task than the rest of assembler.
So I was interested in simple designs like Tcl language which is built around string substitution and passes code to if/for/while constructions as... strings! Code inside sting of code inside string of code and so on - it's possible due to special string syntax with start/end markers, so strings can be nested. So if/for/while could be implemented in language itself using built-in 'apply' function or something like that.
Interesting thing to implement, however I don't have time to make this experiment yet.
That is macros must behave as functions returning strings.
And string allocation questions are not worth of narrowing of this design to simple string substitution.
However implementation of turing-complete language inside macros could be more powerfull task than the rest of assembler.
So I was interested in simple designs like Tcl language which is built around string substitution and passes code to if/for/while constructions as... strings! Code inside sting of code inside string of code and so on - it's possible due to special string syntax with start/end markers, so strings can be nested. So if/for/while could be implemented in language itself using built-in 'apply' function or something like that.
Interesting thing to implement, however I don't have time to make this experiment yet.
Re: Strategies for implementing macro functionality in assemblers
You *can* use the cpp or m4 preprocessors if you don't feel like reimplementing #define.
Re: Strategies for implementing macro functionality in assemblers
Did you try this in ca65? I'm 99% sure it won't work, as z: is read as a single token. Everything in ca65 is tokenized always. I would test it myself, but I'm away from my computer.tokumaru wrote: ↑Thu Aug 20, 2020 3:15 pm I started this thinking I'd use tokens all the way, but then found that macros and defines would make that really hard, since tokens can mean different things when used in different places. For example:
So how would I tokenize the "z:" as the expansion of "whatever", considering that it can be interpreted as...Code: Select all
.define whatever z: ;in this case, "whatever" expands to a label definition whatever JMP z ;but here, it expands to an address size modifier (forcing ZP addressing) lda whatever$1b
-
- Posts: 1565
- Joined: Tue Feb 07, 2017 2:03 am
Re: Strategies for implementing macro functionality in assemblers
the string substitution, just stack alloc the temp buffer, that should easily handle every thing a sane person would through at a 6502 assembler. So make a large buffer, then start from the back and expand as you need to doing find replaces. Keep the src and convert one into the buffer going forwards as you find things. These are nothing to worry about on a machine made in the last 20 years.
Re: Strategies for implementing macro functionality in assemblers
I didn't try it, but regardless of how ca65 handles this, from a logical point of view it should work, and one popular assembler not supporting it shouldn't be an excuse to not do this properly.
Yeah, I'm away from my computer so much these days that I've even been coding on my phone (I can compile and run C programs using Termux).I would test it myself, but I'm away from my computer.
- Jarhmander
- Formerly ~J-@D!~
- Posts: 568
- Joined: Sun Mar 12, 2006 12:36 am
- Location: Rive nord de Montréal
Re: Strategies for implementing macro functionality in assemblers
Pretty much my thinking regarding memory.Oziphantom wrote: ↑Fri Aug 21, 2020 7:08 am the string substitution, just stack alloc the temp buffer, that should easily handle every thing a sane person would through at a 6502 assembler. So make a large buffer, then start from the back and expand as you need to doing find replaces. Keep the src and convert one into the buffer going forwards as you find things. These are nothing to worry about on a machine made in the last 20 years.
Well, it all depends on what you are trying to achieve. When I see your example, I shiver a bit, but if that's what you want... Then obviously, "whatever" is not associated with the token (LABEL="z"), but several lower level tokens like [(id="z"), (colon)]; then when it's expanded, the tokens will tell what they means according to context. In whatever JMP z, id + colon means it's a label; in lda whatever$1b, id + colon is an address specifier. That's because in your "reduction rules", in the first case you match a "maybe-label", that is a label that may be omitted for a regular ASM "statement", and in the second case, it's a "memory-operand", which contains a "maybe-address-specifier". Note that your reduction rules may well be represented by functions, if you're doing a recursive-descent parser, which is simply about the simplest thing that just works.tokumaru wrote: ↑Thu Aug 20, 2020 3:15 pm I started this thinking I'd use tokens all the way, but then found that macros and defines would make that really hard, since tokens can mean different things when used in different places. For example:
So how would I tokenize the "z:" as the expansion of "whatever", considering that it can be interpreted as different things depending on where it's used later on? To avoid problems like these, I figured it'd be better to delay the tokenization until each line of code was fully expanded, so there'd be no ambiguity to worry about.Code: Select all
.define whatever z: ;in this case, "whatever" expands to a label definition whatever JMP z ;but here, it expands to an address size modifier (forcing ZP addressing) lda whatever$1b
Working with tokens like this will still be better than in raw strings IMHO, because you will not have to worry about token boundaries so much in expansions. For the record, the C preprocessor does exactly this: it tokenizes its input, and except for the "token paste" and "stringify" operators, nothing really changes the interpretation of what is read wrt tokens. The C compiler just reads tokens from the preprocessor.
((λ (x) (x x)) (λ (x) (x x)))
Re: Strategies for implementing macro functionality in assemblers
Yeah, I figured I'd try something like this: fully expand each line to a local (stack) buffer, and process the result. If it's a macro call, recursively process it so that the original line will remain in memory, allowing arguments to be copied from it. This imposes no specific limit on the number of nested macro calls, but the stack space will eventually run out.Oziphantom wrote: ↑Fri Aug 21, 2020 7:08 amthe string substitution, just stack alloc the temp buffer
Re: Strategies for implementing macro functionality in assemblers
It's not so much what I want, but what I think makes sense... defines are presented as "text replacements", and if you don't have the freedom to have the same text interpreted differently in different contexts, then that's false advertising.Jarhmander wrote: ↑Fri Aug 21, 2020 8:30 amWhen I see your example, I shiver a bit, but if that's what you want...
I see. So your suggestion is to make the tokens as basic as necessary so they can still be interpreted in different contexts without the awkwardness of having to "mutate" or be "reinterpreted". I guess that's a fair strategy.Then obviously, "whatever" is not associated with the token (LABEL="z"), but several lower level tokens like [(id="z"), (colon)];
Tokens sound really efficient when you have to do multiple passes over the source code, or when you need to evaluate the same expressions over and over. This was my main motivation for going with tokens from the start.Working with tokens like this will still be better than in raw strings IMHO, because you will not have to worry about token boundaries so much in expansions.
I did not know that.For the record, the C preprocessor does exactly this: it tokenizes its input, and except for the "token paste" and "stringify" operators, nothing really changes the interpretation of what is read wrt tokens. The C compiler just reads tokens from the preprocessor.
Re: Strategies for implementing macro functionality in assemblers
If you were coding an assembly program and did this:
What would you expect the name of that label to be? Is it "second" or is it "third"? I'm under the impression that in C it would be "second", but without using a pre-processor I believe that the implementation where that becomes "third" would be simpler to write.
Code: Select all
.define first second
.define second third
first:
;what's the name of the label above?
- Jarhmander
- Formerly ~J-@D!~
- Posts: 568
- Joined: Sun Mar 12, 2006 12:36 am
- Location: Rive nord de Montréal
Re: Strategies for implementing macro functionality in assemblers
It is "third". A C preprocessor just expands until 1) there's nothing left to expand, or 2) while expanding, it encounters a macro that was already expanded before (in the same expansion context), and let it as it is, to prevent any recursion.
That means
"foo" expands to "foo, bar, baz".
It's a bit limitative, and about any other macro system allows recursion. Some (like much Lisps I've checked) don't bother explicitly limiting the recursion level at all, they either do an infinite loop (Scheme systems, probably happily tail recursing the expansion, and SBCL also, surprisingly) or let the stack overflow (CLisp, and it does recover gracefully from that).
That means
Code: Select all
#define foo bar, baz
#define bar foo, bar
It's a bit limitative, and about any other macro system allows recursion. Some (like much Lisps I've checked) don't bother explicitly limiting the recursion level at all, they either do an infinite loop (Scheme systems, probably happily tail recursing the expansion, and SBCL also, surprisingly) or let the stack overflow (CLisp, and it does recover gracefully from that).
((λ (x) (x x)) (λ (x) (x x)))