C(++) and binary data annoyance

You can talk about almost anything that you want to on this board.

Moderator: Moderators

NewRisingSun
Posts: 1510
Joined: Thu May 19, 2005 11:30 am

C(++) and binary data annoyance

Post by NewRisingSun »

In C (and C++), you can define structures with arbitrary data types inside, and you can even define structures with packed bitfields. For example, you could do something like this:

Code: Select all

union {
	struct {
		unsigned grayscale: 1;
		unsigned leftBorderBackground: 1;
		unsigned leftBorderSprites: 1;
		unsigned showBackground: 1;
		unsigned emphasizeRed: 1;
		unsigned emphasizeGreen: 1;
		unsigned emphasizeBlue: 1;
	};
	uint8_t data;
} PPUMask;

void PPU::writeHandler (uint16_t address, uint8_t value) {
	switch(address &0x0007) {
		case 1: PPUMask.data = value; break;
	}
}

void PPU::draw (void) {
	// ...
	if (PPUMASK.showBackground) {
		// ...
	}
}
By naming bitfields while also being able to access the entire PPUMask structure at once, one can write very flexible and readable code. But every C person will tell you that this is bad, because the C standard lets compilers pad structures and fill up bitfields however it wants, and you should not assume anything about bit ordering, byte order and so on for the sake of portability. So, why not add additional keywords to the C standard that lets me specify what kind of bit or byte ordering I expect, if I expect anything, and whether I can accept structure padding in any given case, and let the compiler do the additional work? Something like this stylized example:

Code: Select all

struct IFFHeader {
	char		ID[4];
	be_uint32_t	length;	// big endian chunk size
} binary; // don't pad anything

union {
	struct {
		unsigned grayscale: 1;
		unsigned leftBorderBackground: 1;
		unsigned leftBorderSprites: 1;
		unsigned showBackground: 1;
		unsigned emphasizeRed: 1;
		unsigned emphasizeGreen: 1;
		unsigned emphasizeBlue: 1;
	} binary lsbfirst; // bits are packed and specified from 0x01s to 0x40s
	uint8_t data;
} PPUMask;
But no. If you look at stackoverflow.com and similar sites, as well as most "portable" code that processes binary data in any form, you are expected to do something horrible like this in the name of portability:

Code: Select all

#define PPUMASK_SHOWBG 0x08

if (PPUMask &PPUMASK_SHOW_BG) {
	// ...
}
// read big-endian chunk size
uint32_t iffLength = (chunkHeader[4] <<24) | (chunkHeader[5] <<16] | (chunkHeader[6] <<8) | chunkHeader[7];
Basically, when processing binary data, you're supposed to eschew high-level structures and do everything by hand in the name of portability: masking and shifting bits from packed bytes, retrieving and putting together multibyte fields from binary structures, and so on, with all the potential for additional error this hard-to-read code brings about. Instead of just being able to tell the compiler what I want, and let it do the work as needed on the particular platform, and as a consequence, any optimizations. And most C/C++ programmers, and certainly the people doing the standard seem to be perfectly fine with it.
User avatar
koitsu
Posts: 4201
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Re: C(++) and binary data annoyance

Post by koitsu »

Myria might have a field day with this one. ;-) (reference)
User avatar
dougeff
Posts: 3079
Joined: Fri May 08, 2015 7:17 pm

Re: C(++) and binary data annoyance

Post by dougeff »

Interesting side note.

The wikipedia page on bit fields has NES specific example code.

https://en.m.wikipedia.org/wiki/Bit_field
nesdoug.com -- blog/tutorial on programming for the NES
User avatar
LightStruk
Posts: 45
Joined: Sat May 04, 2013 6:44 am

Re: C(++) and binary data annoyance

Post by LightStruk »

I use bitfields all over the place in my emulator, in a way very similar to yours. I use them in the core CPU and PPU emulation, I use them in mapper registers, I use them in iNES and FDS header parsers.

That advice about non-portable or compiler-specific behavior? Yeah, the standard might say that, but in practice, which compilers are you going to use? My emulator works on Windows, macOS, and Linux, using MSVC, clang/llvm, and gcc respectively, and the bitfields behave exactly the same way on those three compilers. What more do you want?

Bitfields aren't perfect, however. You have to pay close attention to bit alignment within bytes, byte alignment within words, endianness, and more. My bitfields look like this:

Code: Select all

	enum NametableSource : uint8_t {
		CIRAM,
		CHRROM
	};

	union VRC6PpuBankingStyle {
		VRC6PpuBankingStyle(uint8_t val) : value(val) {}
		struct {
#if __BYTE_ORDER == __LITTLE_ENDIAN
			uint8_t ppuBankingMode : 2;
			uint8_t mirroring : 2;
			NametableSource nametableSource : 1;
			uint8_t chrA10Rule : 1;
			uint8_t unused : 1;
			uint8_t prgRamEnable : 1;
#else // __BIG_ENDIAN
			uint8_t prgRamEnable : 1;
			uint8_t unused : 1;
			uint8_t chrA10Rule : 1;
			NametableSource nametableSource : 1;
			uint8_t mirroring : 2;
			uint8_t ppuBankingMode : 2;
#endif
		};
		uint8_t value;
	};
All of my bitfields are endian-safe by repeating themselves in reverse order for the other endianness mode. Boost.Endian provides the compile-time endianness detection. You can see I use a typesafe enum inside of this bitfield, and I declare the underlying integer data type with my enums, which might be necessary to use them the way I do.

I often deal with 16-bit fields like this:

Code: Select all

#pragma pack(push, 1)
struct DiskInfo
{
	uint8_t BlockCode;
	char DiskVerification[14];
	// ... snip ...
private:
	uint8_t diskWriterSerialNumberLo;
	uint8_t diskWriterSerialNumberHi;
	// ... snip ...
public:
	uint16_t GetDiskWriterSerialNumber() const { return diskWriterSerialNumberLo | (uint16_t(diskWriterSerialNumberHi) << 8); }
};
#pragma pack(pop)
Doing it this way means I don't even have to think about byte alignment / word alignment within the structure - I can just pluck a pointer out of the middle of an arbitrary buffer, cast it to the type I want, and get 16-bit integers out, regardless of the architecture or endianness. It's not like this is performance sensitive code.
User avatar
pubby
Posts: 583
Joined: Thu Mar 31, 2016 11:15 am

Re: C(++) and binary data annoyance

Post by pubby »

Many years ago I wrote some code which would handle serialization automatically, but you had to write structs inside a macro. It looked like this:

Code: Select all

struct header_t
{
    SERIALIZED_POD
    (
        ((std::uint16_t, op))
        ((std::uint32_t, length))
    )
};
So it is possible to solve the problem generically, it just requires a library.

@LightStruk
Why do you have to swap the bits with different endians? Shouldn't you only be swapping the bytes?
User avatar
Banshaku
Posts: 2417
Joined: Tue Jun 24, 2008 8:38 pm
Location: Japan
Contact:

Re: C(++) and binary data annoyance

Post by Banshaku »

So to allow portability you just destroy readability... :? That's doesn't seems like a good compromise. I feel your pain. I would react the same way.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: C(++) and binary data annoyance

Post by tepples »

dougeff wrote:The wikipedia page on bit fields has NES specific example code.
The diff that added it is mine. The previous bit order didn't match any real-world bit order, and I thought reflecting real-world use would make the example more meaningful to readers.
User avatar
LightStruk
Posts: 45
Joined: Sat May 04, 2013 6:44 am

Re: C(++) and binary data annoyance

Post by LightStruk »

pubby wrote:Why do you have to swap the bits with different endians? Shouldn't you only be swapping the bytes?
I think the compiler developers could have decided to keep bit-order within bytes the same between little and big endian code, but they decided instead to make the bit-order behave like the byte-order. In other words, for little endian, bytes and bits are declared from low to high within a given integer type, even though high-to-low bit order is a lot more legible. Since byte order works like this:

Code: Select all

union shortword {
  struct {
#if __BYTE_ORDER == __LITTLE_ENDIAN
    uint8_t byteLow;
    uint8_t byteHigh;
#else // __BIG_ENDIAN
    uint8_t byteHigh;
    uint8_t byteLow;
#endif
  };
  uint16_t value;
};
bit order has to work like this:

Code: Select all

union nibbles {
  struct {
#if __BYTE_ORDER == __LITTLE_ENDIAN
    uint8_t nibbleLow:4;
    uint8_t nibbleHigh:4;
#else // __BIG_ENDIAN
    uint8_t nibbleHigh:4;
    uint8_t nibbleLow:4;
#endif
  };
  uint8_t value;
};
NewRisingSun
Posts: 1510
Joined: Thu May 19, 2005 11:30 am

Re: C(++) and binary data annoyance

Post by NewRisingSun »

koitsu wrote:Myria might have a field day with this one. ;-) (reference)
Oh, I have absolutely no doubt that the sadists on that C standard committee are always keeping a "gotcha" whip ready for everybody about something or other.
LightStruk wrote:That advice about non-portable or compiler-specific behavior? Yeah, the standard might say that, but in practice, which compilers are you going to use? My emulator works on Windows, macOS, and Linux, using MSVC, clang/llvm, and gcc respectively, and the bitfields behave exactly the same way on those three compilers. What more do you want?
Supposedly, it will fail on a Solaris machine or some other big-endian crap platform, because apparently, GCC packs bitfields backwards for these target platforms.

But before I torture myself with code from hell like this:

Code: Select all

int PRGMask = ((0x3F | (Reg[1] &0x40) | ((Reg[1] &0x20) <<2)) ^ ((Reg[0] &0x40) >>2)) ^ ((Reg[1] &0x80) >>2);
... I have decided to become a portability refusenik and to embrace bit fields whole-heartedly, devil-may-care. In the unlikely event that some poor soul has to adapt my code, reversing the lines of a bit field definition will be easier than trying to understand bit shifts from hell such as the above.

Also, there apparently is Boost.Endian, which allows specifying a "be_uint32_t/le_uint32_t" instead of just an "uint32_t" with automatic implicit conversion where necessary. Of course, installing boost can be a major pain. "The plan is to submit Boost.Endian to the C++ standards committee for possible inclusion in a Technical Specification or the C++ standard itself." Wow! 44 years (that quote is from 2016) after the introduction of the C language in 1972, and 27 years after its standardization in 1989, somebody has the bright idea that C itself might benefit from the ability to specify endianness for a variable! What other wonders will they think of next?
User avatar
rainwarrior
Posts: 8732
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: C(++) and binary data annoyance

Post by rainwarrior »

Banshaku wrote:So to allow portability you just destroy readability... :? That's doesn't seems like a good compromise. I feel your pain. I would react the same way.
There are plenty of "readable" ways to make bitfields.

Most commonly I've seen macros used for it. Some people have general opposition to macros. YMMV.

You can also make bitfields with template classes. std::bitset is one such implementation.

The bitfield part of the language spec is also usable for lots of purposes. Like you've hit on before, the C standard was always allowed to pad structures arbitrarily. Most compilers have a #pragma pack extension, but I don't think C bitfields have really gotten that kind of treatment, partly because they aren't very widely used.

And yes, Banshaku, you can write explicit shift / and / or / complement / etc. everywhere too. It has a disadvantage of verbosity and maintenance, but I would also say that approach has an advantage of being very explicit about what is happening.


Every approach has compromises. The reason the C language feature isn't well used is just that most cases where you really want a bitfield you care specifically how those bits are packed. Its primary use case is more or less for an optional compiler optimization of space. There's just no way to specify even "I want these 8 bits to fit into a byte", which kinda defeats the usual point of bitfields.

I don't think portability is really the biggest problem here, it's just that most implementations will not create a bitfield structure that looks the way you want it to be packed in a lot of cases.

Same deal with std::bitset, since the actual implementation is hidden by the library, it's more or less a compiler-optional space optimization only. (...but std::bitset at least works pretty well in practice when e.g. you want to story a million flag variables.)


So... these are optional memory usage optimizations. In a lot of cases memory isn't so constrained. In cases where it is very constrained, you may very well need to "roll your own" instead of hoping the compiler will do what you want with those bits... and that's the reason most people do.
User avatar
rainwarrior
Posts: 8732
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: C(++) and binary data annoyance

Post by rainwarrior »

NewRisingSun wrote:Wow! 44 years (that quote is from 2016) after the introduction of the C language in 1972, and 27 years after its standardization in 1989, somebody has the bright idea that C itself might benefit from the ability to specify endianness for a variable! What other wonders will they think of next?
For portability's sake, the one thing that's really been missing is a compile time device to know the native endian of the target system, and yeah that's still not in. Coming in C++20.

The solution to that one is generally just to add a line or two to some header whenever a new kind of target platform is added. There's some compiler defines (e.g. GCC's __BYTE_ORDER__) that can cover a whole compiler family at once. When your product is a library that's supposed to build in any compiler out there, like Boost, this solution gets a little bit cumbersome, but aside from universal library projects I think most applications can get all their targets covered in a handful of #define lines? (Or already depend on a library that has done it for them.) It's a problem, but not a very hard one.

Though, ironically, finally adding it to C++20 won't help cases like Boost which will still have a mandate to support every compiler under the sun. ;P

As far as Boost's endian types, I dunno. It's one way to solve a particular problem, but there are other very trivial solutions for the same thing. I could take it or leave it, don't really think it needs to be part of the spec, but wouldn't kick it out of bed I suppose. This doesn't seem a glaring omission to me at all.
NewRisingSun
Posts: 1510
Joined: Thu May 19, 2005 11:30 am

Re: C(++) and binary data annoyance

Post by NewRisingSun »

Everybody seems to come up with insular solutions to cope with the problem of compilers doing whatever they want, because the standard allows it. Some of these solutions may well be elegant and workable, and of course, they have been used for decades. But they are still workarounds, hacks, kludges, coping strategies.

My point is, one should not have to. One should not have to use workarounds to cope with a compiler doing whatever its wants. One should have the ability to tell the compiler to do what it should do, and when necessary how it should do it, and in a way that every standards-conforming compiler understands. And that's a fundamental, almost philosophical, premise when writing or updating a standard, one that seems to be missing, and whose lack will not be remedied by submitting any particular workaround for standardization.
User avatar
rainwarrior
Posts: 8732
Joined: Sun Jan 22, 2012 12:03 pm
Location: Canada
Contact:

Re: C(++) and binary data annoyance

Post by rainwarrior »

NewRisingSun wrote:And most C/C++ programmers, and certainly the people doing the standard seem to be perfectly fine with it.
I mean, I can agree that it would be nice if there was a way to specify a well defined data structure in C or C++.

The reality of it is that you can't though. If you're interested in ways to practically do it, there are many I could suggest, but by your response it sounds like you don't want to discuss that.

Am I "perfectly fine" with it? I guess. I try to accept the things I can't change. Joining an ISO national committee and working on the standard is an option, I suppose, but that's not a casual engagement. I'd probably rather spend my time writing code than arguing with people about how it could have been written. (Even less so with people with no capacity to change it.)

I wouldn't assume that "the people doing the standard" are perfectly fine with it either. I'm sure this particular issue has been debated many times, with probably a great many people unhappy that it hasn't reached a solution.


Also, just to point out another way the existing C/C++ bitfield feature is unuseful, try making that union in your first post and taking the size of it. (This is kind of what I meant by portability not being the biggest problem. It's not that some compilers might do a different thing, it's that you probably won't find any compiler for a given platform that does what you want here.) They're pretty good at saving some data space if used carefully, but that's about it.

As another thought, though, is there a language that implements bitfields in a better way? Not meant as an argument that C/C++ should not improve, I'm just curious if any do.
NewRisingSun
Posts: 1510
Joined: Thu May 19, 2005 11:30 am

Re: C(++) and binary data annoyance

Post by NewRisingSun »

Well, this is more of a rant thread than a practical solution thread. :)

I do think however that the problem of well-defined data structures has in fact not been adequately considered by the standards bodies, rather than having been considered and reached a decision to keep things as they are. And I think that is exactly because people are too willing to help themselves with practical yet insular solutions, so it is not seen as something in need of being addressed. That is what I meant by people liking it that way. We shall see what becomes of that Boost.Endian proposal, if there is one.

The union size is of course the target platform's word size (i.e. 32 or 64 bit), but that does not make it useless. I use constructs like the quoted one solely for the purpose of accessing individual hardware register bits or bit groups in an extremely readable manner, without having to clutter up my code by explicitly writing out ANDs and ORs, or CheckBit/ClearBit/SetBit macros/helper functions, all the time. It is quite useful for that, and works well enough when I restrict myself to compilers targeting little-endian platforms.

Edit: I've thought some more about why exactly I so dislike bit-setting macros, std::setbit templates, let alone manually ANDing and ORing all the time: it basically requires me to explicitly specify the storage details every time I access a bitfield member, instead of just once when I declare the bitfield. The equivalent with normal variables would be never being able to declare variables as floats, ints of various sizes or strings, but instead having to declare everything as void or uint8_t, and then having to cast every single time I access the variable. Nobody would want to do that for normal variables, yet that is exactly what all "portable" solutions of accessing well-fined data structures amount to, regardless of the amount of syntactic sugar that they use. C++ bit fields, and Boost:Endian arithmetic, are the only solutions that do not suffer from that drawback.
User avatar
thefox
Posts: 3134
Joined: Mon Jan 03, 2005 10:36 am
Location: 🇫🇮
Contact:

Re: C(++) and binary data annoyance

Post by thefox »

Just say "fuck it", and say that your code is only supported on compilers/platforms where bitfields are laid out in the order you expect. You can add compile-time or runtime asserts to make sure that your expectations hold true.

It's perfectly reasonable to not support every compiler (and platform) in existence.
Download STREEMERZ for NES from fauxgame.com! — Some other stuff I've done: fo.aspekt.fi
Post Reply