Determining a file's content type without extension

You can talk about almost anything that you want to on this board.

Moderator: Moderators

Post Reply
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Determining a file's content type without extension

Post by tepples »

anikom15 wrote: Thu Dec 10, 2020 12:20 am It’s not uncommon for extensions to mean fuckall anyway. The fact Windows uses extensions to decide what program to launch is [careless] in the first place.
I'd be interested to discuss a replacement for the file extension (or file name suffix). Let me first define what problems there are so we know what we need to solve.

User double-clicked a file. Which app starts?
Say I have a file whose name ends with .bin (generic ROM image), .cue (index of multitrack optical disc image), or .iso (ISO 9660 or UDF file system image), and I want to launch the appropriate emulator. What steps should an operating system take to determine this? I imagine there would be two steps: first determine the platform for which the image was made, and then determine the preferred emulator for that platform. In a well-designed operating system, how would an application go about registering rules to recognize a format and ability to play that format? Would the registration have to be system-wide or per user?

Web client requested a file. What's its Content-type?
Or say I have a web server from which the user has requested a particular file. I want the web server to determine which Content-type value to send along with the file's contents. What steps should the web server take to determine this? Or say the user drags a file into a mail user agent's compose window as an attachment. What should the attachment's Content-type be once the user sends the email? Per the previous topic Internet media types (MIME types) for retro file formats, I'm aware that even having a correct value to send in the first place is a serious undertaking because of formal documentation that needs to be prepared and submitted to IETF.

User opened an assembly file. What instruction set is it in?
Or say I have a text editor that can edit assembly language for 6502/65816, Z80, 68000, x86, SM83 (8080-like CPU in Game Boy), SPC700, MIPS, SuperH, or ARM. A single project may have code for two or more instruction sets, such as 65816 and SPC700 (Super NES), or 68000, Z80, and SuperH (Genesis 32X), or SuperH and ARM (Dreamcast). How would an editor know what set of syntax highlighting rules to apply for a given .s or .asm file?

I'm at the combination Pizza Hut and Taco Bell
Should polyglot files, which are valid as multiple content types, receive any special treatment? Even apart from lightweight markup languages, such as Markdown being intended for legibility as plain UTF-8 text, there are pairs of formats for which it is straightforward. These pairs can arise by design, such as producing a zipfile with prepended extraction program. Or they can arise by accident, such as a Game Boy ROM that is also a valid PNG image by putting the entire program part in a chunk, or a ROM for Super NES and Game Boy where each platform sees its own program at a separate header ($7FC0 or $0100).

These specific examples motivated this:
  • Use of .bin in the Atari 2600 scene, the Mega Drive scene, and the early Game Boy Advance scene
  • Use of .iso and .bin/.cue by multiple disc-based consoles
  • Use of .spc by music in SPC700 save state format when it was already taken by Authenticode software publisher certificates
  • Use of .wad for Wii channel packaging when it was already taken by Id Software's Doom
  • No-Intro's use of .md instead of .gen for Mega Drive ROMs when .md was already taken by Markdown
  • Use of .deb for both Debian GNU/Linux application packaging and FCEUX debug files
  • Use of .gg for both Game Gear ROM images and lists of ROM patches for Game Genie
  • Disputes in the gbdev Discord server as to the appropriate extension for SM83 assembly language source code files
anikom15
Posts: 22
Joined: Mon Nov 30, 2020 2:41 am

Re: Determining a file's content type without extension

Post by anikom15 »

There are so many goddamn threads you can take out of this discussion. What is data? What is information? What do we keep? What do we discard?

How do delineate formats? Is it based on physical media? If so, then that’s easy. Use ‘.cart’ for cartridges, ‘.tape’ for cassettes, etc. The way a cartridge’s data is laid out is going to be different enough from other types to make it recognizable.

What if it’s based on system? Then you have things like .nes and .gen, but what if a system has multiple formats or capabilities you want to keep distinct (e.g. Game Boy)?

Or it could be based on how the data is laid out, e.g. .iso, .wav, etc. It’s sort of organization by data, but has all sorts of potential issues coupled to it as well. You wouldn’t normally play Jetpac on your Walkman, but nothing is going to stop you. Is naming a tape file ‘.wav’ really a bad thing?

My opinion? Figure out how to determine what your data is without relying on its filename. Most data is obviously distinct enough for this to work, but it’s going to be easier if everyone is playing by the same rules. A file browser should be able to launch applications based on the content, not the filename. It should also be able to have overrides, both on a per-file and per-directory basis. Applications should always validate the data they are given before doing anything. It shouldn’t trust the filename. In fact, the application shouldn’t even need to know what the filename is at all.

When programmers think of data as bytes and not as ‘files’ and ‘objects’ the result is always better.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Determining a file's content type without extension

Post by tepples »

anikom15 wrote: Thu Dec 10, 2020 12:51 pm My opinion? Figure out how to determine what your data is without relying on its filename. Most data is obviously distinct enough for this to work, but it’s going to be easier if everyone is playing by the same rules. A file browser should be able to launch applications based on the content, not the filename.
Again, "how would an application go about registering rules to recognize a format" based on its content?
anikom15 wrote: Thu Dec 10, 2020 12:51 pm It should also be able to have overrides, both on a per-file and per-directory basis.
Does copying a file copy its overrides? And unfortunately, standard C and C++ offer no way to access the "alternate data streams" in which these overrides would be stored.
anikom15
Posts: 22
Joined: Mon Nov 30, 2020 2:41 am

Re: Determining a file's content type without extension

Post by anikom15 »

The filebrowser has to sniff the content to get a sense of its data. Most data has at least some kind of descriptor in its early bytes. The filebrowser can then determine the datatype, like PNG or MP4 with video content. The user gets to choose how to associate those programs. The user can also tell the filebrowser that it’s wrong and should consider a particular file or contents of a directory to be a certain type.

The application should do some verification before attempting to process the data, depending on the nature of the data.
User avatar
Quietust
Posts: 1920
Joined: Sun Sep 19, 2004 10:59 pm
Contact:

Re: Determining a file's content type without extension

Post by Quietust »

anikom15 wrote: Thu Dec 10, 2020 7:36 pm The filebrowser has to sniff the content to get a sense of its data. Most data has at least some kind of descriptor in its early bytes. The filebrowser can then determine the datatype, like PNG or MP4 with video content. The user gets to choose how to associate those programs. The user can also tell the filebrowser that it’s wrong and should consider a particular file or contents of a directory to be a certain type.
Are you suggesting that all file browsers should be required to independently implement their own filetype detection logic, as opposed to just asking the operating system to do it for them?

Also, some file types are indistinguishable from each other if you just look at the early bytes - for example, the "docx", "xlsx", "pptx", and "jar" file formats are just ZIP files with special files inside them, but you generally don't want to open them in your archive utility.
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.
tepples
Posts: 22708
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: Determining a file's content type without extension

Post by tepples »

anikom15 wrote: Thu Dec 10, 2020 7:36 pm The filebrowser has to sniff the content to get a sense of its data. Most data has at least some kind of descriptor in its early bytes.
It's a matter of having some way for applications on a system to tell file browsers on the same system what descriptors exist.
anikom15
Posts: 22
Joined: Mon Nov 30, 2020 2:41 am

Re: Determining a file's content type without extension

Post by anikom15 »

Quietust wrote: Thu Dec 10, 2020 7:43 pm
anikom15 wrote: Thu Dec 10, 2020 7:36 pm The filebrowser has to sniff the content to get a sense of its data. Most data has at least some kind of descriptor in its early bytes. The filebrowser can then determine the datatype, like PNG or MP4 with video content. The user gets to choose how to associate those programs. The user can also tell the filebrowser that it’s wrong and should consider a particular file or contents of a directory to be a certain type.
Are you suggesting that all file browsers should be required to independently implement their own filetype detection logic, as opposed to just asking the operating system to do it for them?

Also, some file types are indistinguishable from each other if you just look at the early bytes - for example, the "docx", "xlsx", "pptx", and "jar" file formats are just ZIP files with special files inside them, but you generally don't want to open them in your archive utility.
What do you mean by ‘asking the operating system’? The filebrowser is usually part of the operating system. What particular part of the system does the detection is inconsequential.
It's a matter of having some way for applications on a system to tell file browsers on the same system what descriptors exist.
That’s just a matter of using a common database to store the information. How you implement it is a matter of opinion. The filebrowser would be configured to use the database. You could have some standardized verification as part of it.

UNIX already handles filetypes without extensions. It’s not completely new territory.

That said, an application really shouldn’t need to think of something as a file. It should just read and write data. The application should say ‘save this chunk of data somewhere’ and the OS takes care of the rest (including user interaction). What’s an example of an application that really needs to manage files and not just data, besides a filebrowser?
Fiskbit
Posts: 891
Joined: Sat Nov 18, 2017 9:15 pm

Re: Determining a file's content type without extension

Post by Fiskbit »

anikom15 wrote: A file browser should be able to launch applications based on the content, not the filename. It should also be able to have overrides, both on a per-file and per-directory basis. Applications should always validate the data they are given before doing anything. It shouldn’t trust the filename. In fact, the application shouldn’t even need to know what the filename is at all.
anikom15 wrote: That said, an application really shouldn’t need to think of something as a file. It should just read and write data. The application should say ‘save this chunk of data somewhere’ and the OS takes care of the rest (including user interaction).
I'm seeing a lot of assertions about what software should do, but you've made no argument as to why they should do any of this, which makes this conversation wholly unintersting. Aside from this behavior being your preference, how is it actually better?
anikom15 wrote: When programmers think of data as bytes and not as ‘files’ and ‘objects’ the result is always better.
If it's always better, then you should have no trouble explaining why.
Pokun
Posts: 2681
Joined: Tue May 28, 2013 5:49 am
Location: Hokkaido, Japan

Re: Determining a file's content type without extension

Post by Pokun »

The OS should always have the option to let the user choose the program to launch the file, as you may want to use several programs for the same file.

I use .bin for files of binary data that I don't have a need to give a specific name to. Likewise .rom is a generic extension for a ROM image. I've set them to open in a hex editor by default, so it's unfortunate that some Sega and GBA systems uses these extensions. I prefer to rename GBA ROMs to .gba and SMC, FIG etc to .sfc (especially if they really are just raw non-interleaved SFC ROM dumps without a copier header).

For assembly source files I generally use an extension such as .z80 so that the editor recognizes it and loads up the correct syntax highlighting. For SM83 I just use .z80 since they use the same syntax family. I've added the SM83-specific stuff to my Z80 syntax highlighting definition. For anything in the 65xx-family, I use .x65 (I stole that nomenclature from Nintendo), for 68000 I use .68k (though maybe .m68 or something is better if there is a reason to avoid starting on a number), for PIC I use .pic and so on. Otherwise I just use a generic extension like .asm or .s, if there is some reason I can't or don't want to give them a unique extension.


Recently I had problems with Windows not wanting to remember in what programs I have previously opened a file in the "Open with" list. It remembers to open SFC files in Mesen-S and SNES9x, but not in bsnes, bsnes plus or No$SNS, and those requires three extra clicks using the "Choose another app" dialogue. I searched the internet and it seems this is some bug in Windows.
anikom15
Posts: 22
Joined: Mon Nov 30, 2020 2:41 am

Re: Determining a file's content type without extension

Post by anikom15 »

Fiskbit wrote: Fri Dec 11, 2020 2:49 am
anikom15 wrote: A file browser should be able to launch applications based on the content, not the filename. It should also be able to have overrides, both on a per-file and per-directory basis. Applications should always validate the data they are given before doing anything. It shouldn’t trust the filename. In fact, the application shouldn’t even need to know what the filename is at all.
anikom15 wrote: That said, an application really shouldn’t need to think of something as a file. It should just read and write data. The application should say ‘save this chunk of data somewhere’ and the OS takes care of the rest (including user interaction).
I'm seeing a lot of assertions about what software should do, but you've made no argument as to why they should do any of this, which makes this conversation wholly unintersting. Aside from this behavior being your preference, how is it actually better?
anikom15 wrote: When programmers think of data as bytes and not as ‘files’ and ‘objects’ the result is always better.
If it's always better, then you should have no trouble explaining why.
If the conversation is wholly uninteresting, you don't need to participate in it. These reasons are off-topic:

Identifying files without using the extension is a very old concept. On UNIX there is the file command. Mac OS had its own split-filesystem method for keeping track of filetypes. Other operating systems have come and gone and used different approaches. My biggest issues with extensions are security issues (.jpg.exe) and naming restrictions (naming something xxx.yyy.zzz can break some applications). Another problem is that files with the same extension can refer to different kinds of data, and files with different extensions can refer to the same kind of data.

A filebrowser having overrides is for a specific use-case where you may have application-specific files in a directory that you want to associate with that application, but another set of files somewhere that is the same type should be associated with something else. Consider a directory with a user's music that they listen to, and a directory of audio recordings that a user has made.

Consider a process that translates data from one domain to another. The process takes some time T. If invalid data is inputted, then the output data is invalid and T is wasted. If the data can be validated in time much less than T, then validating the data before processing can prevent wasted time, unless the input data is absolutely known to be correct.

Consider a word processor. The processor can handle documents that contain pages, text, images, graphs, and other elements. The processor organizes these elements into sections and bundles it as a document. This document may be saved in many ways. It may be printed, or it may be stored on the computer, or on the internet, or on some network drive. The word processor doesn't need to know anything about how the data is stored to do its job. Perhaps the images need to be stored in a different location from the text. Perhaps the sections need to be stored as individual files and not the whole document. The word processor doesn't need to know about any of this to do its job. Some other parts of the system should handle this.

Why should other parts handle it? Because otherwise the author of the word processor has to implement it. This means that every software author has to write his own routines for handling files (or use a library). This is not only redundant, but can lead to errors and security risks. The application will also be easier to sandbox, as it won't depend on the structure of a certain filesystem. It's not an unprecedented idea either. Applications used to use their own routines for managing memory. Very few applications do this now.
Fiskbit
Posts: 891
Joined: Sat Nov 18, 2017 9:15 pm

Re: Determining a file's content type without extension

Post by Fiskbit »

Well, you started this conversation by calling the use of file extensions for program association 'stupid' and then kept making prescriptive statements about how things should be done, which isn't exactly useful. Explaining why lets us understand the problems with this system and the value of your solution, and thus has a chance of actually convincing someone you're right. Thank you for providing more information.

I just don't see much difference between a file's type being determined by its extension or by, say, a magic number (assuming it's universally present across the multitude of file types, which it's not). I agree that having a meaningful extension makes filename handling more complex and increases the possibility of the program mishandling it. I can't think of examples in which I've encountered this in practice, suggesting it hasn't happened recently or in a way that caused me any significant trouble, so I'm not convinced this is much of a real problem that people encounter in typical computer use. Extensions provide value in being a user-visible indication of the file's type (if I see a filename, I have a good idea of what the file is, even if I don't have the file), while file type otherwise has to be determined by the software used to view the file lists and by actually examining the data (which means you need to actually have some amount of its data). Depending on the complexity of the method used to examine a file's contents to determine type, that introduces risk that simply querying the file type can be exploited by an attacker. With extensions, a user or program could be tricked into opening a file of a type they didn't expect, but with file typing requiring inspection, a program could still be made to open a file that isn't or doesn't conform to the type it expects.

Having program association overrides sounds like a useful feature, but file extensions get us closer to that than inspection by allowing the same type to have different extensions for the different programs you may want to use it with. In fact, if a program allows opening of files with arbitrary extensions (programs I use tend to allow this), a user can even invent his own file extension for a type that already has an extension in order to associate it with a specific program.
User avatar
Quietust
Posts: 1920
Joined: Sun Sep 19, 2004 10:59 pm
Contact:

Re: Determining a file's content type without extension

Post by Quietust »

anikom15 wrote: Fri Dec 11, 2020 11:21 am My biggest issues with extensions are security issues (.jpg.exe)
The only reason that's a problem is because some people at Microsoft decided that file extensions were "confusing" and made Windows hide them by default.
anikom15 wrote: Fri Dec 11, 2020 11:21 am and naming restrictions (naming something xxx.yyy.zzz can break some applications).
I would be curious to see an application written in the last 20 years which cannot handle a filename containing multiple periods.
Quietust, QMT Productions
P.S. If you don't get this note, let me know and I'll write you another.
Pokun
Posts: 2681
Joined: Tue May 28, 2013 5:49 am
Location: Hokkaido, Japan

Re: Determining a file's content type without extension

Post by Pokun »

Quietust wrote: Sat Dec 12, 2020 7:23 am
anikom15 wrote: Fri Dec 11, 2020 11:21 am My biggest issues with extensions are security issues (.jpg.exe)
The only reason that's a problem is because some people at Microsoft decided that file extensions were "confusing" and made Windows hide them by default.
Indeed, and it causes new confusion when people don't realize this, and tries to rename a file extension.
zzo38
Posts: 1096
Joined: Mon Feb 07, 2011 12:46 pm

Re: Determining a file's content type without extension

Post by zzo38 »

I mostly just open files in the programs I want to open by specifying by myself each time; I don't use a graphical file manager at all, since the command-line is more useful. Often it will be wanted to use multiple programs with the same file, anyways.

For determining MIME types, it would be helpful in a web browser to allow the user to override MIME types for each file being loaded, and for the server side, you can have .htaccess or other configuration files to configure it. For web browser with file: URLs, maybe is helpful to allow the user to add file (which the user may digitally sign if necessary, depending on how the user has configured it) into the directory containing the file to figure out MIME types. For email attachments, let the user to specify the type by themself (although it would be useful to have a default setting, perhaps in case the user does not know what to put).

However, there are ways to guess of a file's content type without the extension (or in combination with the extension), in some cases. For example, a Game Boy ROM image will have the Nintendo logo at the correct offset, and the header checksum must also be correct. SQLite databases have a common header, and there is one header field (the "application ID") which can be used to distinguish different things that use it. Many file formats (including iNES format) have a header to identify it.

Yes there are "polyglot" files, and this is one kind of thing where you may likely want to use multiple kinds of programs to open it. I do this myself too, in order to include compilation instructions in C programs (the file is both a C program and a shell script).

Many formats are based on others, too, such as many formats use ZIP format and you may also want to just extract or list the files, and many formats use SQLite, and you may wish to make SQL queries, so you will open them in SQLite, etc.

Sometimes a shebang line can be used. Some formats can already use it (such as text formats where # indicates a comment; some don't have this in general but do allow a shebang line as a special case; I think this is also possible with Game Boy if the lowest ROM addresses are unused by that program), while others don't work with such a thing.

I think what Amiga does is it has a separate icon file to indicate how the GUI deals with the file. I suppose that might help for someone who does use the GUI.
(Free Hero Mesh - FOSS puzzle game engine)
Post Reply