The environment variable isn't much better, both are akin to using a global var in your reentrant code, but at least STDDATA_FD is less likely to collide than 3.
Can't wait for scripts using this variable for something unrelated to break when they call my scripts.
That doesn't work reliably either. No existing code scrubs STDDATA_FD from their environment variables, and there's no way to know if anyone uses STDDATA_FD in the wild. Why not just use a command line parameter like everyone else? Different isn't better in a situation like this.
This is a larger concern I've started to see in a certain class of younger developer where existing conventions are just ignored without an attempt at understanding of why they exist. Things are only going to get worse as naive vibe coders start flinging more AI generated garbage out into the world. I pity the pole folks trying to maintain these systems a couple of decades from now.
This is long overdue. PowerShell has long supported passing structured output (objects) via pipes and this is the closest attempt to approximate that without breaking the world.
Yeah, POSIX made choices that looked sane and even elegant at the time, but nowadays I think it is fair to say that they have not aged well. Like it's not just FDs getting inherited by default, almost everything gets inherited by default:
Working dir, env vars, uid/gid, socket handles, file descriptors, (some) file locks, message queues. AFAIK the only exception is the argv, everything else is inherited on fork or exec.
Sometimes this makes sense, but programmers always forget about this, resulting in security incidents. Eventually most programming languages gave up and updated their stdlibs to set CLOEXEC when opening files and sockets, knowing that it would break POSIX compatibility and API compatibility on their stdlibs. Python is one example: https://peps.python.org/pep-0446/
The "inherit by default" behavior also makes it very difficult to evolve the shell interface. The nushell devs are looking for a reliable way to request JSON output/input on processes spawned by the shell (if supported by the program). Naively passing env vars or FDs to the process causes problems because if the process spawns any children of it's own, they too would also inherit those env vars or FDs.
process inheritance was the best invention, because it models reality quite close.
you dont have new things just sitting in an empty universe all alone and initialize everything themself from ... somewhere ... because everything is reset around them.
environment (in a broader sense, not just environment variables, but also CWD, file handles, uid/gid, sec context, namespaces) is there for a reason: to use. if you dont want your children processes to read the stdin in place of you, dont give it to them. it's the parent process responsibility to set up the env for the children.
although subprocesses are invented to do (some of) the parent's job by delegating smaller steps and leave the details to them. for example a http server would read the request (first) line, then delegate the rest of the input to a subprocess (worker) depending on who is free, who handles which type of request, etc. this is original idea behind inheritance, IMO.
> Okay, apparently the stddata addition is causing havoc (who knew how many scripts just haphazardly hand programs random file descriptors, that's surely not a problem.)
I knew, and I've known since reading the "C shell considered harmful" paper, which offhandedly mentioned that sh-based shells can use an arbitrary number of file descriptors (maybe they have to be one-digit integers though). csh can't, of course.
this brings memories - university, first Unix exposure, Sun Ray terminals, "tcsh" as default shell, and me doing "find / -name ..." a lot.
I always wanted to ignore all errors form this (there was a lot of "permission denied"), but tcsh just didn't have a simple ability to do so. This taught me a valuable lesson about some software just being better than other. And to this day, I keep wondering you would people choose to use csh/tcsh voluntarily.
Tangential but I was surprised to see that tree(1), at least the popular implementation, is made in Terre Haute (which is where I'm from). Maybe I should invite the author for lunch or something :)
Nor have I; I think it is just what the developer of tree has chosen to call file descriptor 3, rather than being a wider convention or standard thing provided by the environment.
> As of version 2.0.0, in Linux, tree will attempt to automatically output a
compact JSON tree on file descriptor 3 (what I call stddata,) if present
> who knew how many scripts just haphazardly hand programs random file descriptors, that's surely not a problem.
Oh for fuck's sake! Why are you using random file descriptors nobody told you about? Those open fds are there for a reason, thank you: I've put an end of an open pipe specifically so I could notice when it will become closed.
If the user set up the environment of your application in a specific way, that means he wants your application to run in such an environment. If you were invoked with 10 non-standard file descriptors open and two injected threads — you'll have to live with it. Because, believe it or not, your application's purpose is to serve the user's goals. So don't break composability that the user relies on, please.
This is the first I've heard of using an open pipe to poll for subprocess termination. Don't get me wrong, I don't hate it, but you could just as easily have a SIGCHLD handler write to your pipe (or do nothing, since poll(2) will be fail with EINTR), and you don't have to worry about the subprocess closing the pipe or considering it some weird stddata fd like tree does here.
`SIGCHLD` is extremely unreliable in a lot of ways, `pidfd` is better (but Linux-specific), though it doesn't handle the case of wanting to be notified of all grandchildren's terminations after the direct child dies early.
Can somebody explain what's going on here? It seems I'm missing some important piece of background info. Why don't they just add -J flag for everyone who wants to output JSON? Oh, wait, tree already has -J flag to output JSON. So WTF are they doing here?
I am especially confused by this:
> Surely, nothing will happen if I just assume that the existence of a specific file descriptor implies something, as nobody is crazy or stupid enough to hardcode such a thing?
Wait, what? But "you" (tree authors) just hardcoded such a thing. Do "you" have some special permission to do this nonsense?
If only there was a variant of execve() / posix_spawn() that simply took a literal array of which file descriptors would need to be present in the new process. So that you can say:
int subprocess_stdin = open("/dev/null", O_RDONLY);
int subprocess_stdout = open("some_output", O_WRONLY);
int subprocess_stderr = STDERR_FILENO; // Let the subprocess use the same stderr as me.
int subprocess_fds[] = {subprocess_stdin, subprocess_stdout, subprocess_stderr};
posix_spawn_with_fds("my process", [...], subprocess_fds, 3);
Never understood why POSIX makes all of this so hard.
It's something trivial to write (~20 lines of code), there is no point for standard library to provide that kind of functions in my opinion.
You do after the fork() (or clone, on Linux) a for loop that closes every FD except the one you want to keep. In Linux there is a close_range system call to close a range of in one call.
POSIX is an API designed to be a small layer on the operating system, and designed to make as little assumption as possible to the underlying system. This is the reason why POSIX is nowadays implemented even on low resources embedded devices and similar stuff.
At an higher level it's possible to use higher level abstractions to manipulate processed (e.g. a C++ library that does all of the above with a modern interface).
That's `sysconf(_SC_OPEN_MAX)`, but it is always an bug to close FDs you don't know the origin of. You should be specifying `O_CLOEXEC` by default if you want FDs closed automatically.
That won't returned the maximum open file descriptor. You could perhaps use that value in lieu of the maximum open file descriptor and loop through a crap ton more FDs than even my previous post implied, I suppose, and this is getting less efficient and more terribly engineered by the comment, which I think proves the point…
> but it is always an bug to close FDs you don't know the origin of.
And I would agree. I'm replying to the poster above me, who is staking the claim that POSIX permits closing all open file descriptors other than a desired set.
So, I suppose it can, at a cost of a few thousand syscalls that'll all be pointless…
It is always a bug to call `closerange` since you never know if a parent process has deliberately left a file descriptor open for some kind of tracing. If the parent does not want this, it must use `O_CLOEXEC`. Maybe if you clear the entire environment you'll be fine?
That said, it is trivial to write a loop that takes a set of known old and new fd numbers (including e.g. swapping) produces a set of calls to `dup2` and `fcntl` to give them the new numbers, while correctly leaving all open fds open.
> Never understood why POSIX makes all of this so hard
I honestly can't say in this particular instance but always my (unpopular?) instinct im such a situation is to asdume there is a good reason and I just haven't understood it yet. It may have become irrelevant in the meantime, but I can't know until I understand, and it's served me well to give the patriarchs the benefit of the doubt in such cases.
every time I see the output of nushell I get so disappointed, they got the formatting so wrong, all the extra delimiters makes it hard to actually read the data. powershell got it right, using alignment. if you look at virtually all shell programs until the last few years you are going to see a similar, alignment based output. only recently, with the rise of the abuse of ligature, we started seeing this kind of incomprehensible blobs surrounding our text.
The author states they're using nushell's `markdown` table style because of issues with their font rendering certain characters. `rounded` is the default and indeed, `markdown` looks truly horrible in comparison.
Nushell's front page [1] shows an example of rounded, and here's an example of an even further customized version [2].
I think these are very readable. There is alignment too, but it's "local" alignment to cells in the same sub-table, not "global" to the entire table -- this is good for fitting more stuff into your terminal width without wrapping.
nushell front page is exactly what I was referring to.
Compare the legibility of the ls command in the front page to a regular ls command, it's insane how much more cluttered the nushell version is.
I do a lot of very low level programming with awful performance-maintenance trade-offs. Here's a great trick for a "binary" JSON: remove all of the extra whitespace, normalize your numbers, and the LZ4 the resulting string.
UTF-8 is already a great wire format.
I've never found a "binary JSON" that's significantly better than this; I mean you can beat it, but you need awkward encodings (prefix indices & other weird shit). You end up burning nearly-byte for any particularly clever integer encoding.
Most data structures are just nested arrays of integers. If you need an integer keyed OBJECT you're SOL, but I just play fiddly games with astral plane UTF-8 characters. (Yeah yeah yeah ad hoc encodings are nasty news.)
If you've got a BUTT LOAD of data just fire up a compressing SQLite DB like a normal human.
If I'm interested in performance I'll build my data out of offset handles and lay everything down into a block and mmap() it around. That's parsing free, up to an htons() — but that's only a worst case scenario. Everything else is about not inventing something custom & being able to use easily vendored high-trust 3rd party tools. (In this case: a JSON library, LZ4, and/or SQLite.)
Do you have a CBOR implementation that you like? Ideally one with decent schema support? I was looking into CBOR as a replacement for Protobufs for an embedded system I work on and it's got a lot going for it but every implementation I looked at seemed to support a very different subset of the schema spec and it was brutal to try to find a pair of libraries (C for the embedded side, C++ for the host side) that could actually share a set of schema files.
Sorry, but this is going to be very dangerous because much code will close unwanted FDs then open others. It's 50 years too late to add this convention.
Instead maybe we need new system calls that return dups of a hidden stddata FD or create/replace it.
The company I work for is guilty of abusing 3. We use it for debug output of user-supplied scripts that are meant to implement monitoring / metrics :'(
This is the first time I hear about stddata though. Is this a thing that's going into a standard? Is there already? Or is it just a name someone gave to it and it's not a real thing?
The environment variable isn't much better, both are akin to using a global var in your reentrant code, but at least STDDATA_FD is less likely to collide than 3.
Can't wait for scripts using this variable for something unrelated to break when they call my scripts.
This should be a parameter or argv[0]-based.
That doesn't work reliably either. No existing code scrubs STDDATA_FD from their environment variables, and there's no way to know if anyone uses STDDATA_FD in the wild. Why not just use a command line parameter like everyone else? Different isn't better in a situation like this.
This is a larger concern I've started to see in a certain class of younger developer where existing conventions are just ignored without an attempt at understanding of why they exist. Things are only going to get worse as naive vibe coders start flinging more AI generated garbage out into the world. I pity the pole folks trying to maintain these systems a couple of decades from now.
That's what I really meant by saying a parameter, it should be an option/flag that's given explicitly at invocation, or just a different program name.
Just go for `--json-output=filename` rather than playing games.
Why filename? It doesn't need to know how to write files. That's what Greater than is for. Do --output=json
This is long overdue. PowerShell has long supported passing structured output (objects) via pipes and this is the closest attempt to approximate that without breaking the world.
I don't know, Nushell does a pretty good job.
https://www.nushell.sh/
It's a shame that stdX streams were never spec'd as sockets, with appropriate handling available in the various shells.
Also, file handle inheritance by default was such a big mistake.
Yeah, POSIX made choices that looked sane and even elegant at the time, but nowadays I think it is fair to say that they have not aged well. Like it's not just FDs getting inherited by default, almost everything gets inherited by default:
Working dir, env vars, uid/gid, socket handles, file descriptors, (some) file locks, message queues. AFAIK the only exception is the argv, everything else is inherited on fork or exec.
Sometimes this makes sense, but programmers always forget about this, resulting in security incidents. Eventually most programming languages gave up and updated their stdlibs to set CLOEXEC when opening files and sockets, knowing that it would break POSIX compatibility and API compatibility on their stdlibs. Python is one example: https://peps.python.org/pep-0446/
The "inherit by default" behavior also makes it very difficult to evolve the shell interface. The nushell devs are looking for a reliable way to request JSON output/input on processes spawned by the shell (if supported by the program). Naively passing env vars or FDs to the process causes problems because if the process spawns any children of it's own, they too would also inherit those env vars or FDs.
process inheritance was the best invention, because it models reality quite close. you dont have new things just sitting in an empty universe all alone and initialize everything themself from ... somewhere ... because everything is reset around them.
environment (in a broader sense, not just environment variables, but also CWD, file handles, uid/gid, sec context, namespaces) is there for a reason: to use. if you dont want your children processes to read the stdin in place of you, dont give it to them. it's the parent process responsibility to set up the env for the children.
although subprocesses are invented to do (some of) the parent's job by delegating smaller steps and leave the details to them. for example a http server would read the request (first) line, then delegate the rest of the input to a subprocess (worker) depending on who is free, who handles which type of request, etc. this is original idea behind inheritance, IMO.
> Okay, apparently the stddata addition is causing havoc (who knew how many scripts just haphazardly hand programs random file descriptors, that's surely not a problem.)
I knew, and I've known since reading the "C shell considered harmful" paper, which offhandedly mentioned that sh-based shells can use an arbitrary number of file descriptors (maybe they have to be one-digit integers though). csh can't, of course.
It's discussed in the first section here
https://harmful.cat-v.org/software/csh
this brings memories - university, first Unix exposure, Sun Ray terminals, "tcsh" as default shell, and me doing "find / -name ..." a lot.
I always wanted to ignore all errors form this (there was a lot of "permission denied"), but tcsh just didn't have a simple ability to do so. This taught me a valuable lesson about some software just being better than other. And to this day, I keep wondering you would people choose to use csh/tcsh voluntarily.
Tcsh originally was more user-friendly for interactive use. The rest is inertia.
Tangential but I was surprised to see that tree(1), at least the popular implementation, is made in Terre Haute (which is where I'm from). Maybe I should invite the author for lunch or something :)
I've never heard of stddata. What distro/environment provides it?
It's a local invention of TFA's, AFAIK. It's not "std".
stdout would be the canonical location for putting JSON output (and the "data" of a command, generally). Then things like `| jq` just work.
Nor have I; I think it is just what the developer of tree has chosen to call file descriptor 3, rather than being a wider convention or standard thing provided by the environment.
> As of version 2.0.0, in Linux, tree will attempt to automatically output a compact JSON tree on file descriptor 3 (what I call stddata,) if present
https://github.com/Old-Man-Programmer/tree/blob/d501b58ff9cb...
offtopic: why is the Copyright © icon shake like crazy at the bottom of the page?
Edit: Oh I guess it seems to be intentional, I clicked around and I like the rgbcube site map.
<copyright intensifies>
> who knew how many scripts just haphazardly hand programs random file descriptors, that's surely not a problem.
Oh for fuck's sake! Why are you using random file descriptors nobody told you about? Those open fds are there for a reason, thank you: I've put an end of an open pipe specifically so I could notice when it will become closed.
If the user set up the environment of your application in a specific way, that means he wants your application to run in such an environment. If you were invoked with 10 non-standard file descriptors open and two injected threads — you'll have to live with it. Because, believe it or not, your application's purpose is to serve the user's goals. So don't break composability that the user relies on, please.
This is the first I've heard of using an open pipe to poll for subprocess termination. Don't get me wrong, I don't hate it, but you could just as easily have a SIGCHLD handler write to your pipe (or do nothing, since poll(2) will be fail with EINTR), and you don't have to worry about the subprocess closing the pipe or considering it some weird stddata fd like tree does here.
`SIGCHLD` is extremely unreliable in a lot of ways, `pidfd` is better (but Linux-specific), though it doesn't handle the case of wanting to be notified of all grandchildren's terminations after the direct child dies early.
Can somebody explain what's going on here? It seems I'm missing some important piece of background info. Why don't they just add -J flag for everyone who wants to output JSON? Oh, wait, tree already has -J flag to output JSON. So WTF are they doing here?
I am especially confused by this:
> Surely, nothing will happen if I just assume that the existence of a specific file descriptor implies something, as nobody is crazy or stupid enough to hardcode such a thing?
Wait, what? But "you" (tree authors) just hardcoded such a thing. Do "you" have some special permission to do this nonsense?
If only there was a variant of execve() / posix_spawn() that simply took a literal array of which file descriptors would need to be present in the new process. So that you can say:
Never understood why POSIX makes all of this so hard.It's something trivial to write (~20 lines of code), there is no point for standard library to provide that kind of functions in my opinion.
You do after the fork() (or clone, on Linux) a for loop that closes every FD except the one you want to keep. In Linux there is a close_range system call to close a range of in one call.
POSIX is an API designed to be a small layer on the operating system, and designed to make as little assumption as possible to the underlying system. This is the reason why POSIX is nowadays implemented even on low resources embedded devices and similar stuff.
At an higher level it's possible to use higher level abstractions to manipulate processed (e.g. a C++ library that does all of the above with a modern interface).
… what POSIX API gets you the open FDs? (Or even just the maximum open FD, and we'll just cause a bunch of errors closing non-existent FDs.)
That's `sysconf(_SC_OPEN_MAX)`, but it is always an bug to close FDs you don't know the origin of. You should be specifying `O_CLOEXEC` by default if you want FDs closed automatically.
That won't returned the maximum open file descriptor. You could perhaps use that value in lieu of the maximum open file descriptor and loop through a crap ton more FDs than even my previous post implied, I suppose, and this is getting less efficient and more terribly engineered by the comment, which I think proves the point…
> but it is always an bug to close FDs you don't know the origin of.
And I would agree. I'm replying to the poster above me, who is staking the claim that POSIX permits closing all open file descriptors other than a desired set.
So, I suppose it can, at a cost of a few thousand syscalls that'll all be pointless…
It is always a bug to call `closerange` since you never know if a parent process has deliberately left a file descriptor open for some kind of tracing. If the parent does not want this, it must use `O_CLOEXEC`. Maybe if you clear the entire environment you'll be fine?
That said, it is trivial to write a loop that takes a set of known old and new fd numbers (including e.g. swapping) produces a set of calls to `dup2` and `fcntl` to give them the new numbers, while correctly leaving all open fds open.
> Never understood why POSIX makes all of this so hard
I honestly can't say in this particular instance but always my (unpopular?) instinct im such a situation is to asdume there is a good reason and I just haven't understood it yet. It may have become irrelevant in the meantime, but I can't know until I understand, and it's served me well to give the patriarchs the benefit of the doubt in such cases.
It's not hard, just a bit too long:
For this the key would be to eliminate serialization and deserialization between steps in the pipeline.
I wouldn't have said this is anything new.
FreeBSD has libxo[0] integrated into some of its tools:
[0] https://github.com/Juniper/libxo
Except they went with --libxo command-line option, which is extremely unlikely to cause any problems in the existing scripts.
every time I see the output of nushell I get so disappointed, they got the formatting so wrong, all the extra delimiters makes it hard to actually read the data. powershell got it right, using alignment. if you look at virtually all shell programs until the last few years you are going to see a similar, alignment based output. only recently, with the rise of the abuse of ligature, we started seeing this kind of incomprehensible blobs surrounding our text.
The author states they're using nushell's `markdown` table style because of issues with their font rendering certain characters. `rounded` is the default and indeed, `markdown` looks truly horrible in comparison.
Nushell's front page [1] shows an example of rounded, and here's an example of an even further customized version [2].
I think these are very readable. There is alignment too, but it's "local" alignment to cells in the same sub-table, not "global" to the entire table -- this is good for fitting more stuff into your terminal width without wrapping.
A supporting font is required though, yes.
[1]: https://www.nushell.sh/
[2]: https://i.imgur.com/U4MnYLe.png
nushell front page is exactly what I was referring to. Compare the legibility of the ls command in the front page to a regular ls command, it's insane how much more cluttered the nushell version is.
Why not use a saner protocol than JSON, e.g. CBOR?
Is CBOR as popularly supported as JSON?
Also, to answer your question with a guess, I would suppose it’s because they wanted to use JSON and they wrote the feature.
I do a lot of very low level programming with awful performance-maintenance trade-offs. Here's a great trick for a "binary" JSON: remove all of the extra whitespace, normalize your numbers, and the LZ4 the resulting string.
UTF-8 is already a great wire format.
I've never found a "binary JSON" that's significantly better than this; I mean you can beat it, but you need awkward encodings (prefix indices & other weird shit). You end up burning nearly-byte for any particularly clever integer encoding.
Most data structures are just nested arrays of integers. If you need an integer keyed OBJECT you're SOL, but I just play fiddly games with astral plane UTF-8 characters. (Yeah yeah yeah ad hoc encodings are nasty news.)
If you've got a BUTT LOAD of data just fire up a compressing SQLite DB like a normal human.
If you're interested in performance, what about all the number conversion (to decimals, presumably) that is incurred with JSON?
If I'm interested in performance I'll build my data out of offset handles and lay everything down into a block and mmap() it around. That's parsing free, up to an htons() — but that's only a worst case scenario. Everything else is about not inventing something custom & being able to use easily vendored high-trust 3rd party tools. (In this case: a JSON library, LZ4, and/or SQLite.)
Do you have a CBOR implementation that you like? Ideally one with decent schema support? I was looking into CBOR as a replacement for Protobufs for an embedded system I work on and it's got a lot going for it but every implementation I looked at seemed to support a very different subset of the schema spec and it was brutal to try to find a pair of libraries (C for the embedded side, C++ for the host side) that could actually share a set of schema files.
Sorry, but this is going to be very dangerous because much code will close unwanted FDs then open others. It's 50 years too late to add this convention.
Instead maybe we need new system calls that return dups of a hidden stddata FD or create/replace it.
The company I work for is guilty of abusing 3. We use it for debug output of user-supplied scripts that are meant to implement monitoring / metrics :'(
This is the first time I hear about stddata though. Is this a thing that's going into a standard? Is there already? Or is it just a name someone gave to it and it's not a real thing?