Programming
file unix text-files newline
Updated Wed, 10 Aug 2022 05:48:59 GMT

Why should text files end with a newline?


I assume everyone here is familiar with the adage that all text files should end with a newline. I've known of this "rule" for years but I've always wondered why?




Solution

Because thats how the POSIX standard defines a line:

3.206 Line
A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

Therefore, lines not ending in a newline character aren't considered actual lines. That's why some programs have problems processing the last line of a file if it isn't newline terminated.

There's at least one hard advantage to this guideline when working on a terminal emulator: All Unix tools expect this convention and work with it. For instance, when concatenating files with cat, a file terminated by newline will have a different effect than one without:

$ more a.txt
foo
$ more b.txt
bar$ more c.txt
baz
$ cat {a,b,c}.txt
foo
barbaz

And, as the previous example also demonstrates, when displaying the file on the command line (e.g. via more), a newline-terminated file results in a correct display. An improperly terminated file might be garbled (second line).

For consistency, its very helpful to follow this rule doing otherwise will incur extra work when dealing with the default Unix tools.


Think about it differently: If lines arent terminated by newline, making commands such as cat useful is much harder: how do you make a command to concatenate files such that

  1. it puts each files start on a new line, which is what you want 95% of the time; but
  2. it allows merging the last and first line of two files, as in the example above between b.txt and c.txt?

Of course this is solvable but you need to make the usage of cat more complex (by adding positional command line arguments, e.g. cat a.txt --no-newline b.txt c.txt), and now the command rather than each individual file controls how it is pasted together with other files. This is almost certainly not convenient.

Or you need to introduce a special sentinel character to mark a line that is supposed to be continued rather than terminated. Well, now youre stuck with the same situation as on POSIX, except inverted (line continuation rather than line termination character).


Now, on non POSIX compliant systems (nowadays thats mostly Windows), the point is moot: files dont generally end with a newline, and the (informal) definition of a line might for instance be text that is separated by newlines (note the emphasis). This is entirely valid. However, for structured data (e.g. programming code) it makes parsing minimally more complicated: it generally means that parsers have to be rewritten. If a parser was originally written with the POSIX definition in mind, then it might be easier to modify the token stream rather than the parser in other words, add an artificial newline token to the end of the input.





Comments (5)

  • +0 – Although now quite impractical to rectify, clearly POSIX made a mistake when defining the line -- as evidence by the number of questions regarding this issue. A line should have been defined as zero or more characters terminated by <eol>, <eof>, or <eol><eof>. Parser complexity is not a valid concern. Complexity, wherever possible, should be moved from the programmers head and into the library. — Dec 06, 2018 at 18:11  
  • +0 – @DougCoburn This answer used to have an exhaustive, technical discussion explaining why this is wrong, and why POSIX did the right thing. Unfortunately these comments were apparently recently deleted by an overzealous moderator. Briefly, its not about parsing complexity; rather, your definition makes it much harder to author tools such as cat in a way thats both useful and consistent. — Dec 06, 2018 at 18:22  
  • +0 – @Leon The POSIX rule is all about reducing edge cases. And it does so beautifully. Im actually somewhat at a loss how people fail to understand this: Its the simplest possible, self-consistent definition of a line. — Feb 12, 2019 at 11:30  
  • +0 – @BT I think youre assuming that my example of a more convenient workflow is the reason behind the decision. Its not, its just a consequence. The reason is that the POSIX rule is the rule thats simplest, and which makes handling lines in a parser the easiest. The only reason were even having the debate is that Windows does it differently, and that, as a consequence, there are numerous tools which fail on POSIX files. If everybody did POSIX, there wouldnt be any problem. Yet people complain about POSIX, not about Windows. — Feb 12, 2019 at 11:32  
  • +0 – @BT Im only referring to Windows to point out the cases where POSIX rules dont make sense (in other words, I was throwing you a bone). Im more than happy never to mention it in this discussion again. But then your claim makes even less sense: on POSIX platforms it simply makes no sense to discuss text files with different line ending conventions, because theres no reason to produce them. Whats the advantage? There is literally none. In summary, I really dont understand the hatred this answer (or the POSIX rule) are engendering. To be frank, its completely irrational. — Feb 14, 2019 at 10:33  


External Links

External links referenced by this document: