Why are flat text files the state of the art for representing source code?

Sure - the preprocessor and compiler need to see a flat file representation of the file, but that's easily created.

It seems to me that some form of XML or binary data could represent lots of ideas that are very difficult to track, otherwise.

For instance, you could embed UML diagrams right into your code. They could be generated semi-automatically, and annotated by the developers to highlight important aspects of the design. Interaction diagrams in particular. Heck, embedding any user drawing might make things more clear.

Another idea is to embed comments from code reviews right into the code.

There could be all sorts of aids to make merging multiple branches easier.

Something I'm passionate about is not just tracking code coverage, but also looking at the parts of code covered by an automated test. The hard part is keeping track of that code, even as the source is modified. For instance, moving a function from one file to another, etc. This can be done with GUIDs, but they're rather intrusive to embed right in the text file. In a rich file format, they could be automatic and unobtrusive.

So why are there no IDEs (to my knowledge, anyway) which allow you to work with code in this way?

EDIT: On October 7th, 2009.

Most of you got very hung up on the word "binary" in my question. I retract it. Picture XML, very minimally marking up your code. The instant before you hand it to your normal preprocessor or compiler, you strip out all of the XML markup, and pass on just the source code. In this form, you could still do all of the normal things to the file: diff, merge, edit, work with in a simple and minimal editor, feed them into thousands of tools. Yes, the diff, merge, and edit, directly with the minimal XML markup, does get a tad more complicated. But I think the value could be enormous.

If an IDE existed which respected all of the XML, you could add so much more than what we can do today.

For instance, your DOxygen comments could actually look like the final DOxygen output.

When someone wanted to do a code review, like Code Collaborator, they could mark up the source code, in place.

The XML could even be hidden behind comments.

// <comment author="mcruikshank" date="2009-10-07">
// Please refactor to Delegate.
// </comment>

And then if you want to use vi or emacs, you can just skip over the comments.

If I want to use a state-of-the-art editor, I can see that in about a dozen different helpful ways.

So, that's my rough idea. It's not "building blocks" of pictures that you drag on the screen... I'm not that nuts. :)

+7  A: 

A colleague of mine wrote this blog about this subject

Lou Franco
The text layout of this blog gave me a headache for some reason.
Martin Cote
I don't know if it's because you planted the idea in my head, but me too.
its the background color, makes the font really weirdish, iono, its really weird.
He fixed the font -- looks normal now.
Lou Franco
+124  A: 
  • you can diff them
  • you can merge them
  • anyone can edit them
  • they are simple and easy to deal with
  • they are universally accessible to thousands of tools
No one owns the format either.
Jeff Yates
"they are universally accessible to thousands of tools" - bingo. Chicken and egg problem, really.
It's all about the source control, baby!
Richard Morgan
You can just as easily diff tree nodes in memory as you can lines in a text file. Though the biggest advantage would be you could more easily ignore formatting and comments in the diffs. But yeah, chicken and egg problem still.
This doesn't answer the question at all. A tree data-structure could just as easily be serialized just before diffing, merging, or interacting with other tools.
Jonathan Tran
And they are robust -- how many times have binary file formats been broken due to upgrades or other glitches in programs that parse them? Number of times vi has damaged my plain text LaTeX document: 0. Number of times Word has corrupted the binary document structure: $\infinity$.
+12  A: 

Why are essays written in text? Why are legal documents written in text? Why are fantasy novels written in text? Because text is the single best form - for people - of persisting their thoughts.

Text is how people think about, represent, understand, and persist concepts - and their complexities, hierarchies, and interrelationships.

that's not true at all. Some things are best described using diagrams. Think of a flow chart: take a flow chart and you can create a textual representation of it, but it'd be much easier to understand as a diagram.
I think the question is about flat files and not just about the usage of text.
nickf, I disagree; pseudo-code is generally a clearer way to show an algorithm than a flowchart. Things like for loops can't be directly represented in a flowchart so when you read it you have to work out which bits are really a loop etc.
Mark Baker
Most research papers are published in PS or PDF, not ASCII. You're confusing representation with presentation. The diagrams in rich formats are very helpful - not to mention the nice presentation of equations.
Matt Cruikshank
+21  A: 

In my opinion, any possible benefits are outweighed by being tied to a particular tool.

With plain-text source (that seems to be what you're discussing, rather than flat files per se) I can paste chunks into an email, use simple version control systems (very important!), write code into comments on Stack Overflow, use one of a thousand text editors on any number of platforms, etc.

With some binary representation of code, I need to use a specialized editor to view or edit it. Even if a text-based representation can be produced, you can't trivially roll back changes into the canonical version.

+12  A: 

Smalltalk is an image-based environment. You are no longer working with code in a file on disk. You are working with and modifying the real objects in runtime. It still is text but classes are not stored in human readable files. Instead the whole object memory (the image) is stored on a file in binary format.

But the biggest complaints of those trying out smalltalk is because it doesn't use files. Most of the file-based tools that we have (vim, emacs, eclipse,, unix tools) will have to be abandoned in favor of smalltalk's own tools. Not that the tools provided in smalltalk in inferior. It is just different.

Is Smalltalk interpreted realtime?
Jon Limjap
Squeak smalltalk is interpreted. Other smalltalks are compiled to bytecode
Are you sure Squeak is interpreted ? Running `(Number>>#asInteger) inspect` opens an inspector on a `CompiledMethod`, where you can see the bytecode.
Sébastien RoccaSerra
"" lists "interpreted" as one of the features. But now I'm curious as to what that actually means. Compiled into bytecode but interpreted by the VM? I know VisualWorks also compiled to bytecode but it has a JITter. Not sure about Squeak though.
Some "interpreted" languages (CPython for one) are actually compiled to a very high-level bytecode:
+5  A: 

Ironically there ARE programming constructs that use precisely what you describe.

For example, SQL Server Integration Services, which involve coding logic flow by dragging components into a visual design surface, are saved as XML files describing precisely that back end.

On the other hand SSIS is pretty difficult to source-control. It is also fairly difficult to design any sort of complex logic into it: if you need a little bit more "control", you'll need to code VB.NET code into the component, which brings us back to where we started.

I guess that, as a coder, you should consider the fact that for every solution to a problem there are consequences that follow. Not everything could (and some argue, should) be represented in UML. Not everything could be visually represented. Not everything could be simplified enough to have a consistent binary file representation.

That being said, I would posit that the disadvantages of relegating code to binary formats (most of which will also tend to be proprietary) far outweight the advantages of having them in plain text.

Jon Limjap
Fine, so let me drop <img src="UMLDiagram.png"> into my comments, and make an IDE smart enough to show me that image. It's still just text. There's no magic. It's just that the IDE is too stupid to do it right now.
Matt Cruikshank
Or better yet, let me embed things like Google Chart API objects, all with clever text comments, and have the IDE resolve it and display it.For cripes sake, the IDE should at least understand DOxygen comments, and be as smart and navigable as the DOxygen output!!!
Matt Cruikshank
Roll your own DOxygen parser, perhaps? What IDEs are we talking about?
Jon Limjap
Visual Studio for me.
Matt Cruikshank
Wait for VS10 then write it yourself :)
Simon Buchan
+3  A: 

IMHO, XML and binary formats would be a total mess and wouldn't give any significant benefit.

OTOH, a related idea would be to write into a database, maybe one function per record, or maybe a hierarchical structure. An IDE created around this concept could make navigating source more natural, and easier to hide anything not relevant to the code you're reading at a given moment.


The code of your program define the structure that would be created with xml or the binary format. Your programming language is a more direct representation of your program's structure than an XML or Binary representation would be. Have you ever noticed how Word misbehaves on you as you give structure to your document. WordPerfect at least would 'reveal codes' to allow you to see what lay beneath your document. Flat files do the same thing for your program.

You're missing the point entirely.Picture your source code. Now imagine that you can embed a picture in your comments. That's the only difference I'm talking about. Even as an <img src="UMLDiagram.png"> in the comments.I'm NOT talking about converting { into <scope> or something like that.
Matt Cruikshank
+4  A: 

It's a good question. FWIW, I'd love to see a Wiki-style code management tool. Each functional unit would have its own wiki page. The build tools pull together the source code out of the wiki. There would be a "discuss" page linked to that page, where people can argue about algorithms, APIs and such like.

Heck, it wouldn't be that hard to hack one up from a pre-existing Wiki implementation. Any takers...?

See the neat thing is, because code is stored in plain text, it won't be hard to write a wiki system which does this.If the code was XML in some predefined structure, this would be at best much harder, and more likely, not possible at all
Orion Edwards
+1  A: 

You mention that we should use "some form of XML"? What do you think XHTML and XAML are?

Also XML is still just a flat file.

Chris Pietschmann

Neat idea's. I have myself wondered on a smaller scale ... much smaller, why can't IDE X generate this or that.

I don't know if I am capable as a programmer yet to develop something as cool and complex as your talking about or what I am thinking about, but I would be interested in trying.

Maybe start out with some plugins for .NET, Eclipse, Netbeans, and so on? Show off what can be done, and start a new trend in coding.

+9  A: 

Lisp programs are not flat files. They are serialization of data structures. This code-as-data is an old idea, and actually one of the greatest idea in computer science.

I guess these ideas hadn't caught up yet. :)
I'm a day-to-day Lisper, and I have to disagree with you to an extent. Lisp source code contains a lot of information that's not preserved when parsed -- reader macros and comments, for example. You could also say that C source code is a serialization of a C compiler's parse tree.
I have to agree with Rich -- all computer languages provide some sort of depth in a parse tree. Lisp's greatness is the clarity of this concept and the ability of the programs to work on their own parse tree.
+5  A: 

Here's why:

  • Human readable. That makes a lot easier to spot a mistake, in both the file and the parsing method. Also can be read out loud. That's one that you just cannot get with XML, and might make a difference, specially in customer support.

  • Insurance against obsolescence. As long as regex exist, it is possible to write a pretty good parser in just a few lines of code.

  • Leverage. Almost everything there is, from revision control systems to editors to filter, can inspect, merge and operate on flat files. Merging XML can be a mess.

  • Ability to integrate them rather easily with UNIX tools, such as grep, cut or sed.

Edu Felipe
There are languages which are not parsable this way (C++).
+4  A: 

People have tried for a long time to create an editing environment that goes beyond the flat file and everyone has failed to some extent. The closest I've seen was a prototype for Charles Simonyi's Intentional Programming but then that got downgraded to a visual DSL creation tool.

No matter how the code is stored or represented in memory, in the end it has to be presentable and modifiable as text (without the formatting changing on you) since that's the easiest way we know to express most of the abstract concepts that are needed for solving problems by programming.

With flat files you get this for free and any plain old text editor (with the correct character encoding support) will work.

Mark Cidade

I think another aspect of this is that the code is what is important. It is what is going to be executed. For example, in your UML example, I would think rather than having UML (presumably created in some editor, not directly related to the "code") included in your "source blob" would be almost useless. Much better would be to have the UML generated directly from your code, so it describes the exact state the code is in as a tool for understanding the code, rather than as a reminder of what the code should have been.

We've been doing this for years regarding automated doc tools. While the actual programmer generated comments in the code might get out of sync with the code, tools like JavaDoc and the like faithfully represent the methods on an object, return types, arguments, etc. They represent them as they actually exist,not as some artifact that came out of endless design meetings.

It seems to me that if you could arbitrarily add random artifacts to some "source blob", these would likely be out of date and less than useful right away. If you can generate such artifacts directly from the code, then the small effort to get your build process to do so is vastly better than the previously mentioned pitfalls of moving away from plain text source files.

Related to this, an explanation of why you'd want to use a plain-text UML tool (UMLGraph) seems to apply nearly equally as well to why you want plain-text source files.

+1  A: 

Old habits die hard, I guess.

Until recently, there weren't many good-quality, high-performing, widely-available libraries for general storage of structured data. And I would emphatically not put XML in that category even today--too verbose, too intensive to process, too finicky.

Nowadays, my favorite thing to use for data that doesn't need to be human-readableis SQLite and make a database. It's so incredibly easy to embed a full-featured SQL database into any app... there are bindings for C, Perl, Python, PHP, etc... and it's open-source and really fast and reliable and lightweight.

I <3 SQLite.

I don't really think the medium programming tools work in is a matter of 'habit'. :)
+6  A: 

<?xml version="1.0" encoding="UTF-8"?><code>Flat files are easier to read.</code></xml>

Mark Stock
+3  A: 

Steve McConnell has it right, as always - you write programs for other programmers (including yourself), not for computers.

That said, Microsoft Visual Studio must internally manage the code you write in a very structured format, or you wouldn't be able to do such things as "Find All References" or rename or re-factor variables and methods so readily. I'd be interested if anyone had links to how this works.

David Grigg
+1  A: 

The trend we are seeing about DSL's are the first thing that comes to mind when reading your question. The problem has been that there does not exist a 1-to-1 relationship between models (like UML) and an implementation. Microsoft among others are working on getting there, so that you can create your app as something UML-like, then code can be generated. And the important thing - as you opt to change your code, the model will reflect this again.

Windows Workflow Foundation is a pretty good example. Of cause there are flat files and/or XML in the background, but you usually end up defining your business logic in the orchestration tool. And that is pretty cool!

We need more of the "software factories" thinking, and will see a richer IDE experience in the future, but as long as computers run on zeroes and ones, flat text files can and (probably) will always be an intermediate stage. As stated be several people already, simple text files are very flexible.

+3  A: 

Actually, roughly 10 years ago, Charles Simonyi's early prototype for intentional programming attempted to move beyond the flat file into a tree representation of code that can be visualized in different ways. Theoretically, a domain expert, a PM, and a software engineer could all see (and piece together) application code in ways that were useful to them, and products could be built on a hierarchy of declarative "intentions", digging down to low-level code only as needed.

ETA (per request in the questions) There's a copy of one of his early papers on this at the Microsoft research web site. Unfortunately, since Simonyi left MS to start a separate company several years ago, I don't think the prototype is still available for download. I saw some demos back when I was at Microsoft, but I'm not sure how widely his early prototype was distributed.

His company, IntentSoft is still a little quiet about what they're planning to deliver to the market, if anything, but some of the early stuff that came out of MSR was pretty interesting.

The storage model was some binary format, but I'm not sure how much of those details were disclosed during the MSR project, and I'm sure some things have changed since the early implementations.

Can you point out site(s) with more info on what you describe in your first paragraph. I would very much appreciate it. Thanks.

It's pretty obvious why plain text is king. But it is equally obvious why a structured format would be even better.

Just one example: If you rename a method, your diff/merge/source control tool would be able to tell that only one thing had changed. The tools we use today would show a long list of changes, one for every place and file that the method was called or declared.

(By the way, this post doesn't answer the question as you might have noticed)

Arne Evertsson

This might not answer exactly your question but here is an editor allows having an higher view of code:


I've wistfully wondered the same thing, as described in the answer to: What tool/application/whatever do you wish existed?

While it's easy to imagine a great number of benefits I think the biggest hurdle that would have to be addressed is that no-one has produced a viable alternative.

When people think of alternatives to storing source as text they seem to often immediately think in terms of graphical representations (I'm referring here to the commercial products that have been available - eg. HP-vee). And if we look at the experience of people like the FPGA designers, we see that programming (exclusively) graphically just doesn't work - hence languages like Verilog and VHDL.

But I don't see that the storage of source necessarily needs to be bound to the method of writing it in the first place. Entry of source can be largely done as text - which means that the issues of copying/pasting can still be achieved. But I also see that by allowing merges and rollbacks to be done on the basis of tokenised meta-source we could achieve more accurate and more powerful manipulation tools.

Andrew Edgecombe
+1  A: 

For a example of a language that does away with traditional text-programming, see the Lava Language.

Another nifty thing I just recently discovered is subtext2 (video demo).

David Schmitt

I think the reason of why text files are used in development is that they are universal against various development tools. You can look inside or even fix some errors using a simple text editor (you can't do it in a binary file because you never know how any fix would destroy other data). It doesn't mean, however, that text files are best for all those purposes.

Of course, you can diff and merge them. But it doesn't mean that the diff/merge tool understand the distinct structure of the data encoded by this text file. You can do the diff/merge, but (especially seen in XML files) the diff tool won't show you the differences correctly, that is, it will show you where the files differ and which parts of the data the tool "thinks" are the same. But it will not show you the differences in the structure of XML file - it will just match lines that look the same.

Regardless whether we're using binary files or text files, it's always better that the diff/merge tools take care of the data structure this file represents rather than the lines and characters. For C++ or Java files, for example, report that some identifier changed its name, report that some section was surrounded with additional if(){}, but, on the other hand, ignore changes in indents or EOL characters. The best approach would be that a file is read into internal structures and dumped using specific format rules. This way the diff-ing will be made through the internal structures and the merge result will be generated from the merged internal structure.

+1  A: 

Why do text files rule? Because of McIlroy's test. It is vital to have the output of one program be acceptable as the source code for another, and text files are the simplest thing that works.

Michael Dorfman
You've just successfully argued databases out of existence. Congratulations.
Matt Cruikshank
Thanks, but credit goes to McIlroy. Naturally, "the simplest thing that works" gets more complicated if one needs atomicity of transactions, two-phase commit, etc., on source code.Do you know of any programming languages that use databases as the primary way to represent source code?
Michael Dorfman

Flat files rock.

+1  A: 

Anyone ever tryed Mathematica?

The pic above is from an old version but it was the best google could give me. the first equation there to Math.Integrate(1/(Math.Pow("x",3)-1), "x") like you would have to write if you were coding with plain text in most common languages. Imo the mathematical representation is much easier to read, and that is still a pretty small equation.

And yes, you can both input and copy-paste the code as plain text if you want.

See it as the next generation syntax highlighting. I bet there are alot of other stuff than math that could take benifit from this kind of representation.


Modern programs are composed of flat pieces, but are they flat? There are usings, and includes, and libraries of objects, etc. An ordinary function call is a peek into a different place. The logic isn't flat, due to having multiple threads, etc.

+3  A: 

Labview and Simulink are two graphical programming environments. They are both popular in their fields (interfacing to hardware from a PC, and modeling control systems, respectively), but not used much outside of those fields. I've worked with people who were big fans of both, but never got into them myself.


I have the same vision! I really wish this would exists.

You might want to take a look at Fortress, a research language by Sun. It has special support for formulas in source code. The quote below is from Wikipedia

Fortress is being designed from the outset to have multiple syntactic stylesheets. Source code can be rendered as ASCII text, in Unicode, or as a prettied image. This will allow for support of mathematical symbols and other symbols in the rendered output for easier reading.

The major reason for the persistence of text as source is the lack for powertools, as eg version control, for non-text date. This is based on my experience working with Smalltalk, where plain byte-code is kept in a core-dump all time. In a non-text system, with today's tools, team development is a nightmare.

+1  A: 

Visual FoxPro uses dbf table structures to store code and metadata for forms, reports, class libs, etc. These are binary files. It also stores code in prg files that actual text files...

The only advantage I see is being able to use the built in VFP data language to do code searches on those files... other than that it is a liability imo. At least once every few months, one of these files will become corrupted for no apparent reason. Integration with source control and diffs very painful as well. There are workarounds for this, but involve converting the file to text temporarily!

Brian Vander Plaats

Who works with flat files?

Eclipse gives you views into your source so that I can see inner classes, methods and data, all sorted and grouped. if I want to edit the inner class I click on it. While technically there is a flat file underlying I almost never navigate it like that.


One thing not touched on is that some languages have the concept of a source file builtin with respect to things like variable scoping. Changing to something else (like storing functions in a database) would require you to alter the language itself.


While having a drink this night with my friends(programmers too), one of them told me that they use UML to generated code. But he said that they still need to manually edit the generated code, there are some problem domains that can't be easily described with UML.

With all the LINQ-goodness, lambda and all, some problem domains cannot be represented by UML, we still need to make our way around the generated code for the computer to do our bidding.

How could we represent in UML, let alone XML, the following problem?

The amount of answers to that simple problem is very telling that UML, SQL(the most important assembly language, whatever those ORM guys tell you otherwise), XML are not an XOR proposition. We will still use the combinations of these technology, not using just one of them to the exclusion of others.

Michael Buen

It's still flat files because maybe that's how they can sell softwares tools :D

Source Code should be itself Object Oriented that is encapsulated as Member. There is only one Product I know that does so, it exists since very long (Windows 3.0) and designed by Paul Allen himself. It was originally inspired by Hypercard on Mac but as Bill Gates told it:;slug=1073140

``It's generations beyond HyperCard,'' says Gates.

Unfortunately they didn't target the right people:

In pursuing (interests of) software developers,'' says Alsop, Asymetrix may have made ToolBook too complex for the little guy.''

They should have targeted Professional Programmers instead of Hobbysts.

Still today on concept level it's still beyond other languages except Rebol of course ;)

Rebol Tutorial