Cobol: science and fiction

views:

331

answers:

+10 Q:

Cobol: science and fiction

There are a few threads about the relevance of the Cobol programming language on this forum, e.g. this thread links to a collection of them. What I am interested in here is a frequently repeated claim based on a study by Gartner from 1997: that there were around 200 billion lines of code in active use at that time!

I would like to ask some questions to verify or falsify a couple of related points. My goal is to understand if this statement has any truth to it or if it is totally unrealistic.

I apologize in advance for being a little verbose in presenting my line of thought and my own opinion on the things I am not sure about, but I think it might help to put things in context and thus highlight any wrong assumptions and conclusions I have made.

Sometimes, the "200 billion lines" number is accompanied by the added claim that this corresponded to 80% of all programming code in any language in active use. Other times, the 80% merely refer to so-called "business code" (or some other vague phrase hinting that the reader is not to count mainstream software, embedded systems or anything else where Cobol is practically non-existent). In the following I assume that the code does not include double-counting of multiple installations of the same software (since that is cheating!).

In particular in the time prior to the y2k problem, it has been noted that a lot of Cobol code is already 20 to 30 years old. That would mean it was written in the late 60ies and 70ies. At that time, the market leader was IBM with the IBM/370 mainframe. IBM has put up a historical announcement on his website quoting prices, configuration and availability. According to the sheet, prices are about one million dollars for machines with up to half a megabyte of memory.

Question 1: How many mainframes have actually been sold?

I have not found any numbers for those times; the latest numbers are for the year 2000, again by Gartner. :^(

I would guess that the actual number is in the hundreds or the low thousands; if the market size was 50 billion in 2000 and the market has grown exponentially like any other technology, it might have been merely a few billions back in 1970. Since the IBM/370 was sold for twenty years, twenty times a few thousand will result in a couple of ten-thousands of machines (and that is pretty optimistic)!

Question 2: How large were the programs in lines of code?

I don't know how many bytes of machine code result from one line of source code on that architecture. But since the IBM/370 was a 32-bit machine, any address access must have used 4 bytes plus instruction (2, maybe 3 bytes for that?). If you count in operating system and data for the program, how many lines of code would have fit into the main memory of half a megabyte?

Question 3: Was there no standard software?

Did every single machine sold run a unique hand-coded system without any standard software? Seriously, even if every machine was programmed from scratch without any reuse of legacy code (wait ... didn't that violate one of the claims we started from to begin with???) we might have O(50,000 l.o.c./machine) * O(20,000 machines) = O(1,000,000,000 l.o.c.).

That is still far, far, far away from 200 billion! Am I missing something obvious here?

Question 4: How many programmers did we need to write 200 billion lines of code?

I am really not sure about this one, but if we take an average of 10 l.o.c. per day, we would need 55 million man-years to achieve this! In the time-frame of 20 to 30 years this would mean that there must have existed two to three million programmers constantly writing, testing, debugging and documenting code. That would be about as many programmers as we have in China today, wouldn't it?

EDIT: Several people have brought up automatic templating systems/code generators or so. Could somebody elaborate on this? I have two issues with that: a) I need to tell the system what it is supposed to do for me; for that I need to communicate with the computer and the computer will output the code. This is exactly what a compiler of a programming language does. So essentially I am using a different high-level programming language to generate my Cobol code. Shouldn't I work with that other high-level language instead of Cobol? Why the middle-man? b) In the 70s and 80s the most precious commodity was memory. So if I have a programming language output something, it should better be concise. Using my hypothetical meta-language would I really generate verbose and repetitive Cobol code with it rather than bytecode/p-code like other compilers of that time did? END OF EDIT

Question 5: What about the competition?

So far, I have come up with two things here:

1) IBM had their own programming language, PL/I. Above I have assumed that the majority of code has been written exclusively using Cobol. However, all other things being equal I wonder if IBM marketing had really pushed their own development off the market in favor of Cobol on their machines. Was there really no relevant code base of PL/I?

2) Sometimes (also on this board in the thread quoted above) I come across the claim that the "200 billion lines of code" are simply invisible to anybody outside of "governments, banks ..." (and whatnot). Actually, the DoD had funded their own language in order to increase cost effectiveness and reduce the proliferation of programming language. This lead to their use of Ada. Would they really worry about having so many different programming languages if they had predominantly used Cobol? If there was any language running on "government and military" systems outside the perception of mainstream computing, wouldn't that language be Ada?

I hope someone can point out any flaws in my assumptions and/or conclusions and shed some light on whether the above claim has any truth to it or not.

With regards to #4: how much of that could have been machine-generated code? I don't know if template-based code was used a lot with Cobol, but I see a lot of it used now for all sorts of things. If my application has thousands of LOC that were machine generated, that doesn't mean much. The last code-generating script I wrote took 20 minutes to write, 10 minutes to format the input, 2 minutes to run, then an hour to execute a suite of automatic tests to verify it actually worked, but the code it generated would have taken several days to do by hand (or the time between the morning meeting and lunch, doin' it my way ;) ). Ok I admit it's not always that easy and there is often manual tweaking involved, but I still don't think the LOC metric has much meaning if code-generators are in heavy use.

Maybe that's how they generated so much code in so little time.

FrustratedWithFormsDesigner 2010-06-01 14:06:40

Wasn't programming done with punch cards back then? I really doubt you would write a program that spits out punch cards which spits out more punch cards ... ;^) On the other hand, Lisp has had that capability and it was certainly around back then ...

user8472 2010-06-01 14:16:40

Ok, so with punch-cards, maybe they weren't doing much code generation. But once they got past that, then they could - I just don't know if they *did*! Although one of the guys here said something about code-generating tools for Cobol in use in the early/mid-1980's, so maybe they did start then (if I understood him correctly)...

FrustratedWithFormsDesigner 2010-06-01 14:33:53

That sounds a little vague. "Code-generating tools" could generically refer to anything, including compilers, cross-compilers to other languages and what have you. It would be helpful if there were any references to corroborate this. Substantial and systematic templating sounds quite innovative (and absolutely not conservative and reliability-oriented) and I doubt that Cobol programmers really went that road in the 80ies. It has certainly not been done for Fortran! And Fortran has been (and still is) a very lively and active language in its niche!

user8472 2010-06-01 15:25:20

Punch cards were an *input device* more than a storage medium. Modern code generation tools don't output "keystrokes to the editor", and code generation at the time would not have attempted to output punch cards. Files were store on disk, on tape.

dmckee 2010-06-01 16:11:51

@dmckee actually, punch cards were used as storage medium. I knew one academic who's office was next to impossible to get into because of the mountains of punched cards containing his research data.

anon 2010-06-01 18:54:36

LOC is a BS metric as there's lots of ways to count it. I'm not a COBOL dev at all, so I may be talking out of my butt here: If you count LOC as lines of machine code or even assembly instead of development language code, you will always get lots more code as a result (orders or magnitude in some languages). LOC is a BS metric: meaning you can use LOC to BS any viewpoint you want b/c there's so many ways to measure it. Just decompile a c# app to MSIL and see how much just intermediate code you get for one line c# code. And then that gets turned into machine code inflating again.

Jim Leonardo 2010-06-01 18:55:35

Most likely 99 to 99.9% of the new code is generated. Most cobol programmers are doing maintenance. That should eliminate more lines of code than it generates :)

Stephan Eggermont 2010-06-02 09:36:53

Well, you're asking in the wrong place here. This forum is dominated by .net programmers, with a significant java minority and such a age build-up that only a very small minority has any cobol experience.

The CASE tool market consisted for a large part of cobol code generators. Most tools were write-only, not two-way. That ensures there are a lot of lines of code. This market was somewhat newer than the 70s, so the volume of cobol code grew fast in the 80s and 90s.

A lot of cobol development is done by people having no significant internet access and therefore visibility. There is no need for it. Cobol prorammers are used to having in-house programming courses and paper documentation (lots of it).

[edit] Generating cobol source made a lot of sense. Cobol is very verbose and low level. The various cobol implementations are all slightly different, so configuring the code generator eliminates a lot of potential errors.

Stephan Eggermont 2010-06-01 18:36:19

Based on this and other answers, I would say SO is exactly the right place for this question.

BlueRaja - Danny Pflughoeft 2010-06-01 21:30:55

The Wikipedia entry on CASE_tool talks about a graphical text editor that coined the term in 1982. This does not really sound like the core market that Cobol targeted. Are you still saying that most Cobol code is machine-generated? If I program my machine to generate code (which, essentially, is what a compiler does) would I not have a different language instead of Cobol that I program, instead? And why would I want to generate code in such a ridiculously verbose language as Cobol instead of bytecode at a time when storage space was so precious?

user8472 2010-06-01 21:31:33

@BlueRaja: I'm not speaking from real Cobol experience.

Stephan Eggermont 2010-06-02 08:19:49

+1 A:

I would never defend those clowns at Gartner, but still:

Your ideas about IBM/370s are wrong. The 370 was an architecture, not a specific machine - it was implemented on everything from water cooled monsters to small, mini-computer sized machine (same size as a VAX). The number sold was thus probably far larger, by orders of magnitude, than you suspect.

Also, COBOL was heavily used on DEC's VAX lineup, and before that on the DEC-10 and DEC-20 lines. In the UK it was used on all ICL mainframes. Lots of other platforms also supported it.

anon 2010-06-01 18:53:15

Can you corroborate the statement on the number of units sold with a reference or so? I have pointed out that I did not find actual numbers, but based on the market size estimate from 2000 I have a hard time believing that the market for machines that cost > 1 million $ was much larger in the 70s and 80s. "Heavily used" is also not quite specific - the program PLUS data will still have to fit in memory! I don't find it obvious how such a gigantic code base could have been written and deployed when the IT market was still in its infancy and technology just couldn't store that much!

user8472 2010-06-01 21:39:55

@user You underestimate the amount of bloat caused by GUIs. The DEC-10 the Poly I worked for can only have had a maximum of 1 MByte of memory (can't remember if it actually did have that much) but routinely supported around 70 simultaneous users, all connected by 1200 baud terminals. Almost all COBOL code is non-GUI. And no, I have no actual figures on the number of units sold, but the same cheapskate Poly (no money in education) owned two 370 processors.

anon 2010-06-01 21:50:59

+5 A:

On the surface, the numbers Gartner produces are akin to answering the question: How many angels can dance on the head of a pin?. Unless you obtain a full copy of their report (costing big bucks) you will never know how they came up with or justified the 200 billion lines of COBOL claim. Having said that, Gartner is a well respected information technology research and advisory firm so I would think they would not have made such a claim without justification or explanation.

It is amazing how this study has been quoted over the years. A Google search for "200 billion lines of COBOL" got me about 19,500 hits. I sampled a bunch of them and most attribute the number directly to the 1997 the Gartner report. Clearly, this figure has captured the attention of many.

The method that you have taken to "debunk" the claim has a few problems:

1) How many mainframes have been sold This is a big question in and of itself, probably just as difficult as answering the 200 billion lines of code question. But more importantly, I don't see how determining the number of mainframes could be used in constraing the number of lines of code running on them.

2) How large were the programs in lines of code? COBOL programs tend to be large. A modest program can run to a few thousand lines, a big one into tens of thousands. One of the jokes COBOL programmers often make is that only one COBOL program has ever been written, the rest are just modified copies of it. As with many jokes there is a grain of truth in it. Most shops have a large program inventory and a lot of those programs were built by cutting and pasting from each other. This really "fluffs" up the size of your code base.

Your assumption that a program must fit into physical memory in order to run is wrong. The size problem was solved in several different ways (e.g. program overlays, virtual memory etc.). It was not unusual in the 60's and 70's to run large programs on machines with tiny physical memories.

3) Was there no standard software? A lot of COBOL is written from scratch or from templates. A number of financial packages were developed by software houses the 70's and 80's. Most of these were distributed as source code libraries. The customer then copied and modified the source to fit their particular business requirement. More "fluffing" of the code base - particularly given that large segments of that code was logically unexecutable once the package had been "configured".

4) How many programmers did we need to write 200 billion lines of code Not as many as you might think! Given that COBOL tends to be verbose and highly replicated, a programmer can have huge "productivity". Program generating systems were in vogue during the 70's and early 80's. I once worked with a product (now defunct fortunately) that let me write "business logic" and it generated all of the "boiler plate" code around it - producing a fully functional COBOL program. The code it generated was, to be polite, verbose in the extreme. I could produce a 15K line COBOL program from about 200 lines of input! We are taking serious fluff here!

5) What about the competition? COBOL has never really had much serious competition in the financial and insurance sectors. PL/1 was a major IBM initiative to produce the one programming language that met every possible computing need. Like all such initiatives it was too ambitious and has pretty much collapsed inward. I believe IBM still uses and supports it today. During the 70's several other languages were predicted to replace COBOL - ALGOL, ADA and C come to mind, but none have. Today I hear the same said of Java and .Net. I think the major reason COBOL is still with us is that it does what it is supposed to do very well and the huge multi billion lines of code legacy make moving to a "better" language both expensive and risky from a business point of view.

Do I believe the 200 billion lines of code story? I find the number high but not impossibly high given the number of ways COBOL code tends to "fluff" itself up over time.

I also think that getting too involved in analyzing these numbers quickly degrades into a "counting angels" exercise - something people can get really wound up over but has no significance in the real world.

EDIT

Let me address a few of the comments left below...

Never seen a COBOL program used by an investment bank. Quite possible. Depends which application areas you were working in. Investment banks tend to have large computing infrastructures and employ a wide range of technologies. The shop I have been working in for the past 20 years (although not an investment bank) is one of the largest in the country and it has a significant COBOL investment. It also has significant Java, C and C++ investments as well as pockets of just about every other technology known to man. I have also met some fairly senior applications developers here that were completely unaware that COBOL was still in use. I did a rough line count through our source control system and found around 70 million lines of production COBOL. Quite a few people that have worked here for years are completely oblivious to it!

I am also aware that COBOL is rapidly declining as a language of choice, but the fact is, there is still a lot of it around today. In 1997, the period to which this question relates, I believe COBOL would have been the dominant language in terms of LOC. The point of the question is: Could there have been 200 billion lines of it in 1997?

Counting mainframes. Even if one were able to obtain the number of mainframes it would be difficult to assess the "compute" power they represented. Mainframes, like most other computers, come in a wide range of configurations and processing capacity. If we could say there were exactly "X" mainframes in use in 1997, you still need to estimate the processing capacity they represented, then you need to figure out what percentage of the work load was due to running COBOL programs and a bunch of other confounding factors. I just don't see how this line of reasoning would ever yield an answer with an acceptable margin of error.

Multi-counting lines of code. That was exactly my point when referring to the COBOL "fluff" factor. Counting lines of COBOL can be a very misleading statistic simply because a significant amount of it was never written by programmers in the first place. Or if it was, quite a bit of it was done using the cut-paste-tinker "design pattern".

Your observation that memory was a valuable commodity in 1997 and prior is true. One would think that this would have lead to using the most efficient coding techniques and languages available to maximize its use. However, there are other factors: The opportunity cost of having an application backlog was often perceived to outweigh the cost of bringing in more memory/cpu to deal with less than optimal code (which could be cranked out quite a bit faster). This thinking was further reinforced by the observation that Moore's Law leads to ever declining hardware costs whereas software development costs remain constant. The "logical" conclusion was to crank out less than optimal code, suffer for a while, then reap the benefits in the years to come (IMO, a lesson in bad planning and greed, still common today).

The push to deliver applications during the 70's through 90's led to the rise of a host of code generators (today I see frameworks of various flavours fulfilling this role). Many of these code generators emitted tons of COBOL code. Why emit COBOL code? Why not emit assembler or p-code or something much more efficient? I believe the answer is one of risk mitigation. Most code generators are proprietary pieces of software owned by some third party who may or may not be in business or supporting their product 10 years from now. It is a very hard sell if you can't provide an iron-clad guarantee that the generated application can be supported into the distant future. The solution is to have the "generator" produce something familiar - COBOL for example! Even if the "generator" dies, the resulting application can be maintained by your existing staff of COBOL programmers. Problem solved ;) (today we see open source used as a risk mitigation argument).

Returning to the LOC question. Counting lines of COBOL code is, in my opinion, open to a wide margin of error or at least misinterpretation. Here are a few statistics from an application I work on (quoted approximately). This application was built with and is maintained using Basset Frame Technology (frame-work) so we don't actually write COBOL but we generate COBOL from it.

Lines of COBOL: 7,000,000
- Non-Procedure Division: 3,000,000
- Procedure Division: 3,500,000
- Comment/blank : 500,000
- Non-expanded COPY directives: 40,000
COBOL verbs: 2,000,000
Programmer written procedure Division: 900,000
Application frame generated: 270,000
Corporate infrastructure frame generated: 2,330,000

As you can see, almost half of our COBOL programs are non-procedure Division code (data declaration and the like). The ratio of LOC to Verbs (statement count) is about 7:2. Using our framework leverages code production by about a factor of 7:1. So what should you make of all this? I really don't know - except that there is a lot of room to fluff up the COBOL line counts.

I have worked with other COBOL program generators in the past. Some of these had absolutely stupid inflation factors (e.g. the 200 lines to 15K line fluffing mentioned earlier). Given all these inflationary factors and the counting methodology used in by Gartner, it may very well have been possible to "fluff" up to 200 billion lines of COBOL in 1997 - but the question remains: Of what real use is this number? What could it really mean? I have no idea. Now lets get back to the counting angels problem!

NealB 2010-06-01 21:25:35

Everyone says COBOL is widely used in finance. Yet despite contracting and consulting for quite a few investment banks over the years, I've never come across a single COBOL program.

anon 2010-06-01 21:56:44

@NeilButterworth: At least in Germany one comes across such programs once in a while. However, such code is usually accompanied by lots of builerplate/glue/support code in "common" languages whose l.o.c. count far outnumbers the size of the core code. I have found a similar situation with Fortran in its respective niche - a few thousands (at most tens of thousands) lines of core code, supported by hundreds of thousands of lines of C/C++/Python/etc. code.

user8472 2010-06-02 05:54:38

@NealB: Re 1) I think it is far easier to count the number of units sold than estimate the code running on them. Since IBM was market leader, they should at least have an order-of-magnitude idea of the market size. However, they don't make it easy to find such a number. Based on your arguments I agree, however, that this number alone might not get me as far as I would have liked.

user8472 2010-06-02 05:59:41

@NealB: Re 3) This sounds like cheating to me. Like double/multi-counting the code of dynamic libraries on a modern systems for every third-party software that uses it. Or like forking opensource projects in-house, customizing a couple of minor parts in it and then counting all lines of code as written from scratch. Or counting different versions/iterations/checkpoints of a code as independent. Well, other people have pointed out that l.o.c. is a bad metric, so presumably one can really cheat this way. Re 4) see my edit.

user8472 2010-06-02 06:05:15

@NealB: Re 5) I don't question the point that "never touching a running system" is a good idea and I don't advocate migrating code just for the sake of migrating. I am not even arguing whether Cobol is a "worse" language than others or which one might be a better choice. I am just wondering where the gigantic amount of code comes from and if that number can be realistic at all!

user8472 2010-06-02 06:08:14

@NeilButterworth: I do know a few large banks and insurance companies that still have mainframe systems running Cobol. I think all of these companies are now in some stage of replacing these Cobol systems, but it's a long slow process, and the systems have to be kept up-to-date until the day they're shut off, so they still require active maintenance development.

FrustratedWithFormsDesigner 2010-06-02 13:27:18

@NealB: Since the thread is closed I have decided to accept your answer since it addresses all points, has good arguments and gives a couple of specific facts that are very relevant to the original question.

user8472 2010-06-04 09:53:51

+2 A:

[Usual disclaimer - I work for a COBOL vendor]

It's an interesting question and it's always difficult to get a provable number. On the number of COBOL programmers estimates - the 2 - 3 million number may not be orders of magnitude in error. I think there have been estimates of 1 or 2 million in the past. And each one of those can generate a lot of code in a 20 year career. In India tens of thousands of new COBOL programmers are added to the pool every year (perhaps every month!).

I think the automatically generated code is probably bigger than might be thought. For example PACBASE is a very popular application in the banking industry. A very large global bank I know of uses it extensively and they generate all their code into COBOL and estimate this generated code is 95% of their total code base with the other 5% being hand coded/maintained. I don't think this is uncommon. The maintenance and development of those systems is typically done at the model-level not the generated code as you say.

There is a group of applications that were missing from the original question - COBOL isn't only a mainframe language. The early years of Micro Focus were almost entirely spent in the OEM marketplace - we used to have something like 200 different OEMs (lots of long-gone names like DEC, Stratus, Bull, ...). Every OEM had to have a COBOL compiler on their box alongside C and Assembler. A lot of big applications were built at that time and are still going strong - think about the largest HR ERP systems, the largest mobile phone billing systems etc. My point is that there is a lot of COBOL code that was never on an IBM mainframe and is often overlooked.

And finally, the size of the code base may be larger in COBOL shops than the "average". That's not just because COBOL is verbose (or was - that's not been the case for a long time) but the systems are just bigger - they're in large organizations, doing a large number of disparate tasks. It's very common for sites to have 10's of millions of LoC.

Mark 2010-06-02 13:57:37

+2 A:

I don't have figures, but my first "real" job was on IBM 370s.

First: Number sold. In 1974, a large railway ran on three 370s. These were big 370s, though, and you could get a small one for a whole lot less. (For perspective, at that time whether to get another megabyte was a decision on the VP level.)

Second: COBOL code is what you might call "fluffy", since a typical sentence (COBOLese for line) might be "ADD 1 TO MAIN-ACCOUNT OF CUSTOMER." There would be relatively few machine instructions per line. (This is especially true on the IBM 360 and onwards, which had machine instructions designed around executing COBOL.) BTW, addresses were 16 bits, four to designate a register (using the bottom 24 bits as a base address) and 12 as an offset. This meant that something under 64K could be addressed at a time, since not all of the 16 registers could be used as base registers for various reasons. Don't underestimate the ability of people to fit programs into small memory.

Also, don't underestimate the number of programs. The program library would be on disk and tape, and was essentially only limited by cost and volume. (Earlier on, they'd be on punch cards, which had serious problems as data and program storage.)

Third: Yes, most software was hand-written for the business at that time. Machines were far more expensive then, and people were cheaper. Programs were written to automate the existing business processes, and the idea that you could get outside software and change your business practices was almost heresy. This changed over time, of course.

Fourth: Programmers could go much faster than today, in lines of code per person-year, since these were largely big dumb programs for big dumb problems. In my experience, the DATA DIVISION was a large part of each COBOL program, and that would frequently take large descriptions of file layouts and repeat them in each program in the series.

I have limited experience with program generators, but it was very common at the time to use it to generate an application and then modify the output. This was partly just bad practice, and partly because a program generator couldn't do everything needed.

Fifth: PL/I was not heavily used, despite IBM's efforts. It ran into early bad press, although as far as I know the only real major problem that couldn't be fixed was figuring out the precision system. The Defense Department used Ada and COBOL for entirely different things. You are omitting assembly language as a competitor, and lots of small shops used BAL (also called ASM) instead of COBOL. After all, programmers were cheap, compilers were expensive, and there were a whole lot of things COBOL couldn't do. It was actually a very nice assembly language, and I liked it a lot.

David Thornley 2010-06-02 17:37:22

ansaurus

tags:

views:

answers:

Cobol: science and fiction

related questions