technicalities: specs (Lisp, ML, ALGOL, Ada, JS2, ES4, RnRS, Redex, POPLmark, Rust, K, Comp/JSCert)

« previous entry | next entry »
Feb. 6th, 2014 | 09:17 am

According to the History of Programming Languages website (currently offline) there have been somewhere around 8500 programming languages developed since the Autocodes of the early 50s. The internet archive has scraped HOPL's taxonomy at least as well as a few references and at least the primary year indexes (the year of my birth, for example).

It's an open secret among language designers that many, perhaps most languages in that set are not very well specified. To explain what this means and why it's so is ... a bit complicated. I will try to explain a little bit about it here, as well as some of my history with it and some recent innovations in the field that have been getting my attention.

To anyone uninterested in language design, this is super boring stuff, and this entry turned out to be much longer and involve much more wandering backstory than the previous one, so as usual I'll cut it for brevity.


When we build a programming language, we're creating a human-computer interface: a system of rules and relations with two very different "sides". On one side the rules define the set of all the programs a human is allowed to legally write in your language in a way that is meaningful to a human programmer and captures their intentions; and on the other side the rules define some mechanistic relationship between those legal programs and the sorts of computational actions and mathematical values -- input, output, reading and writing numbers in memory and so forth -- that have meaning inside a computer.

Artifacts


Language engineering, as a job, involves producing a set of artifacts that "specify" a coherent set of such rules and relationships. Unambiguously, completely, and in some sense practically: the artifacts need to be adequate for the human side -- people -- to write programs that work, and for the machine side -- interpreters and compilers -- to run those programs. Often we produce these multiple artifacts simultaneously, each of which captures some aspect of the job. Some examples:

  1. A spec document in a natural language like english, sometimes in bullet-point form or with all-caps words like MUST and SHALL at the very important parts.

  2. A formal grammar written in some formal grammar-specifying language, like ABNF.

  3. A formal semantics, typically of the type system and evaluation model, written in the logic rules of some semantic framework like operational or denotational semantics.

  4. A reference implementation (RI) -- an interpreter or compiler -- written in some executable language.

  5. A conformance testsuite -- a set of programs -- that's expected to work, accompanied by expected results.


There are three main risks for any specification artifact:

  1. Ill-specifying: describing a language that either doesn't make sense -- contains internal contradictions -- or that causes programs written in it to themselves be systematically flawed (unsafe, erroneous).

  2. Underspecifying: not providing enough detail, so people can't determine if the language is ill-specified, and/or they can't they tell which programs are valid, or what they mean.

  3. Overspecifying: providing too much detail, so people have too many details to check for ill-specificity / bugs, and/or are too constrained in how they implement the language (needless idiosyncrasies).


Depending on the artifacts you provide, and the techniques you use to develop them, you will face a different combination of such risks. Oh, and if you produce more than one artifact, they'd better agree with each other!

Formalism and mechanization


When one of the artifacts in this list is present in a sufficiently formal form -- subject to an internal structure, logic and set of rules of its own -- it may itself be subject to mechanization by some program that implements the formalism: a meta-implementation. For example, when someone ships a formal grammar in the (A|E)BNF formalism, one of the advantages is that you can machine process it: analyze it, prove properties about it (say, complexity of parsing), and synthesize parsers for it using tools like YACC ("yet another compiler-compiler") or Bison (so named as a pun on YACC). This is super cool when it works, and is one of the strong motives for using formalisms in the first place: computers are good at helping us work with them.

Curiously, we often get close to using a formalism, but then stray. We may ship a "nearly" formal grammar, but not really tighten it up to the point of usability in production; we may choose to ship a "production" parser written by hand, rather than feeding a formal grammar through YACC, for reasons of comprehensibility, performance, error correction, diagnostic generation and so forth. As a case in point, it seems likely to me (despite my best efforts) that Rust may ship without a formal grammar. This is much more common than one might imagine.

Similar things happen with formal semantics. Often they're omitted entirely, but when they're written they're still often written in LaTeX! I.e. they're written for typesetting, for human readers; they're not mechanizable. This is partly because the mechanization tools lag behind the theory; it's also partly because the language-design process often proceeds informally and the formalization plays catch-up long after the production implementation has shipped (removing some of the advantage of mechanization -- you don't need to synthesize a slow interpreter from the spec if you already have a production compiler). Partly it's also due to simple cognitive load: there's only so much mental budget available and learning formalisms and the mechanizations thereof is quite a chore in itself. Often it's just not considered to pay for itself, especially in industry. Academia is often more patient, and they tend to develop tidier, better-studied languages. Industrial languages are more often "specified" with english-plus-an-RI, and seldom with any formal semantics.

Languages like, say, PHP or Perl are at an extreme end of this spectrum. Nobody even pretends they have a formal semantics, much less a mechanizable one; Perl5 mostly even lacks a formal grammar. The RI is the strongest component of the spec. On the other end there are things like the Lisps and MLs, which actually grew out of the implementation layers for formal reasoning tools in the late 50s and early 70s respectively, and often have very well-studied, well-formalized semantics, and well-studied reference implementations. The descendants of these formal reasoning tools live on more powerful and general modern incarnations like Coq, HOL, Isabelle, Twelf, ACL2 and (as I mentioned earlier) Maude.

Ancient history


This tension has been with us as long as people have been specifying languages. Consider ALGOL (initially IAL, the "International Algebraic Language", SO CUTE), the first language to ever seek the status of an international standard. ALGOL is old enough that BNF was actually invented to specify it, for formalizing the grammar in ALGOL60. Consider the infamous later revision, ALGOL68, which attempted to specify more of the language using a more-powerful, undecidable formalism (W-grammars, a sort of attribute-grammar then in vogue): this variant was mostly considered "too complicated to understand", and tools to work with the spec were beyond the familiarity of most implementers. Simplifications that unwound the cognitive load of the spec back to BNF-and-english followed.

By contrast, consider that in 1975 the US DoD design process that resulted in Ada did not attempt any fancier-than-BNF formalisms. The requirement documents (STRAWMAN, WOODENMAN, TINMAN,IRONMAN and STEELMAN); the candidate language specs (Red, Green, Blue, Yellow and Tartan); and the final winner Green which became Ada, were all BNF-and-english. Academic tools -- and many were being developed! -- didn't really figure into the process. The only STEELMAN requirement that reads on this at all is
To the extent that a formal definition assists in achieving the above goals (i.e., all of section 1), the language shall be formally defined.
In practice no additional formalisms came into play, and the informal english bits just kept growing (indeed, the Tartan proposal was an academic response to what was perceived as runaway complexity in the informal process). It's worth studying the resulting documents. They're quite brute-force, but also highly comprehensive, understandable. One can probably program in Ada as confidently as in Standard ML, despite the fact that the informal definition of the former is as much as 10 times larger than the formal definition of the latter, and there's no mechanization in sight.

More recent and personal history


Anyway, that's all backstory from before I was born! Fast forward through a few decades of people struggling with this tension, specifying languages and maturing tools, while I learned to tie my shoes and ride bikes. By 2006 I was working in industry and starting to get involved in language design, having spent a while working on other people's languages. I joined Mozilla in order to work on a language-spec-in-progress called ES4 ("EcmaScript 4", a.k.a. the trademarked "JavaScript"). ES4 never shipped, for reasons I'm not going to get much into here, but along the way we struggled a fair bit with the tension between formalisms and informality. Curiously, ES4 was actually a sort of "reboot" of an earlier Netscape language-design process called JavaScript 2, which was highly formalized! It was an executable formal semantics written in Common Lisp, by Waldemar Horwat. (Extra-curiously, I interviewed to work with Waldemar on this language back in 1999, but that's another story).

Part of the language-design context in 2006-8 was an academic competition called the POPLmark challenge, a set of successively harder-to-mechanize tasks in formal language specification that had several viable entries from the communities of researchers around Coq, HOL, Twelf and the like. Since I saw this going on and was very green and optimistic, and knew that the JS2 effort had been highly formalism-friendly, I tried to push for doing ES4 in a mechanized logical framework. The committee -- sympathetic but much more experienced and dubious -- pushed back and we settled on a (terrible) compromise of doing our RI in ML, with some modules normative and some informative. At least that would be higher-level and easier to work with than the "english pseudocode" that made up the ES3 spec, and might rest on better formal foundations. This was a mistake, but it's all over now. ES4 died, ES5 succeeded and ES6 is on the way.

Around the same time as ES4, another language called Scheme (a very well-studied Lisp) was undergoing its 6th revision: R6RS. Some of the folks working on ES4 were following R6RS closely, and I vaguely remember we entertained a guest expert from that community. The scheme community pushed hard on formalization in that round, using a tool from the PLT group called Redex. Redex is a very domain-specific tool, not a general logical framework; it's aimed squarely at the needs of language designers, and in practice this seems to make the barrier to entry much lower. Perhaps as evidence: the resulting R6RS spec did actually ship with an executable Redex model.

The present day


Subsequently, as I mentioned above, I've spent quite some time working on Rust. We stayed on the less-formal side while developing it, favoring english and a solid RI (perhaps I was a bit burned by ES4). I worry of course that it should have more formal parts, but having stepped away from it I'm kinda resigned to an observer status; this post is not about trying to influence its direction, it's more a retrospective and context post, pointing out some tools and projects I think are currently interesting. Redex is definitely one of them, and I was excited to see both Lindsey working on an initial Redex model, and later Niko developing one of his own. I don't know what will actually ship, but it's an encouraging sign that in 2014 industrial systems languages at least consider this a plausible step.

Parallel to my interest in Redex, I've of course been fiddling around with Maude (mentioned in an earlier post), and have noticed a tool similar in spirit to Redex emerge out of the Maude community. This is called the K Framework, and seems quite focused on providing domain-specific tools to language engineers. It has, in its short life, chalked up some pretty impressive mechanized (executable) models: reasonably complete semantics for Scheme, Python and C, as well as some work-in-progress semantics of other production systems (JVM, LLVM, Haskell).

Finally, mechanized specification in Coq has not stood still! The community received a bit of a jolt in 2008 when Xavier Leroy, Sandrine Blazy and their team released CompCert, a Coq formalization of C, and PowerPC machines, and a reference implementation compiler that translates between the two, with mechanized proofs of correctness. Following on the heels of this, a project called JSCert just released a Coq formalization of ES5 (the version of EcmaScript that came after the ES4 process collapsed). Like CompCert, this includes both spec and RI, as well as proof of correctness relative to spec.

So it's exciting times! The gulf between formal (and mechanized-formal) and informal appears to be, if not exactly dissolving, at least beginning to thaw. Industrial languages in production are beginning to see realistic, end-to-end mechanized specification for the first time, if not during development then at least "shortly thereafter". I don't know exactly when the tide will turn permanently, but it seems likely within the next few generations of languages.

This entry was originally posted at http://graydon2.dreamwidth.org/1839.html. Please comment there using OpenID.
Tags:

Link | Leave a comment | Share

Comments {6}

Robert Harper

mechanization of standard ml

from: Robert Harper
date: Feb. 7th, 2014 04:18 am (UTC)
Link

Karl Crary and I completed a mechanization of the semantics of Standard ML more than five years ago. It's about 30,000 lines of Twelf code, which we wrote by pair-programming together over the course of a semester, meeting one day per week for about two hours or so each time. A big part of this work was reported in our POPL paper with Daniel K Lee, and we subsequently published the tar ball of the entire proof of type safety for the full language. It was the first such mechanization effort at scale, to my knowledge, and has not been matched since by any widely-used language.


Edited at 2014-02-07 05:27 am (UTC)

Reply | Thread

graydon

Re: mechanization of standard ml

from: graydon
date: Feb. 7th, 2014 09:13 pm (UTC)
Link

Cool! That's an impressive bit of work, congratulations! I think I did not see it when it was published.

Reply | Parent | Thread

Robert Harper

POPLmark challenge

from: Robert Harper
date: Feb. 7th, 2014 04:20 am (UTC)
Link

I would also mention that Karl Crary, Michael Ashley-Rollman, and I "won" the POPLmark challenge (it was, after all, a competition) by providing a complete solution within two weeks of the announcement of the challenge. The first partial solution in Coq only arrived 6 months later; I am not sure whether any full solution was achieved subsequently.

Edited at 2014-02-07 05:26 am (UTC)

Reply | Thread

Robert Harper

Re: POPLmark challenge

from: Robert Harper
date: Feb. 7th, 2014 05:45 pm (UTC)
Link

I should have said, but maybe it goes without saying, that we used Twelf, and built on the Harper-Stone methodology for defining Standard ML.

Incidentally, the stated purpose of the POPLmark challenge was to determine which system was best for doing mechanized metatheory. Obviously Twelf won hands-down, but it seems that that was not the intended answer, and the result was accepted grudgingly and was subsequently ignored.

Reply | Parent | Thread

graydon

Re: POPLmark challenge

from: graydon
date: Feb. 7th, 2014 09:08 pm (UTC)
Link

I don't quite know how to respond to this in a way that's not offensive; I want to assure you that I have no horse in this race and it's entirely possible that Twelf Is Better Than Coq (which seems to be what you're getting at). I apologize for giving short shrift to Twelf. I don't actually know how to evaluate them; I don't think POPLmark can tell us that.

For what it's worth: I never got the impression that the challenge was a "pick-a-winner" competition of systems. If you look at the initial posting it seems clear to me that the organizers just wanted to get a very general "sense of the field", stimulate discussion, see some approaches people would take on benchmark problems (which are still "toy sized" relative to industrial PL problems, but whatever).

As a (now-ex) paid PL designer (not a logical-framework expert, much less the author/designer of one) I still don't know how how to evaluate the POPLmark results. I don't understand the submissions especially well, I can't give a particularly compelling description of the difference between a HOAS solution and a locally nameless one; I can barely decipher the simplest proofs in each framework, and certainly couldn't make any of my own at this point, about any of the languages I've worked on.

As I pointed out earlier in this post, my work as a PL designer has been about making human/computer interfaces and most of that work honestly falls on the "human" side. The human is the more-fallible, harder-to-predict part with the much, much more limited (cognitive) budget. I can design languages with cool features that I can figure out a machine implementation for much more readily than I can figure out how to make it palatable to a human. To my eyes, the biggest problem with all proof assistants is similar: not that they're not powerful enough as technical artifacts, but that too often, humans can quite figure out how to use them to their full power.

That is: either I am incompetent (you're welcome to reach that conclusion, it doesn't offend me; in some sense we all are) or else the cognitive load of all the formalisms and their logical-framework mechanizations is probably still too high for the target audience. This is why I find Redex and K a little more appealing: for reasons I can't even articulate I find their models more approachable. Maybe it's just that they give up on the "proof" side and aim for "high-confidence model checking". I don't know.

Personally I think it's telling that most of the submissions to POPLmark still seemed closely associated with the particular academic lineage surrounding the tool's designers. That is: few people outside the very top experts on the planet could operate these tools to the effect required. That suggests to me that none of them are quite approachable enough yet to be used regularly in language design. Whether approachability is the goal or not is a different question, possibly of no interest to the tool designers. I don't know.

Reply | Parent | Thread

Robert Harper

Re: POPLmark challenge

from: Robert Harper
date: Feb. 8th, 2014 05:12 am (UTC)
Link

In my opinion it does not make sense to say that disparate systems such as Twelf and Coq are better than one another. So I am not saying that. What I am saying is that a benchmark was formulated, a challenge was laid, and we won that challenge hands-down using Twelf. If the challenge had been for something other than mechanization of programming language metatheory, then we wouldn't even have entered, because Twelf would not have been suitable to the task. This is perfectly clear, and not in contention. But the idea at the time was to see which system was better for doing language metatheory, and this was the benchmark that was established to test that premise. We didn't choose the benchmark; in fact, it is quite clear from the formulation that the benchmark was chosen to defeat us. Although we showed that Twelf was superior for the given purpose, the proposers just carried on promoting Coq for mechanized metatheory. In fact it is such a poor choice that the currently recommended practice is to use an erroneous notion of substitution in a non-erroneous way in order to make progress on this sort of problem using Coq. As I mentioned we went on to use Twelf to solve this sort of problem at scale without any significant difficulties. Comparable results have not, to my knowledge, been achieved using any other system.

Reply | Parent | Thread