Programming Languages: Syntax

COS 441- Syntax - Feb 6, 1996

A programming language is a formal language used to communicate algorithms both from programmer to programmer and from programmer to machine. A formal language consists of:

a set of symbols;
rules for forming term;
rules for transforming terms to terms.

Some general purpose programming languages include C, C++, PASCAL, and Ada. Some special languages are TEX, Post-Script, JAVA Byte-Code, TCP/IP, and perhaps the WIN32S API.

To use a programming language effectively we must study and understand it from three perspectives:

Syntax - the set of symbols and rules for forming terms.
Semantics - the rules for transforming terms to terms.
Pragmatics - using the particular constructs of the language.

Here are three ways of expressing "increment the i-th element of array x" in different programming languages.

(a) x[i] = x[i] + 1; [C]			
(b) (vector-set! x i (+ (vector-ref x i) 1)) [Scheme]
(c) x[i] = x[i] + 1; [Java]

These expressions have approximately the following semantics.

(a) if i in bounds of x then x[i] <- (x[i] + 1) mod 2^32
    else who knows?
(b) if x is not a vector then ERROR
    else if i is not an integer then ERROR
    else if i is not in bounds of x then ERROR
    else if x[i] is not an integer then ERROR
    else x[i] <- x[i] + 1
(c) if i is not in bounds of x then ERROR
    else x[i] <- (x[i] + 1) mod 2^32

Despite the apparent similarity of the C and Java expressions, the Java expression semantics is closer to that of Scheme than C.

Now consider expressing "increment x[0] thru x[N]". In C we write:

for( i = N; i >= 0; i-- )
    x[i] = x[i] + 1;

In Scheme we write something rather different:

(define natural-foreach
   (lambda (f n)
      (cond ((>= n 0) (begin 
                         (f n)
                         (natural-foreach f (- n 1))))))) 
(define inc-x (lambda (i)
   (vector-set! x i (+ (vector-ref x i) 1))))
(natural-fold inc-x N)

Finally in Java, we probably write something that looks the same (has the same syntax) as in C. C/Java pragmatics suggest the use of iteration, while in Scheme we use of recursion.

Most programming language courses survey a variety of programming languages, covering syntax mostly, with only a short time left for semantics. Instead, we will only use Scheme, which will allow us to quickly move onto semantic issues. We will use definitial interpreters and spend a little time looking at pragmatic issues.

This course will NOT teach you:

any practical programming languages; nor
how to implement high performance programming languages.

But it will teach you:

how to learn a new programming language quickly;
how to choose a programming language for a particular task;
how to design and build interpreters;
more about programming languages than the designers of most popular languages will ever know.

Syntax

To simplify understanding and analyzing a language's syntax, we separate syntax into three levels: lexical elements, context free syntax, and context sensitive syntax. In English, letters form words which form sentences. In programming languages, characters form tokens, which form terms. Tokens are lexical elements.

Lexical Analysis

Following are some of Scheme's tokens:

string of digits
"characters ..."
' ` #f #t '() #\a
strings of letters, digits, and characters such as - + * - @ $, etc.

The last element of this list is called "identifiers". Scheme's syntax for identifiers is more liberal than that of many other languages; for example, +, a+b, and -a*2 are all identifiers.

Aside: Comments are usually discarded by a language processor during lexical analysis; that is, while the language processor is converting the stream of input characters into a stream of tokens. Scheme's comments begin with a semicolon and extend to the end of the line.

Context Free Syntax

Consider a simple query language. In English, we define a query to be a list of words, NOT query, or (query AND query). To be more precise, let's define querys using mathematics (specifically set theory):

Query = { w1 ... wN | w1,...,wN in Word }
      U { NOT q | q in Query }
      U { (q1 AND q2) | q1, q2 in Query }

For defining the context free syntax of programming languages, we often use a special language that is more concise. It is called BNF (Backus Naur Form):

Query ::= Word * | NOT Query | (Query AND Query)

In BNF terminals, or tokens, are symbols that do not appear on the left of the ::= operator. In the example above, AND, (, ), and NOT are the terminals. Query is the only non-terminal. Well, almost. Word is really a non-terminal that we haven't bothered to define.

BNF can only describe context free languages. The following set of terms is impossible to describe in BNF.

Kwery = { w1 ... wN | w1,...,wN in Word }
      U { NOT q | q in Kwery }
      U { (q1 AND q1) | q1 in Kwery }

An AND-Kwery requires both its arms be the same. This set of terms is context-sensitive: selecting a Kwery to place in the hole of the term (q1 AND []) (where [] denotes a hole) cannot be done without knowing the context surrounding the hole. Specifically, we must know what q1 is, because to get a valid Kwery we can only place q1 in the hole.

Reading

The Little Schemer (whole book)
EOPL (Essentials of Programming Languages) Chapter 1

Exercise

Write a C program that declares an array x, initializes it, and increments each element.
Write a Scheme program that does the same.
Write a C program that does it using recursion.