Programming Languages/Syntax Specification
There are two ways to describe a language: by its syntax or by its semantics. The syntax of a language is a set of rules that defines what strings of characters (sentence or statements) belong to this language; the semantics of a language describe the meaning of a given statement.
Specifying the Syntax of Programming Languages
[edit | edit source]Lexemes and Tokens
[edit | edit source]For a language, one can decompose a statement into a string of lexemes, the indecomposable building blocks of a language. For example, the following w:Perl statement:
do { sleep 1; } while (++$i < 10);
contains lexemes: do
, {
, sleep
,
1
, ;
, }
, while
, (
, ++
, $i
, <
, 10
, )
and ;
.
Lexemes can be grouped into categories. Such a category is called a token.
Context-Free Grammars
[edit | edit source]- Backus-Naur Form(BNF)
- Arithmetic grammar
- Unary negation (~)
Parse Trees
[edit | edit source]- Parse tree of complex formula using exponents, et cetera
Extended Backus-Naur Form
[edit | edit source]Extended Backus-Naur Form is intended to be a precise method for specifying the syntax of a language. It is a metalanguage, in that it is a language used to describe languages. It is also commonly seen employed in command usage or help documents.
The complete documentation on EBNF is to be found here: [1].
The basic elements of a language would be described using EBNF terminal symbols, which are then grouped together into logical higher-level expressions of the overall language using nonterminal symbols. Terminal symbols specify actual valid character patterns used to represent things like the proper syntax for variable names in the language, or for representing numbers. Nonterminal symbols group the different possible representations of a given thing (say, numerical constants in decimal, hexadecimal or scientific notation format) into a single logical thing ([Numerical_Constant]). Ultimately, by building up the various syntactical elements and expressing the ways in which they may be combined, one ends up with a symbol representing a program in the language in question.
A couple of examples are to be found in the man pages for the awk utility.
(excerpts are from [2])
First, its usage:
- awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ]
The items in square brackets are an example of EBNF and denote an optional parameter to use on the awk command line.
Secondly, the usage detail section is partially specified in EBNF.
In the example below, words such as 'pattern', 'action', 'expression' and 'statement' are EBNF nonterminal symbols.
A pattern-action statement has the form
pattern { action }
A missing { action } means print the line; a missing pattern always matches. Pattern-action statements are separated by newlines or semicolons.
An action is a sequence of statements. A statement can be one of the following:
if( expression ) statement [ else statement ] while( expression ) statement for( expression ; expression ; expression ) statement for( var in array ) statement do statement while( expression ) break continue { [ statement ... ] } expression # commonly var = expression print [ expression-list ] [ > expression ] printf format [ , expression-list ] [ > expression ] return [ expression ] next # skip remaining patterns on this input line nextfile # skip rest of this file, open next, start at top delete array[ expression ]# delete an array element delete array # delete all elements of array exit [ expression ] # exit immediately; status is expression
Statements are terminated by semicolons, newlines or right braces.
Entire languages can also be documented in EBNF. The link below documents the ANSI definition of C++, and many parts of this document are written in EBNF.