Some problems don't lend themselves to solutions described only in terms of objects. This is one of them.
Where to begin.....
At first glance this project may seem like an impossible task but once it is properly broken down it's not so bad. So let's begin with some terminology:
A Lexical analyzer takes the "raw" words in a statement and breaks the statement up into it's various components (tokens). Then the Parser takes over who's job is to determine whether or not these tokens constitute a properly formed statement according to the rules of the language (rules of syntax). The final stage then is to determine the meaning (semantics) of the raw words given their context. This is done by the Interpreter.
In natural languages (especially English) the rules of syntax are fairly well known but the interpretation of the meanings of the words is often ambiguous and the same statement can be broken up various ways.
For example:
"British Left Waffles on Falkland Islands" can be interpreted at least two ways:
[British Left][Waffles] [on Falkland Islands]
as [subject][verb] [prepositional phrase] --- result 1 from the lexical analyzer; the parser says this is syntactically correct; the interpreter may proceed
-or-
[British] [Left] [Waffles] [on Falkland Islands]
as [subject] [verb] [object] [prepositional phrase] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is now different.
"Squad helps dog bite victim" can also be interpreted two ways:
[Squad][helps] [dog bite victim]
as [subject][verb] [object] --- result 1 from the lexical analyzer; parser says this is syntactically correct; the interpreter may proceed
-or-
[Squad][helps] [dog][bite] [victim]
as [subject][verb, part 1] [object 1] [verb, part 2] [object 2]--- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is again different.
"Enraged Cow Injures Farmer With Ax"can be interpreted two ways too:
[Enraged Cow] [Injures] [Farmer With Ax]
as [subject] [verb] [object] --- result 1 from the lexical analyzer; parser says this is syntactically correct; the interpreter may proceed
-or-
[Enraged Cow] [Injures] [Farmer] [With Ax]
as [subject] [verb] [object] [prepositional phrase] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is different.
Well, you get the picture. What we do when we analyze these statements is to apply known rules of syntax to decompose the statement into it's syntactic components (parts of speech) so we can apply the appropriate meaning (semantics) to each part. Before we even get to the semantics part we must reduce each word (called a token) and classify it (as "subject", "verb", etc.). The resultant syntactic components are called descriptors. This allows us to check that the statement is syntactically correct (to parse it). We can conclude that the statement is syntactically correct when the descriptors match one of the known rules for syntax. In other words, if the statements can be reduced to the descriptors [subject] [verb] [object] then it is syntactically correct and we can proceed to the next step, namely to interpret the meaning. Humans are wonderfully adapted to do all of these jobs and once we "know" a language, we can go through these steps without even being conscious of it. The process is more obvious when we are forced to deal with a language with which we are not that familiar.
Computers can interpret languages too, but in order to be able to do this by computer we must find a way of eliminating ALL ambiguities. In human languages, we can make adjustments to our "lexical analyzer" on the fly to resolve ambiguities as they come up. We are very good at using context for this ("It can't mean this because that makes no sense so it must mean that"). Computers are not. As a result the rules of syntax for a computer language do not contain ambiguities. Each language has a precisely outlined set of syntactical rules and an equally precise set of semantic rules to go along with them. With this done, we can now write a program that can do the steps outlined above automatically.
1. What is an Interpreter? (From The Computer Desktop Encyclopedia Copyright © 1981-2000 The Computer Language Company Inc. All rights reserved. Ver. 13.1, 1st Quarter 2000)
A high-level programming language translator that translates and runs the program at the same time. It translates one program statement into machine language, executes it, then proceeds to the next statement. This differs from regular executable programs that are presented to the computer as binary-coded instructions. Interpreted programs remain in the same source language format the programmer wrote in: as text files, not machine language files.
Interpreted programs run slower than their compiler counterparts. Whereas the compiler translates the entire program before it is run, interpreters translate a line at a time while the program is run. However, it is very convenient to write an interpreted program, since a single line of code can be tested interactively.
Interpreted programs must always be run with the interpreter. For example, in order to run a BASIC or dBASE program, the BASIC or dBASE interpreter must be in the target computer.
If a language can be both interpreted and compiled, a program may be developed with the interpreter for ease of testing and debugging and later compiled for production use.
Interpreters and Compilers
Unlike compiled languages that have been translated into machine language (right), interpreted languages must be translated at runtime. With languages such as dBASE and BASIC (middle), the interpreter translates the original source code. Languages such as Java and Visual Basic (left) use an intermediate language. The source code is compiled into "byte code," and the bytecode is interpreted at runtime.
2. What is parsing (a somewhat more formal explanation)
An algorithm or program to determine the syntactic structure of a sentence or string of symbols in some language. A parser normally takes as input a sequence of tokens output by a lexical analyzer. It may produce some kind of abstract syntax tree as output.
3. BNF "Language" definition:
It's Syntax Rules, or "How to Read the Notation & What it Means"
The Notation......(called BNF, or Backus-Naur-Form)
::= means 'is defined as'
<something> means an entity that is further defined (usually later); this is a token
{<something>} means that the thing enclosed in braces is optional and can be omitted
<X> | <Y> means that either '<X>' or '<Y>' can be used here
all other marks must appear literally
Rules of Syntax:
Each statement has the following form:
<statement> ::= <expression>;
Which means: a <statement> consists of an <expression> followed by a semi-colon
More on "Backus-Naur Form" (the notation used above)
Backus-Naur Form (from FOLDOC (on-line computer dictionary)
<language, grammar> (BNF, originally "Backus Normal Form") A formal metasyntax used to express context-free grammars. Backus Normal Form was renamed Backus-Naur Form at the suggestion of Donald Knuth.
BNF is one of the most commonly used metasyntactic notations for specifying the syntax of programming languages, command sets, and the like. It is widely used for language descriptions but seldom documented anywhere (how do you document a metasyntax?), so that it must usually be learned by osmosis (but see RFC 2234).
This translates into English as: "A postal-address consists of a name-part, followed by a street-address part, followed by a zip-code part. A personal-part consists of either a first name or an initial followed by a dot. A name-part consists of either: a personal-part followed by a last name followed by an optional "jr-part" (Jr., Sr., or dynastic number) and end-of-line, or a personal part followed by a name part (this rule illustrates the use of recursion in BNFs, covering the case of people who use multiple first and middle names and/or initials). A street address consists of an optional apartment specifier, followed by a street number, followed by a street name. A zip-part consists of a town-name, followed by a comma, followed by a state code, followed by a ZIP-code followed by an end-of-line."
Note that many things (such as the format of a personal-part, apartment specifier, or ZIP-code) are left unspecified. These lexical details are presumed to be obvious from context or specified somewhere nearby.
There are many variants and extensions of BNF, possibly containing some or all of the regexp wild cards such as "*" or "+". EBNF is a common one. In fact the example above isn't the pure form invented for the ALGOL 60 report. "[]" was introduced a few years later in IBM's PL/I definition but is now universally recognised. ABNF is another extension.