Making nuBASIC
Making the interpreter
The experience of writing a BASIC interpreter using modern C++ was truly enjoyable. BASIC itself is a language that evokes nostalgic memories from the early era of personal computing. Back in those days, iconic computers like the Commodore 64 featured an embedded BASIC interpreter, which added to the charm and allure of the computing experience.
Main interpreter components
In developing the nuBASIC interpreter, I followed an approach that primarily involves parsing the BASIC source code and compiling it into an Abstract Syntax Tree (AST). This AST is then used to execute the code. As a result, I created the following key components, which may not always correspond directly to a single class or other C++ element:
Tokenizer: Breaks down the language string into individual tokens.
Expression Parser: Converts mathematical, logical, and string expressions into an internal evaluable tree object.
Expression Evaluator: Evaluates the expression tree object.
Statement Parser: Constructs an executable syntax tree.
Statement Executor: Executes a BASIC statement, which may include one or more BASIC expressions.
Static Program Context: Collects and manages program meta-data required for handling and executing multi-line BASIC constructs.
Run-time Program Context: Manages objects such as variable states, procedure stacks, function return values, etc., generated dynamically during program execution.
Command Line Interface (CLI): Provides a console-oriented interface, serving as the primary user interface for the interpreter. It includes a line-oriented editor for on-the-fly program insertion or modification.
Interpreter: Executes CLI commands, runs programs, handles program data and meta-data, and includes debugging capabilities.
Built-in Function Library: Implements predefined functions used within expressions and statements.
Language Statement Objects: Implements language statements such as control structures (e.g., If-Then-Else, For-Next, Do-Loop-While) and I/O built-in procedures (Print, Input, Open), among others.
Syntax and Run-time Error Handlers: Helper classes used to handle C++ exceptions, handling syntax and run-time language errors.
Built-in Help: An interactive guide offering information about interpreter commands and language statements, accessible through specific CLI commands.
These components collectively form the foundation of the nuBASIC interpreter, enabling efficient parsing, execution, and management of BASIC programs while providing a comprehensive user experience.
A line-oriented language interpreter
The nuBASIC interpreter is designed to cater to the line-oriented nature of the BASIC language. In BASIC, the positioning of hard line breaks in the source code holds significance, distinguishing it from other programming languages.
In the nuBASIC interpreter, each code line in a BASIC program is treated as a self-contained unit. To reflect this, the interpreter itself operates in a line-oriented manner. The program's source text is divided into lines, and these lines are owned by the Interpreter class. The lines are stored in a map structure, where each line is associated with a line number and its corresponding text.
During the parsing process, each code line is transformed into an independent execution unit. The interpreter constructs a static program context, serving as the connecting code between program lines and statements. This context allows the Statement Parser to recognize complex language constructs that may span multiple lines and generates metadata to reference them. It's worth noting that a single line containing a control structure can encompass more than one statement.
By adopting a line-oriented approach, the nuBASIC interpreter ensures that each line in the program is treated as a distinct entity, facilitating the parsing and execution of BASIC programs while preserving the unique characteristics of the language.
Handling the tokens
Before parsing each line, the Tokenizer performs a separate lexical analysis, which involves creating tokens from the source text. Each token is implemented as a class and contains various attributes, including the original source text, token position within the text, length, type (such as 'identifier', 'operator', 'literal string', 'integer', etc.), and additional data.
In designing the token representation, I opted not to use a pure object-oriented programming (OOP) approach, which typically involves building a class hierarchy for token types. Instead, I chose a flat representation where the token type is simply an attribute of the token object, implemented as an "enum-class." This decision was made to facilitate the collection and handling of homogeneous token objects.
The Parser leverages the token type attribute to recognize and validate the syntax of the language. It utilizes this information to construct statement objects and generate the necessary meta-data that the interpreter relies on during program execution.
By utilizing tokens with flat token types, the nuBASIC interpreter streamlines the process of recognizing and validating the language syntax while efficiently building the required statement objects and runtime meta-data.
Token list container
Token list container
In order to simplify the parser and reduce complexity, a Token List container class has been implemented. This class wraps around a standard Deque (double-ended queue) and provides additional functionality to facilitate the handling of token lists.
The Token List class offers convenient features designed to simplify the manipulation of token lists and contribute to a more streamlined parser implementation. These features aim to enhance the ease of working with token lists and alleviate the burden of managing complex data structures within the parser.
Parsing the code
When it comes to parsing the code in nuBASIC, there are multiple parsers in place to handle different aspects:
Expression Parser: This parser focuses on analyzing individual expressions, such as "2+2*3/Sin(PI()/2)", and generates an executable object instance that represents the expression's behavior.
Statement Parser: Responsible for analyzing each line of source code, the Statement Parser generates executable statement objects. During this process, it invokes the Expression Parser whenever it encounters an expression in the source text, enabling it to obtain the executable equivalent representation of that expression.
CLI Parser: As part of the Interpreter, the CLI Parser handles the validation and execution of commands specific to the interpreter environment, such as List, Run, Load, Save, and more. These commands are distinct from language statements but serve essential functions within the interpreter.
One of the primary challenges in parsing nuBASIC arises from handling expressions. For instance, when parsing an expression like "2+4*17", determining the correct syntax tree relies on analyzing the entire expression.
The Expression Parser must account for operator precedence and restructure the previous expression into something like "(2 + (4 * 17))" to form a well-formed syntax tree.
This approach ensures that expressions are parsed correctly, considering operator precedence and producing syntax trees that accurately represent the intended calculations:
'+'
/ \
'2' '*'
/ \
'4' '17'
Expression parsing involves several distinct steps to ensure accurate evaluation:
Tokenizer: The expression, represented as a string, is broken down into tokens using the Tokenizer. These tokens are stored in a token list object instance.
Token List Rearrangement: The token list is rearranged to establish the operator precedence order. Special markers, such as 'begin' and 'end' sub-expression markers, are inserted into the list. The order of precedence is as follows:
Unary identity and negation (+, -)
Exponentiation (^)
Multiplication and floating-point division (*, /)
Integer division (, Div)
Modulus arithmetic (Mod)
Addition and subtraction (+, -)
Comparison operators (=, <>, <, <=, >, >=)
Logical Conjunction (And), Inclusive disjunction (Or), Exclusive disjunction (Xor), and bit-wise operators
Expression Parser: The Expression Parser constructs an evaluable syntax tree. Each node in the tree represents a different element of the expression. These elements include empty expressions, binary expressions, functions, variables, and literal constants. The nodes belong to a class hierarchy derived from the expr_any_t class.
By following these steps, the Expression Parser effectively analyzes expressions, ensuring correct evaluation by considering operator precedence and constructing a syntax tree that accurately represents the expression's structure.
Let's examine the mathematical expression "2 + 4 * 17". The parsing process converts it into an object tree structured as follows:
binary_expression
(+, sum)
/ \
literal \
(integer) \
2 binary_expression
(*, multiplication)
/ \
/ \
literal literal
(integer) (integer)
4 17
In this tree representation, the "+" operator is the root node, with "2" as its left child and the "*" operator as its right child. The "*" operator, in turn, has "4" as its left child and "17" as its right child.
This tree structure captures the hierarchical relationship between the operators and operands in the original expression, allowing for correct evaluation and calculation.
The abstract class expr_any_t serves as the base class for various subclasses and defines the virtual method eval(). Each subclass implements this method according to its specific functionality.
The eval() method has the following prototype:
virtual variant_t eval(rt_prog_ctx_t & ctx) const = 0;
This method takes a reference to the run-time program (execution) context, represented by the rt_prog_ctx_t object. It then evaluates the expression based on the given context and returns a variant_t object.
The variant_t object represents a flexible variant type that can hold different data types. The details of the variant_t object and its usage will be discussed in the following section.
By implementing the eval() method in the derived subclasses, the expression evaluation process is tailored to the specific behavior and requirements of each expression type.
Variant
The Variant class is designed to handle various types of data in the nuBASIC language in a unified and simplified manner. It significantly reduces the complexity of the evaluator.
To achieve this, I opted not to use C++ union because it only supports plain old data types and is not suitable for handling non-trivial complex types. Instead, the Variant class provides a flexible solution that can handle a wide range of BASIC language types seamlessly.
By utilizing the Variant class, the evaluator can manipulate different data types in a uniform way, making the code more concise and easier to maintain. This approach streamlines the handling of data within the interpreter and contributes to a more efficient and organized evaluation process.
Tracing execution of a simple program
Let us consider the following simple BASIC program, containing just a unique line with a unique statement:
10 PRINT 2+4*17
Suppose you have already inserted the program so you have just type “RUN” to execute it.
First nuBASIC_console() function gets the command string “RUN” from standard input, then invokes the function exec_command() which is a helper function that invokes the related interpreter exec_command() method, catching any exceptions.
This method parses a command in order to recognize it and perform the action required.
In this case it calls the rebuild() interpreter method which clears static and run-time context objects (removing both dynamic data and meta-data), then for each source line calls the update_program() method. This method creates a Tokenizer object and calls the compile_line() method of the Statement Parser object, held as attribute of the Interpreter object.
compile_line() method uses the Tokenizer to break down the source line into a language token list like the following (for simplicity no all token fields are reported below):
{ (“PRINT”, identifier), (“ “, blank), (“2”, integer), (“+”, operator), (“4”, integer), (“17”, integer) }
Each line of code is treated as a “block” which is a container of statements. Thus the method parse_block() is first called. This method iterates while the token list is not empty calling for each iteration the method parse_stmt(), which is able to recognize the statement in order to select the specific parse_xxx() method.
In our example, it recognizes the token “PRINT” (which is the first token of the token list), and calls the specific parse_print() method and this builds a stmt_print_t object which holds an expression object instance. parse_print() method calls the template function parse_arg_list() which builds an expression list for the PRINT statement. Each item of the list is an expression object built via the Expression Parser, ready to be evaluate by its eval() method against a run-time program (execution) context.
Finally, parse_print() method returns to the calling parse_block(). Which in turn returns an handle to statement block object by means of a (smart standard) shared pointer to the class object.
The statement handle is stored in a map (prog_line_t) where the key element is just the processing line number.
After building the program, interpreter calls the run() method which creates a program_t object instance passing to it program line and program context objects coming from previous building phase.
The program object executes each code line (represented by block statement object as discussed before) by calling the related run() virtual method. In our example the unique program line (a block statement object) contains the print statement object. Calling its run() method the related print statement run() method is finally invoked.
stmt_print_t run() method evaluates each argument of its argument list. The argument list is just a collection of expression objects which export the eval() method. The eval() method returns a variant object that can be printed out to the standard output.