UNB Faculty of Computer Science, CS4905 Introduction to Compiler Construction
Lab 1 January 15, 2007
Purpose: To become familiar with tools for lexical analysis.
1. Log in to a Linux workstation in ITD415. By virtue of being enrolled in CS4905, you should receive a
Computer Science Linux-lab login ID and password via your UNB E-mail.
2. Create a subdirectory in your unix directory space to contain the source code for this lab. For purposes
of illustration, I will assume that you call this subdirectory “L1”.
Part 1. Experiments with JavaCC
3. Download the “[Link]” file into your L1 subdirectory from the CS4905 web site
[Link]
(use e.g. Mozilla). File names ending in .jj are intended to contain JavaCC (Java compiler compiler)
input.
4. Type
javacc [Link]
on the command line to compile the javacc into a Java program. The overall process is shown in Figure 1.
source code
JavaCC Java source Java executable
program e.g. javacc program e.g. javac tokens
program e.g.
[Link] [Link] [Link]
Figure 1. Using JavaCC to generate executable programs for lexical analysis.
5. Type
javac [Link]
on the command line to compile the Java program into an executable (.class) program. This will
automatically compile any dependent classes.
6. Type
java Simple1
at the command line to execute the Java program. When running, the Simple1 program checks for
matching curly braces. Type a series of matching curly braces e.g.
{{}}
at the command line, followed by a “newline” character (press the Enter key to obtain a “\n” character)
and an “end of file” <EOF> character (press “Ctrl – d” to obtain an <EOF> charcacter).
Try entering a set of unmatched curly braces to see what the lexer program does.
7. Repeat the above steps 3, 4, 5 and 6 for the [Link] program. Test this program with input
containing curly braces and some white space characters. What is the difference between [Link]
and [Link]?
8. Repeat the above steps 3, 4, 5 and 6 for the [Link] program. Test this program with input
containing curly braces and some white space characters. Note how the program counts the nesting level
of the curly braces and prints the nesting level after <EOF> is encountered. Change the message printed
by the Simple3 program to “Curly brace nesting level is”. Recompile [Link] and run it
again to see the changed output.
1
9. Repeat the above steps 3, 4, 5 and 6 for the [Link] program. Test this program with input
containing valid and invalid identifiers according to the regular expressions for the TOKEN <Id> in
[Link].
10. Copy the [Link] program to a file called [Link]. We will now modify the
[Link] program to make a parser object called Lexer1. This requires changing all instances of
IdList to Lexer1 in the .jj program. Add an output statement something like the following:
{ [Link]("I recognize ID " ); }
after an ID token is recognized.
Add a [Link] program that constructs a Lexer1 object from a Main program. Do this by
entering a file called [Link] similar to the following:
import [Link].*;
public class Main {
public static void main(String [] args) throws Exception {
try {
new Lexer1([Link]).Input();
[Link]("Lexical analysis successful");
}
catch (ParseException e) {
[Link]("Lexer Error : \n"+ [Link]());
}
}
}
Prepare a test file [Link] containing the following three lines:
if8
Test
7.29
Now, run your Lexer1 parser against the input file using the following steps:
javacc [Link]
javac [Link]
java Main < [Link]
where the input redirection operator “<” redirects the file [Link] to the standard input object
[Link]. Your lexer should print out that it recognizes two ID tokens, and then print an error
message. Change the input data file so that all lines contain valid identifiers, and run your lexer again.
11. Modify your Lexer1 parser to add the recognition of keyword IF tokens and print a message "I
recognize IF" when an IF token is recognized. The JavaCC specification for this is as follows:
< IF: "if" >
Note that this token will need to be specified as the first alternative in a disjunctive list (i.e. separated by
the | operator) inside the TOKEN specifications. The Input() method also needs to be modified to add
the output statement for IF tokens. Modify your [Link] file to include valid if keywords and
ID tokens, and run your lexer program again.
Part 2. Experiments with lex
12. Download the “lex1.l” file into your L1 subdirectory from the CS4905 web site
[Link]
(use e.g. Mozilla). File names ending in .l are intended to contain lex input.
13. Compile the lex1.l program using the instructions posted at the above site; i.e. type
lex name.l
2
on the command line, where name.l is replaced with the name of your lex input file. Then invoke the C
compiler using
cc [Link].c -o name -ll
on the command line. By default, lex always produces output to the [Link].c file. The output
executable file is called name, so to run the program, type
./name < [Link]
where [Link] is the name of an input file.
14. Test your lexer by using the following input file:
if(x < 12.56E-4)
y = x + 7;
else if(x < 0.5 && x >= 0.0)
y = x * x + 4;
else
{
y = x / 2.5;
z = a*(b-3) + 4 / 7.3;
}
The lexer should print out all the numbers it finds in this file.
15. Modify the lex1.l program to add recognition of real numbers (in addition to integers) according
to the regular expression in Figure 2.2 on p.20 of the text. The regular expression for a REAL token in
Figure 2.2 is as follows:
([0-9]+"."[0-9]*)|([0-9]*"."[0-9]+)
Your will need to add an optional part something like
([eE][+-]?[0-9]+)?
to recognize the exponent as part of the real number.
16. Repeat the above step 13 for the lex2.l program. This lexer recognizes html tokens. Test this
lexer using as input the input file of the CS4905 home page; i.e.
[Link]
17. Repeat the above step 13 for the lex3.l program. This lexer program counts the number of lines,
words and characters in a file. Test this lexer using as input the test program given in step 14 above.
Note that the appearance of the circumflex ‘^’ character as the first character in a character class
specification (i.e. characters between square brackets ‘[‘ and ‘]’) changes the meaning to match any
character except those within the brackets.
18. Repeat the above step 13 for the lex5.l program. This lexer prints only words followed by
punctuation. If the following sentence was the input from standard input:
"I was here", they said.
But were they? I cannot tell.
it will recognize and print the words “here”, “said”, “they”, and “tell”. Test that your lexer program
works correctly with the above example. Note that the forward slash character ‘/’ matches the preceding
regular expression but only if followed by the following regular expression; thus the pattern ‘0/1’
matches “0” in the string “01”. The characters matched by the pattern following the forward slash is not
“consumed” and remains to be turned into subsequent tokens. Only one foward slash is permitted per
pattern.