I have been trying to write a simple lexical analyzer in java .
The File Token.java looks as follows :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public enum Token {
TK_MINUS ("-"),
TK_PLUS ("\\+"),
TK_MUL ("\\*"),
TK_DIV ("/"),
TK_NOT ("~"),
TK_AND ("&"),
TK_OR ("\\|"),
TK_LESS ("<"),
TK_LEG ("<="),
TK_GT (">"),
TK_GEQ (">="),
TK_EQ ("=="),
TK_ASSIGN ("="),
TK_OPEN ("\\("),
TK_CLOSE ("\\)"),
TK_SEMI (";"),
TK_COMMA (","),
TK_KEY_DEFINE ("define"),
TK_KEY_AS ("as"),
TK_KEY_IS ("is"),
TK_KEY_IF ("if"),
TK_KEY_THEN ("then"),
TK_KEY_ELSE ("else"),
TK_KEY_ENDIF ("endif"),
OPEN_BRACKET ("\\{"),
CLOSE_BRACKET ("\\}"),
DIFFERENT ("<>"),
STRING ("\"[^\"]+\""),
INTEGER ("\\d"),
IDENTIFIER ("\\w+");
private final Pattern pattern;
Token(String regex) {
pattern = Pattern.compile("^" + regex);
}
int endOfMatch(String s) {
Matcher m = pattern.matcher(s);
if (m.find()) {
return m.end();
}
return -1;
}
}
The Lexer is as follows : Lexer.java
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exausthed = false;
private String errorMessage = "";
private Set<Character> blankChars = new HashSet<Character>();
public Lexer(String filePath) {
try (Stream<String> st = Files.lines(Paths.get(filePath))) {
st.forEach(input::append);
} catch (IOException ex) {
exausthed = true;
errorMessage = "Could not read file: " + filePath;
return;
}
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
moveAhead();
}
public void moveAhead() {
if (exausthed) {
return;
}
if (input.length() == 0) {
exausthed = true;
return;
}
ignoreWhiteSpaces();
if (findNextToken()) {
return;
}
exausthed = true;
if (input.length() > 0) {
errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
}
}
private void ignoreWhiteSpaces() {
int charsToDelete = 0;
while (blankChars.contains(input.charAt(charsToDelete))) {
charsToDelete++;
}
if (charsToDelete > 0) {
input.delete(0, charsToDelete);
}
}
private boolean findNextToken() {
for (Token t : Token.values()) {
int end = t.endOfMatch(input.toString());
if (end != -1) {
token = t;
lexema = input.substring(0, end);
input.delete(0, end);
return true;
}
}
return false;
}
public Token currentToken() {
return token;
}
public String currentLexema() {
return lexema;
}
public boolean isSuccessful() {
return errorMessage.isEmpty();
}
public String errorMessage() {
return errorMessage;
}
public boolean isExausthed() {
return exausthed;
}
}
And can be tested with a Try.java as follows :
public class Try {
public static void main(String[] args) {
Lexer lexer = new Lexer("C:/Users/Input.txt");
System.out.println("Lexical Analysis");
System.out.println("-----------------");
while (!lexer.isExausthed()) {
System.out.printf("%-18s : %s \n",lexer.currentLexema() , lexer.currentToken());
lexer.moveAhead();
}
if (lexer.isSuccessful()) {
System.out.println("Ok! :D");
} else {
System.out.println(lexer.errorMessage());
}
}
}
Say the Input.txt has
define mine
a=1000;
b=23.5;
The output I expect is
define : TK_KEYWORD
mine : IDENTIFIER
a : IDENTIFIER
= : TK_ASSIGN
1000 : INTEGER
; : TK_SEMI
b : IDENTIFIER
= : TK_ASSIGN
23.5 : REAL
But The issue I am facing is : It treats each digit like
1 INTEGER
0 INTEGER
0 INTEGER
0 INTEGER
also it doesn't recognize Real numbers . I get:
Unexpected symbol: '.'
What are the changes needed to get the expected results?
Your pattern to match integer is:
INTEGER ("\\d"),
That matches exactly one digit.
If you want more than one, go for
INTEGER ("\\d+"),
for example.
And, just for completion, the missing other pattern for floating point numbers could look like
REAL ("(\\d+)\\.\\d+")
as the comments pointed out. Or
REAL ("(\\d*)\\.\\d+")
to allow for
.23
too - if that is what you are looking for!