STRING TOKENIZING (AND FILE HANDLING)


CONTENTS

1. Introduction
2. The string tokenizer
3. Using the string tokenizer to process a sequence of numbers
4. Using more than one delimeter
 
5. String tokenizing and file handling
5.1 File interleave example
5.2 Output first "N" lines
5.3 File merge example



1. INTRODUCTION

Usually we input values a value at a time following each value by a carriage return. In some cases it is nice to be able to input values as a sequence. We can do this by simply entering the sequence as a string. However, once Java has got the string, to do anything useful with it we must be able to isolate different values within the string. We could do this by painstakingly analysing the string character by character (it is after all a character array) and find the delimiters for each "word" (double quotes are used here as the word in question may in fact be a number or some sequence of special characters). Such words are referred to as tokens. Some example code to achieve this is given in Table 1 together with some output in Table 2. Note that use is made of methods contained in the String class as illustrated previously.

// STRING PROCESSING
// Frans Coenen, Monday 15 January 1999
// Department of Computer Science, The University of Liverpool, UK

import java.io.*; 
import java.util.*;

class StringProcessing
    {
    // Create BufferedReader class instance

    static InputStreamReader input = new InputStreamReader(System.in);
    static BufferedReader    keyboardInput = new BufferedReader(input);

    /* Main method  */

    public static void main(String[] args) throws IOException
        {
	// Get a string
	
        System.out.println("Input a string ");
	String data = keyboardInput.readLine();

	// Output number of characters in the line
	
	int numberCharacters = data.length();
	System.out.println("Number of characters = " +
				numberCharacters + "\n");
	
	// Output tokens
	
	for (int counter=0; counter < numberCharacters; counter++) {
	    char character = data.charAt(counter);
	    if (character == ' ') System.out.println();	 
	    else System.out.print(character);
	    }	
	System.out.println("\n");    
	}
   }     

Table 1: Processing a string as a character array

$ java StringProcessing
Input a string
Some sample output resulting from running the above code
Number of tokens = 56

Some
sample
output
resulting
from
running
the
above
code                  

Table 2: Sample output generate from string processing code presented in Table 1




2. THE STRING TOKENIZER

A better way of doing this is to use the StringTokenizer class which is found in the Java package util. This contains a number of useful instance methods which can be used to isolate tokens (Figure 1).

CLASS DIAGRAM SHOWING DETAILS OF THE STRING TOKENIZER CLASS

Figure 1: Class diagram showing details of the StringTokenizer class

To use these methods we must of course first create an instance of the class StringTokenizer:

StringTokenizer data = new StringTokenizer(string);

where the argument is a string of text. The StringTokenizer class also provides two methods which are immediately useful for processing strings:
  1. countTokens which return the number of tokens that are delimited by any white space in a given string, thus we know how many tokens there are and therefore can use this number as a loop parameter with which to process the string.
  2. nextToken which will return the next token in a string form the current token. When used for the first time the next token is the first token in the string. Thus this method can be used to pass along the string token by token.

In the following piece of code (Table 3) we use these methods to identify and output the tokens in a string provided by the user.

// TOKENIZER EXAMPLE
// Frans Coenen, Monday 15 January 1999
// Department of Computer Science, The University of Liverpool, UK

import java.io.*; 
import java.util.*;

class TokenizerExample
    {
    // Create BufferedReader class instance

    static InputStreamReader input = new InputStreamReader(System.in);
    static BufferedReader    keyboardInput = new BufferedReader(input);

    /* Main method  */

    public static void main(String[] args) throws IOException
        {
	int numberOfTokens=0;
	
	// Get a string
	
        System.out.println("Input a string ");
	StringTokenizer data = new StringTokenizer(keyboardInput.readLine());

	// Output number of tokens in the line
	
	numberOfTokens = data.countTokens();
	System.out.println("Number of tokens = " + numberOfTokens + "\n");
	
	// Output tokens
	
	for (int counter=0; counter < numberOfTokens; counter++) { 
	    System.out.println(data.nextToken());
	    }	
	}
   }    

Table 3: Tokenizing example

Some sample output resulting from running the above code is given in Table 4.

$ java TokenizerExample
Input a string
Some sample output resulting from running the above code
Number of tokens = 9

Some
sample
output
resulting
from
running
the
above
code   

Table 4: Sample output generate from tokenizing code presented in Table 3




3. USING THE STRING TOKENIZER TO PROCESS A SEQUENCE OF NUMBERS

The java code presented in Table 5 processes a sequence of comma separated integers using the methods found in the StringTokenizer class. Note that, by default the delimiter is a white space character, in this case we have specified the nature of the delimiter, i.e. a comma, as part of the constructor. Note that, whatever the delimiter is defined as, it is not considered to be a token in its own right. Note also that we use the integer wrapper class methods (as illustrated previously) to convert the individual tokens from strings to integers. In Table 6 some sample output is presented.

// TOKENIZER EXAMPLE 2
// Frans Coenen, Tuesday 18 January 2000
// Department of Computer Science, The University of Liverpool, UK

import java.io.*; 
import java.util.*;

class TokenizerExample2 {
    // Create BufferedReader class instance

    static InputStreamReader input = new InputStreamReader(System.in);
    static BufferedReader    keyboardInput = new BufferedReader(input);

    /* Main method  */

    public static void main(String[] args) throws IOException
        {
	int numberOfTokens=0;
	int numberArray[];
	int total=0;
	
	// Get a string
	
        System.out.print("Input a sequence of integers separated by " +
        		"commas (`,'): ");
	StringTokenizer data = new
			StringTokenizer(keyboardInput.readLine(),",");

	// Get number of tokens in line and initialise array
	
	numberOfTokens = data.countTokens();
	System.out.println("Number of tokens = " + numberOfTokens + "\n");
	numberArray = new int[numberOfTokens];
	
	// Isolate tokens and maintain total
	
	for (int counter=0; counter &ly numberOfTokens; counter++) { 
	    numberArray[counter] = new Integer(data.nextToken()).intValue();
	    System.out.println(numberArray[counter]);
	    total = total+numberArray[counter];
	    }
	    
	// Output total
	
	System.out.println("------------\n" + total + " (average = " +
			total/numberOfTokens + ")");   	
	}
   }    

Table 5: Tokenizing example

$ java TokenizerExample2
Input a sequence of integers separated by commas (`,'): 1,2,3,4,5,6,7
Number of tokens = 7

1
2
3
4
5
6
7
------------
28 (average = 4)            

Table 6: Sample output generate from tokenizing code presented in Table 3





4. USING MORE THAN ONE DELIMETER

Sometimes we wish to tokenize using several delimeters. For example we might wish to process a text file which contains markup tags enclosed in pairs of [% and %]. For Example:

Welcome to [%name%] page. Today is [%date%], and it is [%time%]. Today's 
picture is [%picture%] Today's quote is [%quote%>]

We could define a tokenizer as follows:

StringTokenizer openTagTokenizer = new 
		StringTokenizer(keyboardInput.readLine(),"[%");

This might allow us to identify all the sub-strings separated by [%, and then we could process these substrings using a second tokenizer. The two tokenizer idea is a good one, but the tokenizer as defined above will not quite work as expected. The delimeter [%. is interpreted as either [% or [ or %. Thus the string:

Welcome to [%name%] page.

would be considered to comprise three sub-strings: Welcome to, name and ]page. (and not two --- Welcome to and name %] page. --- as might have been anticupated). Thus to identify our tags we should first look for [ and then test the first element of the resulting sub-string for %, if found we have the start of a tag which can then be processed by the second tokenizer. The code presented in Table 7 can be used to identify tags in text in this manner.

// TOKENIZER EXAMPLE 8
// Frans Coenen, Friday 28 June 2002
// Department of Computer Science, The University of Liverpool, UK

import java.io.*; 
import java.util.*;

class TokenizerExample8 {
    // Create BufferedReader class instance

    static InputStreamReader input         = new InputStreamReader(System.in);
    static BufferedReader    keyboardInput = new BufferedReader(input);

    /* Main method  */

    public static void main(String[] args) throws IOException {
	
	// Define "startTag" tokenizer and get a string
	
        System.out.print("Input some text containing markup tags delimited " +
		"by \"[%\" and \"%]\", for example \"The [%DATE%] is\" = ");
	StringTokenizer openTagTokenizer = new 
				StringTokenizer(keyboardInput.readLine(),"[");

	// Get number of tokens in line 
	
	int numberOfTokens = openTagTokenizer.countTokens();
	System.out.println("Number of tokens = " + numberOfTokens);
	
	// Define variable to hold substring 
	
	String subString;
	for (int counter=0; counter < numberOfTokens; counter++) { 
	    subString = openTagTokenizer.nextToken(); 
	    // If substring starts with '%' we have a tag, otherwise ignore
	    if (subString.charAt(0) == '%') findTag(subString);	    
	    }
	}
   
   /* FIND TAG */
   
   private static void findTag(String str) {
   
        // Define string tokenizert with "%]" delimeter
	
	StringTokenizer closeTagTokenizer = new StringTokenizer(str,"%");
	
        // Output
	
	System.out.println("TAG = " + closeTagTokenizer.nextToken());
	}
   }   

Table 7: String tokenizing with several delimeters

Some sample output is presented in Table 8. Note that the input string should be all on one line with a "carriage return" at the end.

$ java TokenizerExample8
Input some text containing markup tags delimited by "[%" and "%]", for example 
"The [%DATE%] is" = Welcome to [%name%] page. Today is [%date%], and it is 
[%time%]. Today's picture is [%picture%] Today's quote is [%quote%>]
Number of tokens = 6
TAG = name
TAG = date
TAG = time
TAG = picture
TAG = quote

Table 8: Sample output produced by code presented in Table 7





5. STRING TOKENIZING AND FILE HANDLING

We can also use the string tokenizer to process input from a file line by line as shown in Table 9. Here we read a file called HelloWorld2 in the same manner as illustrated previously and then use the tokenizer to identify and output the contents.

// STRING TOKENIZER EXAMPLE TO READ A FILE
// Frans Coenen, Saturday 22 January 1999
// Department of Computer Science, The University of Liverpool, UK

import java.io.*; 
import java.util.*;

class TokenizerExample3 {
    
    /* Main method  */

    public static void main(String[] args) throws IOException {
	FileReader file = new FileReader("HelloWorld2.java");
	BufferedReader fileInput = new BufferedReader(file);
	final int NUMBER_OF_LINES_IN_FILE = 31;
	int numberOfTokens = 0;
	
	// Read file
	
	for(int counter=0;counter < NUMBER_OF_LINES_IN_FILE;counter++) {
	    StringTokenizer dataLine =
	    		new StringTokenizer(fileInput.readLine());
	    numberOfTokens = numberOfTokens + dataLine.countTokens();
	    }
	    
	// Output result and close file
	    
	System.out.println("Number of tokens = " + numberOfTokens);
	fileInput.close();
	}
   }   

Table 9: Using the string tokenizer to process a text file (example 1)

From the code presented above we can see that we use knowledge of the size (NUMBER_OF_LINES_IN_FILE) of the input file to control the "input" loop in the same way that we did in the previous work on file handling. If we knew that the input file did not contain any blank lines, one way of avoiding the need to know in advance the number of lines in the input file, is to process the file until a line with no tokens is found and assume that this is the end of the file. Some appropriate code to achieve this is given in Table 10.

public static void main(String[] args) throws IOException {
    FileReader file = new FileReader("HelloWorld2.java");
    BufferedReader fileInput = new BufferedReader(file);
    int totalNumOfTokens = 0, numberOfTokens = 0;

    // Read file

    StringTokenizer dataLine = new StringTokenizer(fileInput.readLine());
    numberOfTokens = dataLine.countTokens();
        while (numberOfTokens != 0) {
            totalNumOfTokens = totalNumOfTokens + numberOfTokens;
            dataLine = new StringTokenizer(fileInput.readLine());
            numberOfTokens = dataLine.countTokens();
            }

    // Output result and close file

    System.out.println("Number of tokens = " + totalNumOfTokens);
    fileInput.close();
    }

Table 10: Using the string tokenizer to process a text file (example 2)

The format of code is useful if we wish to read a file line by line and process each line in turn. Table 11 shows the contents of a file containing numeric data. If we wish to read this file line by line and identify the individual integers in each line we can use the string tokenizer approach described above. Note that the data file is completed with a blank line which we can use to single the end of file (EOF). The code is presented in Table 12.

24 53 46 78 92 3 56 8 90 12 56 12 5 78
4 2 61 87 82 97 91 41 88 91 59 92 83 94 12 91 65 87 68 68 82 2 8 38
94 34 80 91 85 18 17 13 69 90 47 66 32 47 9 65 54 59 18 21 97 63 29
43 12 26 6 49 84 89 29 25 71 16 17 77 55 24 19 98 8 13 92 1 96 73 75
27 87 53 52 39 80 81 25 81 13 72 0 73 53 33 85 94 55 98 3 89 12 53 8
51 2 3 48 58 65 67 52 53 93 89 52 89 50            

Table 11: Sample numeric data

// Stream TOKENIZER EXAMPLE --- Reading a sequence of numbers
// Frans Coenen, Saturday 22 January 1999
// Department of Computer Science, The University of Liverpool, UK

import java.io.*;
import java.util.*;

class TokenizerExample7 {

    /* Main method  */

    public static void main(String[] args) throws IOException {
        FileReader file = new FileReader("numeric.data");
        BufferedReader fileInput = new BufferedReader(file);
        int counter, numberOfTokens = 0, number;

        // Read file

        StringTokenizer dataLine =
        		new StringTokenizer(fileInput.readLine());
        numberOfTokens = dataLine.countTokens();
        while (numberOfTokens != 0) {
            for (counter=0; counter < numberOfTokens; counter++) {
	        number = new Integer(dataLine.nextToken()).intValue();
	        number++;    
                System.out.print(number + " ");
		}
	    System.out.println();
            dataLine = new StringTokenizer(fileInput.readLine());      
	    numberOfTokens = dataLine.countTokens();                   
	    }
	}
   }

Table 12: Reading a file using the string tokenizer.

The final output will be as shown in Table 13.

$ java TokenizerExample7
25 54 47 79 93 4 57 9 91 13 57 13 6 79
5 3 62 88 83 98 92 42 89 92 60 93 84 95 13 92 66 88 69 69 83 3 9 39
95 35 81 92 86 19 18 14 70 91 48 67 33 48 10 66 55 60 19 22 98 64 30
44 13 27 7 50 85 90 30 26 72 17 18 78 56 25 20 99 9 14 93 2 97 74 76
28 88 54 53 40 81 82 26 82 14 73 1 74 54 34 86 95 56 99 4 90 13 54 9
52 3 4 49 59 66 68 53 54 94 90 53 90 51      

Table 13: Output from code presented in Table 10.


5.1 File Interleave Example

The code in Table 14 is used to process two files of equal length (number of lines) and "zip" the two together by inter-leaving the lines from the two files into a third output file. The output file is consequently twice the length of any one of the two input files.

// FILE INTERLEAVE EXAMPLE
// Frans Coenen, Monday 18 December 2000
// Department of Computer Science, The University of Liverpool, UK

import java.io.*;
import java.util.*;

class FileInterleave {
    // Create FileWriter and PrintWriter class instances


    /* Main method  */

    public static void main(String[] args) throws IOException 
        {
	BufferedReader inputFile1 = new
				BufferedReader(new FileReader("inputFile1"));
	BufferedReader inputFile2 = new
				BufferedReader(new FileReader("inputFile2"));
	PrintWriter outputFile = new
				PrintWriter(new FileWriter("outputFile"));
	String lineFromFile1;
	int numberOfTokens = 0;

        // Read file

	lineFromFile1 = inputFile1.readLine();
        StringTokenizer dataLine = new StringTokenizer(lineFromFile1);
        numberOfTokens = dataLine.countTokens();
        while (numberOfTokens != 0) {
	    // Output line from file 1
	    outputFile.println(lineFromFile1);
	    // Output line from file 2	
	    outputFile.println(inputFile2.readLine());
	    // Read next line from file 2
            lineFromFile1 = inputFile1.readLine();
            dataLine = new StringTokenizer(lineFromFile1);
            numberOfTokens = dataLine.countTokens();
            }

	// Close file
	
	inputFile1.close();
	inputFile2.close();
	outputFile.close();
        }
    }

Table 14: File interleave utility


5.2 Output First "N" Lines

The code in Table 15 is used to output the first N lines of a file to another file. Note that the file name and the value of N are presented as command line arguments. Note also that the code includes the checkFile method introduced previously.

// FILE OUTPUT EXAMPLE (OUTPUT FIRST N LINES)
// Frans Coenen
// Tuesday 30 January 2001
// Department of Computer Science, The University of Liverpool, UK

import java.io.*;
import java.util.*;

class FileOutputExample {

    /* Main method  */

    public static void main(String[] args) throws IOException {
    
        // Get file name and number of lines to output (counter)
    
        int counter = Integer.parseInt(args[1]);
        String fileName = new String(args[0]);
    
        // CheckFile
    
        if (!checkFile(fileName)) System.exit(1);
    
        // Prepare input and output files
    
        BufferedReader inputFile = new
        			BufferedReader(new FileReader(fileName));
        PrintWriter outputFile = new
        			PrintWriter(new FileWriter("outputFile"));
        String lineFromFile;
        int numberOfTokens = 0;

        // Read file

        lineFromFile = inputFile.readLine();
        StringTokenizer dataLine = new StringTokenizer(lineFromFile);
        numberOfTokens = dataLine.countTokens();
        for (int index = 0;index < counter;index++) {
            if (numberOfTokens == 0) break;
            // Output 
            outputFile.println(lineFromFile);
            // Read next line from file 
            lineFromFile = inputFile.readLine();
            dataLine = new StringTokenizer(lineFromFile);
            numberOfTokens = dataLine.countTokens();
            }

        // Close file
    
        inputFile.close();
        outputFile.close();
        }

    /* Check File */
    
    private static boolean checkFile(String fileName) {
        File src = new File(fileName);

        if (src.exists()) {
            if (src.canRead()) {
                if (src.isFile()) return(true);
                else System.out.println("ERROR 3: File is a directory");
                }
            else System.out.println("ERROR 2: Access denied");
            }
        else System.out.println("ERROR 1: No such file");

        return(false);
        }
    }

Table 15: Java application class to output first "N" liner of an input file to an output file


5.3 File Merge Example

The code in Table 16 is used to process two files of the form:

1 2 3 4
1 2 3
4 5 6
1 2 3 4 5
3
3
1 2 4 5

File 1

 
1 2 3 4
4 5 6
1 2 3
3
1 2 3 4 5
1 2 4 5
3

File 2

i.e. of equal length (number of lines); and merge the contents by combing pairs of lines from the files into a third output file. Thus given the above tow files this will produce:

1 2 3 4
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5 

File 3

// FILE MERGE EXAMPLE
// Frans Coenen, Tuesday 30 January 2000
// Department of Computer Science, The University of Liverpool, UK

import java.io.*;
import java.util.*;

class FileMergeExample {
    
    /* Main method  */

    public static void main(String[] args) throws IOException {
	String lineFromFile1, lineFromFile2;
	int numberOfTokens1 = 0, numberOfTokens2;

        // Get file names 
    
        String fileName1 = new String(args[0]);
 	String fileName2 = new String(args[1]); 
	  
        // CheckFile
    
        if (!checkFile(fileName1)) System.exit(1);
	if (!checkFile(fileName2)) System.exit(1);
	
        // Create FileWriter and PrintWriter class instances
	
	BufferedReader inputFile1 = new
				BufferedReader(new FileReader(fileName1));
	BufferedReader inputFile2 = new
				BufferedReader(new FileReader(fileName2));
	
        // Read file

	lineFromFile1 = inputFile1.readLine();
	lineFromFile2 = inputFile2.readLine();
        StringTokenizer dataLine1 = new StringTokenizer(lineFromFile1);
	StringTokenizer dataLine2 = new StringTokenizer(lineFromFile2);
        numberOfTokens1 = dataLine1.countTokens();
        numberOfTokens2 = dataLine2.countTokens();
	while (numberOfTokens1 != 0) {
	    mergeLine(dataLine1,dataLine2,numberOfTokens1,numberOfTokens2);
	    // Read next lines
            lineFromFile1 = inputFile1.readLine();
	    lineFromFile2 = inputFile2.readLine();
            dataLine1 = new StringTokenizer(lineFromFile1);
	    dataLine2 = new StringTokenizer(lineFromFile2);
            numberOfTokens1 = dataLine1.countTokens();
	    numberOfTokens2 = dataLine2.countTokens();
            }
	System.out.println("\n");
			
	// Close file
	
	inputFile1.close();
	inputFile2.close();
	}
	
    /* MERGE LINE */
	
    public static void mergeLine(StringTokenizer dl1, StringTokenizer dl2,
    			int length1, int length2) {
        int n1 = new Integer(dl1.nextToken()).intValue();
        int n2 = new Integer(dl2.nextToken()).intValue();
	    
	    
	while(true) {
	    // Both same
	    if (n1 == n2) {
		System.out.print(n1 + " ");
		length1--;
	        length2--;
		if (!checkForNext(dl1,dl2,length1,length2,0,0)) break;
		n1 = new Integer(dl1.nextToken()).intValue();
		n2 = new Integer(dl2.nextToken()).intValue();
		}
	    else {
		// First less than second
		if (n1 < n2) {
		    System.out.print(n1 + " ");
		    length1--;
		    if (!checkForNext(dl1,dl2,length1,length2,2,n2)) break;
		    n1 = new Integer(dl1.nextToken()).intValue();
		    }
		// Second less than first (by default)	
		else {
		    System.out.print(n2 + " ");
		    length2--;
		    if (!checkForNext(dl1,dl2,length1,length2,1,n1)) break;
		    n2 = new Integer(dl2.nextToken()).intValue();
		    }
		}
	    }
        }

    /* CHECK FOR NEXT: Flag settings:
		0 = Nothing
		1 = Output number before rest of dataline1
		2 = Output number before rest of dataline2 */
	
    public static boolean checkForNext(StringTokenizer dl1,
    			StringTokenizer dl2, int length1, int length2,
    					int flag, int oldNumber) {
	int index, number;
     
	// Check first data line
	     
	if (length1 == 0) {
	    // Output second data line if not empty
	    if (length2 != 0) {
		if (flag == 2) {
		    System.out.print(oldNumber + " ");
		    length2--;
		    }
	        outputRest(dl2,length2);
	        }
	    // Both empty
	    else System.out.println();	
	    return(false);
	    }

	// Check second data line
	     
	if (length2 == 0 && flag != 2) {
	    // Output first data line (which we know not to be empty)
	    if (flag == 1) {
		System.out.print(oldNumber + " ");
		length1--;
		}
	    outputRest(dl1,length1);
	    return(false);
	    }

	// Otherwise both OK
	    
	return(true);	    
	}
    
    /* OUTPUT REST */
     
    private static void outputRest(StringTokenizer dataLine, int length) {
        int number;
	
	for(int index=0;index < length;index++) {
	    number = new Integer(dataLine.nextToken()).intValue();
	    System.out.print(number + " ");
	    }
	
	System.out.println();
	}
	     
    /* CHECK FILE */
    
    private static boolean checkFile(String fileName) {
        File src = new File(fileName);

        if (src.exists()) {
            if (src.canRead()) {
                if (src.isFile()) return(true);
                else System.out.println("ERROR 3: File is a directory");
                }
            else System.out.println("ERROR 2: Access denied");
            }
        else System.out.println("ERROR 1: No such file");

        return(false);
        }
    }

Table 16: File merge utility




Created and maintained by Frans Coenen. Last updated 21 May 2003