THE LUCS-KDD TFPC TEXT MINING GUI



Liverpool University

Frans Coenen and Yambo Wang

Department of Computer Science

The University of Liverpool

27 February 2008

Last updated: 4 March 2008


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
 
3. Running the software.
3.1. File naming convention.
3.2. Document base specification files.
4. GUI Options.
5. The Phrase Identification Algorithm.
6. Conclusions.



1. INTRODUCTION

The LUCS-KDD text mining system is a sophisticated piece of Java code, fronted by a GUI. The software is designed to address two different objectives:

  1. Provide support for academic research into variety of aspects of text mining.
  2. Produce text mining classifiers for a variety of applications.

Because of these dual objectives the GUI interface has many different options (most of which would not be required if all you want to do is produce a text classifier).

The classification algorithm used is TFPC, and tghis is incorporated into the GUI. For more information on the text preprocessing approach used see Coenen et al. (2006) and Wang et al. 2006




2. DOWNLOADING THE SOFTWARE


The text mining software comprises 33 source files. These are provided from this WWW page together with an application class. The class files are as follows:

  1. AprioriTFP_CRgen: Methods to produce classification rules using a the APRIORI-TFP algorithm with either best first or K best first mtching, CSA (Confidence, Support and Antecedent size) ordering, and T-tree X-checking. Alternative ordering and matching strategies are defined in sub-classes of this class
  2. AprioriTFPclass: Methods to produce classification rules using an TFPC approach.
  3. AssocRuleMining: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  4. BatchModeParams: Window to input batch mode parameter. Batch mode parameters are used for running a sequence of test runs using a variety of input parameters; this is described in some further detail below.
  5. KeyWordMining: Class containing methods to coordinate keyword identification from training set of documents, recast the training and test set in terms of the identified keywords and then produce a classifier from the training set tested on the test set.
  6. PartialSupportTree: Methods to implement the TFP (Total From Partial) ARM algorithm using both the T-tree (Total support tree) and P-tree (Partial support tree data structures.
  7. PhraseBinTree: Contains methods to generate a binary tree to describe all the phrases represented by a set of documents. Generated by processing the training document base using methods in TrainingSetDocumentBase commencing with the method generatePhraseBinTree. Also contains the methods to recast the input training set in terms of a set of attributes representing phrases and to output the training set.
  8. PhraseBinTreeNode: Class to describe an individual node in the phrase bin tree.
  9. PhraseMining: Class containing methods to coordinate phrase identification from training set of documents, recast the training and test set in terms of the identified phrases and then produce a classifier from the training set tested on the test set.
  10. PhraseMining_DelSN_ContGO: Phrase based text classification. Class containing methods to coordinate phrase identification from training set of documents, recast the training and test set in terms of the identified phrases and then produce a classifier from the training set tested on the test set. Phrases defined as follows:
    1. Deliminators = stop marks (S) and noise words (N).
    2. Content = Distinguishing words (G) and adjacent ordinary words (O).
    3. Ignore = Ordinary words not adjacent to distinguishing (Significant) words.
  11. PhraseMining_DelSN_ContGW: Phrase based text classification. Class containing methods to coordinate phrase identification from training set of documents, recast the training and test set in terms of the identified phrases and then produce a classifier from the training set tested on the test set. Phrases defined as follows:
    1. Deliminators = stop marks (S) and noise words (N).
    2. Content = Distinguishing words (G) and adjacent ordinary words represented as wild card words (W).
    3. Ignore = Ordinary words (O) not adjacent to distinguishing (Significant) words.
  12. PhraseMining_DelSO_ContGN: Phrase based text classification. Class containing methods to coordinate phrase identification from training set of documents, recast the training and test set in terms of the identified phrases and then produce a classifier from the training set tested on the test set. Phrases defined as follows:
    1. Deliminators = stop marks (S) and ordinary words (O).
    2. Content = Distinguishing words (G) and adjacent noise words (N).
    3. Ignore = Noise words not adjacent to distinguishing (Significant) words.
  13. PhraseMining_DelSO_ContGW: Phrase based text classification. Class containing methods to coordinate phrase identification from training set of documents, recast the training and test set in terms of the identified phrases and then produce a classifier from the training set tested on the test set. Phrases defined as follows:
    1. Deliminators = stop marks (S) and ordinary words (O).
    2. Content = Distinguishing words (G) and wild card words (W).
    3. Ignore = Noise words (N) not adjacent to distinguishing (Significant) words.
  14. PotentialSigWord: Contains structure to store a potential significant word with respect to a particular class. Note a word can be significant with respect to more than one class. Also the final set of significant words will be some subset of the set of potential significant words.
  15. PtreeCanvas: Paint canvas in which to display the P-tree.
  16. PtreeNode: Tree structure to store Ptree nodes. Same as top level structure (see below) but with the addition of a sibling branch.
  17. PtreeNodeTop: Top level Ptree node structure. An array of such structures is created in which to store the top level of the Ptree.
  18. PtreeWindow: Window in which to place the P-tree paint canvas.
  19. RuleNode: Set of methods that allow the creation and manipulation (e.g. ordering, etc.) of a list of ARs.
  20. SetDocumentBase: Parent class for training and test set document base classes.
  21. TestSetDocumentBase: Class containing methods to describe a set of test documents in terms of identified phrases (those contained in the phrase bin tree).
  22. TextMining: Text preprocessing algorithms for use with TFPC.
  23. TextMiningControl: Main text Mining control GUI.
  24. TextMiningGUI_App: Text mining application class (to "start the ball rolling").
  25. TextMiningModel: Experimental image mining GUI model for text Mining.
  26. Thresholds: Window to input threshold parameters.
  27. TotalSupportTree.java: Methods to implement the "Apriori-T" algorithm using the "Total support" tree data structure (T-tree).
  28. TrainingSetDocumentBase: Class containing methods to describe a set of training documents in terms of references to words in the bin tree and to generate the phrase bin tree. Does not produce the desired final training set in terms of attribute numbers, this is dome by methods in the PhraseBinTree class. Similarly methods to output the training set (in what ever format) are contained in the PhraseBinTree class.
  29. TtreeCanvas: Methods to create a canvas onto which the T-tree may be painted if rquired.
  30. TtreeNode.java: Methods concerned with the structure of T-tree nodes. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree. Note this is a separate class to the other T-tree classes which are arranged in a class hierarchy as illustrated below.
  31. TtreeWindow: Methods to create a GUI window to contain a T-tree canvas (if requested).
  32. WordBinTree: Contains methods to generate a binary tree to store all the words represented by a set of documents. Each node in the tree is an instance of the class WordBinTreeNode. The methods in this class also process the word bin tree to identify significant words.
  33. WordBinTreeNode: Class to describe an individual node in the word bin tree.

The relationship between some of the principal classes is as shown in the hierarchies in Tables 1 and 2 below.

                AssocRuleMining
                        |
        +---------------+--------------+
        |                              |
  TotalSupportTree                TextMining
        |          	               |
        |                   +----------+-----------+
        |                   |                      |
ParrtialSupportTree    PhraseMining           KeyWordMining
	|		    |
	|		    +-- PhraseMining_DelSO_ContGW
        |		    |
  AprioriTFPclass	    +-- PhraseMining_DelSN_ContGO
	|		    |
	|		    +-- PhraseMining_DelSN_ContGW
	|		    |
  AprioriTFP_CRgen	    +-- PhraseMining_DelSO_ContGN

Table 1: Main class hierarchy

                  SetDocumentBase
                          |
   +----------------------+----------------------+
   |                                             |
Training SetDocumentBase   TestSetDocumentBase

Table 2: Set document base class hierarchy


There is also a "tar ball" textMining.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf aprioriT.tgz.


2.1. Compiling

The software has been implemented in Java which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java TextMiningGUI_App

The resulting GUI window is shown in Figure 1. If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m TextMiningGUI_App
Text Mining GUI Window

Figure 1: Text mining GUI Window

The input to the software is a collection of training text files, pre-labeled with an appropriate class name. Table 3 gives an example of such a file taken from the 20 Newsgroup data set (Lang 1995). Further examples taken from the 20 Newsgroup set can be found at:

http://www.csc.liv.ac.uk/~frans/KDD/Software/DataSets/newsGroupsTestData.html .

The thing to note about the example file is the @Class rec.motorcycles meta tag, this needs to be on the first line and tells the text mining system what class the text file belongs to.

@Class rec.motorcycles

paint jobs in the uk

can anyone recommend a good place for reasonably priced
bike paint jobs, preferably but not essentially in the
london area.

thanks
lisa rowlands
--
alex technologies ltd cp house
97-107 uxbridge road

Table 3: Example text file with @class meta tag

Once processed the system will have identified a set (A) of key words or phrases. Consequently the document base will now be represented in terms of a set of records, one record per document, such that each record R is some subset of A. We say that the data set has M columns (attributes) and N rows (records). A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning).


3.1. File naming convention

A naming convention is used for the data sets. The name of each text file comprises: (i) a file stem, (ii) a sequential number (N) and (iii) a file end. The naming convention we used for the newsgroups is as follows:

newsgroupsN.txt

where newsgroups is the file stem, and .txt is the file end. The system assumes that all text files making up a document base are stored in a single directory.


3.2. Document base specification files

To load the text files the system needs to know: (i) the file stem, (ii) the file end. (iii) the number of files (documents) in the document base, and (iv) the directory in which the document base text files are stored. This can be supplied in the form of a document base specification (DB spec.) file. An example (for part of the 20 newsgroups set) is given in Table 4. Note that the list of classes must be all on one line.

ClassList: comp.windows.x rec.motorcycles talk.religion.misc
FileStem: newsgroups
FileEnd: .txt
NumRecords: 3000

Table 4: Example DB specification file for 20 Newsgroups sample.




4. GUI OPTIONS


The interface comprises eight menus:

  1. File: Eight items.
  2. Thresholds: Two items.
  3. Sig. Word Strats.: (Significant Words Strategies) Four items (three with sub-menus). These can be combined in eight different ways.
  4. Algorithm: Five items. There are four phrase identification approaches, plus a keywords only approach.
  5. Evaluation: Three items. Use to obtain am accuracy figure for the generated classifier.
  6. Class. Strat. (Classification Strategy). Two items. The generated classifier comprises a list of Classification Rules (CRs). To classify a new example the list processing from top to bottom to find a rule that fits the current example (ending with a default rule).
  7. Output: Six items each with substantial sub-menus. Mostly aimed at analysis of text mining operations for reseach purposes.
  8. Batch: Two items. Only used for advanced mode of operation. Batch mode produces a sequence of classifier generations using a range of input parameter threshold values.

Details for each menu are given on the following tables.

File Menu
AboutDisplays some instructions in the GUI text area.
Text Base Dir.Allows user to input the directory where # the document base is stored.
Text Base Spec.Allows user to input the name (pathway) of the desired DB specification file.
First N recsOption to specify first N documents to be processed (use if not desired to process all documents in a given set).
StartStart the classification (will only work of appropriate input have been made).
Start Batch ModeRun in batch mode; thid is a specialised mode of operation that should only be used where it is desired (for experimentation) to run a sequence of classification applications on the same data set but with different parameters.
Start Set of 8 ModeSpecialised mode of operation that undertakes run of eight classification applications using all eight different phrase identification techniques incorporated into the text miner.
ExitShuts down the GUI.

Thresholds Menu
T-hold InputBrings up a separate window to allow input of thresholds: (i) support, (ii) confidence, (iii) significance, (iv) upper noise and (v) lower noise.
T-hold OutputOutputs current threshold values.

Sig. Word Strats. Menu
Set Max # Sig Wds.Allows user to specify number N of significant words to be identafied. The value is application dependent, no harm is done to select too many other than slowing down the computational efficiency. The value is usually in the hundreds.
Sig. Contrtib. Gen.Two ways of calculating the contribution of a given word to determine its significance: (i) "Word Frequency", the number of occurrences of the word in the document set; and (ii) "Word Support", the number of documents in which the word appears (count of 1 per document).
Pot. SW list Gen.Two ways of selecting significant words from the list of potential significant words identified according to their contribution count: (i) "Unique Sig. Words", those that are significant in only one class; and (ii) "All Significant Words".
Sig. Word ID Strat.Two ways of pruning the list of significant words according to the input N: (i) "First N (distributed)", first N distributed equally across classes; and (ii) "First N", top N words in the list (not distributed).

Algorithm Menu
DelSN_contGOGenerate key Phrases comprising at least one significant word and ordinary words, using stop and noise words as the delimiters.
DelSN_contGWGenerate key Phrases comprising at least one significant word and "wildcard" words, using stop and noise words as the delimiters.
DelSO_contGNGenerate key Phrases comprising at least one significant word and noise words, using stop and ordinary words as the delimiters.
DelSO_contGWGenerate key Phrases comprising at least one significant word and "wildcard" words, using stop and ordinary words as the delimiters.
KeywordsDo not generate phrases but use keywords instead, i.e. the identified significant words as the keywords

Evaluation
90:10Evaluate the generated classifier using a 90:10 training/test set split.
50:50Evaluate the generated classifier using a 50:50 training/test set split.
TCVEvaluate the generated classifier using Ten Cross Validation (TCV).

Class. Strat.
CSAOrder classification rules according to highest confidence value, if two rules have the same confidence then according to highest support value, if two rules have the same confidence and support then according to largest number of antecedent items (attributes). Then apply first rule in the ordered list that that fits a new example.
CSA best KUse CSA ordering but obtain results of first k rules and select the majority. (k=5 in this case.)

Output
Training setEleven options: (i) "Training set doc IDs", list of training set document numbers; (ii) "Training set doc sizes", list of number of words in each training set document, (iii) "Training set raw", the training set documents in full; (iv) "Training set raw 1st 10", the first ten training set documents in full; (v) "Training set marked up", the training set documents marked up according to whether each word is a significant, ordinary, noise or stop word; (vi) "Training set marked up 1st 10", the first ten training set documents marked up according to whether each word is a significant, ordinary, noise or stop word; (vii) "Train. set Phrases/K'words", the training set documents in terms of identified key phrases/words; (viii) "Train. set Phrases/K'words 1st 10", the first ten training set documents in terms of identified key phrases/words; (ix) "Training set attribute #", the list of training set attributes (identified by unique numbers); (x) "Training set Docs. per class", the number of training set documents per class; (xi) "Training Set Stats", set statistics describing the training set.
Test setSix options: (i) "Test set doc IDs", list of test set document numbers; (ii) "Test set doc sizes", list of number of words in each test set document, (iii) "Test set raw", the test set documents in full; (iv) "Test set Phrases/Keywords", the test set documents in terms of identified key phrases/words; (v) "Test set attribute #", the list of test set attributes (identified by unique numbers); (vi) "Test Set Stats", set statistics describing the test set.
Word bin treeSeven options including a sub-sub-menu. First six options are: (i) "Word Bin Tree", the entire list of words represented in the training set (this is actually stored in a binary tree); (ii) "Lower Noise Words", the lower noise words (words with a count below the lower noise threshold) that are represented in the training set; (iii) "Upper Noise Words", the upper noise words (words with a count above the upper noise threshold) that are represented in the training set; (iv) "Ordinary Words", list of ordinary words represented in the training set (non-noise words that are not considered to be significant); (v) "Word Bin Tree Stats", statistics for the generated word binary tree; (vi) "Words with Count 1", words that only appear once in the entire document set.

The sub-sub-menu has five options all pertaining to significant words: (i) "Significant Words", the list of significant words represented in the training set; (ii) "# pot sig words per class"; the number of potential significant words per class (not the same as the actual number of significant words); (iii) "# sig words per class", the number of actual significant words per class (iv) "Potential Sig. Words", the list of potential significant words represented in the training set; (v) "Top 10 Sig. Wds. per Class", the first ten significant words per class.

Phrase bin treeFour options: (i) "Phrase Bin Tree", the entire list of phrases represented in the training set (this is actually stored in a binary tree); (ii) "Phrase Bin Tree Stats", statistics for the generated word binary tree; (iii) "Phrase list", list of phrase; (iv) "Phrase list 1st 100", first 100 phrases in the phrase list
TFPCFive options pertaining to the TFPC classifier software incorporated into the text miner: (i) "T-tree Stats", statistics for the T-tree (size, number of updates etc) --- T-tree is a storage structure used during the classifier generation process; (ii) "P & T tree Storage", storage requirement for P- and T-tree (P-tree is a data structure to hold a partially process version of the input data); (iii) "Large Itemsets", list of frequent item sets (some authors call these {large item sets), (iv) "Ttree", the T-tree itself (in serialised form); (v) "Ptree", the P-tree itself (in serialised form).ClassifierFour options: (i) "Classifier Stats.", some statistics pertaining to the classifier (number of rules, etc.), (iii) "Classifier", the classifier 9as set of classification rules), (iii) "Rules fired", list of the classification rules actually used; (iv) "Classification", the results of the classifier being applied to the test set.

Batch
Output BatchBrings up a separate window to allow input of batch parameters: (i) support, (ii) confidence, (iii) significance, (iv) upper noise and (v) lower noise. In each case user can specify the start value, the end value and the increment when stepping up from the start value to the end value (negative value for the step when going backwards).
Output BatchOutputs current batch mode threshold parameters.



5. THE PHRASE IDENTIFICATION ALGORITHM


Each document is processed to identify the words in it. Words are continuous sequences of alphabetic characters delimited by non-alphabetic characters, e.g. punctuation marks, white space and numbers. Some non-alphabetic characters ('!', ',', '.', ':', ';' and '?'), referred to as stop marks, play a role in the identification of phrases and are indicated in the processed document by a "null". All other non-alphabetic characters are ignored. As a result some single words in the training set may be identified as two words, for example Tom's will be the word Tom and the "word" s.

Words are stored in a bin tree.

Four types of word are identified in the document base.

  1. Noise words: Words whose support is above/below a user defined upper/lower noise threshold.
  2. Significant words: Words that do not serve to distinguish between classes.
  3. Ordinary words: Non-significant words that do not serve to distinguish between classes.
  4. Stop marks: Not actual words but included in the document using a "null" to assist in phrase identification.

A number of different schemes for identifying phrases can be identified:

  1. Phrases delimited by stop marks and noise words, and made up of sequence of one or more significant words and ordinary words. Sequence of ordinary words delimited by stop marks and noise words that do not include at least one significant word are ignored.
  2. Phrases delimited by stop marks and ordinary words, and made up of sequence of one or more significant words and noise words. Sequence of noise words delimited by stop marks and ordinary words that do not include at least one significant word are ignored.
  3. As 1 but replacing ordinary words in phrases by wild card characters.
  4. As 2 but replacing noise words in phrases by wild card characters.
  5. Combination of 1 and 2.
  6. Combination of 3 and 4.
  7. Phrases made up of sequences of significant words only.

To determine weather a word is significant or not the process is as follows:

  1. Calculate support for word per class in terms of the number of documents containing the word in the class with respect to the total number of documents represented in the class.
  2. Calculate the average support (A) per document.
  3. Calculate the threshold.
  4. Support per class values that do not exceed the threshold value indicate words that do not serve to differentiate between classes (note that given an appropriate value of G it is possible for a word to be significant with respect to more than one class).

The algorithm uses two bin trees, one for words and one for phrases, and two arrays of arrays for the training and test sets respectively. The algorithm works as follows:

PHASE 1: READING THE TRAINING SET

  1. Define: (i) a training document base array (size equal to size of number of training set docs) of document arrays of references to bin tree nodes, and (ii) empty word and phrase bin trees.
  2. For each document:
    1. Read document and determine size and dimension appropriate document array in training document base array of arrays.
    2. Read document again and for each word in the document:
      • if stop mark include in document array as a null reference.
      • else either (i) create new word bin tree node and add to bin tree (support 1) if not already in tree, or (ii) if in tree but first occurrence of word in current document increment support; and then include ref to the word bin tree node in document array.
  3. Identify words in the word bin tree that are not significant (below/above lower/upper noise threshold and words that do not serve to distinguish between classes.
  4. Process training document base and place identified phrases into a phrase bin tree.
  5. Throw away word bin tree and create data array of attribute (phrase) numbers ready for CARM.
  6. Phrase bin tree will now contain phrases and documents in which they appear. Generate training data array of array of attribute numbers (each attribute represents a single phrase) from the phrase bin tree.

PHASE 2: READING THE TEST SET

Starts with method findPhrasesInDB in class PhraseMining.

  1. Define a test document base array (size equal to size of number of test set docs) of document arrays of strings.
  2. For each document:
    1. Read document and determine size and dimension appropriate document array in test document base array of arrays.
    2. Read document again and for each word in document array.
  3. With reference to phrase bin tree and test document base create a test data array.
  4. Throw away test document base and phrase bin tree.
  5. Now have test data ready for DM

PHASE 3: CLASSIFICATION

  1. Apply TFPC algorithm.



6. CONCLUSIONS


The TFPC text mining software described here has been used successfully by the LUCS-KDD research group for the purpose of conduction research into text mining issues. The software is available for free, however the authors would appreciate appropriate acknowledgement. The following reference format for refering to the TFP implementation available here is suggested:

  1. Coenen, F. and Wang, Y.J. (2008), The LUCS-KDD TFPC Text Miner, http://www.csc.liv.ac.uk/~frans/KDD/Software/TextMininghDemo/textMining.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the authors.




REFERENCES


Coenen, F.P., Leng, P. and Sanderson, R. and Wang, Y.J. (2006).
Statistical Identification of Key Phrases for Text Classification.. Proc. MLDM'2007, LNAI 4571, Springer, pp.838-853.
Lang, K. (1995).
Newsreeder: Learning to filter netnews. Proceedings of the 12th International Conference on Machine Learning, pp331-339.
Wang, Y.J., Coenen, F.P., Leng, P. and Sanderson R. (2006).
Text Classification using Language Independent Pre-processing. In Bramer, M., Coenen, F.P. and Tuson, A. (Eds.), Research and Development in Intelligent Systems XXIII., Springer, London, pp 393-397.



Created and maintained by Frans Coenen. Last updated 4 March 2008