FirstLast-L and FirstLast-LW are two synthetic tree databases constructed for classification purposes. Each database contains four classes, each class composed of 100 rooted ordered trees. FirstLast-L contains node-labeled trees and the set of possible node labels is {1,2,...,26}. FirstLast-LW contains the same trees found into FirstLast-L with weighted arcs. The peculiarity of these databases is that they contain trees which are neither characterized by the exact values of the degrees/labels of the nodes, nor by the exact values of the weights of the arcs. Rather, they are characterized by specific and non-trivial properties verified by the degrees/labels of the nodes, as well as the weights of the arcs. These properties are presented in Table 1 below. Beside the properties specified in that table, we must also precise that every tree t in these two databases verifies (0 < depth(t) <= 4). The names of the classes in these two databases are related to the degree and the node labels of the first (leftmost) and last (rightmost) children of every non-leaf node. The weight of the arc linking each node y to its father in FirstLast-LW is obtained from a computation involving the label of y and the label of its father, the computation scheme of the weight depends on the considered class. |
Table 1: Specific properties verified by each tree in FirstLast-L and FirstLast-LW. The expressions L(y) and F(y) respectively refer to the label and to the father of a node y.
Class name | Topology in FirstLast-L and FirstLast-LW | Labels in FirstLast-L and FirstLast-LW | Weights in FirstLast-LW |
---|---|---|---|
First | Every non-leaf node has at least 3 children among which only the first child is not a leaf. | The parity of the label of each non-leaf node is identical to the parity of the label of its first child, but different from the parity of the labels of its other children. | The weight of the arc linking each node y to its father is ⌈0.5×(L(F(y))+L(y))⌉ |
Last | Every non-leaf node has at least 3 children among which only the last child is not a leaf. | The parity of the label of each non-leaf node is identical to the parity of the label of its last child, but different from the parity of the labels of its other children. | The weight of the arc linking each node y to its father is ⌈ √(L(F(y))×L(y)) ⌉ |
Both | Every non-leaf node has at least 3 children among which only the first and the last children are not leaves. | The parity of the label of each non-leaf node is identical to the parity of the labels of its first and last children, but different from the parity of the labels of its other children. | The weight of the arc linking each node y to its father is |L(F(y))-L(y)|+1 |
None | Every non-leaf node has at least 3 children among which only the first and the last children are leaves. | The parity of the label of each non-leaf node is different from the parity of the labels of its first and last children, but identical to the parity of the labels of its other children. | The weight of the arc linking each node y to its father is ⌈Max(L(F(y)),L(y))/Min(L(F(y)),L(y))⌉ |
Figures 1 to 4 depict examples of node-labeled and arc-weighted trees verifying the properties presented in Table 1 for FirstLast-LW. The corresponding trees verifying the properties presented in Table 1 for FirstLast-L are obtained by ignoring the arc weights.
Figure 1: A tree verifying the properties of class First in FirstLast-LW
Figure 2: A tree verifying the properties of class Last in FirstLast-LW
Figure 3: A tree verifying the properties of class Both in FirstLast-LW
Figure 4: A tree verifying the properties of class None in FirstLast-LW
FirstLast-L and FirstLast-LW can be downloaded from this page as text files gathered in '.rar' files. Each '.rar' file contains a main directory identified by the database name. This directory contains 4 sub-directories, each being identified by a class name. Each sub-directories contains 100 text files, each text file being associated to one tree t of the corresponding class. This text file is composed of one line which contains the output of the traversal of t by the Depth-First Search (DFS) algorithm. At each step of the traversal of t, DFS displays the following information (separated by one space character) related to the currently visited node y:
During classification experiments using fixed training and testing sets, we recommend any interested visitor to perform a 10 folds cross-validation of these databases using the training and testing sets available in the archives below. These training and testing sets were used in the first related paper [1] below where a Hidden Markov model (HMM) was associated with each tree of each database. For each database, the proposed training and testing sets of each fold were obtained after randomly shuffling the trees of each class.
The same HMMs associated with the trees in FirstLast-L and FirstLast-LW in [1] have later been used for deriving a descriptor vector for each tree. These descriptor vectors are gathered in the following archive: FirstLast_Arff.rar. The descriptor vectors contained in this file can be used for classification or clustering purposed in dedicated tools as the WEKA soft.