The tree databases FirstLast-L and FirstLast-LW

Short Description

FirstLast-L and FirstLast-LW are two synthetic tree databases constructed for classification purposes. Each database contains four classes, each class composed of 100 rooted ordered trees. FirstLast-L contains node-labeled trees and the set of possible node labels is {1,2,...,26}. FirstLast-LW contains the same trees found into FirstLast-L with weighted arcs. The peculiarity of these databases is that they contain trees which are neither characterized by the exact values of the degrees/labels of the nodes, nor by the exact values of the weights of the arcs. Rather, they are characterized by specific and non-trivial properties verified by the degrees/labels of the nodes, as well as the weights of the arcs. These properties are presented in Table 1 below. Beside the properties specified in that table, we must also precise that every tree t in these two databases verifies (0 < depth(t) <= 4). The names of the classes in these two databases are related to the degree and the node labels of the first (leftmost) and last (rightmost) children of every non-leaf node. The weight of the arc linking each node y to its father in FirstLast-LW is obtained from a computation involving the label of y and the label of its father, the computation scheme of the weight depends on the considered class.

Table 1: Specific properties verified by each tree in FirstLast-L and FirstLast-LW. The expressions L(y) and F(y) respectively refer to the label and to the father of a node y.

Class name Topology in FirstLast-L and FirstLast-LW Labels in FirstLast-L and FirstLast-LW Weights in FirstLast-LW
First Every non-leaf node has at least 3 children among which only the first child is not a leaf. The parity of the label of each non-leaf node is identical to the parity of the label of its first child, but different from the parity of the labels of its other children. The weight of the arc linking each node y to its father is ⌈0.5×(L(F(y))+L(y))⌉
Last Every non-leaf node has at least 3 children among which only the last child is not a leaf. The parity of the label of each non-leaf node is identical to the parity of the label of its last child, but different from the parity of the labels of its other children. The weight of the arc linking each node y to its father is ⌈ √(L(F(y))×L(y))
Both Every non-leaf node has at least 3 children among which only the first and the last children are not leaves. The parity of the label of each non-leaf node is identical to the parity of the labels of its first and last children, but different from the parity of the labels of its other children. The weight of the arc linking each node y to its father is |L(F(y))-L(y)|+1
None Every non-leaf node has at least 3 children among which only the first and the last children are leaves. The parity of the label of each non-leaf node is different from the parity of the labels of its first and last children, but identical to the parity of the labels of its other children. The weight of the arc linking each node y to its father is ⌈Max(L(F(y)),L(y))/Min(L(F(y)),L(y))⌉

Figures 1 to 4 depict examples of node-labeled and arc-weighted trees verifying the properties presented in Table 1 for FirstLast-LW. The corresponding trees verifying the properties presented in Table 1 for FirstLast-L are obtained by ignoring the arc weights.

Figure 1: A tree verifying the properties of class First in FirstLast-LW

Figure 2: A tree verifying the properties of class Last in FirstLast-LW

Figure 3: A tree verifying the properties of class Both in FirstLast-LW

Figure 4: A tree verifying the properties of class None in FirstLast-LW

Downloads

FirstLast-L and FirstLast-LW can be downloaded from this page as text files gathered in '.rar' files. Each '.rar' file contains a main directory identified by the database name. This directory contains 4 sub-directories, each being identified by a class name. Each sub-directories contains 100 text files, each text file being associated to one tree t of the corresponding class. This text file is composed of one line which contains the output of the traversal of t by the Depth-First Search (DFS) algorithm. At each step of the traversal of t, DFS displays the following information (separated by one space character) related to the currently visited node y:

  1. For the files contained in the archive FirstLastLW.rar, the degree of y, the label of y and the weight of the arc linking y to its father are displayed in this order by DFS. According to this principle, the tree verifying the properties of class Last presented in Figure 2 is represented by the following line:
    3 17 0 0 2 6 0 10 14 3 13 15 0 4 8 0 6 9 4 5 9 0 12 8 0 18 10 0 14 9 0 23 11


  2. For the files contained in the archive FirstLastL.rar, the degree of y and the label of y are displayed in this order by DFS. According to this principle, the unweighted version of the tree verifying the properties of class Last presented in Figure 2 is represented by the following line:
    3 17 0 2 0 10 3 13 0 4 0 6 4 5 0 12 0 18 0 14 0 23


  3. The classes of FirstLast-L are additionally available in the archive FirstLastL_P.rar where the bracket notation is used. With this notation, the labels {1,2,...,26} are respectively replaced by the characters {A,B,...,Z}. Following this notation, the unweighted version of the tree verifying the properties of class Last presented in Figure 2 is represented by the following line:
    {Q{B}{J}{M{D}{F}{E{L}{R}{N}{W}}}}

During classification experiments using fixed training and testing sets, we recommend any interested visitor to perform a 10 folds cross-validation of these databases using the training and testing sets available in the archives below. These training and testing sets were used in the first related paper [1] below where a Hidden Markov model (HMM) was associated with each tree of each database. For each database, the proposed training and testing sets of each fold were obtained after randomly shuffling the trees of each class.

The same HMMs associated with the trees in FirstLast-L and FirstLast-LW in [1] have later been used for deriving a descriptor vector for each tree. These descriptor vectors are gathered in the following archive: FirstLast_Arff.rar. The descriptor vectors contained in this file can be used for classification or clustering purposed in dedicated tools as the WEKA soft.

Related Papers

  1. S. Iloga "Customizable HMM-based measures to accurately compare tree sets" , Pattern Analysis and Applications, DOI: 10.1007/s10044-021-00971-3, pp. 1-23, Springer, 2021. (Download this paper)

  2. S. Iloga "Accurate comparison of tree sets using HMM-based descriptor vectors" , Conference de Recherche en Informatique (CRI'21), pp. 1-14,Yaounde, Cameroon, December 2021. (Download this paper)

  3. Coming soon ...

Author contacts

  • Name: Pierre Sylvain Iloga Biyik
  • Email: sylvain.iloga@gmail.com , sylvain.iloga@ensea.fr , pierre.iloga-biyik@cyu.fr
  • Professional adress: Higher Teacher's Training College, Department of Computer Sciences,University of Maroua, PO.Box 55, Maroua,Cameroon.