Visualization using CABRO
Over the last decade, a large number of visualization methods developed in different domains have been used in data exploration and knowledge discovery process. The visualization methods are used for data selection (pre-processing step) and viewing mining results (post-processing step). Some recent visual data mining methods try to involve more intensively the user in the data-mining step through visualization. We only present three visualization techniques: the 2D scatter-plot matrices , the parallel coordinates and bar visualization significantly used for data exploration. We believe that these methods are valuable.
2D scatter-plot matrices
The data points are displayed in all possible pair wise combinations of dimensions in 2D scatter plot matrices. For n-dimensional data, this method visualizes n(n-1)/2 matrices.
Parallel coordinates
The parallel axes represent the data dimensions. A data point corresponds to a poly-line intersecting the vertical axes at the position corresponding to the data value.
Bar visualization
This method divides the display into n equal sized bars (regions) for n-dimensional space with each bar corresponding to a dimension. Within a bar, the sorted attribute values are mapped to pixels in a line-by-line according to their order.
No single visualization tool is the best for high dimensional data exploration: some visualization methods are the best for showing partitions of data, some other visualization methods can handle very large dataset. In all cases, we would like to combine different visualization techniques to overcome the single one. The same information is displayed in different views with different visualization techniques providing useful information to the user.
Furthermore, interactive linking and brushing can be also applied to multiple views: the user can select points in one view and these points are automatically selected (highlighted) in the other available views. Thus, the linked multiple views provide more information than the single one.
Interactive construction of decision trees
The cooperative method tries to involve the user in the construction of decision tree model with multiple linked views and brushing. The starting point of the cooperation here is the multiple views used to visualize the same dataset. The user can choose appropriate visualization methods to gain insight of data. The interactive graphical methods provide some utilities for example brushing, zoom, linking, etc. that can help the user to select by him-self test attributes and split points or oblique cuts according to best pure partitions. The top level with full dataset corresponds to the root of the decision tree. Without requiring a heuristic or statistical measure (e.g., information gain) in automatic decision tree algorithms, the human eye is an excellent tool for spotting natural patterns. The user can chooses by him-self test attributes and an arbitrary number of split points (with bar visualization or parallel coordinates) or an oblique cut in 2 dimensions (with 2D scatter-plot matrices). After that, the pure partition can be assigned to a leaf node holding the class prediction of its only color. The visualization of the remaining partition has to be examined in a further step. On lower levels, partitions of the datapoints
inherited from upper levels are visualized on the multiple views. And then, datapoints are partitioned recursively based on the human pattern recognition capabilities. The user can be an expert of the data domain and can use this domain knowledge during the model construction.
Furthermore, the user is also possible to do backtracking in the tree construction phase. No changes are required from the habitual case other than the direct modification of the tree node. The user can delete this node and then choose test attributes and split points (cuts) in another way. A tree view represents the obtained result in the graphical mode more intuitive than from the columns of numbers or the rules set at the output of the automatic algorithms. The user can easily extract the inductive rules and to prune the tree in the post-processing stage. The user has a better understanding of the obtained model because he was involved in the tree
construction phase.
CABRO is used for mining decision trees that focuses on visualization and model selection techniques in decision tree learning. Though decision trees are a simple notion it is not easy to understand and analyzelarge decision trees generated from huge data sets. For example, the widely used program C4.5 produces a decision tree of nearly 18,500 nodes with 2624 leaf nodesfrom the census bureau database given recently to the KDD community that consistsof 199,523 instances described by 40 numeric and symbolic attributes (103 Mbytes).It is extremely difficult for the user to understand and use that big tree in its textform. In such cases, a graphic visualization of discovered decision trees withdifferent ways of accessing and viewing trees is of great support and recently itreceives much attention from the KDD researcher and user. System MineSet of Silicon Graphics provides a 3D visualization of decision trees. System CART(Salfort Systems) provides a tree map that can be used to navigate the large decisiontrees. The interactive visualization system CABRO, associated with a new proposedtechnique called T2.5D (stands for Tree 2.5 Dimensions) offers an alternative efficient way that allows the user to manipulate graphically and interactively largetrees in data mining.
In CABRO, a mining process concerns with model selection in which the user trydifferent settings of decision tree induction to attain most appropriate decision trees.To help the user to understand the effects of settings on result trees, the treevisualizer is capable of handling multiple views of different trees at any stage in theinduction. There are several modes of view: zoomed, tiny, tightly-coupled, fish-eyed, and T2.5D, in each mode the user can interactively change the layout of thestructure to fit the current interests.The tree visualizer helps the user to understand decision trees by providing differentviews; each is convenient to use in different situations. The available views are
• Standard : The tree is drawn in proportion, the size of a node is up to thelength of its label, a father is vertically located at the middle of its children,and sibling are horizontally aligned.
•Tightly-coupled : The window is divided into two panels, one displays thetree in a tiny size, another displays it in a normal size. The first panel is amap to navigate the tree, the second displays the corresponding area of thetree.
• Fish-eyes: This view distorts the magnified image so that nodes aroundthe center of interest is displayed at high magnification, and the rest of thetree is progressively compressed.
•T2.5D: In this view, Z-order of tree nodes are used to make an 3D effects,nodes and links may be overlapped. The focused path of the tree is drawn in the front and by highlighted colors.The tree visualizer allows the user to customize above views by several operationsthat include: zoom,collapse/expand , and view node content.
• Zoom: The user can zoom in or zoom out the drawn tree.
•Collapse/expand : The user can choose to view some parts of the tree bycollapsing and expanding paths.
• View node content : The user can see the content of a node such as:attribute/value, class, population, etc.In model selection, the tree visualization has an important role in assisting the user tounderstand and interpret generated decision trees. Without its help, the user cannotdecide to favor which decision trees more than the others. Moreover, if the mining process is set to run interactively, the system allows the user to be able to take part atevery step of the induction. He/she can manually choose which attribute will be usedto branch at the considering node. The tree visualizer then can display severalhypothetical decision trees side by side to help user to decide which ones are worthto further develop. We can say that an interactive tree visualizer integrated with the system allows the user to use domain knowledge by actively taking part in mining processes.
Very large hierarchical structures are still difficult to navigate and view even withtightly-coupled and fish-eye views. To address the problem, we have beendeveloping a special technique called T2.5D (Tree 2.5 Dimensions). The 3D browsers usually can display more nodes in a compact area of the screen but requirecurrently expensive 3D animation support and visualized structures are difficult tonavigate, while 2D browsers have limitation in display many nodes in one view. TheT2.5D technique combines the advantages of both 2D and 3D drawing techniques to provides the user with cheap processing cost a browser which can display more than1000 nodes in one view; a large number of them may be partially overlapped butthey all are in full size. In T2.5D, a node can be highlighted or dim. The highlightednodes are that the user currently interested in and they are displayed in 2D to beviewed and navigated with ease. The dim nodes are displayed in 3D and they allowthe user to get an idea about overall structure of the hierarchy