Certified Data Mining and Warehousing Professional Visualization using CABRO

Visualization using CABRO

Over  the  last  decade,  a  large  number  of  visualization  methods  developed  in  different domains have been used in data exploration and knowledge discovery process. The visualization methods  are  used  for  data  selection  (pre-processing  step)  and  viewing  mining  results  (post-processing  step).  Some  recent  visual  data  mining  methods  try  to  involve  more  intensively  the user  in  the  data-mining  step  through  visualization.  We  only  present  three visualization techniques: the 2D scatter-plot matrices , the parallel coordinates and bar visualization significantly used for data exploration. We believe that these methods are valuable.

2D scatter-plot matrices
The data points are displayed in all possible pair wise combinations of dimensions in 2D scatter plot matrices. For n-dimensional data, this method visualizes n(n-1)/2 matrices.

Parallel coordinates
The  parallel axes represent the data dimensions. A data point corresponds to a poly-line intersecting the vertical axes at the position corresponding to the data value.

Bar visualization
This method divides the display into n equal sized bars (regions) for n-dimensional space with  each  bar  corresponding  to  a  dimension.  Within  a  bar,  the  sorted  attribute  values  are mapped to pixels in a line-by-line according to their order.

No  single  visualization  tool  is  the  best  for  high  dimensional  data  exploration:  some visualization  methods  are  the  best  for  showing  partitions  of  data,  some  other  visualization methods  can  handle  very  large  dataset.  In  all  cases,  we  would  like  to  combine  different visualization  techniques  to  overcome  the  single  one.  The  same  information  is  displayed  in different views with different visualization techniques providing useful information to the user.

Furthermore, interactive linking and  brushing  can  be also  applied to multiple views: the user  can  select  points in  one  view  and  these  points  are  automatically  selected  (highlighted) in the  other  available  views.  Thus,  the  linked  multiple  views  provide  more  information  than  the single one.

Interactive construction of decision trees
The  cooperative  method  tries  to  involve  the  user  in  the  construction  of  decision  tree model with multiple linked views and brushing. The starting point of the cooperation here is the multiple views used to visualize the same dataset. The user can choose appropriate visualization methods  to  gain  insight  of  data.  The  interactive  graphical  methods  provide  some  utilities  for example brushing, zoom, linking, etc. that can help the user to select by him-self test attributes and split points or oblique cuts according to best pure partitions. The top level with full dataset corresponds to the root of the decision tree. Without requiring a heuristic or statistical measure (e.g., information gain) in automatic decision tree algorithms, the human eye is an excellent tool for  spotting  natural  patterns. The  user  can  chooses  by  him-self  test  attributes  and  an  arbitrary number  of  split  points  (with  bar  visualization  or  parallel  coordinates)  or  an  oblique  cut  in  2 dimensions  (with  2D  scatter-plot  matrices).  After  that,  the  pure  partition  can  be  assigned  to  a leaf  node  holding  the  class  prediction  of  its  only  color.  The  visualization  of  the  remaining partition  has  to  be  examined  in  a  further  step.  On  lower  levels,  partitions  of  the  datapoints
inherited  from  upper  levels  are  visualized  on  the  multiple  views.  And  then,  datapoints are partitioned recursively based on the human pattern recognition capabilities. The user can be an expert of the data domain and can use this domain knowledge during the model construction.

Furthermore, the  user  is  also  possible to  do  backtracking  in  the  tree  construction  phase. No  changes  are  required  from  the  habitual  case  other  than  the  direct  modification  of  the  tree node.  The  user  can  delete  this  node  and  then  choose  test  attributes  and  split  points  (cuts)  in another  way.  A  tree  view  represents  the obtained  result  in  the  graphical  mode  more  intuitive than from the columns of numbers or the rules set at the output of the automatic algorithms. The user can easily extract the inductive rules and to prune the tree in the post-processing stage. The user  has  a  better  understanding  of  the  obtained  model  because  he  was  involved  in  the  tree
construction phase.

CABRO is used for mining decision trees that focuses on visualization and model selection techniques in decision tree learning. Though decision trees are a simple notion it is not easy to understand and analyzelarge decision trees generated from huge data sets. For example, the widely used program C4.5 produces a decision tree of nearly 18,500 nodes with 2624 leaf nodesfrom the census bureau database given recently to the KDD community that consistsof 199,523 instances described by 40 numeric and symbolic attributes (103 Mbytes).It is extremely difficult for the user to understand and use that big tree in its textform. In such cases, a graphic visualization of discovered decision trees withdifferent ways of accessing and viewing trees is of great support and recently itreceives much attention from the KDD researcher and user. System MineSet of Silicon Graphics provides a 3D visualization of decision trees. System CART(Salfort Systems) provides a tree map that can be used to navigate the large decisiontrees. The interactive visualization system CABRO, associated with a new proposedtechnique called T2.5D (stands for Tree 2.5 Dimensions) offers an alternative efficient way that allows the user to manipulate graphically and interactively largetrees in data mining.

In CABRO, a mining process concerns with model selection in which the user trydifferent settings of decision tree induction to attain most appropriate decision trees.To help the user to understand the effects of settings on result trees, the treevisualizer is capable of handling multiple views of different trees at any stage in theinduction. There are several modes of view: zoomed, tiny, tightly-coupled, fish-eyed, and T2.5D, in each mode the user can interactively change the layout of thestructure to fit the current interests.The tree visualizer helps the user to understand decision trees by providing differentviews; each is convenient to use in different situations. The available views are

• Standard : The tree is drawn in proportion, the size of a node is up to thelength of its label, a father is vertically located at the middle of its children,and sibling are horizontally aligned.

•Tightly-coupled : The window is divided into two panels, one displays thetree in a tiny size, another displays it in a normal size. The first panel is amap to navigate the tree, the second displays the corresponding area of thetree.

• Fish-eyes: This view distorts the magnified image so that nodes aroundthe center of interest is displayed at high magnification, and the rest of thetree is progressively compressed.

•T2.5D: In this view, Z-order of tree nodes are used to make an 3D effects,nodes and links may be overlapped. The focused path of the tree is drawn in the front and by highlighted colors.The tree visualizer allows the user to customize above views by several operationsthat include: zoom,collapse/expand , and view node content.

• Zoom: The user can zoom in or zoom out the drawn tree.
•Collapse/expand : The user can choose to view some parts of the tree bycollapsing and expanding paths.
• View node content : The user can see the content of a node such as:attribute/value, class, population, etc.In model selection, the tree visualization has an important role in assisting the user tounderstand and interpret generated decision trees. Without its help, the user cannotdecide to favor which decision trees more than the others. Moreover, if the mining process is set to run interactively, the system allows the user to be able to take part atevery step of the induction. He/she can manually choose which attribute will be usedto branch at the considering node. The tree visualizer then can display severalhypothetical decision trees side by side to help user to decide which ones are worthto further develop. We can say that an interactive tree visualizer integrated with the system allows the user to use domain knowledge by actively taking part in mining processes.

Very large hierarchical structures are still difficult to navigate and view even withtightly-coupled and fish-eye views. To address the problem, we have beendeveloping a special technique called T2.5D (Tree 2.5 Dimensions). The 3D browsers usually can display more nodes in a compact area of the screen but requirecurrently expensive 3D animation support and visualized structures are difficult tonavigate, while 2D browsers have limitation in display many nodes in one view. TheT2.5D technique combines the advantages of both 2D and 3D drawing techniques to provides the user with cheap processing cost a browser which can display more than1000 nodes in one view; a large number of them may be partially overlapped butthey all are in full size. In T2.5D, a node can be highlighted or dim. The highlightednodes are that the user currently interested in and they are displayed in 2D to beviewed and navigated with ease. The dim nodes are displayed in 3D and they allowthe user to get an idea about overall structure of the hierarchy

 For Support