Class BIRCHCluster

All Implemented Interfaces:
Serializable, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler

public class BIRCHCluster extends ClusterGenerator implements TechnicalInformationHandler
Cluster data generator designed for the BIRCH System

Dataset is generated with instances in K clusters.
Instances are 2-d data points.
Each cluster is characterized by the number of data points in itits radius and its center. The location of the cluster centers isdetermined by the pattern parameter. Three patterns are currentlysupported grid, sine and random.

For more information refer to:

Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: ACM SIGMOD International Conference on Management of Data, 103-114, 1996.

BibTeX:

 @inproceedings{Zhang1996,
    author = {Tian Zhang and Raghu Ramakrishnan and Miron Livny},
    booktitle = {ACM SIGMOD International Conference on Management of Data},
    pages = {103-114},
    publisher = {ACM Press},
    title = {BIRCH: An Efficient Data Clustering Method for Very Large Databases},
    year = {1996}
 }
 

Valid options are:

 -h
  Prints this help.
 
 -o <file>
  The name of the output file, otherwise the generated data is
  printed to stdout.
 
 -r <name>
  The name of the relation.
 
 -d
  Whether to print debug informations.
 
 -S
  The seed for random function (default 1)
 
 -a <num>
  The number of attributes (default 10).
 
 -c
  Class Flag, if set, the cluster is listed in extra attribute.
 
 -k <num>
  The number of clusters (default 4)
 
 -G
  Set pattern to grid (default is random).
  This flag cannot be used at the same time as flag I.
  The pattern is random, if neither flag G nor flag I is set.
 
 -I
  Set pattern to sine (default is random).
  This flag cannot be used at the same time as flag I.
  The pattern is random, if neither flag G nor flag I is set.
 
 -N <num>..<num>
  The range of number of instances per cluster (default 1..50).
  Lower number must be between 0 and 2500,
  upper number must be between 50 and 2500.
 
 -R <num>..<num>
  The range of radius per cluster (default 0.1..1.4142135623730951).
  Lower number must be between 0 and SQRT(2), 
  upper number must be between SQRT(2) and SQRT(32).
 
 -M <num>
  The distance multiplier (default 4.0).
 
 -C <num>
  The number of cycles (default 4).
 
 -O
  Flag for input order is ORDERED. If flag is not set then 
  input order is RANDOMIZED. RANDOMIZED is currently not 
  implemented, therefore is the input order always ORDERED.
 
Version:
$Revision: 15705 $
Author:
Gabi Schmidberger (gabi@cs.waikato.ac.nz), FracPete (fracpete at waikato dot ac dot nz)
See Also:
  • Field Details

    • GRID

      public static final int GRID
      Constant set for choice of pattern. (option G)
      See Also:
    • SINE

      public static final int SINE
      Constant set for choice of pattern. (option I)
      See Also:
    • RANDOM

      public static final int RANDOM
      Constant set for choice of pattern. (default)
      See Also:
    • TAGS_PATTERN

      public static final Tag[] TAGS_PATTERN
      the pattern tags
    • ORDERED

      public static final int ORDERED
      Constant set for input order (option O)
      See Also:
    • RANDOMIZED

      public static final int RANDOMIZED
      Constant set for input order (default)
      See Also:
    • TAGS_INPUTORDER

      public static final Tag[] TAGS_INPUTORDER
      the input order tags
  • Constructor Details

    • BIRCHCluster

      public BIRCHCluster()
      initializes the generator with default values
  • Method Details

    • globalInfo

      public String globalInfo()
      Returns a string describing this data generator.
      Returns:
      a description of the data generator suitable for displaying in the explorer/experimenter gui
    • getTechnicalInformation

      public TechnicalInformation getTechnicalInformation()
      Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
      Specified by:
      getTechnicalInformation in interface TechnicalInformationHandler
      Returns:
      the technical information about this class
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class ClusterGenerator
      Returns:
      an enumeration of all the available options
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a list of options for this object.

      Valid options are:

       -h
        Prints this help.
       
       -o <file>
        The name of the output file, otherwise the generated data is
        printed to stdout.
       
       -r <name>
        The name of the relation.
       
       -d
        Whether to print debug informations.
       
       -S
        The seed for random function (default 1)
       
       -a <num>
        The number of attributes (default 10).
       
       -c
        Class Flag, if set, the cluster is listed in extra attribute.
       
       -k <num>
        The number of clusters (default 4)
       
       -G
        Set pattern to grid (default is random).
        This flag cannot be used at the same time as flag I.
        The pattern is random, if neither flag G nor flag I is set.
       
       -I
        Set pattern to sine (default is random).
        This flag cannot be used at the same time as flag I.
        The pattern is random, if neither flag G nor flag I is set.
       
       -N <num>..<num>
        The range of number of instances per cluster (default 1..50).
        Lower number must be between 0 and 2500,
        upper number must be between 50 and 2500.
       
       -R <num>..<num>
        The range of radius per cluster (default 0.1..1.4142135623730951).
        Lower number must be between 0 and SQRT(2), 
        upper number must be between SQRT(2) and SQRT(32).
       
       -M <num>
        The distance multiplier (default 4.0).
       
       -C <num>
        The number of cycles (default 4).
       
       -O
        Flag for input order is ORDERED. If flag is not set then 
        input order is RANDOMIZED. RANDOMIZED is currently not 
        implemented, therefore is the input order always ORDERED.
       
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class ClusterGenerator
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • getOptions

      public String[] getOptions()
      Gets the current settings of the datagenerator BIRCHCluster.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class ClusterGenerator
      Returns:
      an array of strings suitable for passing to setOptions
      See Also:
      • DataGenerator.removeBlacklist(String[])
    • setNumClusters

      public void setNumClusters(int numClusters)
      Sets the number of clusters the dataset should have.
      Parameters:
      numClusters - the new number of clusters
    • getNumClusters

      public int getNumClusters()
      Gets the number of clusters the dataset should have.
      Returns:
      the number of clusters the dataset should have
    • numClustersTipText

      public String numClustersTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getMinInstNum

      public int getMinInstNum()
      Gets the lower boundary for instances per cluster.
      Returns:
      the the lower boundary for instances per cluster
    • setMinInstNum

      public void setMinInstNum(int newMinInstNum)
      Sets the lower boundary for instances per cluster.
      Parameters:
      newMinInstNum - new lower boundary for instances per cluster
    • minInstNumTipText

      public String minInstNumTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getMaxInstNum

      public int getMaxInstNum()
      Gets the upper boundary for instances per cluster.
      Returns:
      the upper boundary for instances per cluster
    • setMaxInstNum

      public void setMaxInstNum(int newMaxInstNum)
      Sets the upper boundary for instances per cluster.
      Parameters:
      newMaxInstNum - new upper boundary for instances per cluster
    • maxInstNumTipText

      public String maxInstNumTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getMinRadius

      public double getMinRadius()
      Gets the lower boundary for the radiuses of the clusters.
      Returns:
      the lower boundary for the radiuses of the clusters
    • setMinRadius

      public void setMinRadius(double newMinRadius)
      Sets the lower boundary for the radiuses of the clusters.
      Parameters:
      newMinRadius - new lower boundary for the radiuses of the clusters
    • minRadiusTipText

      public String minRadiusTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getMaxRadius

      public double getMaxRadius()
      Gets the upper boundary for the radiuses of the clusters.
      Returns:
      the upper boundary for the radiuses of the clusters
    • setMaxRadius

      public void setMaxRadius(double newMaxRadius)
      Sets the upper boundary for the radiuses of the clusters.
      Parameters:
      newMaxRadius - new upper boundary for the radiuses of the clusters
    • maxRadiusTipText

      public String maxRadiusTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getPattern

      public SelectedTag getPattern()
      Gets the pattern type.
      Returns:
      the current pattern type
    • setPattern

      public void setPattern(SelectedTag value)
      Sets the pattern type.
      Parameters:
      value - new pattern type
    • patternTipText

      public String patternTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getDistMult

      public double getDistMult()
      Gets the distance multiplier.
      Returns:
      the distance multiplier
    • setDistMult

      public void setDistMult(double newDistMult)
      Sets the distance multiplier.
      Parameters:
      newDistMult - new distance multiplier
    • distMultTipText

      public String distMultTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNumCycles

      public int getNumCycles()
      Gets the number of cycles.
      Returns:
      the number of cycles
    • setNumCycles

      public void setNumCycles(int newNumCycles)
      Sets the the number of cycles.
      Parameters:
      newNumCycles - new number of cycles
    • numCyclesTipText

      public String numCyclesTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getInputOrder

      public SelectedTag getInputOrder()
      Gets the input order.
      Returns:
      the current input order
    • setInputOrder

      public void setInputOrder(SelectedTag value)
      Sets the input order.
      Parameters:
      value - new input order
    • inputOrderTipText

      public String inputOrderTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getOrderedFlag

      public boolean getOrderedFlag()
      Gets the ordered flag (option O).
      Returns:
      true if ordered flag is set
    • getSingleModeFlag

      public boolean getSingleModeFlag()
      Gets the single mode flag.
      Specified by:
      getSingleModeFlag in class DataGenerator
      Returns:
      true if methode generateExample can be used.
    • defineDataFormat

      public Instances defineDataFormat() throws Exception
      Initializes the format for the dataset produced.
      Overrides:
      defineDataFormat in class DataGenerator
      Returns:
      the output data format
      Throws:
      Exception - data format could not be defined
      See Also:
      • DataGenerator.defaultRelationName()
    • generateExample

      public Instance generateExample() throws Exception
      Generate an example of the dataset.
      Specified by:
      generateExample in class DataGenerator
      Returns:
      the instance generated
      Throws:
      Exception - if format not defined or generating
      examples one by one is not possible, because voting is chosen
    • generateExamples

      public Instances generateExamples() throws Exception
      Generate all examples of the dataset.
      Specified by:
      generateExamples in class DataGenerator
      Returns:
      the instance generated
      Throws:
      Exception - if format not defined
    • generateExamples

      public Instances generateExamples(Random random, Instances format) throws Exception
      Generate all examples of the dataset.
      Parameters:
      random - the random number generator to use
      format - the dataset format
      Returns:
      the instance generated
      Throws:
      Exception - if format not defined
    • generateFinished

      public String generateFinished() throws Exception
      Compiles documentation about the data generation after the generation process
      Specified by:
      generateFinished in class DataGenerator
      Returns:
      string with additional information about generated dataset
      Throws:
      Exception - no input structure has been defined
    • generateStart

      public String generateStart()
      Compiles documentation about the data generation before the generation process
      Specified by:
      generateStart in class DataGenerator
      Returns:
      string with additional information
    • getRevision

      public String getRevision()
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Returns:
      the revision
    • main

      public static void main(String[] args)
      Main method for testing this class.
      Parameters:
      args - should contain arguments for the data producer: