Class Canopy

All Implemented Interfaces:
Serializable, Cloneable, Clusterer, NumberOfClustersRequestable, UpdateableClusterer, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler

Cluster data using the capopy clustering algorithm, which requires just one pass over the data. Can run in eitherbatch or incremental mode. Results are generally not as good when running incrementally as the min/max for each numeric attribute is not known in advance. Has a heuristic (based on attribute std. deviations), that can be used in batch mode, for setting the T2 distance. The T2 distance determines how many canopies (clusters) are formed. When the user specifies a specific number (N) of clusters to generate, the algorithm will return the top N canopies (as determined by T2 density) when N < number of canopies (this applies to both batch and incremental learning); when N > number of canopies, the difference is made up by selecting training instances randomly (this can only be done when batch training). For more information see:

A. McCallum, K. Nigam, L.H. Ungar: Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms, 169-178, 2000.

BibTeX:

 @inproceedings{McCallum2000,
    author = {A. McCallum and K. Nigam and L.H. Ungar},
    booktitle = {Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms},
    pages = {169-178},
    title = {Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching},
    year = {2000}
 }
 

Valid options are:

 -N <num>
  Number of clusters.
  (default 2).
 -max-candidates <num>
  Maximum number of candidate canopies to retain in memory
  at any one time. T2 distance plus, data characteristics,
  will determine how many candidate canopies are formed before
  periodic and final pruning are performed, which might result
  in exceess memory consumption. This setting avoids large numbers
  of candidate canopies consuming memory. (default = 100)
 -periodic-pruning <num>
  How often to prune low density canopies. 
  (default = every 10,000 training instances)
 -min-density
  Minimum canopy density, below which a canopy will be pruned
  during periodic pruning. (default = 2 instances)
 -t2
  The T2 distance to use. Values < 0 indicate that
  a heuristic based on attribute std. deviation should be used to set this.
  Note that this heuristic can only be used when batch training
  (default = -1.0)
 -t1
  The T1 distance to use. A value < 0 is taken as a
  positive multiplier for T2. (default = -1.5)
 -M
  Don't replace missing values with mean/mode when running in batch mode.
 
 -S <num>
  Random number seed.
  (default 1)
 -output-debug-info
  If set, clusterer is run in debug mode and
  may output additional info to the console
 -do-not-check-capabilities
  If set, clusterer capabilities are not checked before clusterer is built
  (use with caution).
Version:
$Revision: 11012 $
Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:
  • Field Details

  • Constructor Details

    • Canopy

      public Canopy()
  • Method Details

    • globalInfo

      public String globalInfo()
      Returns a string describing this clusterer.
      Returns:
      a description of the evaluator suitable for displaying in the explorer/experimenter gui
    • getTechnicalInformation

      public TechnicalInformation getTechnicalInformation()
      Description copied from interface: TechnicalInformationHandler
      Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
      Specified by:
      getTechnicalInformation in interface TechnicalInformationHandler
      Returns:
      the technical information about this class
    • getCapabilities

      public Capabilities getCapabilities()
      Returns default capabilities of the clusterer.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Specified by:
      getCapabilities in interface Clusterer
      Overrides:
      getCapabilities in class AbstractClusterer
      Returns:
      the capabilities of this clusterer
      See Also:
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class RandomizableClusterer
      Returns:
      an enumeration of all the available options.
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -N <num>
        Number of clusters.
        (default 2).
       -max-candidates <num>
        Maximum number of candidate canopies to retain in memory
        at any one time. T2 distance plus, data characteristics,
        will determine how many candidate canopies are formed before
        periodic and final pruning are performed, which might result
        in exceess memory consumption. This setting avoids large numbers
        of candidate canopies consuming memory. (default = 100)
       -periodic-pruning <num>
        How often to prune low density canopies. 
        (default = every 10,000 training instances)
       -min-density
        Minimum canopy density, below which a canopy will be pruned
        during periodic pruning. (default = 2 instances)
       -t2
        The T2 distance to use. Values < 0 indicate that
        a heuristic based on attribute std. deviation should be used to set this.
        Note that this heuristic can only be used when batch training
        (default = -1.0)
       -t1
        The T1 distance to use. A value < 0 is taken as a
        positive multiplier for T2. (default = -1.5)
       -M
        Don't replace missing values with mean/mode when running in batch mode.
       
       -S <num>
        Random number seed.
        (default 1)
       -output-debug-info
        If set, clusterer is run in debug mode and
        may output additional info to the console
       -do-not-check-capabilities
        If set, clusterer capabilities are not checked before clusterer is built
        (use with caution).
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class RandomizableClusterer
      Parameters:
      options - the list of options as an array of strings throws Exception if an option is not supported
      Throws:
      Exception - if an option is not supported
    • getOptions

      public String[] getOptions()
      Gets the current settings of Canopy.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class RandomizableClusterer
      Returns:
      an array of strings suitable for passing to setOptions()
    • nonEmptyCanopySetIntersection

      public static boolean nonEmptyCanopySetIntersection(long[] first, long[] second) throws Exception
      Tests if two sets of canopies have a non-empty intersection
      Parameters:
      first - the first canopy set
      second - the second canopy set
      Returns:
      true if the intersection is non-empty
      Throws:
      Exception - if a problem occurs
    • assignCanopies

      public long[] assignCanopies(Instance inst) throws Exception
      Uses T1 distance to assign canopies to the supplied instance. If the instance does not fall within T1 distance of any canopies then the instance has the closest canopy assigned to it.
      Parameters:
      inst - the instance to find covering canopies for
      Returns:
      a set of canopies that contain this instance according to T1 distance
      Throws:
      Exception - if a problem occurs
    • updateClusterer

      public void updateClusterer(Instance newInstance) throws Exception
      Description copied from interface: UpdateableClusterer
      Adds an instance to the clusterer.
      Specified by:
      updateClusterer in interface UpdateableClusterer
      Parameters:
      newInstance - the instance to be added
      Throws:
      Exception - if something goes wrong
    • distributionForInstance

      public double[] distributionForInstance(Instance instance) throws Exception
      Description copied from class: AbstractClusterer
      Predicts the cluster memberships for a given instance. Either this or clusterInstance() needs to be implemented by subclasses.
      Specified by:
      distributionForInstance in interface Clusterer
      Overrides:
      distributionForInstance in class AbstractClusterer
      Parameters:
      instance - the instance to be assigned a cluster.
      Returns:
      an array containing the estimated membership probabilities of the test instance in each cluster (this should sum to at most 1)
      Throws:
      Exception - if distribution could not be computed successfully
    • updateFinished

      public void updateFinished()
      Description copied from interface: UpdateableClusterer
      Signals the end of the updating.
      Specified by:
      updateFinished in interface UpdateableClusterer
    • initializeDistanceFunction

      public void initializeDistanceFunction(Instances init) throws Exception
      Initialize the distance function (i.e set min/max values for numeric attributes) with the supplied instances.
      Parameters:
      init - the instances to initialize with
      Throws:
      Exception - if a problem occurs
    • buildClusterer

      public void buildClusterer(Instances data) throws Exception
      Description copied from class: AbstractClusterer
      Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
      Specified by:
      buildClusterer in interface Clusterer
      Specified by:
      buildClusterer in class AbstractClusterer
      Parameters:
      data - set of instances serving as training data
      Throws:
      Exception - if the clusterer has not been generated successfully
    • numberOfClusters

      public int numberOfClusters() throws Exception
      Description copied from class: AbstractClusterer
      Returns the number of clusters.
      Specified by:
      numberOfClusters in interface Clusterer
      Specified by:
      numberOfClusters in class AbstractClusterer
      Returns:
      the number of clusters generated for a training dataset.
      Throws:
      Exception - if number of clusters could not be returned successfully
    • setMissingValuesReplacer

      public void setMissingValuesReplacer(Filter missingReplacer)
      Set a ready-to-use missing values replacement filter
      Parameters:
      missingReplacer - the missing values replacement filter to use
    • getCanopies

      public Instances getCanopies()
      Get the canopies (cluster centers).
      Returns:
      the canopies
    • setCanopies

      public void setCanopies(Instances canopies)
      Set the canopies to use (replaces any learned by this clusterer already)
      Parameters:
      canopies - the canopies to use
    • getClusterCanopyAssignments

      public List<long[]> getClusterCanopyAssignments()
      Get the canopies that each canopy (cluster center) is within T1 distance of
      Returns:
      a list of canopies for each cluster center
    • setClusterCanopyAssignments

      public void setClusterCanopyAssignments(List<long[]> clusterCanopies)
      Set the canopies that each canopy (cluster center) is within T1 distance of
      Parameters:
      clusterCanopies - the list canopies for each cluster center
    • getActualT2

      public double getActualT2()
      Get the actual value of T2 (which may be different from the initial value if the heuristic is used)
      Returns:
      the actual value of T2
    • getActualT1

      public double getActualT1()
      Get the actual value of T1 (which may be different from the initial value if the heuristic is used)
      Returns:
      the actual value of T1
    • t1TipText

      public String t1TipText()
      Tip text for this property
      Returns:
      the tip text for this property
    • setT1

      public void setT1(double t1)
      Set the T1 distance. Values < 0 are taken as a positive multiplier for the T2 distance - e.g. T1_actual = Math.abs(t1) * t2;
      Parameters:
      t1 - the T1 distance to use
    • getT1

      public double getT1()
      Get the T1 distance. Values < 0 are taken as a positive multiplier for the T2 distance - e.g. T1_actual = Math.abs(t1) * t2;
      Returns:
      the T1 distance to use
    • t2TipText

      public String t2TipText()
      Tip text for this property
      Returns:
      the tip text for this property
    • setT2

      public void setT2(double t2)
      Set the T2 distance to use. Values < 0 indicate that a heuristic based on attribute standard deviation should be used to set this (note that the heuristic is only applicable when batch training).
      Parameters:
      t2 - the T2 distance to use
    • getT2

      public double getT2()
      Get the T2 distance to use. Values < 0 indicate that a heuristic based on attribute standard deviation should be used to set this (note that the heuristic is only applicable when batch training).
      Returns:
      the T2 distance to use
    • numClustersTipText

      public String numClustersTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNumClusters

      public void setNumClusters(int numClusters) throws Exception
      Description copied from interface: NumberOfClustersRequestable
      Set the number of clusters to generate
      Specified by:
      setNumClusters in interface NumberOfClustersRequestable
      Parameters:
      numClusters - the number of clusters to generate
      Throws:
      Exception - if the requested number of clusters in inapropriate
    • getNumClusters

      public int getNumClusters()
      Get the number of clusters to generate
      Returns:
      the number of clusters to generate
    • periodicPruningRateTipText

      public String periodicPruningRateTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setPeriodicPruningRate

      public void setPeriodicPruningRate(int p)
      Set the how often to prune low density canopies during training
      Parameters:
      p - how often (every p instances) to prune low density canopies
    • getPeriodicPruningRate

      public int getPeriodicPruningRate()
      Get the how often to prune low density canopies during training
      Returns:
      how often (every p instances) to prune low density canopies
    • minimumCanopyDensityTipText

      public String minimumCanopyDensityTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMinimumCanopyDensity

      public void setMinimumCanopyDensity(double dens)
      Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.
      Parameters:
      dens - the minimum canopy density
    • getMinimumCanopyDensity

      public double getMinimumCanopyDensity()
      Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.
      Returns:
      the minimum canopy density
    • maxNumCandidateCanopiesToHoldInMemory

      public String maxNumCandidateCanopiesToHoldInMemory()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMaxNumCandidateCanopiesToHoldInMemory

      public void setMaxNumCandidateCanopiesToHoldInMemory(int max)
      Set the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
      Parameters:
      max - the maximum number of candidate canopies to retain in memory during training
    • getMaxNumCandidateCanopiesToHoldInMemory

      public int getMaxNumCandidateCanopiesToHoldInMemory()
      Get the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
      Returns:
      the maximum number of candidate canopies to retain in memory during training
    • dontReplaceMissingValuesTipText

      public String dontReplaceMissingValuesTipText()
      Returns the tip text for this property.
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setDontReplaceMissingValues

      public void setDontReplaceMissingValues(boolean r)
      Sets whether missing values are to be replaced.
      Parameters:
      r - true if missing values are to be replaced
    • getDontReplaceMissingValues

      public boolean getDontReplaceMissingValues()
      Gets whether missing values are to be replaced.
      Returns:
      true if missing values are to be replaced
    • printSingleAssignment

      public static String printSingleAssignment(long[] assignments)
    • printCanopyAssignments

      public static String printCanopyAssignments(Instances dataPoints, List<long[]> canopyAssignments)
      Print the supplied instances and their canopies
      Parameters:
      dataPoints - the instances to print
      canopyAssignments - the canopy assignments, one assignment array for each instance
      Returns:
      a string containing the printed assignments
    • toString

      public String toString(boolean header)
      Return a textual description of this clusterer
      Parameters:
      header - true if the header should be printed
      Returns:
      a string describing the result of the clustering
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • cleanUp

      public void cleanUp()
      Save memory
    • aggregateCanopies

      public static Canopy aggregateCanopies(List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies)
      Aggregate the canopies from a list of Canopy clusterers together into one final model.
      Parameters:
      canopies - the list of Canopy clusterers to aggregate
      aggregationT1 - the T1 distance to use for the aggregated classifier
      aggregationT2 - the T2 distance to use when aggregating canopies
      finalDistanceFunction - the distance function to use with the final Canopy clusterer
      missingValuesReplacer - the missing value replacement filter to use with the final clusterer (can be null for no missing value replacement)
      finalNumCanopies - the final number of canopies
      Returns:
      a Canopy clusterer that aggregates all the canopies
    • main

      public static void main(String[] args)