Class EM

All Implemented Interfaces:
Serializable, Cloneable, Clusterer, DensityBasedClusterer, NumberOfClustersRequestable, CapabilitiesHandler, CapabilitiesIgnorer, CommandlineRunnable, OptionHandler, Randomizable, RevisionHandler, WeightedInstancesHandler

Simple EM (expectation maximisation) class.

EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.

The cross validation performed to determine the number of clusters is done in the following steps:
1. the number of clusters is set to 1
2. the training set is split randomly into 10 folds.
3. EM is performed 10 times using the 10 folds the usual CV way.
4. the loglikelihood is averaged over all 10 results.
5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.

The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.

Missing values are globally replaced with ReplaceMissingValues.

Valid options are:

 -N <num>
  number of clusters. If omitted or -1 specified, then 
  cross validation is used to select the number of clusters.
 -X <num>
  Number of folds to use when cross-validating to find the best number of clusters.
 -K <num>
  Number of runs of k-means to perform.
  (default 10)
 -max <num>
  Maximum number of clusters to consider during cross-validation. If omitted or -1 specified, then 
  there is no upper limit on the number of clusters.
 -ll-cv <num>
  Minimum improvement in cross-validated log likelihood required
  to consider increasing the number of clusters.
  (default 1e-6)
 -I <num>
  max iterations.
  (default 100)
 -ll-iter <num>
  Minimum improvement in log likelihood required
  to perform another iteration of the E and M steps.
  (default 1e-6)
 -V
  verbose.
 -M <num>
  minimum allowable standard deviation for normal density
  computation
  (default 1e-6)
 -O
  Display model in old format (good when there are many clusters)
 
 -num-slots <num>
  Number of execution slots.
  (default 1 - i.e. no parallelism)
 -S <num>
  Random number seed.
  (default 100)
 -output-debug-info
  If set, clusterer is run in debug mode and
  may output additional info to the console
 -do-not-check-capabilities
  If set, clusterer capabilities are not checked before clusterer is built
  (use with caution).
Version:
$Revision: 15519 $
Author:
Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
See Also:
  • Constructor Details

    • EM

      public EM()
      Constructor.
  • Method Details

    • globalInfo

      public String globalInfo()
      Returns a string describing this clusterer
      Returns:
      a description of the evaluator suitable for displaying in the explorer/experimenter gui
    • listOptions

      public Enumeration<Option> listOptions()
      Returns an enumeration describing the available options.
      Specified by:
      listOptions in interface OptionHandler
      Overrides:
      listOptions in class RandomizableDensityBasedClusterer
      Returns:
      an enumeration of all the available options.
    • setOptions

      public void setOptions(String[] options) throws Exception
      Parses a given list of options.

      Valid options are:

       -N <num>
        number of clusters. If omitted or -1 specified, then 
        cross validation is used to select the number of clusters.
       -X <num>
        Number of folds to use when cross-validating to find the best number of clusters.
       -K <num>
        Number of runs of k-means to perform.
        (default 10)
       -max <num>
        Maximum number of clusters to consider during cross-validation. If omitted or -1 specified, then 
        there is no upper limit on the number of clusters.
       -ll-cv <num>
        Minimum improvement in cross-validated log likelihood required
        to consider increasing the number of clusters.
        (default 1e-6)
       -I <num>
        max iterations.
        (default 100)
       -ll-iter <num>
        Minimum improvement in log likelihood required
        to perform another iteration of the E and M steps.
        (default 1e-6)
       -V
        verbose.
       -M <num>
        minimum allowable standard deviation for normal density
        computation
        (default 1e-6)
       -O
        Display model in old format (good when there are many clusters)
       
       -num-slots <num>
        Number of execution slots.
        (default 1 - i.e. no parallelism)
       -S <num>
        Random number seed.
        (default 100)
       -output-debug-info
        If set, clusterer is run in debug mode and
        may output additional info to the console
       -do-not-check-capabilities
        If set, clusterer capabilities are not checked before clusterer is built
        (use with caution).
      Specified by:
      setOptions in interface OptionHandler
      Overrides:
      setOptions in class RandomizableDensityBasedClusterer
      Parameters:
      options - the list of options as an array of strings
      Throws:
      Exception - if an option is not supported
    • numKMeansRunsTipText

      public String numKMeansRunsTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • getNumKMeansRuns

      public int getNumKMeansRuns()
      Returns the number of runs of k-means to perform.
      Returns:
      the number of runs
    • setNumKMeansRuns

      public void setNumKMeansRuns(int intValue)
      Set the number of runs of SimpleKMeans to perform.
      Parameters:
      intValue -
    • numFoldsTipText

      public String numFoldsTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNumFolds

      public void setNumFolds(int folds)
      Set the number of folds to use when cross-validating to find the best number of clusters.
      Parameters:
      folds - the number of folds to use
    • getNumFolds

      public int getNumFolds()
      Get the number of folds to use when cross-validating to find the best number of clusters.
      Returns:
      the number of folds to use
    • minLogLikelihoodImprovementCVTipText

      public String minLogLikelihoodImprovementCVTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMinLogLikelihoodImprovementCV

      public void setMinLogLikelihoodImprovementCV(double min)
      Set the minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters when cross-validating to find the best number of clusters
      Parameters:
      min - the minimum improvement in log likelihood
    • getMinLogLikelihoodImprovementCV

      public double getMinLogLikelihoodImprovementCV()
      Get the minimum improvement in cross-validated log likelihood required to consider increasing the number of clusters when cross-validating to find the best number of clusters
      Returns:
      the minimum improvement in log likelihood
    • minLogLikelihoodImprovementIteratingTipText

      public String minLogLikelihoodImprovementIteratingTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMinLogLikelihoodImprovementIterating

      public void setMinLogLikelihoodImprovementIterating(double min)
      Set the minimum improvement in log likelihood necessary to perform another iteration of the E and M steps.
      Parameters:
      min - the minimum improvement in log likelihood
    • getMinLogLikelihoodImprovementIterating

      public double getMinLogLikelihoodImprovementIterating()
      Get the minimum improvement in log likelihood necessary to perform another iteration of the E and M steps.
      Returns:
      the minimum improvement in log likelihood
    • numExecutionSlotsTipText

      public String numExecutionSlotsTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNumExecutionSlots

      public void setNumExecutionSlots(int slots)
      Set the degree of parallelism to use.
      Parameters:
      slots - the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
    • getNumExecutionSlots

      public int getNumExecutionSlots()
      Get the degree of parallelism to use.
      Returns:
      the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
    • displayModelInOldFormatTipText

      public String displayModelInOldFormatTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setDisplayModelInOldFormat

      public void setDisplayModelInOldFormat(boolean d)
      Set whether to display model output in the old, original format.
      Parameters:
      d - true if model ouput is to be shown in the old format
    • getDisplayModelInOldFormat

      public boolean getDisplayModelInOldFormat()
      Get whether to display model output in the old, original format.
      Returns:
      true if model ouput is to be shown in the old format
    • minStdDevTipText

      public String minStdDevTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMinStdDev

      public void setMinStdDev(double m)
      Set the minimum value for standard deviation when calculating normal density. Reducing this value can help prevent arithmetic overflow resulting from multiplying large densities (arising from small standard deviations) when there are many singleton or near singleton values.
      Parameters:
      m - minimum value for standard deviation
    • setMinStdDevPerAtt

      public void setMinStdDevPerAtt(double[] m)
    • getMinStdDev

      public double getMinStdDev()
      Get the minimum allowable standard deviation.
      Returns:
      the minumum allowable standard deviation
    • numClustersTipText

      public String numClustersTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setNumClusters

      public void setNumClusters(int n) throws Exception
      Set the number of clusters (-1 to select by CV).
      Specified by:
      setNumClusters in interface NumberOfClustersRequestable
      Parameters:
      n - the number of clusters
      Throws:
      Exception - if n is 0
    • getNumClusters

      public int getNumClusters()
      Get the number of clusters
      Returns:
      the number of clusters.
    • setMaximumNumberOfClusters

      public void setMaximumNumberOfClusters(int n)
      Set the maximum number of clusters to consider when cross-validating
      Parameters:
      n - the maximum number of clusters to consider
    • getMaximumNumberOfClusters

      public int getMaximumNumberOfClusters()
      Get the maximum number of clusters to consider when cross-validating
      Returns:
      the maximum number of clusters to consider
    • maximumNumberOfClustersTipText

      public String maximumNumberOfClustersTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • maxIterationsTipText

      public String maxIterationsTipText()
      Returns the tip text for this property
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setMaxIterations

      public void setMaxIterations(int i) throws Exception
      Set the maximum number of iterations to perform
      Parameters:
      i - the number of iterations
      Throws:
      Exception - if i is less than 1
    • getMaxIterations

      public int getMaxIterations()
      Get the maximum number of iterations
      Returns:
      the number of iterations
    • debugTipText

      public String debugTipText()
      Returns the tip text for this property
      Overrides:
      debugTipText in class AbstractClusterer
      Returns:
      tip text for this property suitable for displaying in the explorer/experimenter gui
    • setDebug

      public void setDebug(boolean v)
      Set debug mode - verbose output
      Overrides:
      setDebug in class AbstractClusterer
      Parameters:
      v - true for verbose output
    • getDebug

      public boolean getDebug()
      Get debug mode
      Overrides:
      getDebug in class AbstractClusterer
      Returns:
      true if debug mode is set
    • getOptions

      public String[] getOptions()
      Gets the current settings of EM.
      Specified by:
      getOptions in interface OptionHandler
      Overrides:
      getOptions in class RandomizableDensityBasedClusterer
      Returns:
      an array of strings suitable for passing to setOptions()
    • getClusterModelsNumericAtts

      public double[][][] getClusterModelsNumericAtts()
      Return the normal distributions for the cluster models
      Returns:
      a double[][][] value
    • getClusterPriors

      public double[] getClusterPriors()
      Return the priors for the clusters
      Returns:
      a double[] value
    • toString

      public String toString()
      Outputs the generated clusters into a string.
      Overrides:
      toString in class Object
      Returns:
      the clusterer in string representation
    • numberOfClusters

      public int numberOfClusters() throws Exception
      Returns the number of clusters.
      Specified by:
      numberOfClusters in interface Clusterer
      Specified by:
      numberOfClusters in class AbstractClusterer
      Returns:
      the number of clusters generated for a training dataset.
      Throws:
      Exception - if number of clusters could not be returned successfully
    • getCapabilities

      public Capabilities getCapabilities()
      Returns default capabilities of the clusterer (i.e., the ones of SimpleKMeans).
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Specified by:
      getCapabilities in interface Clusterer
      Overrides:
      getCapabilities in class AbstractClusterer
      Returns:
      the capabilities of this clusterer
      See Also:
    • buildClusterer

      public void buildClusterer(Instances data) throws Exception
      Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
      Specified by:
      buildClusterer in interface Clusterer
      Specified by:
      buildClusterer in class AbstractClusterer
      Parameters:
      data - set of instances serving as training data
      Throws:
      Exception - if the clusterer has not been generated successfully
    • clusterPriors

      public double[] clusterPriors()
      Returns the cluster priors.
      Specified by:
      clusterPriors in interface DensityBasedClusterer
      Specified by:
      clusterPriors in class AbstractDensityBasedClusterer
      Returns:
      the cluster priors
    • logDensityPerClusterForInstance

      public double[] logDensityPerClusterForInstance(Instance inst) throws Exception
      Computes the log of the conditional density (per cluster) for a given instance.
      Specified by:
      logDensityPerClusterForInstance in interface DensityBasedClusterer
      Specified by:
      logDensityPerClusterForInstance in class AbstractDensityBasedClusterer
      Parameters:
      inst - the instance to compute the density for
      Returns:
      an array containing the estimated densities
      Throws:
      Exception - if the density could not be computed successfully
    • getRevision

      public String getRevision()
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Overrides:
      getRevision in class AbstractClusterer
      Returns:
      the revision
    • main

      public static void main(String[] argv)
      Main method for testing this class.
      Parameters:
      argv - should contain the following arguments:

      -t training file [-T test file] [-N number of clusters] [-S random seed]