Class DictionarySaver

All Implemented Interfaces:
Serializable, CapabilitiesHandler, CapabilitiesIgnorer, BatchConverter, FileSourcedConverter, IncrementalConverter, Saver, EnvironmentHandler, OptionHandler, RevisionHandler

public class DictionarySaver extends AbstractFileSaver implements BatchConverter, IncrementalConverter
Writes a dictionary constructed from string attributes in incoming instances to a destination.

Valid options are:

 -binary-dict
  Save as a binary serialized dictionary
 
 -R <range>
  Specify range of attributes to act on. This is a comma separated list of attribute
  indices, with "first" and "last" valid values.
 
 -V
  Set attributes selection mode. If false, only selected attributes in the range will
  be worked on. If true, only non-selected attributes will be processed
 
 -L
  Convert all tokens to lowercase when matching against dictionary entries.
 
 -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.
 
 -stopwords-handler <spec>
  The stopwords handler to use (default = Null)
 
 -tokenizer <spec>
  The tokenizing algorithm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 
 -P <integer>
  Prune the dictionary every x instances
  (default = 0 - i.e. no periodic pruning)
 
 -W <integer>
  The number of words (per class if there is a class attribute assigned) to attempt to keep.
 
 -M <integer>
  The minimum term frequency to use when pruning the dictionary
  (default = 1).
 
 -O
  If this is set, the maximum number of words and the
  minimum term frequency is not enforced on a per-class
  basis but based on the documents in all the classes
  (even if a class attribute is set).
 
 -sort
  Sort the dictionary alphabetically
 
 -i <the input file>
  The input file
 
 -o <the output file>
  The output file
 
Version:
$Revision: 12690 $
Author:
Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:
  • Constructor Details

    • DictionarySaver

      public DictionarySaver()
  • Method Details

    • globalInfo

      public String globalInfo()
      Returns a string describing this Saver.
      Returns:
      a description of the Saver suitable for displaying in the explorer/experimenter gui
    • setSaveBinaryDictionary

      @OptionMetadata(displayName="Save dictionary in binary form", description="Save as a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setSaveBinaryDictionary(boolean binary)
      Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text one
      Parameters:
      binary - true if the dictionary is to be saved as binary rather than plain text
    • getSaveBinaryDictionary

      public boolean getSaveBinaryDictionary()
      Get whether to save the dictionary as a binary serialized dictionary, rather than a plain text one
      Returns:
      true if the dictionary is to be saved as binary rather than plain text
    • getAttributeIndices

      public String getAttributeIndices()
      Gets the current range selection.
      Returns:
      a string containing a comma separated list of ranges
    • setAttributeIndices

      @OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList)
      Sets which attributes are to be worked on.
      Parameters:
      rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
      eg: first-3,5,6-last
      Throws:
      IllegalArgumentException - if an invalid range list is supplied
    • getInvertSelection

      public boolean getInvertSelection()
      Gets whether the supplied columns are to be processed or skipped.
      Returns:
      true if the supplied columns will be kept
    • setInvertSelection

      @OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
      Sets whether selected columns should be processed or skipped.
      Parameters:
      invert - the new invert setting
    • getLowerCaseTokens

      public boolean getLowerCaseTokens()
      Gets whether if the tokens are to be downcased or not.
      Returns:
      true if the tokens are to be downcased.
    • setLowerCaseTokens

      @OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
      Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
      Parameters:
      downCaseTokens - should be true if only lower case tokens are to be formed.
    • setStemmer

      @OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      Parameters:
      value - the configured stemming algorithm, or null
      See Also:
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      Returns:
      the current stemming algorithm, null if none set
    • setStopwordsHandler

      @OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      Parameters:
      value - the stopwords handler, if null, Null is used
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      Returns:
      the stopwords handler
    • setTokenizer

      @OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      Parameters:
      value - the configured tokenizing algorithm
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      Returns:
      the current tokenizer algorithm
    • getPeriodicPruning

      public long getPeriodicPruning()
      Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
      Returns:
      the rate at which the dictionary is periodically pruned
    • setPeriodicPruning

      @OptionMetadata(displayName="Periodic pruning rate", description="Prune the dictionary every x instances\n(default = 0 - i.e. no periodic pruning)", commandLineParamName="P", commandLineParamSynopsis="-P <integer>", displayOrder=14) public void setPeriodicPruning(long newPeriodicPruning)
      Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
      Parameters:
      newPeriodicPruning - the rate at which the dictionary is periodically pruned
    • getWordsToKeep

      public int getWordsToKeep()
      Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      Returns:
      the target number of words in the output vector (per class if assigned).
    • setWordsToKeep

      @OptionMetadata(displayName="Number of words to attempt to keep", description="The number of words (per class if there is a class attribute assigned) to attempt to keep.", commandLineParamName="W", commandLineParamSynopsis="-W <integer>", displayOrder=15) public void setWordsToKeep(int newWordsToKeep)
      Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      Parameters:
      newWordsToKeep - the target number of words in the output vector (per class if assigned).
    • getMinTermFreq

      public int getMinTermFreq()
      Get the MinTermFreq value.
      Returns:
      the MinTermFreq value.
    • setMinTermFreq

      @OptionMetadata(displayName="Minimum term frequency", description="The minimum term frequency to use when pruning the dictionary\n(default = 1).", commandLineParamName="M", commandLineParamSynopsis="-M <integer>", displayOrder=16) public void setMinTermFreq(int newMinTermFreq)
      Set the MinTermFreq value.
      Parameters:
      newMinTermFreq - The new MinTermFreq value.
    • getDoNotOperateOnPerClassBasis

      public boolean getDoNotOperateOnPerClassBasis()
      Get the DoNotOperateOnPerClassBasis value.
      Returns:
      the DoNotOperateOnPerClassBasis value.
    • setDoNotOperateOnPerClassBasis

      @OptionMetadata(displayName="Do not operate on a per-class basis", description="If this is set, the maximum number of words and the\nminimum term frequency is not enforced on a per-class\nbasis but based on the documents in all the classes\n(even if a class attribute is set).", commandLineParamName="O", commandLineParamSynopsis="-O", commandLineParamIsFlag=true, displayOrder=17) public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
      Set the DoNotOperateOnPerClassBasis value.
      Parameters:
      newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.
    • setKeepDictionarySorted

      @OptionMetadata(displayName="Sort dictionary", description="Sort the dictionary alphabetically", commandLineParamName="sort", commandLineParamSynopsis="-sort", commandLineParamIsFlag=true, displayOrder=18) public void setKeepDictionarySorted(boolean sorted)
      Set whether to keep the dictionary sorted alphabetically or not
      Parameters:
      sorted - true to keep the dictionary sorted
    • getKeepDictionarySorted

      public boolean getKeepDictionarySorted()
      Get whether to keep the dictionary sorted alphabetically or not
      Returns:
      true to keep the dictionary sorted
    • getCapabilities

      public Capabilities getCapabilities()
      Returns the Capabilities of this saver.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      Overrides:
      getCapabilities in class AbstractSaver
      Returns:
      the capabilities of this object
      See Also:
    • getFileDescription

      public String getFileDescription()
      Description copied from class: AbstractFileSaver
      to be pverridden
      Specified by:
      getFileDescription in interface FileSourcedConverter
      Specified by:
      getFileDescription in class AbstractFileSaver
      Returns:
      the file type description.
    • writeIncremental

      public void writeIncremental(Instance inst) throws IOException
      Description copied from class: AbstractSaver
      Method for incremental saving. Standard behaviour: no incremental saving is possible, therefore throw an IOException. An incremental saving process is stopped by calling this method with null.
      Specified by:
      writeIncremental in interface Saver
      Overrides:
      writeIncremental in class AbstractSaver
      Parameters:
      inst - the instance to be saved
      Throws:
      IOException - IOEXception if the instance acnnot be written to the specified destination
    • writeBatch

      public void writeBatch() throws IOException
      Description copied from class: AbstractSaver
      Writes to a file in batch mode To be overridden.
      Specified by:
      writeBatch in interface Saver
      Specified by:
      writeBatch in class AbstractSaver
      Throws:
      IOException - exception if writting is not possible
    • resetOptions

      public void resetOptions()
      Description copied from class: AbstractFileSaver
      resets the options
      Overrides:
      resetOptions in class AbstractFileSaver
    • resetWriter

      public void resetWriter()
      Description copied from class: AbstractFileSaver
      Sets the writer to null.
      Overrides:
      resetWriter in class AbstractFileSaver
    • setDestination

      public void setDestination(OutputStream output) throws IOException
      Description copied from class: AbstractFileSaver
      Sets the destination output stream.
      Specified by:
      setDestination in interface Saver
      Overrides:
      setDestination in class AbstractFileSaver
      Parameters:
      output - the output stream.
      Throws:
      IOException - throws an IOException if destination cannot be set
    • getRevision

      public String getRevision()
      Description copied from interface: RevisionHandler
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      Returns:
      the revision
    • main

      public static void main(String[] args)