Class DictionarySaver

All Implemented Interfaces:
Serializable, CapabilitiesHandler, CapabilitiesIgnorer, BatchConverter, FileSourcedConverter, IncrementalConverter, Saver, EnvironmentHandler, OptionHandler, RevisionHandler

public class DictionarySaver extends AbstractFileSaver implements BatchConverter, IncrementalConverter
Writes a dictionary constructed from string attributes in incoming instances to a destination.

Valid options are:

  Save as a binary serialized dictionary
 -R <range>
  Specify range of attributes to act on. This is a comma separated list of attribute
  indices, with "first" and "last" valid values.
  Set attributes selection mode. If false, only selected attributes in the range will
  be worked on. If true, only non-selected attributes will be processed
  Convert all tokens to lowercase when matching against dictionary entries.
 -stemmer <spec>
  The stemming algorithm (classname plus parameters) to use.
 -stopwords-handler <spec>
  The stopwords handler to use (default = Null)
 -tokenizer <spec>
  The tokenizing algorithm (classname plus parameters) to use.
  (default: weka.core.tokenizers.WordTokenizer)
 -P <integer>
  Prune the dictionary every x instances
  (default = 0 - i.e. no periodic pruning)
 -W <integer>
  The number of words (per class if there is a class attribute assigned) to attempt to keep.
 -M <integer>
  The minimum term frequency to use when pruning the dictionary
  (default = 1).
  If this is set, the maximum number of words and the
  minimum term frequency is not enforced on a per-class
  basis but based on the documents in all the classes
  (even if a class attribute is set).
  Sort the dictionary alphabetically
 -i <the input file>
  The input file
 -o <the output file>
  The output file
$Revision: 12690 $
Mark Hall (mhall{[at]}pentaho{[dot]}com)
See Also:
  • Constructor Details

    • DictionarySaver

      public DictionarySaver()
  • Method Details

    • globalInfo

      public String globalInfo()
      Returns a string describing this Saver.
      a description of the Saver suitable for displaying in the explorer/experimenter gui
    • setSaveBinaryDictionary

      @OptionMetadata(displayName="Save dictionary in binary form", description="Save as a binary serialized dictionary", commandLineParamName="binary-dict", commandLineParamSynopsis="-binary-dict", commandLineParamIsFlag=true, displayOrder=2) public void setSaveBinaryDictionary(boolean binary)
      Set whether to save the dictionary as a binary serialized dictionary, rather than a plain text one
      binary - true if the dictionary is to be saved as binary rather than plain text
    • getSaveBinaryDictionary

      public boolean getSaveBinaryDictionary()
      Get whether to save the dictionary as a binary serialized dictionary, rather than a plain text one
      true if the dictionary is to be saved as binary rather than plain text
    • getAttributeIndices

      public String getAttributeIndices()
      Gets the current range selection.
      a string containing a comma separated list of ranges
    • setAttributeIndices

      @OptionMetadata(displayName="Range of attributes to operate on", description="Specify range of attributes to act on. This is a comma separated list of attribute\nindices, with \"first\" and \"last\" valid values.", commandLineParamName="R", commandLineParamSynopsis="-R <range>", displayOrder=4) public void setAttributeIndices(String rangeList)
      Sets which attributes are to be worked on.
      rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
      eg: first-3,5,6-last
      IllegalArgumentException - if an invalid range list is supplied
    • getInvertSelection

      public boolean getInvertSelection()
      Gets whether the supplied columns are to be processed or skipped.
      true if the supplied columns will be kept
    • setInvertSelection

      @OptionMetadata(displayName="Invert selection", description="Set attributes selection mode. If false, only selected attributes in the range will\nbe worked on. If true, only non-selected attributes will be processed", commandLineParamName="V", commandLineParamSynopsis="-V", commandLineParamIsFlag=true, displayOrder=5) public void setInvertSelection(boolean invert)
      Sets whether selected columns should be processed or skipped.
      invert - the new invert setting
    • getLowerCaseTokens

      public boolean getLowerCaseTokens()
      Gets whether if the tokens are to be downcased or not.
      true if the tokens are to be downcased.
    • setLowerCaseTokens

      @OptionMetadata(displayName="Lower case tokens", description="Convert all tokens to lowercase when matching against dictionary entries.", commandLineParamName="L", commandLineParamSynopsis="-L", commandLineParamIsFlag=true, displayOrder=10) public void setLowerCaseTokens(boolean downCaseTokens)
      Sets whether if the tokens are to be downcased or not. (Doesn't affect non-alphabetic characters in tokens).
      downCaseTokens - should be true if only lower case tokens are to be formed.
    • setStemmer

      @OptionMetadata(displayName="Stemmer to use", description="The stemming algorithm (classname plus parameters) to use.", commandLineParamName="stemmer", commandLineParamSynopsis="-stemmer <spec>", displayOrder=11) public void setStemmer(Stemmer value)
      the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
      value - the configured stemming algorithm, or null
      See Also:
    • getStemmer

      public Stemmer getStemmer()
      Returns the current stemming algorithm, null if none is used.
      the current stemming algorithm, null if none set
    • setStopwordsHandler

      @OptionMetadata(displayName="Stop words handler", description="The stopwords handler to use (default = Null)", commandLineParamName="stopwords-handler", commandLineParamSynopsis="-stopwords-handler <spec>", displayOrder=12) public void setStopwordsHandler(StopwordsHandler value)
      Sets the stopwords handler to use.
      value - the stopwords handler, if null, Null is used
    • getStopwordsHandler

      public StopwordsHandler getStopwordsHandler()
      Gets the stopwords handler.
      the stopwords handler
    • setTokenizer

      @OptionMetadata(displayName="Tokenizer", description="The tokenizing algorithm (classname plus parameters) to use.\n(default: weka.core.tokenizers.WordTokenizer)", commandLineParamName="tokenizer", commandLineParamSynopsis="-tokenizer <spec>", displayOrder=13) public void setTokenizer(Tokenizer value)
      the tokenizer algorithm to use.
      value - the configured tokenizing algorithm
    • getTokenizer

      public Tokenizer getTokenizer()
      Returns the current tokenizer algorithm.
      the current tokenizer algorithm
    • getPeriodicPruning

      public long getPeriodicPruning()
      Gets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
      the rate at which the dictionary is periodically pruned
    • setPeriodicPruning

      @OptionMetadata(displayName="Periodic pruning rate", description="Prune the dictionary every x instances\n(default = 0 - i.e. no periodic pruning)", commandLineParamName="P", commandLineParamSynopsis="-P <integer>", displayOrder=14) public void setPeriodicPruning(long newPeriodicPruning)
      Sets the rate at which the dictionary is periodically pruned, as a percentage of the dataset size.
      newPeriodicPruning - the rate at which the dictionary is periodically pruned
    • getWordsToKeep

      public int getWordsToKeep()
      Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      the target number of words in the output vector (per class if assigned).
    • setWordsToKeep

      @OptionMetadata(displayName="Number of words to attempt to keep", description="The number of words (per class if there is a class attribute assigned) to attempt to keep.", commandLineParamName="W", commandLineParamSynopsis="-W <integer>", displayOrder=15) public void setWordsToKeep(int newWordsToKeep)
      Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
      newWordsToKeep - the target number of words in the output vector (per class if assigned).
    • getMinTermFreq

      public int getMinTermFreq()
      Get the MinTermFreq value.
      the MinTermFreq value.
    • setMinTermFreq

      @OptionMetadata(displayName="Minimum term frequency", description="The minimum term frequency to use when pruning the dictionary\n(default = 1).", commandLineParamName="M", commandLineParamSynopsis="-M <integer>", displayOrder=16) public void setMinTermFreq(int newMinTermFreq)
      Set the MinTermFreq value.
      newMinTermFreq - The new MinTermFreq value.
    • getDoNotOperateOnPerClassBasis

      public boolean getDoNotOperateOnPerClassBasis()
      Get the DoNotOperateOnPerClassBasis value.
      the DoNotOperateOnPerClassBasis value.
    • setDoNotOperateOnPerClassBasis

      @OptionMetadata(displayName="Do not operate on a per-class basis", description="If this is set, the maximum number of words and the\nminimum term frequency is not enforced on a per-class\nbasis but based on the documents in all the classes\n(even if a class attribute is set).", commandLineParamName="O", commandLineParamSynopsis="-O", commandLineParamIsFlag=true, displayOrder=17) public void setDoNotOperateOnPerClassBasis(boolean newDoNotOperateOnPerClassBasis)
      Set the DoNotOperateOnPerClassBasis value.
      newDoNotOperateOnPerClassBasis - The new DoNotOperateOnPerClassBasis value.
    • setKeepDictionarySorted

      @OptionMetadata(displayName="Sort dictionary", description="Sort the dictionary alphabetically", commandLineParamName="sort", commandLineParamSynopsis="-sort", commandLineParamIsFlag=true, displayOrder=18) public void setKeepDictionarySorted(boolean sorted)
      Set whether to keep the dictionary sorted alphabetically or not
      sorted - true to keep the dictionary sorted
    • getKeepDictionarySorted

      public boolean getKeepDictionarySorted()
      Get whether to keep the dictionary sorted alphabetically or not
      true to keep the dictionary sorted
    • getCapabilities

      public Capabilities getCapabilities()
      Returns the Capabilities of this saver.
      Specified by:
      getCapabilities in interface CapabilitiesHandler
      getCapabilities in class AbstractSaver
      the capabilities of this object
      See Also:
    • getFileDescription

      public String getFileDescription()
      Description copied from class: AbstractFileSaver
      to be pverridden
      Specified by:
      getFileDescription in interface FileSourcedConverter
      Specified by:
      getFileDescription in class AbstractFileSaver
      the file type description.
    • writeIncremental

      public void writeIncremental(Instance inst) throws IOException
      Description copied from class: AbstractSaver
      Method for incremental saving. Standard behaviour: no incremental saving is possible, therefore throw an IOException. An incremental saving process is stopped by calling this method with null.
      Specified by:
      writeIncremental in interface Saver
      writeIncremental in class AbstractSaver
      inst - the instance to be saved
      IOException - IOEXception if the instance acnnot be written to the specified destination
    • writeBatch

      public void writeBatch() throws IOException
      Description copied from class: AbstractSaver
      Writes to a file in batch mode To be overridden.
      Specified by:
      writeBatch in interface Saver
      Specified by:
      writeBatch in class AbstractSaver
      IOException - exception if writting is not possible
    • resetOptions

      public void resetOptions()
      Description copied from class: AbstractFileSaver
      resets the options
      resetOptions in class AbstractFileSaver
    • resetWriter

      public void resetWriter()
      Description copied from class: AbstractFileSaver
      Sets the writer to null.
      resetWriter in class AbstractFileSaver
    • setDestination

      public void setDestination(OutputStream output) throws IOException
      Description copied from class: AbstractFileSaver
      Sets the destination output stream.
      Specified by:
      setDestination in interface Saver
      setDestination in class AbstractFileSaver
      output - the output stream.
      IOException - throws an IOException if destination cannot be set
    • getRevision

      public String getRevision()
      Description copied from interface: RevisionHandler
      Returns the revision string.
      Specified by:
      getRevision in interface RevisionHandler
      the revision
    • main

      public static void main(String[] args)