Class ImmutableStringTrie

  • All Implemented Interfaces:
    java.util.function.ToIntFunction<java.lang.String>

    final class ImmutableStringTrie
    extends java.lang.Object
    implements java.util.function.ToIntFunction<java.lang.String>
    Immutable space-efficient trie that provides a quick lookup index for a sorted set of non empty strings. It assumes only those strings will be queried and therefore may produce false-positive results for strings not in the array.

    Each node of the tree is represented as a series of chars using this layout:

     +---------------------------------+
     | number of branches              |
     +---------------------------------+---------------------------------+----
     | char for branch 0               | char for branch 1               | ...
     +---------------------------------+---------------------------------+----
     | key-delta/leaf/bud for branch 0 | key-delta/leaf/bud for branch 1 | ...
     +---------------------------------+---------------------------------+----
     | offset to jump to branch 1      | offset to jump to branch 2      | ...
     +---------------------------------+---------------------------------+----
     
    Each node is immediately followed by its child nodes according to branch order.

    The key-delta is used to skip over a section of the input key when we know it should always match given the recently matched char (assumes only strings from the original list are queried).

    Leaves mark a definite end of the match, while buds mark a potential end which could continue down the trie if there are more characters to match. The key-delta for buds is implicitly 1.

    The jump for branch 0 is assumed to be 0 and is always ommitted, that is any continuation of the trie for branch 0 immediately follows the current node. The entire jump section is omitted when all the branches from a node are leaves.

    Simple example trie with 2 strings "getValue" and "setValue":

     +---+---+---+--------+--------+
     | 2 | g | s | 0x8000 | 0x8001 |
     +---+---+---+--------+--------+
     
    In this case the first character is enough to determine the index result.

    Example of a trie with a 'bud' that contains 2 strings "getName" and "getNameAndValue":

     +---+---+---+---+---+--------+---+---+--------+
     | 1 | g | 6 | 1 | e | 0x4000 | 1 | A | 0x8001 |
     +---+---+---+---+---+--------+---+---+--------+
     
    After matching 'g' we skip to the end of 'getName' before checking if there are any more characters to match.

    More complex example with 3 strings "getName", "getValue", "getVersion":

     +---+---+---+---+---+---+--------+---+---+---+---+---+--------+--------+
     | 1 | g | 3 | 2 | N | V | 0x8000 | 1 | 0 | 2 | a | e | 0x8001 | 0x8002 |
     +---+---+---+---+---+---+--------+---+---+---+---+---+--------+--------+
     
    After matching 'g' we skip past the 'get'. If the next character is 'N' we know this is 'getName' otherwise we skip over the 'V' and jump to the last check between '...alue' and '...ersion'.
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      private static class  ImmutableStringTrie.Overflow
      Immutable trie that delegates searches that lie outside its range to an overflow trie.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static char BUD_MARKER
      Marks a 'bud' in the tree; the same as a leaf except the trie continues beneath it.
      private static char LEAF_MARKER
      Marks a leaf in the trie, where the rest of the bits are the index to be returned.
      private static int MAX_ROWS_PER_TRIE
      Maximum number of rows that can be indexed by a single trie.
      private char[] trie
      The compressed trie.
    • Constructor Summary

      Constructors 
      Constructor Description
      ImmutableStringTrie​(char[] data)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int applyAsInt​(java.lang.String key)
      Returns the index assigned in the trie to the given string.
      private static void buildSubTrie​(java.lang.StringBuilder buf, java.lang.String[] table, int column, int row, int rowLimit)
      Recursively builds a trie for a slice of rows at a particular column.
      private static java.util.function.ToIntFunction<java.lang.String> buildTrie​(java.lang.StringBuilder buf, java.lang.String[] table, int row, int rowLimit)
      Builds a trie, overflowing to additional tries if there are too many rows
      static java.util.function.ToIntFunction<java.lang.String> buildTrie​(java.util.Collection<java.lang.String> table)
      Builds an immutable trie that indexes the given table of strings.
      private static int nextPivotColumn​(java.lang.String[] table, int column, int row, int rowLimit)
      Finds the next column in the current row whose character differs in at least one other row.
      private static int nextPivotRow​(java.lang.String[] table, char pivot, int column, int row, int rowLimit)
      Finds the next row that has a different character in the selected column to the given one, or is too short to include the column.
      private static int singletonTrie​(java.lang.String key)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LEAF_MARKER

        private static final char LEAF_MARKER
        Marks a leaf in the trie, where the rest of the bits are the index to be returned.
        See Also:
        Constant Field Values
      • BUD_MARKER

        private static final char BUD_MARKER
        Marks a 'bud' in the tree; the same as a leaf except the trie continues beneath it.
        See Also:
        Constant Field Values
      • MAX_ROWS_PER_TRIE

        private static final int MAX_ROWS_PER_TRIE
        Maximum number of rows that can be indexed by a single trie.
        See Also:
        Constant Field Values
      • trie

        private final char[] trie
        The compressed trie.
    • Constructor Detail

      • ImmutableStringTrie

        ImmutableStringTrie​(char[] data)
    • Method Detail

      • singletonTrie

        private static int singletonTrie​(java.lang.String key)
      • applyAsInt

        public int applyAsInt​(java.lang.String key)
        Returns the index assigned in the trie to the given string.

        Note: a return value of -1 means the string is definitely not in the trie, but a non-negative index may be returned for strings that closely match those in the trie. This is acceptable because we will only call this method with strings that we know exist in the trie.

        Specified by:
        applyAsInt in interface java.util.function.ToIntFunction<java.lang.String>
      • buildTrie

        public static java.util.function.ToIntFunction<java.lang.String> buildTrie​(java.util.Collection<java.lang.String> table)
        Builds an immutable trie that indexes the given table of strings.

        The table of strings must be sorted in lexical order.

      • buildTrie

        private static java.util.function.ToIntFunction<java.lang.String> buildTrie​(java.lang.StringBuilder buf,
                                                                                    java.lang.String[] table,
                                                                                    int row,
                                                                                    int rowLimit)
        Builds a trie, overflowing to additional tries if there are too many rows
      • buildSubTrie

        private static void buildSubTrie​(java.lang.StringBuilder buf,
                                         java.lang.String[] table,
                                         int column,
                                         int row,
                                         int rowLimit)
        Recursively builds a trie for a slice of rows at a particular column.
      • nextPivotRow

        private static int nextPivotRow​(java.lang.String[] table,
                                        char pivot,
                                        int column,
                                        int row,
                                        int rowLimit)
        Finds the next row that has a different character in the selected column to the given one, or is too short to include the column. This determines the span of rows that fall under the given character in the trie.

        Returns the row just after the end of the range if all rows have the same character.

      • nextPivotColumn

        private static int nextPivotColumn​(java.lang.String[] table,
                                           int column,
                                           int row,
                                           int rowLimit)
        Finds the next column in the current row whose character differs in at least one other row. This helps identify the longest common prefix from the current pivot point to the next one.

        Returns the column just after the end of the current row if all rows are identical.