Skip to content

Latest commit

 

History

History
505 lines (410 loc) · 28.9 KB

index.md

File metadata and controls

505 lines (410 loc) · 28.9 KB

Index

A library for generating Finite State Transducers based on Levenshtein Automata.

Levenshtein transducers accept a query term and return all terms in a dictionary that are within n spelling errors away from it. They constitute a highly-efficient (space and time) class of spelling correctors that work very well when you do not require context while making suggestions. Forget about performing a linear scan over your dictionary to find all terms that are sufficiently-close to the user's query, using a quadratic implementation of the Levenshtein distance or Damerau-Levenshtein distance, these babies find all the terms from your dictionary in linear time on the length of the query term (not on the size of the dictionary, on the length of the query term).

If you need context, then take the candidates generated by the transducer as a starting place, and plug them into whatever model you're using for context (such as by selecting the sequence of terms that have the greatest probability of appearing together).

For a quick demonstration, please visit the Github Page, here. There's also a command-line interface, liblevenshtein-java-cli. Please see its README.md for acquisition and usage information.

Below, you will find instructions for how to clone universal-automata/liblevenshtein-java and checkout its submodules, which include the source code for all supported languages:

Installation

Latest, Development Release

Add a Maven dependency on Artifactory. For example, in a Gradle project, you would modify your repositories as follows:

repositories {
  maven {
    url 'https://oss.jfrog.org/artifactory/oss-release-local'
  }
}
Latest, Stable Release

Add a Maven dependency on one of the following:

Maven
<dependency>
  <groupId>com.github.universal-automata</groupId>
  <artifactId>liblevenshtein</artifactId>
  <version>3.0.0</version>
</dependency>
Apache Buildr
'com.github.universal-automata:liblevenshtein:jar:3.0.0'
Apache Ivy
<dependency org="com.github.universal-automata" name="liblevenshtein" rev="3.0.0" />
Groovy Grape
@Grapes(
@Grab(group='com.github.universal-automata', module='liblevenshtein', version='3.0.0')
)
Gradle / Grails
compile 'com.github.universal-automata:liblevenshtein:3.0.0'
Scala SBT
libraryDependencies += "com.github.universal-automata" % "liblevenshtein" % "3.0.0"
Leiningen
[com.github.universal-automata/liblevenshtein "3.0.0"]
Git
% git clone --progress [email protected]:universal-automata/liblevenshtein-java.git
Cloning into 'liblevenshtein-java'...
remote: Counting objects: 8120, done.        
remote: Compressing objects: 100% (475/475), done.        
remote: Total 8120 (delta 354), reused 0 (delta 0), pack-reused 7619        
Receiving objects: 100% (8120/8120), 5.52 MiB | 213.00 KiB/s, done.
Resolving deltas: 100% (5368/5368), done.
Checking connectivity... done.

% cd liblevenshtein-java
% git pull --progress
Already up-to-date.

% git fetch --progress --tags
% git checkout --progress 3.0.0
Note: checking out '3.0.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at f6e0a77... Automatically-generated documentation for version 3.0.0

% git submodule init
% git submodule update

Please see Installation for more details.

% gradle jar
[buildinfo] Not using buildInfo properties file for this build.
:extractIncludeProto
:extractProto
:generateProto
:compileJavawarning: No processor claimed any of these annotations: lombok.Setter,lombok.Getter,lombok.experimental.ExtensionMethod,lombok.NonNull,lombok.RequiredArgsConstructor,lombok.EqualsAndHashCode,lombok.Value,lombok.extern.slf4j.Slf4j,lombok.Builder,lombok.Data,lombok.ToString,lombok.AllArgsConstructor,lombok.NoArgsConstructor
1 warning

:processResources
:classes
:jar

BUILD SUCCESSFUL

Total time: 6.453 secs

This build could be faster, please consider using the Gradle Daemon: https://docs.gradle.org/2.13/userguide/gradle_daemon.html


Please see Building for more details.

% gradle clean check
[buildinfo] Not using buildInfo properties file for this build.
:clean
:extractIncludeProto
:extractProto UP-TO-DATE
:generateProto
:compileJavawarning: No processor claimed any of these annotations: lombok.Setter,lombok.Getter,lombok.experimental.ExtensionMethod,lombok.NonNull,lombok.RequiredArgsConstructor,lombok.EqualsAndHashCode,lombok.Value,lombok.extern.slf4j.Slf4j,lombok.Builder,lombok.Data,lombok.ToString,lombok.AllArgsConstructor,lombok.NoArgsConstructor
1 warning

:processResources
:classes
:extractIncludeTestProto
:extractTestProto UP-TO-DATE
:generateTestProto UP-TO-DATE
:compileTestJavawarning: No processor claimed any of these annotations: lombok.extern.slf4j.Slf4j,org.testng.annotations.BeforeTest,org.testng.annotations.DataProvider,lombok.Getter,org.testng.annotations.BeforeClass,org.testng.annotations.BeforeMethod,lombok.RequiredArgsConstructor,org.testng.annotations.Test
1 warning

:processTestResources
:testClasses
:extractIncludeIntegProto
:extractIntegProto
:generateIntegProto UP-TO-DATE
:compileIntegJavawarning: No processor claimed any of these annotations: lombok.extern.slf4j.Slf4j,org.testng.annotations.Test,org.testng.annotations.DataProvider
1 warning

:processIntegResources
:integClasses
:checkstyleInteg
:checkstyleMain[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:20:3: Cyclomatic Complexity is 20 (max allowed is 10). [CyclomaticComplexity]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:20:3: Executable statement count is 53 (max allowed is 30). [ExecutableStatementCount]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:20:3: NCSS for this method is 62 (max allowed is 50). [JavaNCSS]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:20:3: NPath Complexity is 38,400 (max allowed is 200). [NPathComplexity]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:23:9: Variable 'key' should be declared final. [FinalLocalVariable]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:42:5: Only one variable definition per line allowed. [MultipleVariableDeclarations]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:42:52: Only one statement per line allowed. [OneStatementPerLine]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:43:5: Only one variable definition per line allowed. [MultipleVariableDeclarations]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:43:52: Only one statement per line allowed. [OneStatementPerLine]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:47:26: Assignment of parameter 'v' is not allowed. [ParameterAssignment]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:47:29: Only one statement per line allowed. [OneStatementPerLine]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:47:49: Only one statement per line allowed. [OneStatementPerLine]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:48:26: Assignment of parameter 'w' is not allowed. [ParameterAssignment]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:48:29: Only one statement per line allowed. [OneStatementPerLine]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedMergeAndSplit.java:48:49: Only one statement per line allowed. [OneStatementPerLine]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:21:3: Cyclomatic Complexity is 14 (max allowed is 10). [CyclomaticComplexity]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:21:3: Executable statement count is 42 (max allowed is 30). [ExecutableStatementCount]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:21:3: NPath Complexity is 1,536 (max allowed is 200). [NPathComplexity]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:24:9: Variable 'key' should be declared final. [FinalLocalVariable]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:43:5: Only one variable definition per line allowed. [MultipleVariableDeclarations]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:43:52: Only one statement per line allowed. [OneStatementPerLine]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:44:5: Only one variable definition per line allowed. [MultipleVariableDeclarations]
[ant:checkstyle] [WARN] /private/var/folders/x6/m_093hm90hx0stv1jl83643w0000gn/T/GenerateWikidoc-7688351961692878878/liblevenshtein-java/src/main/java/com/github/liblevenshtein/distance/MemoizedStandard.java:44:52: Only one statement per line allowed. [OneStatementPerLine]
# ... TRUNCATED ...


Please see Testing for more details.

Let's say you have the following content in a plain text file called, top-20-most-common-english-words.txt (note that the file has one term per line):

the
be
to
of
and
a
in
that
have
I
it
for
not
on
with
he
as
you
do
at

The following provides you a way to query its content:

import java.io.InputStream;
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import com.github.liblevenshtein.collection.dictionary.SortedDawg;
import com.github.liblevenshtein.serialization.PlainTextSerializer;
import com.github.liblevenshtein.serialization.ProtobufSerializer;
import com.github.liblevenshtein.serialization.Serializer;
import com.github.liblevenshtein.transducer.Algorithm;
import com.github.liblevenshtein.transducer.Candidate;
import com.github.liblevenshtein.transducer.ITransducer;
import com.github.liblevenshtein.transducer.factory.TransducerBuilder;

// ...

final SortedDawg dictionary;
final Path dictionaryPath =
  Paths.get("/path/to/top-20-most-common-english-words.txt");
try (final InputStream stream = Files.newInputStream(dictionaryPath)) {
  // The PlainTextSerializer constructor accepts an optional boolean specifying
  // whether the dictionary is already sorted lexicographically, in ascending
  // order.  If it is sorted, then passing true will optimize the construction
  // of the dictionary; you may pass false whether the dictionary is sorted or
  // not (this is the default and safest behavior if you don't know whether the
  // dictionary is sorted).
  final Serializer serializer = new PlainTextSerializer(false);
  dictionary = serializer.deserialize(SortedDawg.class, stream);
}

final ITransducer<Candidate> transducer = new TransducerBuilder()
  .dictionary(dictionary)
  .algorithm(Algorithm.TRANSPOSITION)
  .defaultMaxDistance(2)
  .includeDistance(true)
  .build();

for (final String queryTerm : new String[] {"foo", "bar"}) {
  System.out.println(
    "+-------------------------------------------------------------------------------");
  System.out.printf("| Spelling Candidates for Query Term: \"%s\"%n", queryTerm);
  System.out.println(
    "+-------------------------------------------------------------------------------");
  for (final Candidate candidate : transducer.transduce(queryTerm)) {
    System.out.printf("| d(\"%s\", \"%s\") = [%d]%n",
      queryTerm,
      candidate.term(),
      candidate.distance());
  }
}

// +-------------------------------------------------------------------------------
// | Spelling Candidates for Query Term: "foo"
// +-------------------------------------------------------------------------------
// | d("foo", "do") = [2]
// | d("foo", "of") = [2]
// | d("foo", "on") = [2]
// | d("foo", "to") = [2]
// | d("foo", "for") = [1]
// | d("foo", "not") = [2]
// | d("foo", "you") = [2]
// +-------------------------------------------------------------------------------
// | Spelling Candidates for Query Term: "bar"
// +-------------------------------------------------------------------------------
// | d("bar", "a") = [2]
// | d("bar", "as") = [2]
// | d("bar", "at") = [2]
// | d("bar", "be") = [2]
// | d("bar", "for") = [2]

// ...

If you want to serialize your dictionary to a format that's easy to read later, do the following:

final Path serializedDictionaryPath =
  Paths.get("/path/to/top-20-most-common-english-words.protobuf.bytes");
try (final OutputStream stream = Files.newOutputStream(serializedDictionaryPath)) {
  final Serializer serializer = new ProtobufSerializer();
  serializer.serialize(dictionary, stream);
}

Then, you can read the dictionary later, in much the same way you read the plain text version:

final SortedDawg deserializedDictionary;
try (final InputStream stream = Files.newInputStream(serializedDictionaryPath)) {
  final Serializer serializer = new ProtobufSerializer();
  deserializedDictionary = serializer.deserialize(SortedDawg.class, stream);
}

Serialization is not restricted to dictionaries, you may also (de)serialize transducers.

Please see Usage for more details.

History

Based largely on the works of Stoyan Mihov, Klaus Schulz, and Petar Nikolaev Mitankin, this library generates Levenshtein transducers using nothing more than an input list of dictionary terms. The referenced literature includes: "Fast String Correction with Levenshtein-Automata", which defines the algorithm used to generate the Levenshtein automata, "Universal Levenshtein Automata. Building and Properties", which provided many mathematical proofs that helped me understand the algorithm and supplied the recursive definitions upon which I based my distance functions, and "Incremental Construction of Minimal Acyclic Finite-State Automata", that defined an algorithm for constructing Minimal Acyclic Finite-State Automata in linear time (i.e. MA-FSA, also known as DAWGs: Directed Acyclic Word Graphs) which I use to store the dictionary of terms.

Upon construction, the list of dictionary terms is indexed as an MA-FSA and a transducer is initialized according to the maximum edit distance and algorithm provided. When queried against, the states of the Levenshtein automaton corresponding to the query term, maximum edit distance, and algorithm specified are constructed on-demand (lazily) as the automaton is evaluated, as described by the paper, "Fast String Correction with Levenshtein-Automata". The Levenshtein automaton is intersected with the dictionary MA-FSA, and every string accepted by both automata is emitted in a list of correction candidates for the query term.

In contrast to many other Levenshtein automata-based algorithms, the entire Levenshtein automaton needn't be constructed prior to evaluation, and only those states of the automaton which are actually required are derived, as needed, thereby greatly improving the efficiency of the transducer in terms of both space and time complexity.

The infamous blog post, "Damn Cool Algorithms: Levenshtein Automata", provided me with a good starting point for this transducer, but the algorithm proposed therein was too inefficient for my needs. Yet, it did reference the paper "Fast String Correction with Levenshtein-Automata", which I ultimately used as the basis of the Levenshtein automata. The same paper also serves as the basis of the Levenshtein automata used by the Apache projects, Lucene and Solr (Lucene's FuzzyQuery is 100 times faster in 4.0), which itself is based on the project by Jean-Philippe Barrette-LaPierre, Moman.

Steve Hanov pointed me to the paper, "Incremental Construction of Minimal Acyclic Finite-State Automata", in his blog post entitled, "Compressing dictionaries with a DAWG". Another of Steve's blogs also made an impact on me, namely "Fast and Easy Levenshtein distance using a Trie".

I've become heavily involved with the online movement regarding MOOCs (Massive Open Online Classrooms), and the following courses taught me much regarding the material within this library:

  1. Automata
  2. Compilers
  3. Natural Language Processing

Finally, I must credit the course which first introduced me to the field of Information Retrieval, "Mathematical Applications in Search Engine Design", taught by Dr. Rao Li at the University of South Carolina Aiken. I highly recommend that course if you are in the area.

liblevenshtein-java is maintained by@dylon ([email protected])