Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic support #160

Comments

Copy link

Hey, I'm trying to use fst with arabic, with levenshtein, and it seems to not work at all... The code I'm working with:

use fst::{automaton::Levenshtein, IntoStreamer, Map};

fn main() {
    let mut data = vec![("مصر", 1)];
    data.sort();

    // A convenient way to create sets in memory.
    let map = Map::from_iter(data)
        .unwrap();
    // Build our fuzzy query.
    let lev = Levenshtein::new("مصر", 2).unwrap();

    // Apply our fuzzy query to the set we built.
    let stream = map.search(lev).into_stream();
    let keys = stream.into_stream().into_str_keys().unwrap();
    println!("keys: {keys:#?}");
}

I've enabled the utf8-ranges and levenshtein features on the crate.

Copy link
Author

Looking at #38 it seems the library has issues with utf8 searching as a whole ☹️

@BurntSushi
Copy link
Owner

#38 is a tough bug to follow, and I'm not sure there is even a solid reproduction there yet? In any case, generalizing from "the levenshtein implementation has some bugs in places" to the entire library seems pretty egregious to me. The vast majority of this library doesn't care about UTF-8 at all. It's just bytes. Levenshtein tries to do the somewhat-correct thing of measuring edit distance in terms of the number of codepoints. There may indeed be bugs there. I personally do not have any time in the short term to look into those bugs. The Levenshtein code was something I whipped together several years ago, and it has not seen much love since then.

But for this one, it looks like it's a nice small reproduction. Thank you for that. I can indeed reproduce it.

Folks are welcome to take a dive into the Levenshtein automata code and see if they can fix it. PRs are welcome. #38 appears to have some useful suggestions.

@BurntSushi BurntSushi added the bug label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant