I think it's a bug in lucene ascii folding filter, if a char is > 80 then it will emitted twice even if it's unchanged.
Problems are:
- frequencies for such terms are doubled
- we store an extra position in the posting
It's really hard to evaluate the impact on scoring and index size but this problem affects mostly non latin wikis where nearly all the words will be duplicated.
I'll try to fix the issue upstream but I believe that we should maybe fix this problem on our side by not using the preserve_original option on asciifolding but rather use the preserve_original generic filter added for icu folding in the extra plugin.