1

When I am doing string comparison, I am getting that 2 strings are not equal even though they are equal.

I am extracting text from 2 PDFs. Extracted text is same. But I can see some font change in one of them. I am not understanding why?

str1 = 'Confirmations'

str2 = 'Confirmations'

str1 == str2

False

3 Answers 3

2

You need to compare the normalized forms of the strings to ignore irrelevant typographical differences.

eg:

In [59]: import unicodedata

In [60]: str1 = 'Confirmations'

In [61]: str2 = 'Confirmations'

In [62]: str1 == str2
Out[62]: False

In [63]: unicodedata.normalize('NFKD', str1) == unicodedata.normalize('NFKD', str2)
Out[63]: True
1

The problem is that "fi" inside the string in the first case is a ligature (https://en.wikipedia.org/wiki/Typographic_ligature), while in the second is the sum of "f" and "i".

You can use a function to check if the ligature is present and substitute it with plain text

def ligature(string):
    if 'fi' in string:
        string.replace('fi', 'fi')
    return string

you can also add other if statements for other ligatures if you found more in your text.

1
  • Hi @bh7781 if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this.
    – Matteo
    Commented Jul 29, 2019 at 7:21
1

Using difflib library you can see that there is visible differnce between string that you want to compare. To check it by yourself you can try instruction as follows:

>>> import difflib
>>> str2 = 'Confirmations'
>>> str1 = 'Confirmations'
>>> print('\n'.join(difflib.ndiff([str1], [str2])))

which yields to

- Confirmations
?    ^

+ Confirmations
?    ^^

>>>

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.