0

I have to XML-files - in the first one ("aix_xml_raw") the names of persons are already tagged with persName. I have a second file ("wien_xml_raw") with the same text, but it differs in the spelling and there are also some new text passages. I want to find all the values of the persName-Elements from the first document in the second one with a fuzzy search (e.g. "mr. l Conte de Sle" from the first document will also match "mr. le C. de Sli." in the second document) and tag it there with persName. I have two solutions, whereby both of them work if you execute the code without the if-condition for the fuzzy thing. Why doesn't it work with the fuzzy match?


aix_xml_raw = """
    <doc><p>Louërent Dieu, mais <persName>mr. l C<ex>onte</ex> de Sle.</persName> ne voulut pas
        appliquer mon voyage a mon avantage il crut que cela ressembloit fort a l’avanture,
        et que la peur de me confesser au <persName>RP. Br.</persName> m’avoit fait aller a 
        <placeName>Hilzing</placeName> 
        Je ne m’excuse point, laissant au jugement de ceux qui liront ces lignes </p>
        <gap/> 
        <p>Votre Saint Nom en soit béni, loué et glorifié. Amen.</p></doc>
        """
wien_xml_raw = """
    <doc>
    <line>louerent Dieu, mais mr. le C. de Sli. ne voulut  pas appliquer mon voyage a mon avantage 
    il crut que cela rasembloit fort a l’avanture, et que la peur de me confesser au RP. Br.
    m’avoit fait aller a Hitzing Je ne m’excuse pas sur ce point, 
    laissant au jugement de ceux qui liront ces ligne votre st nom en soitloué, et glorifié, amen.</line>
    </doc>
"""

Solution 1:

from bs4 import BeautifulSoup, Tag
from fuzzywuzzy import fuzz

# Parse the first document
soup1 = BeautifulSoup(aix_xml_raw, 'xml')

# Find all persName tags and extract their values
pers_names = [tag.text for tag in soup1.find_all('persName')]

# Parse the second document
soup2 = BeautifulSoup(wien_xml_raw, 'xml')

# Find all text nodes in the second document
text_nodes = soup2.find_all(text=True)

# Loop over each text node and replace fuzzy matches with tagged values
for node in text_nodes:
    for name in pers_names:
        if fuzz.token_sort_ratio(name, node.strip()) > 90:
            # Create a new persName tag and insert it before the found value
            new_tag = Tag(name='persName')
            new_tag.string = name
            node.replace_with(node.replace(name, str(new_tag)))

# Print the modified second document
print(soup2.prettify())

Solution 2:

import difflib
import xml.etree.ElementTree as ET

# define a function to get the person names from the first xml document
def get_person_names(xml_str):
    person_names = []
    root = ET.fromstring(xml_str)
    for pers_name in root.iter('persName'):
        person_names.append(pers_name.text.strip())
    return person_names

# define a function to find and tag person names in the second xml document
def tag_person_names(xml_str, person_names):
    root = ET.fromstring(xml_str)
    for line in root.iter('line'):
        tagged_line = line.text
        for name in person_names:
            # perform fuzzy string matching and tag the person names if found
            if difflib.SequenceMatcher(None, name.lower(), line.text.lower()).ratio() >= 0.8:
                tagged_line = tagged_line.replace(name, '<persName>{}</persName>'.format(name))
        line.text = tagged_line
    return ET.tostring(root, encoding='unicode')

person_names = get_person_names(aix_xml_raw)
tagged_wien_xml_raw = tag_person_names(wien_xml_raw, person_names)
print(tagged_wien_xml_raw)

1 Answer 1

0

In solution 1, the problem is that you are comparing the person's name to the whole node. For this reason, the fuzzy matching score is very low. What you could try is to cut the node into substrings of the length of the person's name being compared, then try a fuzzy match.

Something like this:

for name in pers_names:
    for node in text_nodes:

        # check how many words in person name
        wordCount = len(name.split())
        # check how many words in node
        nodeWordCount = len(node.split())

        if (nodeWordCount == 0):
            continue

        bestMatch = ''
        highestRatio = 0

        for x in range(nodeWordCount - wordCount):
            nodeToList = node.split()
            substringToTest = ' '.join(nodeToList[x:wordCount+x])
            ratio = fuzz.token_sort_ratio(name, substringToTest)
            if ratio > 70 and ratio > highestRatio:
                bestMatch = substringToTest
                highestRatio = ratio

       

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.