I have to XML-files - in the first one ("aix_xml_raw") the names of persons are already tagged with persName. I have a second file ("wien_xml_raw") with the same text, but it differs in the spelling and there are also some new text passages. I want to find all the values of the persName-Elements from the first document in the second one with a fuzzy search (e.g. "mr. l Conte de Sle" from the first document will also match "mr. le C. de Sli." in the second document) and tag it there with persName. I have two solutions, whereby both of them work if you execute the code without the if-condition for the fuzzy thing. Why doesn't it work with the fuzzy match?
aix_xml_raw = """
<doc><p>Louërent Dieu, mais <persName>mr. l C<ex>onte</ex> de Sle.</persName> ne voulut pas
appliquer mon voyage a mon avantage il crut que cela ressembloit fort a l’avanture,
et que la peur de me confesser au <persName>RP. Br.</persName> m’avoit fait aller a
<placeName>Hilzing</placeName>
Je ne m’excuse point, laissant au jugement de ceux qui liront ces lignes </p>
<gap/>
<p>Votre Saint Nom en soit béni, loué et glorifié. Amen.</p></doc>
"""
wien_xml_raw = """
<doc>
<line>louerent Dieu, mais mr. le C. de Sli. ne voulut pas appliquer mon voyage a mon avantage
il crut que cela rasembloit fort a l’avanture, et que la peur de me confesser au RP. Br.
m’avoit fait aller a Hitzing Je ne m’excuse pas sur ce point,
laissant au jugement de ceux qui liront ces ligne votre st nom en soitloué, et glorifié, amen.</line>
</doc>
"""
Solution 1:
from bs4 import BeautifulSoup, Tag
from fuzzywuzzy import fuzz
# Parse the first document
soup1 = BeautifulSoup(aix_xml_raw, 'xml')
# Find all persName tags and extract their values
pers_names = [tag.text for tag in soup1.find_all('persName')]
# Parse the second document
soup2 = BeautifulSoup(wien_xml_raw, 'xml')
# Find all text nodes in the second document
text_nodes = soup2.find_all(text=True)
# Loop over each text node and replace fuzzy matches with tagged values
for node in text_nodes:
for name in pers_names:
if fuzz.token_sort_ratio(name, node.strip()) > 90:
# Create a new persName tag and insert it before the found value
new_tag = Tag(name='persName')
new_tag.string = name
node.replace_with(node.replace(name, str(new_tag)))
# Print the modified second document
print(soup2.prettify())
Solution 2:
import difflib
import xml.etree.ElementTree as ET
# define a function to get the person names from the first xml document
def get_person_names(xml_str):
person_names = []
root = ET.fromstring(xml_str)
for pers_name in root.iter('persName'):
person_names.append(pers_name.text.strip())
return person_names
# define a function to find and tag person names in the second xml document
def tag_person_names(xml_str, person_names):
root = ET.fromstring(xml_str)
for line in root.iter('line'):
tagged_line = line.text
for name in person_names:
# perform fuzzy string matching and tag the person names if found
if difflib.SequenceMatcher(None, name.lower(), line.text.lower()).ratio() >= 0.8:
tagged_line = tagged_line.replace(name, '<persName>{}</persName>'.format(name))
line.text = tagged_line
return ET.tostring(root, encoding='unicode')
person_names = get_person_names(aix_xml_raw)
tagged_wien_xml_raw = tag_person_names(wien_xml_raw, person_names)
print(tagged_wien_xml_raw)