4

I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.

What I tried so far,

First I tried to use record linkage package in python, unfortunately it ran out of memory when it build the multi index, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.

Then, I tried to do string match with fuzzy wuzzy and parallelise the process using dask package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']}) 

Here, I am trying to look for test.Address1 in test2.Address1 and bring its ID.

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.ix[slave_df.score.idxmax(),'ID']

dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)

This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.

Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,

# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)

I don't know how to do that ? Please advise. Thanks in advance.

3
  • Have you resolved this? Commented Jun 25, 2017 at 17:02
  • Not yet. Still working on it.
    – ds_user
    Commented Jun 25, 2017 at 19:06
  • Hey @ds_user. I've been googling this problem for weeks and have the exact same issue as you. I have about 100k addresses in df1, and 2 million in df2 that also have an ID that I need to extract. This ID is actually a zipcode that I will append to addresses in df1. Did you find a solution for this? I'm also using fuzzymatching but it is taking an extremely long time. Any help would be appreciated.
    – SCool
    Commented Aug 21, 2019 at 15:34

2 Answers 2

2

I prefer using fuzzywuzzy.process.extractOne. That compares a string to an iterable of strings.

def extract_one(col, other):
    # need this for dask later
    other = other.compute() if hasattr(other, 'compute') else other
    return pd.DataFrame([process.extractOne(x, other) for x in col],
                        columns=['Address1', 'score', 'idx'],
                        index=col.index)

extract_one(test.Address1, test2.Address1)

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

The idx is the index of the other passed to extract_one that matches closest. I would recommend having a meaningful index, to making joining the results later on easier.

For your second question, about filtering to cities, I would use a groupby and apply

gr1 = test.groupby('city')
gr2 = test2.groupby("city")

gr1.apply(lambda x: extract_one(x.Address1, 
gr2.get_group(x.name).Address1))

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

The only difference with dask is the need to specify a meta to the apply:

ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)

dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')

meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1, 

dgr2.get_group(x.name).Address1),
               meta=meta).compute()

             Address1  score  idx
city                             
U    0  234 kookie Pl     83    1
     1  234 kookie Pl     28    1
X    0   123 chese wy     92    0
     1   123 chese wy     28    0

Here's a notebook: https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad

I'm curious to hear how the performance is. I'm assuming the actual string comparison done in fuzzy wuzzy will take the bulk of the time, but I'd love to hear back on how much overhead is spent in pandas and dask. Make sure you have the C extentions for computing the Levenshtein distance.

9
  • Thanks for your response. I will try it today. How do I parallelise the process? Both the methods?
    – ds_user
    Commented Jun 21, 2017 at 18:44
  • 2
    Everything should be parallelized. You may want to check dask.pydata.org/en/latest/dataframe-groupby.html, you shouldn't have to spill to disk so it may be OK. You'll also want to make sure that the Levenshtein distance extension releases the GIL for it's computation (I'm not sure if it does) Commented Jun 21, 2017 at 19:10
  • And another error is - AssertionError: 3 columns passed, passed data had 2 columns - on this line - columns=['Address1', 'score', 'idx']) . I think the fuzzywuzzy.process.extractOne does not return idx as part of the tuple it returns. Can you please help on this. @TomAugspurger
    – ds_user
    Commented Jun 21, 2017 at 22:38
  • Also, by doing this, I will loose the index of test.Address1, so it will make things even complicated.
    – ds_user
    Commented Jun 22, 2017 at 5:39
  • > I think the fuzzywuzzy.process.extractOne does not return idx as part of the tuple it returns It depends on what's passed. When other is dict-like (which includes a Series), extractOne returns a 3-tuple. When it's just array-like it's a 2-tuple. What's type(other) that's being passed to extractOne for you? Commented Jun 22, 2017 at 14:15
2

i run on the same problem once. The whole process takes forever and even you will use multiprocessing it is not really going to be very fast. The main problem that causes the slow speed is the fuzzy matching because the processing is very tedious and requires a lot of time.

Alternatively, and more efficient in my opinion, it would be to use embedding aka bag of words and apply an ML method on it. the fact that you will use numerical vector makes the whole process way faster!

4
  • Thats a great idea. Any example snippet for that?
    – ds_user
    Commented Jul 9, 2018 at 11:24
  • 1
    I am preparing an article about the process. I will give you the link when is ready :) Commented Jul 9, 2018 at 14:35
  • 1
    Hi @GeorgeSotiropoulos. Did you write that article? I'm also interested in this solution.
    – SCool
    Commented Aug 21, 2019 at 15:31
  • @GeorgeSotiropoulos are you still writing the article, if its done could you share it? Commented Mar 11, 2021 at 20:43

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.