I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.
What I tried so far,
First I tried to use record linkage
package in python, unfortunately it ran out of memory when it build the multi index
, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.
Then, I tried to do string match with fuzzy wuzzy
and parallelise the process using dask
package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.
test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']})
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']})
Here, I am trying to look for test.Address1
in test2.Address1
and bring its ID
.
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'ID']
dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)
This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.
Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,
# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)
I don't know how to do that ? Please advise. Thanks in advance.
df1
, and 2 million indf2
that also have anID
that I need to extract. This ID is actually a zipcode that I will append to addresses indf1
. Did you find a solution for this? I'm also using fuzzymatching but it is taking an extremely long time. Any help would be appreciated.