Python multiprocessing against lists for fuzzywuzzy

Question

I have two lists to match against one another. I Need to match each str1 word with each list of str2 words. I have a list of 40k words in str2. I want to try using multiprocessing to make it run faster.

For example:

str1 = ['how', 'are', 'you']
str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]

The code I tried:

from multiprocessing import Process, Pool
from fuzzywuzzy import process 


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))      
    print(res)
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]
    for i in str2:
        p = Process(target=f, args=(i, str1))
        p.start()
        p.join()

This does not return what I expect - I was expecting the output to look like a data frame:

words                   how are you
['this', 'how', 'done'] 100 0   0
['they', 'were', 'here'] 0  90  0
['can', 'you', 'leave']  0  80 100
['how', 'sad']           100 0   0

p.start() p.join() in your loop isn't going to make your code any faster — Jean-François Fabre, Commented Jan 14, 2020 at 20:30

Jean-François Fabre · Accepted Answer · 2020-01-14 21:06:44Z

You're not really using parallel multiprocessing because of this loop:

for i in str2:
    p = Process(target=f, args=(i, str1))
    p.start()
    p.join()

p.join() waits for each process to complete, sequentially. So there's no speedup with that construct (note that it can be useful just to create a new clean process for each case, in some situation where you're loading native code in DLLs for instance).

You have to store the process objects and wait for them in a separate loop instead.

# create & store process objects
processes = [Process(target=f, args=(i, str1)) for i in str2]
# start processes
for p in processes:
   p.start()
# wait for processes to complete
for p in processes:
   p.join()

Note that that approach has several major issues:

this may create too many processes running at the same time
how to get hold of the return values from f simply?

With your current method, the return value is lost, unless you store it in a manager object. The map method allows to get hold of the results, like the example shows above.

That's why objects like process pools exist. Small example of use:

from multiprocessing import Pool

def sq(x):
    return x**2

if __name__=="__main__":
    p = Pool(2)
    n = p.map(sq, range(10))
    print(n)

Here only 2 processes are active at the same time.

Your code, adapted to pools (untested)

from multiprocessing import Pool
from fuzzywuzzy import process


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad']]

    nb_processes = 4
    p = Pool(nb_processes)
    results = p.map(f, [(i,str1) for i in str2])

results is a list of the return values (a list) from each call to f, in the order specified by str2

It does run faster than the previous one. Does map help to separate the scores based of words in str1 ? — code_learner, Commented Jan 14, 2020 at 21:08
map just applies f on each (i,str1) argument. If you want to separate, I suggest that you pass a combination of elements from str1 / str2 instead of looping in f. — Jean-François Fabre, Commented Jan 14, 2020 at 21:09

Collectives™ on Stack Overflow

Python multiprocessing against lists for fuzzywuzzy

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
multiprocessing
fuzzywuzzy
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonmultiprocessingfuzzywuzzy or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
multiprocessing
fuzzywuzzy
or ask your own question.