3

I have two lists to match against one another. I Need to match each str1 word with each list of str2 words. I have a list of 40k words in str2. I want to try using multiprocessing to make it run faster.

For example:

str1 = ['how', 'are', 'you']
str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]

The code I tried:

from multiprocessing import Process, Pool
from fuzzywuzzy import process 


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))      
    print(res)
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad]]
    for i in str2:
        p = Process(target=f, args=(i, str1))
        p.start()
        p.join()

This does not return what I expect - I was expecting the output to look like a data frame:

words                   how are you
['this', 'how', 'done'] 100 0   0
['they', 'were', 'here'] 0  90  0
['can', 'you', 'leave']  0  80 100
['how', 'sad']           100 0   0
2
  • 1
    p.start() p.join() in your loop isn't going to make your code any faster Commented Jan 14, 2020 at 20:30
  • @Jean-FrançoisFabre okay, is there another way around it ? Commented Jan 14, 2020 at 20:32

1 Answer 1

2

You're not really using parallel multiprocessing because of this loop:

for i in str2:
    p = Process(target=f, args=(i, str1))
    p.start()
    p.join()

p.join() waits for each process to complete, sequentially. So there's no speedup with that construct (note that it can be useful just to create a new clean process for each case, in some situation where you're loading native code in DLLs for instance).

You have to store the process objects and wait for them in a separate loop instead.

# create & store process objects
processes = [Process(target=f, args=(i, str1)) for i in str2]
# start processes
for p in processes:
   p.start()
# wait for processes to complete
for p in processes:
   p.join()

Note that that approach has several major issues:

  • this may create too many processes running at the same time
  • how to get hold of the return values from f simply?

With your current method, the return value is lost, unless you store it in a manager object. The map method allows to get hold of the results, like the example shows above.

That's why objects like process pools exist. Small example of use:

from multiprocessing import Pool

def sq(x):
    return x**2

if __name__=="__main__":
    p = Pool(2)
    n = p.map(sq, range(10))
    print(n)

Here only 2 processes are active at the same time.

Your code, adapted to pools (untested)

from multiprocessing import Pool
from fuzzywuzzy import process


def f(str2, str1):
    for u in str1:
        res = []
        for i in str2:
            Ratios = process.extract(u,i)
            res.append(str(Ratios))
    return res

if __name__ == '__main__':
    str1 = ['how', 'are', 'you']
    str2 = [['this', 'how', 'done'], ['they', 'were', 'here'], ['can', 'you', 'leave'], ['how', 'sad']]

    nb_processes = 4
    p = Pool(nb_processes)
    results = p.map(f, [(i,str1) for i in str2])

results is a list of the return values (a list) from each call to f, in the order specified by str2

2
  • It does run faster than the previous one. Does map help to separate the scores based of words in str1 ? Commented Jan 14, 2020 at 21:08
  • 1
    map just applies f on each (i,str1) argument. If you want to separate, I suggest that you pass a combination of elements from str1 / str2 instead of looping in f. Commented Jan 14, 2020 at 21:09

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.