-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The nCore option does not speed up enrichment compute #1
Comments
I will try another methods for spot parralelization |
It seems to be related to overhead and RAM usage. What's strange is that we didn't see this issue before. |
Hi Zach, nice to read you, hope everything is fine for you.
Since the moment you wrote the code, the number of peaks in ReMap increased by a factor of 10, I think.
We may need to redesign the code in order , for example by loading and analysing separately the peak collections (one collection at a time), and by parallelizing the comparisons.
Cheers
Jacques
… On 7 Jul 2019, at 14:10, Zacharie Ménétrier ***@***.***> wrote:
It seems to be related to overhead and RAM usage.
Try to observe your CPUs and RAM usage when doing enrichments.
You will see that the RAM usage sky rocket very early.
The CPUs stop their multi-tasking very early too, leaving all the deserialization to 1 core.
Trying with more than 6 cores on a 16Gb RAM computer, will reach the swap and that's when things become even slower.
What's strange is that we didn't see this issue before.
I remember doing lots of benchmark to test the parallelization and evaluating its performance.
Any clues on what could have changed since last time ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ACT3M2RAPJ34CVVOSTMXZSLP6HML3A5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLKI5I#issuecomment-508994677>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACT3M2UAZSKW3JVHQNZATUDP6HML3ANCNFSM4HKVYH4Q>.
|
Here is the latest tests with another query set.
This is done on an iMac i7 32Go RAM |
After more investigations I suspect the catalog that is passed around for each worker, the main reason for parallel computation to be slower. So Jacques I think you were right about the increase of ReMap being the reason behind slower parallel computations. As for now I can't find a nice solution that will prevent such massive (but needed) data to be passed around. Jacques could you please elaborate on your idea for redesigning the code ? I'm not sure to get what you mean. Edit: After trying with the 2015 catalog it seems the same issue happens again. I think the reason of the catalog (still a huge variable ~= 200Mb) being passed around is still valid. Maybe benchmarking was not done seriously enough at that time (my bad) |
I think that it needs to be discussed in front of the code, but it seems to me that there is no reason to load the whole catalogue, since the analysis is done for each peakset separately.
I would test an approach where
- each peakset comes as a separate dataset,
- are the ReMap loaded as bed files ? If so, I would recommend to store them as RData sets, which should greatly accelerate their loading for the shuffling tests
Other options might be tested, these are just two possible starting points.
Cheers
Jacques
… On 10 Jul 2019, at 18:42, Zacharie Ménétrier ***@***.*** ***@***.***>> wrote:
After more investigations I suspect the catalog that is passed around for each worker, the main reason for parallel computation to be slower.
If you test functions in the details you will see that the computation of shuffles is actually faster with more cores. It is the theoretical means that is slower.
The thing is that to compute theoretical means we need the catalog to do the overlaps. Still, serializing it takes a lot of time.
So Jacques I think you were right about the increase of ReMap being the reason behind slower parallel computations.
As for now I can't find a nice solution that will prevent such massive (but needed) data to be passed around.
Jacques could you please elaborate on your idea for redesigning the code ? I'm not sure to get what you mean.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ACT3M2SADOZWXMBJ2K4UDIDP6YGPJA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUBU7Y#issuecomment-510139007>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACT3M2S6SY3OWPFL4X2OMGTP6YGPJANCNFSM4HKVYH4Q>.
|
I think I need an update to remember the analysis in details. Maybe we could plan a video call soon ? Retrieving the intersections is what takes the most computational time, let's call this time T. We do intersections between the query and the catalog (1T). I think now I understand what Jacques is saying about computing each peakset separately, so it would mean to do the following. We create n shuffled versions of the query.
We then merge the results. In my opinion this may improve performance, for a sequential version of the code, as the intersections would be faster, but would not necessarily allow us to do parallel computing faster. Parallel computing works well when there is a small number of big tasks, and when those big tasks have a small memory footprint of input/output. The reason is that each worker need to copy the data in its own thread to be able to work with. This has nothing to do with peaks being loaded as bed files or RData but rather that variables (pure RAM data) must be passed to each worker to become a variable of the thread. That's why RAM usage increase so much. It is because the whole catalog is copied for each worker in the RAM. For now we have a small number of big tasks (e.g. doing intersections for 6 shuffles) but the input is massive because it needs the whole catalog. If we try to separate each categories of the catalog, we will end up with a more lightweight input but with a lot more small tasks, so still not ideal for parallel computing. What could still be possible is to chunk the categories (one chunk for each core), do parallel computing for each chunk, and then merge the results. However, chunking the categories would still mean to pass around some part of the catalog (let's say for 6 cores we would need to copy a 6th of the catalog so a 33Mb variable). Another solution would be to use fast forking and shared memory with the mclapply function. This would allow workers to access a shared memory without having to copy the inputs. This solution would greatly improve performance in my opinion, but is only possible on Mac and Linux sadly. |
Good points Zacharie.
The catalogue is areadly as Rdata ( I think).
Splitting the Rdata (catalogue) by chromosome may be a way forward.
This reduces the RAM footprint, doesn’t increase too much the nb of jobs (e.g. 6 shuffles x 24).
Could this be a way forward ?
…--
Benoît Ballester, PhD
INSERM U1090 TAGC
Aix-Marseille-Université
Parc Scientifique de Luminy
13288 Marseille Cedex 9
France
+33 4 91 82 87 28
(1st) benoit.ballester@inserm.fr
(2nd) benoit.ballester@univ-amu.fr
On 11 Jul 2019, at 12:31, Zacharie Ménétrier ***@***.***> wrote:
I think I need an update to remember the analysis in details. Maybe we could plan a video call soon ?
Retrieving the intersections is what takes the most computational time, let's call this time T.
We do intersections between the query and the catalog (1T).
We then create n shuffled versions of the query.
We do intersections between the shuffles and the catalog (nT).
I think now I understand what Jacques is saying about computing each peakset separately, so it would mean to do the following.
We create n shuffled versions of the query.
For each category:
We do intersections between the query and the reduced version of the catalog.
We do intersections between the shuffles and the reduced version of the catalog.
We then merge the results.
In my opinion this may improve performance, for a sequential version of the code, as the intersections would be faster, but would not necessarily allow us to do parallel computing faster.
Parallel computing works well when there is a small number of big tasks, and when those big tasks have a small memory footprint of input/output. The reason is that each worker need to copy the data in its own thread to be able to work with. This has nothing to do with peaks being loaded as bed files or RData but rather that variables (pure RAM data) must be passed to each worker to become a variable of the thread. That's why RAM usage increase so much. It is because the whole catalog is copied for each worker in the RAM.
For now we have a small number of big tasks (e.g. doing intersections for 6 shuffles) but the input is massive because it needs the whole catalog.
If we try to separate each categories of the catalog, we will end up with a more lightweight input but with a lot more small tasks, so still not ideal for parallel computing.
What could still be possible is to chunk the categories (one chunk for each core), do parallel computing for each chunk, and then merge the results. However, chunking the categories would still mean to pass around some part of the catalog (let's say for 6 cores we would need to copy a 6th of the catalog so a 33Mb variable).
chunking <https://stackoverflow.com/questions/26592326/parallel-computing-taking-same-or-more-time>
Another solution would be to use fast forking and shared memory with the mclapply function. This would allow workers to access a shared memory without having to copy the inputs. This solution would greatly improve performance in my opinion, but is only possible on Mac and Linux sadly.
shared memory <https://stackoverflow.com/questions/13942202/r-and-shared-memory-for-parallelmclapply>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ABDSGZCZG4SWQ7KBZ22TEGTP64DZPA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWIXNI#issuecomment-510430133>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDSGZCKSEPO4EV4BL7WF4DP64DZPANCNFSM4HKVYH4Q>.
|
I don't think we can split by chrmosome because the intersection statistics are meaningful only if done at the genome level. My proposal is to split by transcription factor + cell type (this is what I meant by peak set).
… On 11 Jul 2019, at 12:46, Benoit Ballester ***@***.***> wrote:
Good points Zacharie.
The catalogue is areadly as Rdata ( I think).
Splitting the Rdata (catalogue) by chromosome may be a way forward.
This reduces the RAM footprint, doesn’t increase too much the nb of jobs (e.g. 6 shuffles x 24).
Could this be a way forward ?
--
Benoît Ballester, PhD
INSERM U1090 TAGC
Aix-Marseille-Université
Parc Scientifique de Luminy
13288 Marseille Cedex 9
France
+33 4 91 82 87 28
(1st) ***@***.***
(2nd) ***@***.***
> On 11 Jul 2019, at 12:31, Zacharie Ménétrier ***@***.***> wrote:
>
> I think I need an update to remember the analysis in details. Maybe we could plan a video call soon ?
>
> Retrieving the intersections is what takes the most computational time, let's call this time T.
>
> We do intersections between the query and the catalog (1T).
> We then create n shuffled versions of the query.
> We do intersections between the shuffles and the catalog (nT).
>
> I think now I understand what Jacques is saying about computing each peakset separately, so it would mean to do the following.
>
> We create n shuffled versions of the query.
> For each category:
>
> We do intersections between the query and the reduced version of the catalog.
> We do intersections between the shuffles and the reduced version of the catalog.
> We then merge the results.
>
> In my opinion this may improve performance, for a sequential version of the code, as the intersections would be faster, but would not necessarily allow us to do parallel computing faster.
>
> Parallel computing works well when there is a small number of big tasks, and when those big tasks have a small memory footprint of input/output. The reason is that each worker need to copy the data in its own thread to be able to work with. This has nothing to do with peaks being loaded as bed files or RData but rather that variables (pure RAM data) must be passed to each worker to become a variable of the thread. That's why RAM usage increase so much. It is because the whole catalog is copied for each worker in the RAM.
>
> For now we have a small number of big tasks (e.g. doing intersections for 6 shuffles) but the input is massive because it needs the whole catalog.
>
> If we try to separate each categories of the catalog, we will end up with a more lightweight input but with a lot more small tasks, so still not ideal for parallel computing.
>
> What could still be possible is to chunk the categories (one chunk for each core), do parallel computing for each chunk, and then merge the results. However, chunking the categories would still mean to pass around some part of the catalog (let's say for 6 cores we would need to copy a 6th of the catalog so a 33Mb variable).
>
> chunking <https://stackoverflow.com/questions/26592326/parallel-computing-taking-same-or-more-time>
> Another solution would be to use fast forking and shared memory with the mclapply function. This would allow workers to access a shared memory without having to copy the inputs. This solution would greatly improve performance in my opinion, but is only possible on Mac and Linux sadly.
>
> shared memory <https://stackoverflow.com/questions/13942202/r-and-shared-memory-for-parallelmclapply>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ABDSGZCZG4SWQ7KBZ22TEGTP64DZPA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWIXNI#issuecomment-510430133>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDSGZCKSEPO4EV4BL7WF4DP64DZPANCNFSM4HKVYH4Q>.
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ACT3M2W5BQPLHT6PGLQCXBLP64FRLA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWJ3JQ#issuecomment-510434726>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACT3M2XR4ZCMUQTGPV7VUYDP64FRLANCNFSM4HKVYH4Q>.
|
I am not sure I get it.
1) You create a shuffled query.
2) which you then intersect against chromosomes one by one,
3) and then merge the intersections counts.
Then you get intersections at the genome level to do the stats.
Ben
…--
Benoît Ballester, PhD
INSERM U1090 TAGC
Aix-Marseille-Université
Parc Scientifique de Luminy
13288 Marseille Cedex 9
France
+33 4 91 82 87 28
(1st) benoit.ballester@inserm.fr
(2nd) benoit.ballester@univ-amu.fr
On 11 Jul 2019, at 13:25, Jacques van Helden ***@***.***> wrote:
I don't think we can split by chrmosome because the intersection statistics are meaningful only if done at the genome level. My proposal is to split by transcription factor + cell type (this is what I meant by peak set).
> On 11 Jul 2019, at 12:46, Benoit Ballester ***@***.***> wrote:
>
> Good points Zacharie.
>
> The catalogue is areadly as Rdata ( I think).
> Splitting the Rdata (catalogue) by chromosome may be a way forward.
> This reduces the RAM footprint, doesn’t increase too much the nb of jobs (e.g. 6 shuffles x 24).
>
> Could this be a way forward ?
>
> --
> Benoît Ballester, PhD
> INSERM U1090 TAGC
> Aix-Marseille-Université
> Parc Scientifique de Luminy
> 13288 Marseille Cedex 9
> France
> +33 4 91 82 87 28
> (1st) ***@***.***
> (2nd) ***@***.***
>
> > On 11 Jul 2019, at 12:31, Zacharie Ménétrier ***@***.***> wrote:
> >
> > I think I need an update to remember the analysis in details. Maybe we could plan a video call soon ?
> >
> > Retrieving the intersections is what takes the most computational time, let's call this time T.
> >
> > We do intersections between the query and the catalog (1T).
> > We then create n shuffled versions of the query.
> > We do intersections between the shuffles and the catalog (nT).
> >
> > I think now I understand what Jacques is saying about computing each peakset separately, so it would mean to do the following.
> >
> > We create n shuffled versions of the query.
> > For each category:
> >
> > We do intersections between the query and the reduced version of the catalog.
> > We do intersections between the shuffles and the reduced version of the catalog.
> > We then merge the results.
> >
> > In my opinion this may improve performance, for a sequential version of the code, as the intersections would be faster, but would not necessarily allow us to do parallel computing faster.
> >
> > Parallel computing works well when there is a small number of big tasks, and when those big tasks have a small memory footprint of input/output. The reason is that each worker need to copy the data in its own thread to be able to work with. This has nothing to do with peaks being loaded as bed files or RData but rather that variables (pure RAM data) must be passed to each worker to become a variable of the thread. That's why RAM usage increase so much. It is because the whole catalog is copied for each worker in the RAM.
> >
> > For now we have a small number of big tasks (e.g. doing intersections for 6 shuffles) but the input is massive because it needs the whole catalog.
> >
> > If we try to separate each categories of the catalog, we will end up with a more lightweight input but with a lot more small tasks, so still not ideal for parallel computing.
> >
> > What could still be possible is to chunk the categories (one chunk for each core), do parallel computing for each chunk, and then merge the results. However, chunking the categories would still mean to pass around some part of the catalog (let's say for 6 cores we would need to copy a 6th of the catalog so a 33Mb variable).
> >
> > chunking <https://stackoverflow.com/questions/26592326/parallel-computing-taking-same-or-more-time>
> > Another solution would be to use fast forking and shared memory with the mclapply function. This would allow workers to access a shared memory without having to copy the inputs. This solution would greatly improve performance in my opinion, but is only possible on Mac and Linux sadly.
> >
> > shared memory <https://stackoverflow.com/questions/13942202/r-and-shared-memory-for-parallelmclapply>
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ABDSGZCZG4SWQ7KBZ22TEGTP64DZPA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWIXNI#issuecomment-510430133>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDSGZCKSEPO4EV4BL7WF4DP64DZPANCNFSM4HKVYH4Q>.
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ACT3M2W5BQPLHT6PGLQCXBLP64FRLA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWJ3JQ#issuecomment-510434726>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACT3M2XR4ZCMUQTGPV7VUYDP64FRLANCNFSM4HKVYH4Q>.
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ABDSGZADFOEKUWMM7AQOI4LP64KDJA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWMWRI#issuecomment-510446405>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDSGZEGQP7J7CMWDBJWWVDP64KDJANCNFSM4HKVYH4Q>.
|
For what I understand we could not split the catalog by chromosomes (in fact there is an option for the shuffles to be done by chromosome or in the whole genome) but by factors (we call them categories in the code, e.g. TAL1, FOXP1, etc.) this would result in a 485 split. That's why I talked about a lot of smaller tasks. Are we talking about the same thing for now ? |
I am not entirely sure we talk about the same thing here.
From what I understand, the original parallelisation issue comes from the intersection of the shuffled query against a large catalogue (take RAM etc..)
The query shuffling (e.g. n=6) can still be done the same way (using the same chromosome repartition), on the entire genome,etc...
Then once the 6 shuffled queries are ready, we can parallelise the intersection against the catalogue but by chromosome.
…--
Benoît Ballester, PhD
INSERM U1090 TAGC
Aix-Marseille-Université
Parc Scientifique de Luminy
13288 Marseille Cedex 9
France
+33 4 91 82 87 28
(1st) benoit.ballester@inserm.fr
(2nd) benoit.ballester@univ-amu.fr
On 11 Jul 2019, at 14:10, Zacharie Ménétrier ***@***.***> wrote:
For what I understand we could not split the catalog by chromosomes (in fact there is an option for the shuffles to be done by chromosome or in the whole genome) but by factors (we call them categories in the code, e.g. TAL1, FOXP1, etc.) this would result in a 485 split. That's why I talked about a lot of smaller tasks.
Are we talking about the same thing for now ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=ABDSGZEPN46ZN2H2A7OMJADP64PKZA5CNFSM4HKVYH42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWP3ZQ#issuecomment-510459366>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDSGZCYOYWSS42KO4GUCA3P64PKZANCNFSM4HKVYH4Q>.
|
We noticed that using the -
nCore
option (2 or more) in the enrichment, would take longer to compute that using 1 core. The more core used, the longer it runs (strange).There is a rscript to test this in :
misc/example1.R
Enrichment shuffle 6
Enrichment shuffle 6 nCores 3
Enrichment shuffle 6 nCores 6
The text was updated successfully, but these errors were encountered: