-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[onert] Share memory for Reshape, ExapndDims and Squeeze #14057
base: master
Are you sure you want to change the base?
Conversation
This commit extends current tensor memory management infrastructure to allow tensor memory sharing if possible. ONE-DCO-1.0-Signed-off-by: Mateusz Bencer [email protected]
c5ddcc5
to
c8d8a75
Compare
@glistening @hseok-oh Thank you for review of previous version. In the current version I've changed completely approach. Now there memory sharing is processed during tensors allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbencer
This seems to be my last review on this work because I will be away from the office for a long time. I'm sorry, I won't be able to give feedback anymore.
} | ||
} | ||
reassign_indexes_to_single_sources(data.shared_memory_operand_map); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-place is dependent on specific operations, and the kernel implementations may vary for each backend. Also, the kernel implementations for the specific operations only exist in cpu backend now. So, it would be better to move this map creation into cpu backend.
I think a better place to create and append map is in KernelGenerator
. However, currently in cpu backend, registering tensors is called before KernelGenerator
, making it difficult to simply implement to move the map into KernelGenerator
. You may need to unify BackendContext::genTensors()
and BackendContext::genKernels()
such as train backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. My intention was to make this mechanism more global but I see that it can be not applicable for other backends.
As you notice it's very problematic to move it to KernelGenerator
- we need to pass this information for TensorBuilder
ctor (setter seems to be not good approach and to initConsts
(here the only possibility seems to be local backend context).
My proposition (already implemented) is to call it in runtime/onert/backend/cpu/Backend.h
#include "GenModelTest.h" | ||
#include "CircleGen.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include "GenModelTest.h" | |
#include "CircleGen.h" | |
#include "CircleGen.h" | |
#include "GenModelTest.h" |
#include "GenModelTest.h" | ||
#include "CircleGen.h" | ||
|
||
TEST_F(GenModelTest, optimized_reshape_inference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is for reshape test, but not reshape optimization(probably in-place) test. It would be better to rename this test and add tests to verify in-place implementation. If it's difficult to add it tests nnfw_api test, please add gtests in the implemented directory instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I've changed name of those tests and introduced in-place tests in https://github.com/Samsung/ONE/pull/14057/files#diff-4071f79a2a4bc3fd841a69a64ada0e01827c2e6202bf7c446dec530aa9f0bc6b
SUCCEED(); | ||
} | ||
|
||
TEST_F(GenModelTest, optimized_expand_dims_inference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
SUCCEED(); | ||
} | ||
|
||
TEST_F(GenModelTest, optimized_squeeze_inference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
SUCCEED(); | ||
} | ||
|
||
TEST_F(GenModelTest, optimized_reshape_reshape_reshape_chain_inference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
SUCCEED(); | ||
} | ||
|
||
TEST_F(GenModelTest, reshape_input_model_input_inference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
SUCCEED(); | ||
} | ||
|
||
TEST_F(GenModelTest, reshape_input_model_output_inference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
SUCCEED(); | ||
} | ||
|
||
TEST_F(GenModelTest, reshape_output_model_output_inference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -217,6 +239,33 @@ createBackendContexts(compiler::ILoweredGraph &lgraph, bool linear_executor, | |||
|
|||
// Create contexts | |||
auto whole_op_order = lgraph.graph().topolSortOperations(); | |||
const std::unordered_set<std::string> memory_sharing_supported_backends = {"cpu", "builtin"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove builtin
backend from this set. The kernels in builtin
backend deal with data transmission between other backends, so there is no need to apply in-place for this task. The required in-place optimization there has already been applied in the other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - I re-wrote implementation to be local for cpu.
@glistening Thank you for response. Sure, I'll try to split the PR but let me introduce review request from @ragmani at first. |
I see. Anyway thank you for very useful feedback! ;) |
I've split part of the implementation into smaller PRs:
The rest of changes are deeply dependent so I'll push it later. |
Some time results (for 50 repeats) from my dev machine. Note: do NOT treat it as an official results: From current branch
From master
Conclusion: Preparation time increases about 1% for mobilenet and almost 3% for mnist. Execution time(mean) decreases about 0.54% for mobilenet and 5.6% for mnist. |
@hseok-oh, @ragmani @zetwhite If you find a moment please take a look for PRs to review ;) |
Thanks for the notice. I'll take a look :) |
I read the draft and understood the overall direction. I could review your PR. |
std::vector<ir::OperandIndex> registered_source_ind; | ||
for (const auto &[_, source_ind] : tensor_builder->getSharedMemoryOperandIndexes()) | ||
{ | ||
if (ctx.external_operands().contains(source_ind)) | ||
continue; | ||
if (tensor_builder->isRegistered(source_ind)) // some tensors can have the same source | ||
continue; | ||
tensor_builder->registerTensorInfo(source_ind, graph.operands().at(source_ind).info()); | ||
registered_source_ind.emplace_back(source_ind); | ||
} | ||
|
||
graph.operands().iterate([&](const ir::OperandIndex &ind, const ir::Operand &obj) { | ||
if (ctx.external_operands().contains(ind)) | ||
return; | ||
if (std::find(std::begin(registered_source_ind), std::end(registered_source_ind), ind) != | ||
std::end(registered_source_ind)) // skip tensors already registered | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing #14228, I re-read this PR.
I'm a bit confused about this part.
In genTensors()
, Is it sufficient just to register the source_ind
first?
I thought this draft tried to allocate only source_operand
and avoid allocating shared_operand
.
(source_operand
- operand matched with source_ind
, shared_operand
- operand matched with share_ind
)
But I failed to understand how this code removes the allocation of shared operand
.
@mbencer I guess there is sth I missed. Could you help me to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that registerTensorInfo
is responsible just for buildTensor
(both non-const
and const
).
The second step is calling allocateNonconsts
where we are calling tensor->setBuffer
. Just at this point we are passing the same buffer for source tensor and shared tensor. It requires of course special handling of such memory lifetime. The lifetime is controlled by StaticTensorManager::claimPlan
called during first use of memory buffer and StaticTensorManager::releasePlan
called during the last use of memory buffer (the graph has to be topologically sorted).
Conclusion - we are creating tensors both for source_operand
and shared_operand
- they are just share memory buffer passed by setBuffer
method.
Registering source tensors at the beginning here is needed to proper handling cases where a source tensor is constant - in such a case the shared tensor tensors has to be also a constant (has ExternalTensor
type).
Without this additional code here we have no guarantee that source operands will be processed at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, Thank you a lot for your kind explanation 👍
I missed the changes in StaticTensorManager.cc
. Now i clearly understood it :)
if (graph.operands().at(op.getInputs().at(0)).info().isDynamic()) | ||
{ | ||
return false; | ||
} | ||
if (graph.operands().at(op.getOutputs().at(0)).info().isDynamic()) | ||
{ | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbencer Is there any reason to not allow on dynamic shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hseok-oh In general I believe that it's possible to handle but my plan was to implement it separately to limit the scope of this feature. Note that dynamic tensor have a separate path of building - DynamicTensorManager::buildTensor. Dyn shapes handling requires additional branch here to handle a case where source memory tensor
is a constant (has ExternalTensor
type). To research is also a case where source memory tensor
has static shape - in such a case DynamicMemoryManager
shouldn't be pass to a tensor ctor. The rest should be even simpler because dyn tensors don't re-use common memory (controlled by [static]MemoryManager
).
To sum-up:
I can handle it also as a part of this feature or create a separate issue ;)
This commit extends current tensor memory management infrastructure to allow tensor memory sharing if possible.
ONE-DCO-1.0-Signed-off-by: Mateusz Bencer [email protected]
Issue: #12836