I am looking for ways to batch inference instead of inference one request at a time in c++
For example, assuming each request
in requests
is a single prediction containing n feature, each feat
for (const auto& request : requests) {
std::vector<torch::Tensor> feature_tensors = convert_request_to_tensor(request);
std::vector<torch::jit::IValue> inputs;
for (const auto& feature_tensor : feature_tensors) {
inputs.push_back(feature_tensor);
}
torch::Tensor output = model.forward(inputs).toTensor();
}
Now I want to batch inference, how would i do that? Can I simply stack them
std::vector<torch::Tensor> inputs = {input_1, input_2, ..., input_N};
torch::Tensor batch = torch::stack(inputs, 0);
but if you have multiple feature per request, ie .input_1
is std::vector<torch::Tensor>
, how does that work?
And how would you know what is the optimal batch ? Is there a formula that you can derive from your cpu core num?