1

I am looking for ways to batch inference instead of inference one request at a time in c++

For example, assuming each request in requests is a single prediction containing n feature, each feat

for (const auto& request : requests) {
    std::vector<torch::Tensor> feature_tensors = convert_request_to_tensor(request);
    std::vector<torch::jit::IValue> inputs;
    for (const auto& feature_tensor : feature_tensors) {
        inputs.push_back(feature_tensor);
    }

    torch::Tensor output = model.forward(inputs).toTensor();
 }

Now I want to batch inference, how would i do that? Can I simply stack them

std::vector<torch::Tensor> inputs = {input_1, input_2, ..., input_N};
torch::Tensor batch = torch::stack(inputs, 0);

but if you have multiple feature per request, ie .input_1 is std::vector<torch::Tensor>, how does that work?

And how would you know what is the optimal batch ? Is there a formula that you can derive from your cpu core num?

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Browse other questions tagged or ask your own question.