Avoiding repeated C++ virtual table lookup

Question

I have C++ program that reads a config file when the binary is executed, creates a number of child class instances based on the config file, and then periodically iterates over these instances and calls their respective virtual functions.

Gprof is telling me that these function calls are taking up a lot of time (the aforementioned iteration happens very frequently), so I want to try to avoid the repeated virtual function calls somehow.

The code is similar to the following. Once the program populates vector v at the start of the program, this vector won't change anymore for the rest of the program, so it seems inefficient to repeatedly have to do a virtual table lookup every time I want to call f(). I would think there must be a way to cache or save the function pointers somehow, but I'm not sure how.

Would love any suggestions you have on speeding things up. Thank you!

Edit: Sorry, I forgot to mention that the function calls f() for the vector of Child instances has to be in order from 0 to v.size() - 1, so I can't group together the elements of v that have the same derived type.

Also, this was built with -O3 -std=c++14

class Parent {
public:
    virtual void f() { } 
};

class Child1 : public Parent {
public:
    void f() { /* do stuff for child1 */ }
};

//...

class Child9 : public Parent {
public:
    void f() { /* do stuff for child9 */ }
};

int main() {
    vector<Parent*> v;

    // read config file and add Child instances to v based on the file contents

    while (true) {
        // do other stuff
        for (size_t i = 0; i != v.size(); ++i) {
             v[i]->f(); // expensive to do the same virtual table lookups every loop!
        }
    }
};

Mathieu Ropert has a nice blog post titled Polymorphic Ducks on how to do static polymorphism in C++. — Eljay, Commented Mar 21, 2020 at 0:26
One important thing that should be added to the question: how was this compiled and linked? (Was the optimizer enabled? Was LTO or WPO enabled?) — Eljay, Commented Mar 21, 2020 at 0:31
The repeated calls to v.size() are likely to have more of a performance hit than a vtable lookup. — 1201ProgramAlarm, Commented Mar 21, 2020 at 0:40
Is it telling you that the lookup (Which has similar cost to calling through a pointer to a function pointer) is taking a long time, or calling the function itself? — Artyer, Commented Mar 21, 2020 at 1:05
For an experiment, mock up a non-virtual example to compare against. Maybe use a config file that requires only Child1, make f() non-virtual, and change v to a vector<Child1 *>? Then switch f() back to virtual to see what the impact of the vtable lookup is? — JaMiT, Commented Mar 21, 2020 at 1:28

DeducibleSteak · Accepted Answer · 2020-03-21 02:20:54Z

Based on some of the questions and your answers in the comments, here are a couple of considerations.

1) Your problem (if there is one, your solution might already be close to optimal, depending on details you have not mentioned) is most likely somewhere else, not in the overhead of a virtual function call.

If you really run this in a tight loop, and there's not much going on in the implementations of f() that touches a lot of memory, your vtables probably remain in the L1 cache, and the virtual function call overhead will be absolutely minimal, if any, on modern hardware.

2) You say "the functions f() themselves are very simple, for example one of them just multiplies the values at two memory addresses and stores the product in a third address" - this might not be as innocent as you expect. For reference, going to L1 cache will cost you about 3 cycles, going to RAM may cost as much as 60-200, depedning on your hardware.

If you have enough of these objects (so that keeping all of the memory they reference in L1 cache is not possible), and the memory locations they reference are basically random (so that prefetching is ineffective), and/or you touch enough things in the rest of your program (so that all the relevant data gets vacated from cache between the loops over your vector), the cost of fetching and storing the values from and to memory/lower levels of cache will outweigh the cost of the virtual function calls by orders of magnitude in the worst case.

3) You iterate over a vector of pointers to objects - not the objects themselves.

Depending on how you allocate the objects and how big they are, this might not be an issue - prefetching will do wonders for you if you allocate them in a tight loop and your allocator packs them nicely. If, however, you allocate/free a lot of other things and mix in the allocations of these objects in between, they may end up located sparsely and in basically random locations in memory; then iterating over them in the order of creation will involve a lot random reads from memory, which will again be far slower than any virtual function overhead.

4) You say "calls to f() for the vector of children has to be in order" - do they?

If they do, then you are out of luck in some ways. If, however, you can re-architect your system so that they can be called ordered by type, then there is a lot of speed to be gained in various aspects - you could probably allocate an array of each type of object (nice, dense packing in memory), iterate over them in order (prefetcher friendly), and call your f()'s in groups for a single, well known type (inlining friendly, instruction cache friendly).

5) And finally - if none of the above applies and your problem is really in virtual function calls (unlikely), then, yes, you can try storing a pointer to the exact function you need to call for each object in some fashion - either manually or by using one of the type erasure / duck typing methods others have suggested.

My main point is this - there a lot of performance benefits to be had from changing the architecture of your system in some ways.

Remember: accessing things that are already in L1/L2 cache is good, having to go to L3/RAM for data is worse; accessing memory in a sequential order is good, jumping all over memory is bad; calling the same method in a tight loop, potentially inlining it, is good, calling a lot of different methods in a tight loop is worse.

If this is a part of your program the performance of which really matters, you should consider changing the architecture of your system to allow for some of the previously mentioned optimizations. I know this may seem daunting, but that is the game we are playing. Sometimes you need to sacrifice "clean" OOP and abstractions for performance, if the problem you are solving allows for it.

Thanks for the insight. I've done further testing and believe (2) is the culprit actually. I changed the multiplying of values at two memory addresses to simply the multiplying of two constants, and the runtime of the associated f() drops significantly in gprof — galpo, Commented Mar 21, 2020 at 4:09

eerorika · Accepted Answer · 2020-03-21 00:59:02Z

2

Edit: For vector of arbitrary child types mixed in together, I recommend going with the virtual call.

If, depending on config, there were a vector of only one child type - or if you can separate the different types into separate containers, then this could be a case where compile time polymorphism might be an option instead of runtime one. For example:

template<class Child, class Range>
void f_for(Range& r) {
    for (Parent* p : r) {
        Child* c = static_cast<Child*>(p);
        c->Child::f(); // use static dispatch to avoid virtual lookup
    }
}

...

if (config)
    f_for<Child1>(v);
else
    f_for<Child2>(v);

Alternative to explicit static dispatch would be to mark the child class or the member function final.

You might even expand the static portion of the program so that you get to use vector<Child1> or vector<Child2> directly, avoiding the extra indirection. At this point the inheritance is not even necessary.

edited Mar 21, 2020 at 0:59

answered Mar 21, 2020 at 0:34

eerorika

238k12 gold badges206 silver badges345 bronze badges

Thanks, but I apologize, I forgot to mention that the calls to f() for the vector of children has to be in order (please see my edit to my original post). Does that invalidate the solution you suggested?
– galpo
Commented Mar 21, 2020 at 0:47
1

@galpo So, what you're saying is that the vector has pointers to several different child types, in arbitrary order? In that case, the virtual lookup is probably the way to go.
– eerorika
Commented Mar 21, 2020 at 0:54
That's right. The order of those pointers in v remains the same throughout the program though. There's no way to somehow save or cache the function pointers retrieved from the virtual table lookups?
– galpo
Commented Mar 21, 2020 at 0:59
1

@galpo Let's assume you could: Why would looking up the cache be faster than looking up the virtual table?
– eerorika
Commented Mar 21, 2020 at 1:00
1

@galpo Feel free to try it out. If you can make it faster than a virtual call, then some people may be interested in a blog post about it.
– eerorika
Commented Mar 21, 2020 at 1:08

| Show 1 more comment

Collectives™ on Stack Overflow

Avoiding repeated C++ virtual table lookup

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
c++
optimization
virtual-functions
vtable
virtual-table
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c++optimizationvirtual-functionsvtablevirtual-table or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
optimization
virtual-functions
vtable
virtual-table
or ask your own question.