New ref_ptr#1702
Conversation
|
Feels like you have too much time on your hands and not enough useful work to keep you occupied. Extra complexity and I'm not clear on what the actual benefit is. Possibly faster observer_ptr<> to ref_ptr<> conversion but folks shouldn't be writing code with lots of observer_ptr<> that are being accessed in tight loop. I'd rather figure why folks are over-using observer_ptr<>/weak_ptr<> and understand that usage case, create a unit test for it then look at bottlenecks of the existing solution then looking at optimizing if required. As a general note vsggroups was an interesting performance test in the early days of the VSG project before I had any proper large scene graphs to test against, things have changed now I have access to big representative scene graphs. Creating such scene graph procedural would be really cool. Another note about vsggroups, before I changed the default aliasing in the vsg::Allocator to 8 bytes it was 4 on x86_64 machines and resulted in slightly more objects fitting in the cache and this helped vsggroups show better performance gap between shared_ptr<> and ref_ptr<>, as the original shared_ptr<> test case didn't use vsg::Allocator it wasn't affected by the change to defaulting to an alignment of 8. For real scene graphs you see a significant improvement in performance when using vsg::Allocator over std::new/delete and this is where it's critical. If we want to look at performance of ref_ptr<> then the optimization we should be looking at using techniques like Tracy uses to store smaller than 8 byte pointers. Potentially we could use 4 bytes and then shift the value in ref_ptr<> by the alignment to give the final place in memory and still have a larger enough addressable space for most VSG applications. Making this an option via CMake would allow users to decide when then want to use this optimization and accept smaller addressable space. |
|
The core problem that I'm trying to solve is the same as it was months ago. Some scenegraph operations, like intersection, inherently require things to be allocated and deallocated, and in the specific case of intersection, once I'd altered the implementation to only try intersecting a ray with the relevant subset of triangles in a mesh, more time was spent allocating and deallocating things than everything else put together. Some of them didn't really need allocating and deallocating and could just be reused, so I made those changes, but the rest needed allocation to stop being so tremendously slow. As the VSG currently is, there's no way to opt an individual instance of a The basic idea here didn't take long to come up with or to implement, it's just do exactly what
Speeding up It's not possible to overuse For the STL type, I don't want to manually deal with the lifetime is a completely valid use case, as
I'll have to take your word for this in the general case. Obviously, it's faster in simple apps like
That wouldn't do anything about the situations that this PR aims to improve, where Whether or not that idea has merit, it's still too easy to get into situations where the VSG allocator is unavoidable and unacceptably slow. All the STL smart pointers and containers recognise that applications can have knowledge about allocation and usage patterns and accept a more suitable allocator to be specified per instance. That's true of most C++ libraries where performance is a goal (unless they manage to avoid allocation and deallocation altogether). Obviously, temporaries have different allocation and usage patterns to any of the existing VSG allocator affinities, and there are plenty of types that are used both as long-lived objects and as temporaries, so a solution can't just rely on affinities per type. The problem I'm hitting is a common problem that high-performance C++ applications hit, so it's got well-known and widely used solutions, including in the standard library, it's just they're not usable because of existing VSG-specific complexities like |
|
FYI, the allocator groups by AllocatorAffinity, there isn't any reason why this can't be extended to specifically handle frequently allocated/deallocated object types and then handle these in a special way. This would require minimal changes to existing code. As a general rule though, avoiding lots of allocations/deallocations is best way to get good performance. |
|
As I said, the specific things that are a problem ( |
When I do another review of the intersection work I'll brain storm approaches that can streamline the functionality currently handled by ArrayState. |
|
I'm fairly sure there's nothing that would achieve the same job without being wildly more complicated or breaking thread safety. We can't know what Vulkan state user-defined draw operations care about (especially anything that comes from future extensions or that cares about user-defined uniforms etc.) that might be set up elsewhere in the scenegraph, so we need to allow a user-defined type to be able to take responsibility, and it needs to be a separate instance of that type so intersection can be parallelised without each thread's state tracking trampling over other that of other threads. If there's something to streamline things, it'll involve breaking existing functionality. |
|
I've done another code review and it's a non starter. If I had wanted to use a std::shared_ptr<> approach I would have used std::shared_ptr<>. Trying to mix ref_ptr<> and shared_ptr<> to avoid the 16 byte size of the shared_ptr<> introduces more complexity and referencing of pointers. There is no way this will be as fast as proper intrusive reference counting which we already have and it's more complicated, which will make it more harder to maintain and less folks will look at the code and actually understand what is being done for what reasons. This really isn't good engineering, it's changing stuff, making it less fast, less maintainable with a theoretical benefit in niche usage cases, ignoring that fact that there are other better ways to solve those problems. A better solution for slow memory allocations and deletions is to refine the vsg::Allocator. It's not been written for performance of allocations and deletions, it's been written for performance of access of data. A better solution for slow observer_ptr<> performance can be come up with application level design refinements to avoid over use of observer_ptr<> or if this can't be done then tackling that specifically as performance optimization tasks. For both the memory allocation/deletion speed and oberserver_ptr<> issues the right way to tackle is to create a performance test case that is added to vsgExamples/tests and then look at the bottlenecks and how best to address them, then uses these test cases when evaluation which different approaches might work best. I do have ideas how both these areas might be improved, neither of which have anything close to the approach taken in this PR, this PR isn't just a bit rough around the edges and needing refinement, it's just not a design that actually resolves any of the actual problems we have. |
This PR implements a new kind of
ref_ptrthat attempts to combine the best of both worlds fromstd::shared_ptrand classic intrusive reference-counted smart pointers like the existing implementation ofvsg::ref_ptr.vsg::ref_ptrhas the advantages that:std::shared_ptras it only needs to hold a pointer to the object itself, not a separate reference count. This is good for cache locality as each cache line can fit twice as many pointers.ref_ptr, whereasstd::shared_ptrwould make a second control block unless you explicitly usedstd::enable_shared_from_this, which makes it intrusive after all.std::shared_ptrhas the advantages that:shared_ptrin mind.std::weak_ptrcan be locked using only atomic operations instead of requiring a mutex. On typical desktop hardware, this is far, far faster, e.g. on x86, correctly-aligned operations on less than eight bytes are always atomic, so with relaxed ordering, atomic operations can be free.vsg::observer_ptrs because they'd been ported fromstd::weak_ptr, and no one expected such a huge performance difference, so no one checked whether it was causing problems.shared_ptrinstead of a weird subclass.vsg::Allocatorfor the type being allocated to an appropriatestd::pmr::memory_resourcesubclass gave significant improvements. I bodged that by making avsg::ref_ptrsubclass that would deallocate using the right memory resource, but that's not suitable for real-world usage as if any otherref_ptrto the objects had been the last one to be destroyed, it'd use the wrong deallocation function. Withstd::shared_ptrI could have just calledstd::allocate_sharedwith astd::pmr::polymorphic_allocatorand everything would work properly.shared_ptrinstead of a weird subclass.vsg::Object::attemptDeletedoes (although I've not seen that actually be used for anything in a VSG app and it's unused in the examples, so can't be sure).std::shared_ptr<int>as a parameter and be passed a pointer to anintthat's really a member of an instance of a class managed by a shared pointer, and keep the class instance alive.The implementation in this PR:
sizeof(vsg::ref_ptr<T>) == sizeof(T*)property by not pointing to the control block in the pointer instance but instead doing it from the pointed-to instance. This doesn't make the pointed-to type bigger because there were already four bytes of padding after the reference count invsg::Object, so holding a pointer instead of an atomic integer just fills in that padding.ref_ptrto something that's already reference counted.vsg::observer_ptrs convertible tovsg::ref_ptrwithout locking a mutex, using atomic operations instead. This should be much faster in apps that use a lot ofobserver_ptrs (or allow apps that don't use them to start doing so).attemptDeleteis gone (because I don't believe that feature is necessary any more).operator()on a template argument.attemptDeletesystem, but again, there don't seem to be any public use cases for it, so I can't demonstrate how to port them. Obviously, we need to discuss this.int, so the main use case can't work.There are some limitations and unanswered questions, some of which are potentially easily addressed and/or already existed:
To adopt objects created with
new, I had to makeref_ptrable to directly access the_referenceCountmember, so theref/unrefmethods ofvsg::Objectbecame apparently redundant, and I've commented them out for now. I'm not entirely sure adopting objects created withnewis particularly important, though, as everything could just usevsg::Whatever::createor the newvsg::make_referenced<Whatever>. Forcing that would be a breaking change, so I didn't want to include it if I didn't have to, but it might be a good idea. It could also make it less error-prone to stack-allocate VSG classes, as a null reference count could be interpreted as meaning an instance isn't heap-allocated. That's something that could be checked at runtime, and if anything was misused, it'd be an easily-debugged immediate segfault rather than delayed memory corruption like the existing implementation would cause.There's some heavily templated code that makes things work. By modern C++ standards, it's not that bad (and I've gone out of my way to make it more readable than some of the STL code it was based on), but it's more complicated than it was before.
I left out some empty base class optimisation/compressed pair optimisations all STL implementations use as they make things much less readable. If the VSG used C++20, then they could be replaced by the
[[no_unique_address]]attribute, which is much more readable. This shouldn't be a big deal for now, as it's only relevant when an empty type is used as a template parameter, e.g. passing in a plain function as a deleter or using a stateless allocator, and for now, the use case I'm addressing has astd::pmr::polymorphic_allocator, which isn't stateless.I've tucked some of the templatey stuff away into a
vsg::detailnamespace so it doesn't pollute the mainvsgnamespace.The code ended up triggering an intentionally-enabled warning with MSVC (but it does the thing the warning warns about on purpose because it's the right thing in this specific situation), and a buggy warning with GCC, so there are some
#pragmas to suppress these, which makes some things look more complicated than they are.Because many VSG classes have a non-public destructor so they can't be stack allocated:
As I think the justification for the
vsg::Objectdestructor being protected has been reduced with this PR, I'd be in favour of making it public, and then these could be tidied up.std::enable_shared_from_thisdoesn't work when called from an object's constructor (it must have been fully constructed and adopted by astd::shared_ptrfirst). On the other hand,vsg::Objectsubclasses can and do passthistoref_ptr/observer_ptrin their constructors. There's no entirely automatic way to preserve this ability (or at least not one that's reliable - there are lots of possible hacks that would only work sometimes).The most sensible compromise was to make a custom type trait that controls whether the reference count is passed to the constructor, as that way, only a handful of classes needed to be changed and the rest would keep working exactly as they always have. I've gone through and adjusted all the affected classes I can find in the public VSG projects, which all turned out to be in the main VSG repo, so this shouldn't cause problems. Affected applications can copy the change from
DeviceMemoryetc. as it's not complicated.The two biggest potential issues are:
thisobviously gives lots of places it's not used in a constructor. Igrepped with PCRE subroutines to search for thethiskeyword only within constructors, so believe I've got everything, but I'm pretty sure some people's brains would melt if they tried to read the regular expressions I wrote. If a search like that isn't done, then problems will only present themselves at runtime, which isn't ideal.vsg::Objectcopy constructor, there was a branch wherethisis used and assigned to aref_ptr. As that's the base class for everything, if I kept that, we'd need to take the ref count in the copy constructors for everything, which would be a huge pain. After investigating the branch, though, it looks like it's only reachable if aCopyOpis going to assign the same object to the same pointer once its constructor has finished anyway, so I don't think it's accomplishing anything productive. I've therefore commented it out for now, but it was originally added after theCopyOpstuff that I think makes it redundant. It would definitely be good to have a second pair of eyes on this.vsg::Auxiliaryis managed with a plain C pointer and explicit reference adjustment invsg::Object, which I think is a bit odd as it looks like avsg::ref_ptrwould be fine. If that's right, then some things could be simplified a little by making that change.As there's now a
std::shared_ptr-style control block rather than just an atomic uint in each object, that obviously needs to live somewhere. The obvious solution is to allocate it as its own object, but that means an extra allocator call, and it was the allocator taking too long that motivated me to start this.std::make_sharedandstd::allocate_shareduse an alternative approach - they define a templated struct that includes both the reference count and the managed object, which means a single allocator call can deal with both.Technically that's not required by the standard, but it does recommend it and all extant implementations do so, so the organisations maintaining STL implementations seem to have decided that it's faster in the general case.
The VSG isn't necessarily the general case, though. Putting the control block in the same allocation as the managed object would, on the face of it, be beneficial if the reference count is typically accessed when the object is (e.g. it's usually assigned to a fresh
ref_ptrwhen it's accessed) as they'll land in the same cache line, or when allocation itself is the most expensive operation to happen to objects, and be counterproductive if other objects are typically accessed at the same time as other objects without accessing the ref count (e.g. the record traversal) as the ref count being in the same cache line means something else isn't.The
vsggroupsexample exists and on paper, would be an excellent tool for testing this kind of change.However, as my post the other day explains, its results are very sensitive to things that they ideally wouldn't be. Initially, testing on Windows with MSVC in the RelWithDebInfo configuration, I was seeing no performance degradation during traversal from collocating the control block, which was a pleasant surprise, but when I came to do final thorough testing, I noticed some odd results, e.g.:
This prompted my investigation into the
vsggroupsexample, and it turned out that I could induce impacts much bigger than this PR produces by doing things that shouldn't make any difference to anything, e.g. adding some functions to various source files and only calling them after the traversal benchmark has finished.As far as I can tell, the factor causing the differences is really just whether the compiler/linker happen to put functions called during traversal into adjacent memory addresses that end up mapping to the same instruction cache line. This is really easy to influence in a small benchmark app like
vsggroups(usually making things worse). However, in a real application with many more functions, it's much less likely to get lucky enough to have a meaningful number of functions that are called consecutively and small enough to share a cache line actually end up in the same cache line, and especially not also end up the only functions executed on a hot path.Depending on the machine I test on, after I've attempted to exclude the effect from binary layout, the impact on pure traversal time of collocating the control block seems to be around 3% on a machine with fast RAM and 5% on a machine with slower RAM, but these are fairly rough figures because the binary layout effects can be double digit percentages, so it's hard to isolate what's what and whether one result is more or less directly comparable to the baseline than another is.
I've therefore opted not to collocate the control block by default so the cache pressure from nodes themselves is unchanged by this PR, and for simplicity, it's allocated with a regular
newcall. That will have a (likely small) performance impact in some specific cases, e.g. allocating a lot of VSG objects (but it's already been decided that it's okay for doing that to not be super fast, hence it being fine that the VSG allocator is significantly slower than plainmalloc), or when a lot of things arerefed/unrefed in quick succession. Sometimes, the impact will be positive, though, e.g. if a lot of objects areunrefed on different threads, as even on architectures like x86 where small memory accesses are always atomic, they're relaxed atomics rather than strict, so writes still need to stall the thread if any other core has the target address cached. Reducing the chances that atomics accessed withoutstd::memory_order_relaxedend up in the same cache line can therefore be helpful.If we knew that decrementing reference counts could afford to be slow, or could predict which objects were likely to be
unrefed from one thread while another thread was using them, there'd be easy gains here by allocating the reference count withstd::pmr::synchronised_pool_resource, so that might be worth pursuing. It might not make any difference one way or the other to real apps, though, so there's no point until we've confirmed it matters with a profiler.There might be places in the VSG where using the new abilities this provides ends up giving a performance improvement, e.g. anywhere temporary
vsg::Objectsubclass instances are made and then discarded before the next frame. I've only investigated using it withvsg::Intersectorand it won't give a statistically significant improvement there without the other changes.Obviously, this is a fairly invasive change as it affects some of the core classes that everything uses, so it needs more careful consideration than a typical PR.
Possible alternatives
Before doing all this, I experimented with an alternative where
vsg::Auxiliaryheld astd::optional<std::function>, which could optionally be used as a deleter instead of thedeleteoperator.That was much less invasive, but:
vsg::Auxiliaryinstance isn't necessarily unique to a particularvsg::Objectinstance, and that opens a can of worms with using the wrong deleter with the wrong memory. Potentially, this could be shored up a bit, though.vsg::Auxiliarybecomes mandatory, and that undid the performance improvement from using a specialised allocator in the first place. Potentially, something could be implemented to allow constructing avsg::Auxiliaryin the same allocation as itsvsg::Object, though.vsg::Auxiliaryhas astd::mapmember, and the C++ specification permitsstd::mapto require a node to be allocated even for an empty map to represent the end node. libc++ and libstdc++ put the end node within thestd::mapinstance, but the Microsoft STL allocates it on the heap, and Windows with MSVC is a target platform of the client app I was originally trying to optimise, so this third allocation ends up making everything slower than it started. Potentially, this could be mitigated by makingvsg::Auxiliary::userObjectsastd::optional<ObjectMap>instead of anObjectMap, though, and only initialise it if it's used.vsg::observer_ptrstop being surprisingly much slower thanstd::weak_ptr, which we know has caught people out in the past.It's worth having a brief mention of the idea of using
std::shared_ptrinstead ofvsg::ref_ptr, but:std::make_shared/std::allocate_sharedto mitigate the putting-48-bytes-of-control-block-between-every-pair-of-nodes problem that undermines thevsg::IntrusiveAllocatorgains.thispointers to things during constructors, those are going to be a pain to unpick.We therefore wouldn't want to switch to
std::shared_ptrwithout a compelling reason, and this PR reduces the number of reasons.Next steps
I'd hoped to make a quick proof-of-concept for this and have a discussion about it at an earlier stage, but between solving edge cases that needed to work before most of the examples would run and then investigating why the performance numbers were weird, getting it to a stage where it was worth discussing has taken longer than I'd have liked.
In terms of testing this, on paper, it should be reliable, but I've only got limited access to real VSG apps, so don't want to make a blanket statement that everything will just work. The impact on framerates should be zero for apps that don't do a lot of allocations and deallocations within a frame, and if they do that, there should be new optimisations possible using the new features here that will allow better performance than before. Operations like initial loading that do a lot of allocations are expected to be a little slower, but hopefully not too much. If it's a problem, then we can look at changing how control blocks are allocated. Any traversal performance changes reported by
vsggroupsare more likely to be caused by its unreliability than this PR.There's obviously some tidying to do once some of the points to discuss have been resolved, too.