threads blocking on read to scaleout server after moving to .NET 5

#1
Hello,
We've upgraded our apps which were using .NET Framework 4.7.2 successfully for a long time, to .NET 5.
As a consequence our apps functionality degraded significantly and while analyzing the dumps we noticed a lot of threads stuck on

13032 506 Worker Thread Worker Thread Scaleout.Client.dll!Scaleout.Client.Net.GatewayMultiplexer.GetGatewayFor
[External Code]
Scaleout.Client.dll!Scaleout.Client.Net.GatewayMultiplexer.GetGatewayFor(ref Scaleout.Client.Net.ObjectKey key)
Scaleout.Client.dll!Scaleout.Client.Net.Store.DoRequest(Scaleout.Client.Net.ObjectKey key, System.Net.IPAddress preferredHost, System.Func<System.IO.Stream, Scaleout.Client.Net.Messaging.Response> requestFunc, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Net.Store.Read(Scaleout.Client.Net.ObjectKey key, Scaleout.Client.LockToken lockToken, Scaleout.Client.VersionToken version, Scaleout.Client.Net.ReadFlags flags, byte[] keyList, Scaleout.Client.Net.ObjectKey? mmfKey, System.Threading.CancellationToken cancellationToken)
Scaleout.Client.dll!Scaleout.Client.Cache<string, object>.ReadInternal(Scaleout.Client.Net.ObjectKey objectKey, Scaleout.Client.ReadOptions options, Scaleout.Client.Net.ReadFlags flags, Scaleout.Client.LockToken lockToken, Scaleout.Client.Trace.AccessOperation traceOp, string key, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Cache<System.__Canon, System.__Canon>.Read(System.__Canon key, Scaleout.Client.ReadOptions options, System.Threading.CancellationToken cancellationToken)


Any idea what the issue might be ?
We're currently using Nuget version 3.1.4

Thanks !
 
Last edited:

markw

Administrator
Staff member
#2
A few questions:
  1. How many threads do you see blocked on GatewayMultiplexer.GetGatewayFor()? Hundreds?
  2. What operating system are you running on?
  3. Is it possible for you to move to async calls and keep your app's thread count down?
For background, I looked through old support cases and pieced together some history:

We made significant optimizations on the .NET Framework 4.x build of the Scaleout.Client library based on problems that your company reported back in 2020 when you were still running on the .NET Framework. Back then, your app ran as a traditional ASP.NET worker process that regularly served 500+ simultaneous web requests, where each request was handled by a separate thread. The Scaleout.Client library maintains a connection pool that uses .NET locking primitives to queue up waiters for a connection, but the locking primitives in the .NET Framework performed poorly under heavy thread contention.

(I remember the problem: the Monitor class in .NET 4.x queues up waiting threads in a singly-linked list, which slows down as more threads traverse the growing list to wait for a connection, which makes the ASP.NET runtime create more request-handling threads, which creates more waiters, which grows the list more... so the problem snowballed and performance cratered.)

We fixed this on the .NET Framework by replacing all the old locking with much more efficient, native locking primitives that didn't degrade as the thread count increased. So performance on .NET 4.x became very good in this situation (even better than .NET Core).

At the time, we found that .NET Core's behavior under this kind of load was not as bad. There was a little bit of a slowdown if the thread count grew to the hundreds, but it didn't become pathologically slow like .NET Framework 4.x. So we left the original locking in place for .NET Core since it's cross-platform, and the fix we made for you on .NET 4.x leveraged facilities that were Windows-specific. But, evidently, .NET Core is still slow enough to cause problems for you now.

In any case, I'm surprised that you're having this problem on .NET Core/5/6... ASP.NET Core apps usually run with far fewer threads--concurrency on Core is typically achieved by doing lots of async operations on a smaller thread pool instead of using a thread per request. Does your app manually spin up a lot of background threads (or make a lot of calls to ThreadPool.QueueUserWorkItem()?

Thanks,
Mark Waterman
 
#3
Hello Mark,

Thanks for your response.
1. The threads counts was about 500.
2. The clients run on Windows , and the Scaleout servers are on Linux.
3. Back then when you did the optimizations for .NET framework as regards to the locking replacement , you advised that we'd better also use the non-async calls due to some investigation you did.
Is that not the case anymore ? Or maybe that async versions behave better with the latest version of the nuget and not 3.1.4 which we have currently installed ? If so - would you recommend to upgrade the client's nuget version along the way ?

Thanks
 

markw

Administrator
Staff member
#4
OK, so you're still running with a whole lot of threads. We can research to see if there's a similar optimization available for the .NET Core build.

But, if possible, I would try moving to the async Cache methods and at least test to see if there's an improvement. I'm almost certain it'll help since using fewer threads will eliminate the heavy contention on the connection pool.

...I don't remember my exact statement about async calls being slower, but I think it was more nuanced. I recall alluding to the fact that synchronous calls will outperform async calls in situations like benchmarks where the only thing you're doing is hammering the ScaleOut service with requests. Since .NET 4.x already uses a thread-per-request model, using async calls there was of questionable value. But async is great in the .NET Core world, where everything else is also async and you have a good mix of I/O-bound and compute-bound work... ASP.NET Core really wants you to do a lot of async work on a smaller number of threads.

A bit more background:

When it comes to raw performance in a tight benchmark loop, you'll find that synchronous calls will outperform async calls in pretty much every library out there. A 30-50% hit is not unusual for libraries that do async operations. (A statement like this is probably what jumped out at you.)

Naturally, you'll wonder what the value of async is if individual calls are always slower. Well, async calls can do something wonderful: If you're running a web app, your request handling threads don't get blocked when you need to make an async call--instead of blocking, the thread moves on to doing other compute-intensive work while the OS waits for the async I/O to complete. So an individual request may be a little slower, but your app scales much better because you can now handle a lot more concurrent requests.

In other words, async isn't optimized for situations like benchmarks where you're flooding the system with I/O operations--rather, it's meant to help out when you have I/O work and CPU-bound work interleaved with each other. Microsoft's guidance in this matter is worth reviewing.
 
#5
Thanks Mark.
We'll try switching to using async - should be relatively straightforward.
If that doesn't work , we'd be happy if you could research applying to .NET Core an optimization similar to the one you did for the .NET framework build , which was very helpful.
We'll keep you posted.

Thanks !
 
#6
Hello Mark,
We want to try out the switch to async soon - do you think upgrading to the latest nuget version of the client could be helpful as well ( as mentioned , we currently use version 3.1.4 )?

Thanks
 

markw

Administrator
Staff member
#7
I don't think you'll see a difference in performance of the core library functionality if you upgrade to the latest. Most of the changes since 3.1.4 have involved adding support for advanced features in the service (GeoServer, PMI, etc.).

There was one important fix in 4.2.7 involving client cache accuracy that could improve performance and stability for some users. It's actually in library's Scaleout.Collections dependency, as outlined in this discussion. At the very least, I'd recommend updating the Scaleout.Collections package to 1.0.4 if you haven't already.
 
#8
Hi Mark,
Unfortunately that ( collections 1.0.4 ) , and moving most of our flows to async didn't help either.
There are some locations which are more difficult to move to async , and it seems it's failing on them now-

Not Flagged 25944 2042 Worker Thread Worker Thread System.Private.CoreLib.dll!System.Threading.ManualResetEventSlim.Wait
[External Code]
[Waiting on Async Operation, double-click or press enter to view Async Call Stacks]
[External Code]
Scaleout.Client.dll!Scaleout.Client.Net.Gateway.GetConnection()
Scaleout.Client.dll!Scaleout.Client.Net.Store.DoRequest(Scaleout.Client.Net.ObjectKey key, System.Net.IPAddress preferredHost, System.Func<System.IO.Stream, Scaleout.Client.Net.Messaging.Response> requestFunc, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Net.Store.Read(Scaleout.Client.Net.ObjectKey key, Scaleout.Client.LockToken lockToken, Scaleout.Client.VersionToken version, Scaleout.Client.Net.ReadFlags flags, byte[] keyList, Scaleout.Client.Net.ObjectKey? mmfKey, System.Threading.CancellationToken cancellationToken)
Scaleout.Client.dll!Scaleout.Client.Cache<string, object>.ReadInternal(Scaleout.Client.Net.ObjectKey objectKey, Scaleout.Client.ReadOptions options, Scaleout.Client.Net.ReadFlags flags, Scaleout.Client.LockToken lockToken, Scaleout.Client.Trace.AccessOperation traceOp, string key, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Cache<System.__Canon, System.__Canon>.Read(System.__Canon key, Scaleout.Client.ReadOptions options, System.Threading.CancellationToken cancellationToken)
Toluna.Common.dll!Toluna.Common.Caching.SossCache<object>.Read(string key, Toluna.Common.Caching.IReadOptions readOptions)
Toluna.Common.dll!Toluna.Common.Caching.SossModularizedCache.ReadObject<Toluna.Common.Entities.HelperObjects.ConfigurationSettingCacheData>(string key, Toluna.Caching.Modularized.ReadOptions readOptions)

This happens even though we have C#'s MemoryCache on top of most of these calls for a duration of 15 minutes , so seems that these calls performance has possibly degraded when moving to .NET Core.

Is it possible to apply the optimization you did for .NET framework also to .NET Core ?

Thanks !
 

markw

Administrator
Staff member
#10
I applied the locking change to our .NET Core build, but it did not improve performance in our tests--we saw that the performance on .NET Core 5/6 was already very good, even slightly better than the optimized .NET Framework behavior under the same load (500 threads vying for resources from a small pool of 32 connections).

Assuming your app is using the library the same way, you should see identical performance between .NET Core and .NET Framework, even with release 3.1.4 of the library. So we'll need to dig deeper into your app's behavior. A few questions to start with:
  1. In your post above (#8) you mentioned:
    There are some locations which are more difficult to move to async , and it seems it's failing on them now
    Can you characterize this failure? Is it an exception, or is it just that you continue to see a lot of threads blocked while waiting for a connection to free up?
  2. How does the request rate to the ScaleOut service compare between your old .NET Framework code and your new .NET Core app? If you look at the store-wide CRUD operations in the ScaleOut Management Console, is there a significant difference when your app is under the same load?
  3. Is the connection pool the same size in your new app as it was in your .NET Framework version? (This is configured through the maxPoolSize value in your connection string.)
Thanks,
Mark
 
#11
Hi Marc,
Maybe it's indeed related to the access patterns of our app - having a mixture of both sync and async calls.
The failure manifests in lot of threads blocked while waiting for a connection to free up :
Not Flagged 25944 2042 Worker Thread Worker Thread System.Private.CoreLib.dll!System.Threading.ManualResetEventSlim.Wait
[External Code]
[Waiting on Async Operation, double-click or press enter to view Async Call Stacks]
[External Code]
Scaleout.Client.dll!Scaleout.Client.Net.Gateway.GetConnection()
Scaleout.Client.dll!Scaleout.Client.Net.Store.DoRequest(Scaleout.Client.Net.ObjectKey key, System.Net.IPAddress preferredHost, System.Func<System.IO.Stream, Scaleout.Client.Net.Messaging.Response> requestFunc, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Net.Store.Read(Scaleout.Client.Net.ObjectKey key, Scaleout.Client.LockToken lockToken, Scaleout.Client.VersionToken version, Scaleout.Client.Net.ReadFlags flags, byte[] keyList, Scaleout.Client.Net.ObjectKey? mmfKey, System.Threading.CancellationToken cancellationToken)
Scaleout.Client.dll!Scaleout.Client.Cache<string, object>.ReadInternal(Scaleout.Client.Net.ObjectKey objectKey, Scaleout.Client.ReadOptions options, Scaleout.Client.Net.ReadFlags flags, Scaleout.Client.LockToken lockToken, Scaleout.Client.Trace.AccessOperation traceOp, string key, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Cache<System.__Canon, System.__Canon>.Read(System.__Canon key, Scaleout.Client.ReadOptions options, System.Threading.CancellationToken cancellationToken)


Regarding of the pool size - no change between .NET core and framework versions.

In terms of request to the service -the access to our servers seem to be more or less the same at load time and that should be correlated to the access to the cache servers since the code didn't change otherwise.

Could you publish the version with the optimization so that we give it a shot ? Maybe somehow the pattern of usage in our app exposes some intricacies which the optimization solves ?

Thanks!
 

markw

Administrator
Staff member
#12
OK, I've posted Scaleout.Client version 4.2.10 on ScaleOut's development NuGet feed. You'll need to add http://nuget.scaleoutsoftware.com/nuget as a package source and then make sure this source is selected in the NuGet package manager. (See Microsoft’s Package Sources topic for details on adding additional package sources to the Package Manager UI in Visual Studio.)
1652380988572.png

Also, there's a new connection string option in this release: forceManagedLocking. It defaults to false, but if set to true in your connection string then this build will revert to using the original, managed locking approach instead of the new, native Windows locks. This may be useful if you want to do quick performance comparisons without having to mess around with rebuilding against different NuGet versions.
C#:
string myConnString = "hosts=10.0.9.12:721;maxPoolsize=64;forceManagedLocking=true";
 
#13
Thanks a lot Mark. We'll try that out. Just to be sure - the optimized version ( corresponding to the one that's working for us in .NET framework ) is the one with forceManagedLocking = true right ?
 

markw

Administrator
Staff member
#14
No, it's the opposite: you want forceManagedLocking to be set to false to get the new, "optimized" locking behavior.
 
#15
Hi Mark, we tried one more time to deploy it. This time we changed other frequent flows to use async reads from the cache, so the majority of our code uses async calls to SOSS. It still behaves the same - one of our apps has spikes of threads (from ~100 to ~1000). Each time it happens on a different server.
We took dumps, and most of the threads are stuck in here:

Not Flagged 20340 4 Worker Thread Worker Thread Scaleout.Client.dll!Scaleout.Client.InProc.LruEvictionCache<object>.Get
[External Code]
Scaleout.Client.dll!Scaleout.Client.InProc.LruEvictionCache<object>.Get(ref Scaleout.Client.Net.ObjectKey key)
Scaleout.Client.dll!Scaleout.Client.Cache<string, object>.CheckInProcCaches(ref Scaleout.Client.Net.ObjectKey objectKey, Scaleout.Client.ReadOptions options, bool lockObj)
Scaleout.Client.dll!Scaleout.Client.Cache<string, object>.ReadInternalAsync(Scaleout.Client.Net.ObjectKey objectKey, Scaleout.Client.ReadOptions options, Scaleout.Client.Net.ReadFlags flags, Scaleout.Client.LockToken lockToken, Scaleout.Client.Trace.AccessOperation traceOp, string key, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Cache<string, object>.ReadAsync(string key, Scaleout.Client.ReadOptions options, System.Threading.CancellationToken cancellationToken)


It's a different stack trace than last times but causes the same issue - a lot of threads and high latency for our HTTP requests.
Moving to async didn't help, the other nuget fix also didn't help. What else can it be?

Thank you for the help!
 

markw

Administrator
Staff member
#16
Hi, thanks for the update.

It would appear that the bottleneck moved out of the connection pool and up into the client cache. This might actually be good progress!

Try increasing the number of client cache partitions to reduce the contention there. You can do that through the partitionCount parameter of
the CacheBuilder.SetClientCache method, or, if you're using a config file to set up your caches, you can use the clientCachePartitions config parameter. I'd start with a value of 10 partitions and check to see if the bottleneck is relieved (or if it's moved elsewhere).
 
#17
Hi Mark,

As you suggested, we tried to change the partitionCount param. It used to be 10, so we changed it to 64. We had the same issue, same bottleneck, so we changed it to 128. It seemed to behave better, but Yesterday we had the same issue again - one of our apps has spikes of threads (from ~100 to ~1000). Each time it happens on a different server. It causes also very high latency for the HTTP requests.
We took dump again and we can see that the bottleneck is the same:

Scaleout.Client.dll!Scaleout.Client.InProc.LruEvictionCache<Toluna.UPS.RouterCachedObjects.ProtoByteArray>.Get(ref Scaleout.Client.Net.ObjectKey key)
Scaleout.Client.dll!Scaleout.Client.Cache<string, Toluna.UPS.RouterCachedObjects.ProtoByteArray>.CheckInProcCaches(ref Scaleout.Client.Net.ObjectKey objectKey, Scaleout.Client.ReadOptions options, bool lockObj)
Scaleout.Client.dll!Scaleout.Client.Cache<string, Toluna.UPS.RouterCachedObjects.ProtoByteArray>.ReadInternalAsync(Scaleout.Client.Net.ObjectKey objectKey, Scaleout.Client.ReadOptions options, Scaleout.Client.Net.ReadFlags flags, Scaleout.Client.LockToken lockToken, Scaleout.Client.Trace.AccessOperation traceOp, string key, System.Threading.CancellationToken ct)
Scaleout.Client.dll!Scaleout.Client.Cache<string, Toluna.UPS.RouterCachedObjects.ProtoByteArray>.ReadAsync(string key, Scaleout.Client.ReadOptions options, System.Threading.CancellationToken cancellationToken)

Most of the threads are stuck in here.
What else can we try? Do you think that changing the partitionCount to 256 can help? Anything else can help us?
We are not worried about the accuracy of LRU eviction because we don't reach the max number of objects.

Thanks!
 

markw

Administrator
Staff member
#18
That's a little surprising. Do you have a "hot object" that's accessed much more frequently than any other? (If so, whatever client cache partition it falls into will get hammered and eventually become a bottleneck... so scaling the number of client cache partitions may not help in that case.) Let us know and we'll come up with some ideas.

Also, looking back at earlier conversations, I see your company was using (or maybe just considering?) a custom client cache that's built on top of .NET's System.Runtime.Caching.MemoryCache (as described here). The MemoryCache has different performance characteristics that may help this scenario--is it something you've already looked into?
 
Top