Skip to content

RATIS-2303. Migrate ratis-examples tests to Junit 5.#1267

Merged
szetszwo merged 1 commit intoapache:masterfrom
slfan1989:RATIS-2303
May 22, 2025
Merged

RATIS-2303. Migrate ratis-examples tests to Junit 5.#1267
szetszwo merged 1 commit intoapache:masterfrom
slfan1989:RATIS-2303

Conversation

@slfan1989
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Migrate ratis-examples tests to Junit 5.

What is the link to the Apache JIRA

JIRA: RATIS-2303. Migrate ratis-examples tests to Junit 5.

How was this patch tested?

mvn clean test.

@slfan1989
Copy link
Copy Markdown
Contributor Author

slfan1989 commented May 21, 2025

@szetszwo @adoroszlai While upgrading the ratis-examples module, I discovered an issue where the workerGroup might not be properly shut down, causing unit tests to fail with the following exception.

2025-05-21 05:09:14,262 WARN  util.LeakDetector (LeakDetector.java:assertNoLeaks(175)) - 29/30) numLeaks == 1 > 0, will wait and retry ...
2025-05-21 05:09:15,265 WARN  util.ReferenceCountedLeakDetector (ReferenceCountedLeakDetector.java:logLeakMessage(168)) - LEAK: (class org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoopGroup, count=3, value=org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoopGroup@756cf158)

java.lang.IllegalStateException: #leaks = 1 > 0, #leaks == set.size = 1

	at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
	at org.apache.ratis.util.LeakDetector$LeakTrackerSet.assertNoLeaks(LeakDetector.java:100)
	at org.apache.ratis.util.LeakDetector$LeakTrackerSet.getNumLeaks(LeakDetector.java:94)
	at org.apache.ratis.util.LeakDetector.assertNoLeaks(LeakDetector.java:178)
	at org.apache.ratis.server.impl.MiniRaftCluster.shutdown(MiniRaftCluster.java:892)
	at org.apache.ratis.grpc.MiniRaftClusterWithGrpc.shutdown(MiniRaftClusterWithGrpc.java:97)
	at org.apache.ratis.examples.filestore.FileStoreStreamingBaseTest.testFileStoreStreamSingleFile(FileStoreStreamingBaseTest.java:83)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

I will walk through the entire process of analyzing the issue, in hopes that it can offer some troubleshooting guidance to others facing similar problems.

  • Step 1 I suspected the issue was due to the close method of NettyClientStreamRpc not properly releasing resources. To address this, I added logic to forcibly shut down the workerGroup. This approach worked as expected in early tests, but it later caused other unit tests to hang.

NettyClientStreamRpc.java#L234-L242

void close() {
final MemoizedSupplier<ChannelFuture> previous = ref.getAndSet(null);
if (previous != null && previous.isInitialized()) {
// wait channel closed, do shutdown workerGroup
previous.get().channel().close().addListener(future -> workerGroup.shutdownGracefully());
} else {
workerGroup.shutdownGracefully();
}
}

The modification logic is as follows

connection.close();
if (!connection.isTerminated()) {
  connection.workerGroup.shutdownGracefully();
}
  • Step 2 I began investigating the reference counting logic involved in the creation of NettyClientStreamRpc.

This logic increments the reference count at line 97 with new WorkerGroupGetter(current.join().retain()), and releases it at line 103 with return previous.join().release() ? null : previous;.

static WorkerGroupGetter newInstance(RaftProperties properties) {
final boolean shared = NettyConfigKeys.DataStream.Client.workerGroupShare(properties);
if (shared) {
final CompletableFuture<ReferenceCountedObject<EventLoopGroup>> created = new CompletableFuture<>();
final CompletableFuture<ReferenceCountedObject<EventLoopGroup>> current
= SHARED_WORKER_GROUP.updateAndGet(g -> g != null ? g : created);
if (current == created) {
created.complete(ReferenceCountedObject.wrap(newWorkerGroup(properties)));
}
return new WorkerGroupGetter(current.join().retain()) {
@Override
void shutdownGracefully() {
final CompletableFuture<ReferenceCountedObject<EventLoopGroup>> returned
= SHARED_WORKER_GROUP.updateAndGet(previous -> {
Preconditions.assertSame(current, previous, "SHARED_WORKER_GROUP");
return previous.join().release() ? null : previous;
});
if (returned == null) {
get().shutdownGracefully();
}
}
};
} else {
return new WorkerGroupGetter(newWorkerGroup(properties));
}
}

I added trace logs to the retain and release methods in ReferenceCountedLeakDetector to monitor the reference count of org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoopGroup.

From the logs, I found that 30 object references were created, but only 15 were released. This indicates that some logic is not properly releasing the NioEventLoopGroup objects.

There are two main call chains involved:

  • RaftClientImpl
RaftClientImpl  
  └── DataStreamClient  
      └── NettyDataStreamFactory  
          └── NettyClientStreamRpc  
              └── WorkerGroupGetter.newInstance  
  • NettyServerStreamRpc
NettyServerStreamRpc  
  └── DataStreamClient  
      └── NettyDataStreamFactory  
          └── NettyClientStreamRpc  
              └── WorkerGroupGetter.newInstance  

By examining the code, we can see that NettyServerStreamRpc#close() properly releases resources. This suggests that the issue lies with RaftClientImpl, which does not release its resources. Since RaftClientImpl is directly instantiated by the client in the test, we need to explicitly call close() to ensure proper resource cleanup.

  • testSingleFile

image

  • testMultipleFiles
    image

With the modification, the unit tests now run and pass successfully.

image

@slfan1989 slfan1989 marked this pull request as draft May 21, 2025 04:28
@Override
public CompletableFuture<DataStream> stream(RaftClientRequest request) {
final SingleDataStream s = new SingleDataStream(request);
LOG.info("XXX {} put {}, {}", this, ClientInvocationId.valueOf(request), s);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I optimized the logs by removing the XXX prefix. I understand it was likely added for debugging purposes, but such prefixes are not appropriate in a production environment.

@slfan1989 slfan1989 marked this pull request as ready for review May 21, 2025 10:44
@slfan1989
Copy link
Copy Markdown
Contributor Author

@szetszwo @adoroszlai RATIS-2303 has passed all checks. Although some unit tests are failing, the fixes are relatively straightforward. I'm considering including them directly in this PR. Could you please review this pr? Thank you very much!

Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

}
for (Future<FileStoreWriter> future : writerFutures) {
future.get();
future.get().close();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch! A new client is created for each write, we should close them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo Thank you very much for reviewing the code!

@szetszwo szetszwo merged commit 5528be0 into apache:master May 22, 2025
15 checks passed
szetszwo pushed a commit to szetszwo/ratis that referenced this pull request May 23, 2025
slfan1989 added a commit to slfan1989/ratis that referenced this pull request May 6, 2026
slfan1989 added a commit to slfan1989/ratis that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants