Saturday, September 10, 2022

[FIXED] Rejections from search thread_pool Elasticsearch

September 10, 2022 elasticsearch, garbage-collection, java, memory-leaks

Issue

I have a elasticsearch cluster with 6 nodes. The heapsize is set as 50GB.(I know less than 32 is what is recommended but this was already set to 50Gb for some reason which I don't know). Now I am seeing a lot of rejections from search thread_pool.

[2020-10-26T12:50:32,730][DEBUG][o.e.a.s.TransportSearchAction] [IDC.node1.com] [indice_1][3], node[M4pYcHhzSTIj5NXA], [P], s[STARTED], a[id=IAQaCGTCTA-CkKGPw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[indice_1], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[_doc], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, lastShard [true]
org.elasticsearch.transport.RemoteTransportException: [IDC.node1.com][node1:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@21060fcc on QueueResizingEsThreadPoolExecutor[name = IDC.node1.com/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.2s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@3c46a309[Running, pool size = 49, active threads = 49, queued tasks = 1063, completed tasks = 1693218669]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-7.4.2.jar:7.4.2]
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:84) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$8(SearchService.java:1043) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.index.shard.IndexShard.awaitShardSearchActive(IndexShard.java:3247) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.search.SearchService.lambda$rewriteShardRequest$9(SearchService.java:1043) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:114) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:87) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.search.SearchService.rewriteShardRequest(SearchService.java:1048) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:340) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$6(SearchTransportService.java:335) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:185) [elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:118) [elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at java.lang.Thread.run(Thread.java:830) [?:?]

I can also "Pause Young (Allocation Failure)" happening constantly in gc log.

[2020-10-26T07:31:06.514+0000][61585][gc,start       ] GC(7139651) Pause Young (Allocation Failure)
[2020-10-26T07:31:06.514+0000][61585][gc,task        ] GC(7139651) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) - age   1:   40802096 bytes,   40802096 total
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) - age   2:   16714352 bytes,   57516448 total
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) - age   3:    8509976 bytes,   66026424 total
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) - age   4:    6906208 bytes,   72932632 total
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) - age   5:    4186344 bytes,   77118976 total
[2020-10-26T07:31:06.574+0000][61585][gc,age         ] GC(7139651) - age   6:    1960288 bytes,   79079264 total
[2020-10-26T07:31:06.574+0000][61585][gc,heap        ] GC(7139651) ParNew: 1631431K->102498K(1763584K)
[2020-10-26T07:31:06.574+0000][61585][gc,heap        ] GC(7139651) CMS: 35964431K->35966764K(50469312K)
[2020-10-26T07:31:06.574+0000][61585][gc,metaspace   ] GC(7139651) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:06.574+0000][61585][gc             ] GC(7139651) Pause Young (Allocation Failure) 36714M->35223M(51008M) 59.928ms
[2020-10-26T07:31:06.574+0000][61585][gc,cpu         ] GC(7139651) User=1.02s Sys=0.00s Real=0.06s
[2020-10-26T07:31:06.574+0000][61585][safepoint      ] Safepoint "GenCollectForAllocation", Time since last: 675463951 ns, Reaching safepoint: 1382982 ns, At safepoint: 60208788 ns, Total: 61591770 ns
[2020-10-26T07:31:07.575+0000][61585][safepoint      ] Safepoint "Cleanup", Time since last: 1000129984 ns, Reaching safepoint: 901989 ns, At safepoint: 39274 ns, Total: 941263 ns
[2020-10-26T07:31:08.529+0000][61585][gc,start       ] GC(7139652) Pause Young (Allocation Failure)
[2020-10-26T07:31:08.529+0000][61585][gc,task        ] GC(7139652) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:08.584+0000][61585][gc,age         ] GC(7139652) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:08.584+0000][61585][gc,age         ] GC(7139652) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:08.584+0000][61585][gc,age         ] GC(7139652) - age   1:   17722072 bytes,   17722072 total
[2020-10-26T07:31:08.585+0000][61585][gc,age         ] GC(7139652) - age   2:   15204416 bytes,   32926488 total
[2020-10-26T07:31:08.585+0000][61585][gc,age         ] GC(7139652) - age   3:   16712976 bytes,   49639464 total
[2020-10-26T07:31:08.585+0000][61585][gc,age         ] GC(7139652) - age   4:    8509664 bytes,   58149128 total
[2020-10-26T07:31:08.585+0000][61585][gc,age         ] GC(7139652) - age   5:    6501424 bytes,   64650552 total
[2020-10-26T07:31:08.585+0000][61585][gc,age         ] GC(7139652) - age   6:    3047280 bytes,   67697832 total
[2020-10-26T07:31:08.585+0000][61585][gc,heap        ] GC(7139652) ParNew: 1670178K->89020K(1763584K)
[2020-10-26T07:31:08.585+0000][61585][gc,heap        ] GC(7139652) CMS: 35966764K->35968680K(50469312K)
[2020-10-26T07:31:08.585+0000][61585][gc,metaspace   ] GC(7139652) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:08.585+0000][61585][gc             ] GC(7139652) Pause Young (Allocation Failure) 36754M->35212M(51008M) 56.177ms
[2020-10-26T07:31:08.585+0000][61585][gc,cpu         ] GC(7139652) User=0.87s Sys=0.00s Real=0.05s
[2020-10-26T07:31:08.585+0000][61585][safepoint      ] Safepoint "GenCollectForAllocation", Time since last: 948848143 ns, Reaching safepoint: 4487353 ns, At safepoint: 56505105 ns, Total: 60992458 ns
[2020-10-26T07:31:09.586+0000][61585][safepoint      ] Safepoint "Cleanup", Time since last: 1000188782 ns, Reaching safepoint: 807865 ns, At safepoint: 31141 ns, Total: 839006 ns
[2020-10-26T07:31:10.587+0000][61585][safepoint      ] Safepoint "Cleanup", Time since last: 1000218955 ns, Reaching safepoint: 947598 ns, At safepoint: 40468 ns, Total: 988066 ns
[2020-10-26T07:31:11.588+0000][61585][safepoint      ] Safepoint "Cleanup", Time since last: 1000252002 ns, Reaching safepoint: 688820 ns, At safepoint: 24744 ns, Total: 713564 ns
[2020-10-26T07:31:12.590+0000][61585][safepoint      ] Safepoint "Cleanup", Time since last: 1000168389 ns, Reaching safepoint: 978766 ns, At safepoint: 30805 ns, Total: 1009571 ns
[2020-10-26T07:31:13.497+0000][61585][gc,start       ] GC(7139653) Pause Young (Allocation Failure)
[2020-10-26T07:31:13.498+0000][61585][gc,task        ] GC(7139653) Using 23 workers of 23 for evacuation
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) Desired survivor size 100302848 bytes, new threshold 6 (max threshold 6)
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) Age table with threshold 6 (max threshold 6)
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) - age   1:   11416792 bytes,   11416792 total
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) - age   2:    4119704 bytes,   15536496 total
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) - age   3:   15256408 bytes,   30792904 total
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) - age   4:   17073416 bytes,   47866320 total
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) - age   5:    8509616 bytes,   56375936 total
[2020-10-26T07:31:13.560+0000][61585][gc,age         ] GC(7139653) - age   6:    5802544 bytes,   62178480 total
[2020-10-26T07:31:13.560+0000][61585][gc,heap        ] GC(7139653) ParNew: 1656700K->91797K(1763584K)
[2020-10-26T07:31:13.560+0000][61585][gc,heap        ] GC(7139653) CMS: 35968680K->35971272K(50469312K)
[2020-10-26T07:31:13.560+0000][61585][gc,metaspace   ] GC(7139653) Metaspace: 112067K->112067K(118784K)
[2020-10-26T07:31:13.561+0000][61585][gc             ] GC(7139653) Pause Young (Allocation Failure) 36743M->35217M(51008M) 63.001ms
[2020-10-26T07:31:13.561+0000][61585][gc,cpu         ] GC(7139653) User=0.99s Sys=0.00s Real=0.07s
[2020-10-26T07:31:13.561+0000][61585][safepoint      ] Safepoint "GenCollectForAllocation", Time since last: 906329704 ns, Reaching safepoint: 1247593 ns, At safepoint: 63668517 ns, Total: 64916110 ns

Should this happen this frequently?. Below is heap pool settings:

          "pools" : {
            "young" : {
              "used_in_bytes" : 731499608,
              "max_in_bytes" : 1605304320,
              "peak_used_in_bytes" : 1605304320,
              "peak_max_in_bytes" : 1605304320
            },
            "survivor" : {
              "used_in_bytes" : 187117000,
              "max_in_bytes" : 200605696,
              "peak_used_in_bytes" : 200605696,
              "peak_max_in_bytes" : 200605696
            },
            "old" : {
              "used_in_bytes" : 16461553480,
              "max_in_bytes" : 51680575488,
              "peak_used_in_bytes" : 50144794688,
              "peak_max_in_bytes" : 51680575488
            }

Are the rejections because of heapsize or cause the thread_pool being low. And why is "Pause Young" happening this rapidly?

Update This update was asked as another question: Search thread_pool for particular nodes always at maximum

Solution

search thread_pool has an associated queue on which search worker threads work on and its default capacity is 1000(note its per node) which is consumed in your case, please refer thread_pools in ES for more info.

Now, queue capacity can be exhausted due to one of the below reasons:

High search requests rate on your ES node, which your node might not be able to keep up with.
Some slow queries which are taking a lot more time to process, and causing other requests to wait in the search queue, and ultimately leading to rejection.
Your nodes and shards are not properly balanced in your cluster, this can cause few nodes to receive a high number of search queries.

You should look at the search slow-logs to filter the costly queries during the issue.

And if its legitimate increase in search traffic, you should scale your cluster to handle more search requests.

Regarding young heap logs, if it's not taking more time and under 50-70ms its not that bigger concern, and you should focus on fixing the search rejection issue which might solve or lower the heap issue and in any case you should have Max heap size lower than 32 GB for better performance.

Answered By - Amit- ES enthusiast
Answer Checked By - Clifford M. (JavaFixing Volunteer)

This Answer collected from stackoverflow and tested by JavaFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 10, 2022

[FIXED] Rejections from search thread_pool Elasticsearch

Issue

Solution

Popular Posts

Labels