Since an upgrade to ES 8.18.3 and SG 3.1.0, all my access are quite slow, going to Kibana or ES REST API.
But only access from users stored in LDAP are slow, users in the internal database are still fast. I have done a curl in a loop and indeed, access delay range from 1m to less than 1s for the same user within the loop. So I’m quite sure that it’s an LDAP cache problem. I wonder if there is a metric somewhere that will show cache hit and help me adapt the size of it. We have a huge number of AD groups, so the cache is very important.
Another question, the documentation talks about ldap.group_search.cache.max_size says it hold 1000 entries. What is exactly an entry ? A group, a user ? Given the number of users and groups, is there a way to calculate the optimum size ?
Regarding ldap.group_search.cache.max_size, this option regards cached groups. The cached groups have username attribute that will connect user and the group.
The cache will remain for a time of 2 minutes by default. You can manipulate this value with ldap.group_search.cache.expire_after_write option.
I moved from ES 8.17.6 to 8.18.3. So SearchGuard was almost from 3.0.3 to 3.1.0.
I updated caching configuration. If it improved the situation, there is still a cache expiration every 30m, so I got it with a long request, more that one minute.
It remind me of ticket 629, that I opened more that one year ago.
When doing recursion, I was hopping that it might also group resolution, not only the individual mapping from users to groups. In such a case, increasing the cache size will not help, as I have a few users.
After more investigation, it was a referral following problem and broken domain controller at our AD.
As a work around I reduced connection timeout to a short value (2s) and reduced resolution time to 15s. But if I manually disable the referral following, the resolution is done in about 2s, I can handle that.