Very slow SearchGuard

Since an upgrade to ES 8.18.3 and SG 3.1.0, all my access are quite slow, going to Kibana or ES REST API.

But only access from users stored in LDAP are slow, users in the internal database are still fast. I have done a curl in a loop and indeed, access delay range from 1m to less than 1s for the same user within the loop. So I’m quite sure that it’s an LDAP cache problem. I wonder if there is a metric somewhere that will show cache hit and help me adapt the size of it. We have a huge number of AD groups, so the cache is very important.

Another question, the documentation talks about ldap.group_search.cache.max_size says it hold 1000 entries. What is exactly an entry ? A group, a user ? Given the number of users and groups, is there a way to calculate the optimum size ?

@fbacchella What were your original versions of the ELK stack and SG plugin?

Regarding ldap.group_search.cache.max_size, this option regards cached groups. The cached groups have username attribute that will connect user and the group.
The cache will remain for a time of 2 minutes by default. You can manipulate this value with ldap.group_search.cache.expire_after_write option.

Could you share your current ldap configuration?

I moved from ES 8.17.6 to 8.18.3. So SearchGuard was almost from 3.0.3 to 3.1.0.

I updated caching configuration. If it improved the situation, there is still a cache expiration every 30m, so I got it with a long request, more that one minute.

It remind me of ticket 629, that I opened more that one year ago.

The ldap configuration is:

- type: "basic/ldap"
  ldap:
    idp:
      tls:
        trust_all: false
        enabled_protocols:
        - "TLSv1.2"
        - "TLSv1.3"
      hosts:
      - "ldaps://host1:636"
      - "ldaps://host1:636"
      - "ldaps://host1:636"
      bind_dn: "XXX"
      password: "XXXX"
    user_search:
      filter:
        by_attribute: "sAMAccountName"
      base_dn: "XXX"
      retrieve_attributes:
      - "memberOf"
      - "sAMAccountName"
      - "dn"
    group_search:
      cache:
        max_size: 10000
        expire_after_write: "30m"
      base_dn: "XXX"
      recursive:
        enabled: true
      retrieve_attributes:
      - "memberOf"
      role_name_attribute: "dn"
  user_mapping:
    user_name:
      from_backend: "$.ldap_user_entry.sAMAccountName"
    roles:
      from:
      - "$.ldap_user_entry[\"memberOf\"]"

When doing recursion, I was hopping that it might also group resolution, not only the individual mapping from users to groups. In such a case, increasing the cache size will not help, as I have a few users.

@fbacchella, Could you tell me what the common and the highest level of recursion are in your nested AD groups?

There is only 3 level of recursion.

After more investigation, it was a referral following problem and broken domain controller at our AD.

As a work around I reduced connection timeout to a short value (2s) and reduced resolution time to 15s. But if I manually disable the referral following, the resolution is done in about 2s, I can handle that.

I have requested an evolution in ticket #778.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.