Severe performance problems caused by SearchGuard

First, is this the proper place to be requesting support if we have been buying the Enterprise license every year for a few years?

A summary of the problems:

  1. SearchGuard seems to 2x to 3x the CPU usage of Elasticsearch
  2. SearchGuard introduces a indexing bottleneck that for our cluster is about 25% total throughput of the cluster without SearchGuard
  3. SearchGuard in the newest versions has a significant performance regression in searches, to the point it makes Kibana unusable.

Current version
7.16.2
SG: 52.6.0
SG Kibana: 52.1.0

Recently upgraded from
7.8.2
SG: 42.0.0

Cluster specs. Used for log collection.
Indexing: 60,000 per second (with spikes up to 200,000 per second)
Searches: 5,000 per second
113 nodes
22,000 shards
11,000 indices
Logstash is the source of all indexing
Kibana is the source of all searches
CentOS 7
All bare metal Linux on physical hardware

After we upgraded we encountered such severe performance degradation that the cluster was essentially unusable. Through the process of troubleshooting we eventually disabled SearchGuard as a test, at which point all performance problems disappeared. Details of performance problems below:

  1. We have always suffered from an invisible indexing bottleneck/limit where it was impossible to push beyond a certain barrier regardless of Logstash or Elasticsearch hardware we threw at it. The upgrade did not fix this bottleneck, but CPU usage on Elasticsearch dropped to about 1/3 of previous (probably due to the faster bulk actions fix introduced in 49.0.0. Previously we would hit a bottleneck at around 60,000 per second (new logs written). As soon as we removed SearchGuard we were hitting 150,000 to 250,000 new logs written per second at about half the CPU usage on Elasticsearch as 60,000 per second with SearchGuard enabled.

  2. After we upgraded, the cluster became so slow that Kibana became unusable. Simple commands such as GET /_cat/health or GET /_cluster/settings were taking 60+ seconds to return and prior to the upgrade would take <2 seconds. Any more complex Kibana actions that hit data indices went from <10 seconds to timing out after 180 seconds. In Kibana monitoring, prior to the upgrade the average Kibana latency was <10 seconds. After the upgrade it averaged 90 seconds. After removing SearchGuard it averaged <0.5 seconds.

Before, during, and after the the upgrade there were zero errors in the Elasticsearch, Kibana, or SearchGuard logs. Everything is running properly without problems, just with significant performance bottlenecks.

1 Like

Hi Brian,

First, is this the proper place to be requesting support if we have been buying the Enterprise license every year for a few years?

There are several partners who sell Search Guard licenses and provide support; thus, I cannot give you a definite answer if and where you can get personal support. For this, please get in touch with the partner you have licensed Search Guard from.

Still, regarding your problem:

This does not sound good and is a bit surprising as the SG releases after 42 have brought a number of performance improvements.

A good starting point for analyzing such issues is a “hot threads” dump. See the docs here on how to retrieve it:

If you don’t want to share it publicly, you can send it to me by private message.

Further useful information would be:

  • Number of entries in sg_roles.yml and sg_roles_mapping.yml.
  • Number of roles per user.
  • How are the users authenticated? LDAP? SAML? OIDC?

We are running with SearchGuard disabled, so am not able to get you a hot threads dump at this moment.

When we disabled indexing the performance went back to normal. Searches/Kibana usability was fine.

109 roles with an equal number of roles_mappings (roughly)

Most Kibana users will be members of an assortment of 60+ to 109 roles. Internal users such as kibana server or logstash only have a single role and single role mapping.

For normal Kibana users:
authc is OpenID, it provides the username and a single role
authz is LDAP to enumerates all the other roles.
1 index is mapped to a role. Role mapping maps that role to a backend LDAP role.

For internal service accounts (kibana server, logstash, etc)
authc is internal user database
authz is noop

An example of the 1 index to 1 LDAP backend group mapping. There are roughly 109 indexes, so 109 of these mappings. Users are put into the LDAP groups and may be a member of any variety of the total 109.

ldap_index_network-firewall:
  reserved: false
  hidden: false
  cluster_permissions: []
  index_permissions:
  - index_patterns:
    - "network-firewall-*"
    fls: []
    masked_fields: []
    allowed_actions:
    - "SGS_READ"
  tenant_permissions: []
  exclude_cluster_permissions: []
  exclude_index_permissions: []
  static: false
ldap_index_network-firewall:
  reserved: false
  hidden: false
  backend_roles:
  - "kibana_index_network-firewall"
  hosts: []
  users: []
  and_backend_roles: []

That all sounds quite reasonable and should not cause issues.

However, this also means that we need more information to find out the cause of the issue. I fear, a hot thread dump would have the highest chance to find the culprit.

Ok. Understood. It will take us 1-2 weeks to get two full cluster cold restarts scheduled in, I’ll respond back when we have hotthreads info.

1 Like

Attached are several hot_threads. The details documented below.

hot_threads.zip (455.6 KB)

*pre and post testing with SG disabled ran at 53k/s indexing steady state with 310k/s indexing rate while purging a buffer.
**with SG enabled, the indexing rate would not exceed 17k/s

1_hotthreads_nosg_indexing_pre
SearchGuard disabled
Searching load: yes
Indexing load: yes
Indexing rate = 53,000 logs per second

2_hotthreads_sg_indexing_1
SearchGuard enabled
Searching load: minimal
Indexing load: yes
Indexing rate = 17,000 logs per second (it would not go faster then this even though there was a queue to purge)

3_hotthreads_sg_indexing_2
SearchGuard enabled
Searching load: minimal
Indexing load: yes
Indexing rate = 17,000 per second (it would not go faster then this even though there was a queue to purge)

4_hotthreads_sg_noindexing
SearchGuard enabled
Searching load: no
Indexing load: no
Indexing rate = 0 logs per second

5_hotthreads_nosg_noindexing_post_1
SearchGuard disabled
Searching load: no
Indexing load: no
Indexing rate = 0 logs per second
Has some shard recovery ongoing

6_hotthreads_nosg_noindexing_post_2
SearchGuard disabled
Searching load: no
Indexing load: no
Indexing rate = 0 logs per second

7_hotthreads_nosg_indexing_post_1
SearchGuard disabled
Searching load: yes
Indexing load: yes
Indexing rate = 310,000 logs per second (purging a queue)

8_hotthreads_nosg_indexing_post_2
SearchGuard disabled
Searching load: yes
Indexing load: yes
Indexing rate = 52,000 logs per second

Hi Brian!

Thank you for the update - I am looking into it right now.

I have just two more questions:

  • Did the number of indices significantly increase while moving from ES 7.8.2 to ES 7.16.2?
  • Did you run ES on the bundled Java versions or on different Java versions?

Thanks,

Nils

Thank you. Also as a note, the above 8 hot_threads were collected over the period of about 4 hours. Same cluster, same versions, same number of indexes, and same load. There is a significant performance difference in-between having SearchGuard enabled or disabled.

No. 7.8.2 to 7.16.2 was a rolling in-place upgrade. Same number of indexes in both.

Bundled version.

[INFO ][o.e.n.Node] [server01p-es-01] JVM home [/usr/share/elasticsearch/jdk], using bundled JDK [true]

Just a quick update on this:

We analyzed the hot threads files and came up with a concept that should solve the issue. Implementing this will take two or tree more weeks, though.

Nils

Out of curiosity, is this in any part due to the number of roles/mappings we have? We’ve noticed that on another cluster that the same performance issues do not exist. The only real difference is that the other cluster has 2 roles related to the OIDC with everything else being local users with a single role, where the cluster with the performance issues has 100+ roles that OIDC uses in addition to the same set of local users and local roles.

It depends both on the number of role mappings and indices. Reducing either should reduce the number of string matching operations necessary per request linearly.

First question.
This is our logstash user.

local_logstash:
  reserved: false
  hidden: false
  cluster_permissions:
  - "SGS_CLUSTER_MONITOR"
  - "SGS_CLUSTER_COMPOSITE_OPS"
  - "SGS_CLUSTER_MANAGE_ILM"
  - "SGS_CLUSTER_MANAGE_INDEX_TEMPLATES"
  - "SGS_CLUSTER_MANAGE_PIPELINES"
  index_permissions:
  - index_patterns:
    - "a-*"
    - "b-*"
    - "c-*"
    - "d-*"
    - "e-*"
    - "f-*"
    - "g-*"
    - "h-*"
    - "i-*"
    - "j-*"
    - "k-*"
    - "l-*"
    - "m-*"
    - "n-*"
    - "o-*"
    - "p-*"
    - "q-*"
    - "r-*"
    - "s-*"
    - "t-*"
    - "u-*"
    - "v-*"
    - "w-*"
    - "x-*"
    - "y-*"
    - "z-*"
    fls: []
    masked_fields: []
    allowed_actions:
    - "SGS_CREATE_INDEX"
    - "SGS_CRUD"
    - "SGS_INDICES_MANAGE_ILM"
    - "SGS_MANAGE"
  - index_patterns:
    - "ilm-*"
    fls: []
    masked_fields: []
    allowed_actions:
    - "SGS_MANAGE_ALIASES"
  tenant_permissions: []
  exclude_cluster_permissions: []
  exclude_index_permissions: []
  static: false

local_logstash:
  reserved: false
  hidden: false
  backend_roles: []
  hosts: []
  users:
  - "logstash"
  and_backend_roles: []

Is it more efficient to have a single index pattern?

  - index_patterns:
    - '/[a-z]-\*/'

Second question.
Is it more efficient to have 1 role with 20 index permissions and 1 role mapping for that role. Or 20 roles with 1 index permission each and 20 role mappings for those 20 roles?

This is a bit difficult to answer, because there are two opposing forces here:

  • Having fewer index patterns will be certainly more efficient, as fewer string matching operations need to be performed.
  • Using an actual regexp instead of simple patterns (which just support *) is a bit less efficient.

Still, in total, I would say that the single regexp will be more efficient.

This won’t make any difference.

Thank you. We are having significant performance impacts with indexing, which is that above logstash role and role mapping. While we wait for a fix we will try and optimize the permissions some.

Hi Brian!

Still working on this, unfortunately.

I have two more questions:

  • How many aliases do you have on your cluster?
  • Do the index_patterns in your role definitions refer to alias names or index names?

Roughly 150 aliases. I can get you a more accurate count if you need it.

Role definitions refer to wildcarded index names. ‘metrics-metricbeat-*’ for example for permissions on one role for one index type.

    'index_metrics-metricbeat':
      content:
        cluster_permissions: []
        index_permissions:
          - index_patterns:
            - "metrics-metricbeat-*"
            allowed_actions:
            - "SGS_READ"
            fls: []
            masked_fields: []
        tenant_permissions: []

    'index_metrics-metricbeat':
      content:
        backend_roles:
          - "kibana_index_metrics-metricbeat"
        hosts: []
        users: []

Every index is a member of at least 2 aliases.

  1. All indices under ‘metrics-*’ for example will be a member of alias ‘metrics’. KIbana users access data by using these large aliases, ‘metrics’ is mapped to a Kibana index template that users use.
  2. A second alias is used for ILM, ‘metrics-metricbeat-*’ is a member of ‘ilm-metrics-metricbeat’ and Logstash writes to this alias.
  3. Some indexes are members of 3+ aliases. Security logs from a web server for example might be a member of both ‘application’ and ‘security’ aliases, plus the alias used by ILM to write logs.

Because of ILM there is 1 alias that is used for ILM for every index type. If there are 109 index names, there are 109 SearchGuard roles and role mappings, and there are 109 aliases used by ILM. Then in addition to that there are the large overall aliases such as ‘metrics’

Thank you! This is already helpful; more accurate figures are not necessary for now.

Nils

Hi Brian!

Unfortunately, developing a profound fix for this will take some more time. However, we believe that we can offer a temporary solution for this:

You can find here a snapshot of Search Guard, which is equal to the current SG 53 release version, with one additional change:

https://maven.search-guard.com:443/search-guard-suite-snapshot/com/floragunn/search-guard-suite-plugin/b-disable-alias-resolution-es-7.16.3-SNAPSHOT/search-guard-suite-plugin-b-disable-alias-resolution-es-7.16.3-20220223.070148-2.zip

The snapshot is for ES 7.16.3. If you need a different version, please give a quick notice.

If you edit sg_config.yml and set support_aliases_in_index_privileges inside the dynamic object to false, index patterns defined inside sg_roles.yml role definitions no longer support the defintion of aliases or date math.

Example:

sg_config:
  dynamic:
    support_aliases_in_index_privileges: false

This will significantly speed up privilege evaluation, as only a fraction of operations needs to be performed.

Note: This does NOT mean that you will be no longer able to access indices via aliases; this should work as before. It will be just no longer possible to specify aliases in role definitions and have the indices the aliases are pointing to “inherit” the privileges.

Hope that helps; glad to answer further questions.

Thank you. We will test that and gather hot_threads info.

Separately, we will also test changing the logstsash user from 26 individual index permissions to a single * index permissions and gather hot_threads info.

1 Like