First, is this the proper place to be requesting support if we have been buying the Enterprise license every year for a few years?
A summary of the problems:
SearchGuard seems to 2x to 3x the CPU usage of Elasticsearch
SearchGuard introduces a indexing bottleneck that for our cluster is about 25% total throughput of the cluster without SearchGuard
SearchGuard in the newest versions has a significant performance regression in searches, to the point it makes Kibana unusable.
Current version
7.16.2
SG: 52.6.0
SG Kibana: 52.1.0
Recently upgraded from
7.8.2
SG: 42.0.0
Cluster specs. Used for log collection.
Indexing: 60,000 per second (with spikes up to 200,000 per second)
Searches: 5,000 per second
113 nodes
22,000 shards
11,000 indices
Logstash is the source of all indexing
Kibana is the source of all searches
CentOS 7
All bare metal Linux on physical hardware
After we upgraded we encountered such severe performance degradation that the cluster was essentially unusable. Through the process of troubleshooting we eventually disabled SearchGuard as a test, at which point all performance problems disappeared. Details of performance problems below:
We have always suffered from an invisible indexing bottleneck/limit where it was impossible to push beyond a certain barrier regardless of Logstash or Elasticsearch hardware we threw at it. The upgrade did not fix this bottleneck, but CPU usage on Elasticsearch dropped to about 1/3 of previous (probably due to the faster bulk actions fix introduced in 49.0.0. Previously we would hit a bottleneck at around 60,000 per second (new logs written). As soon as we removed SearchGuard we were hitting 150,000 to 250,000 new logs written per second at about half the CPU usage on Elasticsearch as 60,000 per second with SearchGuard enabled.
After we upgraded, the cluster became so slow that Kibana became unusable. Simple commands such as GET /_cat/health or GET /_cluster/settings were taking 60+ seconds to return and prior to the upgrade would take <2 seconds. Any more complex Kibana actions that hit data indices went from <10 seconds to timing out after 180 seconds. In Kibana monitoring, prior to the upgrade the average Kibana latency was <10 seconds. After the upgrade it averaged 90 seconds. After removing SearchGuard it averaged <0.5 seconds.
Before, during, and after the the upgrade there were zero errors in the Elasticsearch, Kibana, or SearchGuard logs. Everything is running properly without problems, just with significant performance bottlenecks.
First, is this the proper place to be requesting support if we have been buying the Enterprise license every year for a few years?
There are several partners who sell Search Guard licenses and provide support; thus, I cannot give you a definite answer if and where you can get personal support. For this, please get in touch with the partner you have licensed Search Guard from.
Still, regarding your problem:
This does not sound good and is a bit surprising as the SG releases after 42 have brought a number of performance improvements.
A good starting point for analyzing such issues is a âhot threadsâ dump. See the docs here on how to retrieve it:
If you donât want to share it publicly, you can send it to me by private message.
Further useful information would be:
Number of entries in sg_roles.yml and sg_roles_mapping.yml.
Number of roles per user.
How are the users authenticated? LDAP? SAML? OIDC?
We are running with SearchGuard disabled, so am not able to get you a hot threads dump at this moment.
When we disabled indexing the performance went back to normal. Searches/Kibana usability was fine.
109 roles with an equal number of roles_mappings (roughly)
Most Kibana users will be members of an assortment of 60+ to 109 roles. Internal users such as kibana server or logstash only have a single role and single role mapping.
For normal Kibana users:
authc is OpenID, it provides the username and a single role
authz is LDAP to enumerates all the other roles.
1 index is mapped to a role. Role mapping maps that role to a backend LDAP role.
For internal service accounts (kibana server, logstash, etc)
authc is internal user database
authz is noop
An example of the 1 index to 1 LDAP backend group mapping. There are roughly 109 indexes, so 109 of these mappings. Users are put into the LDAP groups and may be a member of any variety of the total 109.
That all sounds quite reasonable and should not cause issues.
However, this also means that we need more information to find out the cause of the issue. I fear, a hot thread dump would have the highest chance to find the culprit.
*pre and post testing with SG disabled ran at 53k/s indexing steady state with 310k/s indexing rate while purging a buffer.
**with SG enabled, the indexing rate would not exceed 17k/s
1_hotthreads_nosg_indexing_pre
SearchGuard disabled
Searching load: yes
Indexing load: yes
Indexing rate = 53,000 logs per second
2_hotthreads_sg_indexing_1
SearchGuard enabled
Searching load: minimal
Indexing load: yes
Indexing rate = 17,000 logs per second (it would not go faster then this even though there was a queue to purge)
3_hotthreads_sg_indexing_2
SearchGuard enabled
Searching load: minimal
Indexing load: yes
Indexing rate = 17,000 per second (it would not go faster then this even though there was a queue to purge)
4_hotthreads_sg_noindexing
SearchGuard enabled
Searching load: no
Indexing load: no
Indexing rate = 0 logs per second
5_hotthreads_nosg_noindexing_post_1
SearchGuard disabled
Searching load: no
Indexing load: no
Indexing rate = 0 logs per second
Has some shard recovery ongoing
6_hotthreads_nosg_noindexing_post_2
SearchGuard disabled
Searching load: no
Indexing load: no
Indexing rate = 0 logs per second
7_hotthreads_nosg_indexing_post_1
SearchGuard disabled
Searching load: yes
Indexing load: yes
Indexing rate = 310,000 logs per second (purging a queue)
8_hotthreads_nosg_indexing_post_2
SearchGuard disabled
Searching load: yes
Indexing load: yes
Indexing rate = 52,000 logs per second
Thank you. Also as a note, the above 8 hot_threads were collected over the period of about 4 hours. Same cluster, same versions, same number of indexes, and same load. There is a significant performance difference in-between having SearchGuard enabled or disabled.
No. 7.8.2 to 7.16.2 was a rolling in-place upgrade. Same number of indexes in both.
Bundled version.
[INFO ][o.e.n.Node] [server01p-es-01] JVM home [/usr/share/elasticsearch/jdk], using bundled JDK [true]
Out of curiosity, is this in any part due to the number of roles/mappings we have? Weâve noticed that on another cluster that the same performance issues do not exist. The only real difference is that the other cluster has 2 roles related to the OIDC with everything else being local users with a single role, where the cluster with the performance issues has 100+ roles that OIDC uses in addition to the same set of local users and local roles.
It depends both on the number of role mappings and indices. Reducing either should reduce the number of string matching operations necessary per request linearly.
Is it more efficient to have a single index pattern?
- index_patterns:
- '/[a-z]-\*/'
Second question.
Is it more efficient to have 1 role with 20 index permissions and 1 role mapping for that role. Or 20 roles with 1 index permission each and 20 role mappings for those 20 roles?
Thank you. We are having significant performance impacts with indexing, which is that above logstash role and role mapping. While we wait for a fix we will try and optimize the permissions some.
All indices under âmetrics-*â for example will be a member of alias âmetricsâ. KIbana users access data by using these large aliases, âmetricsâ is mapped to a Kibana index template that users use.
A second alias is used for ILM, âmetrics-metricbeat-*â is a member of âilm-metrics-metricbeatâ and Logstash writes to this alias.
Some indexes are members of 3+ aliases. Security logs from a web server for example might be a member of both âapplicationâ and âsecurityâ aliases, plus the alias used by ILM to write logs.
Because of ILM there is 1 alias that is used for ILM for every index type. If there are 109 index names, there are 109 SearchGuard roles and role mappings, and there are 109 aliases used by ILM. Then in addition to that there are the large overall aliases such as âmetricsâ
The snapshot is for ES 7.16.3. If you need a different version, please give a quick notice.
If you edit sg_config.yml and set support_aliases_in_index_privileges inside the dynamic object to false, index patterns defined inside sg_roles.yml role definitions no longer support the defintion of aliases or date math.
This will significantly speed up privilege evaluation, as only a fraction of operations needs to be performed.
Note: This does NOT mean that you will be no longer able to access indices via aliases; this should work as before. It will be just no longer possible to specify aliases in role definitions and have the indices the aliases are pointing to âinheritâ the privileges.
Hope that helps; glad to answer further questions.
Thank you. We will test that and gather hot_threads info.
Separately, we will also test changing the logstsash user from 26 individual index permissions to a single * index permissions and gather hot_threads info.