Hello! I am running ES Cluster 6.2 with SG.
Last week accidentally one data node was spinned up with 6.8 version of ES and SG. It was deallocated and stopped. Now cluster running back again all nodes with 6.2. version. Bu something looks like was broken in SG index with 6.8 node and now it is incompatible with 6.2 version. At least:
New node with 6.2. unable to authenticate users
I am not able to retrieve SG config from old nodes with sg_admin.sh
when I am trying to retrieve config with /usr/share/elasticsearch/plugins/search-guard-6/tools/sgadmin.sh -cd /tmp/ -r -icl -nhnv -cacert /tmp/es_root_ca.pem -cert /tmp/es_admin.pem -key /tmp/es_admin.key -keypass xxx
i receive error msg
Cannot retrieve cluster state due to: no permissions for [cluster:monitor/health] and User [name=CN=admin-eu,OU=x,O=y,DC=e,DC=s,DC=com, roles=[], requestedTenant=null]. This is not an error, will keep on trying ...
Root cause: ElasticsearchSecurityException[no permissions for [cluster:monitor/health] and User [name=CN=admin-eu,OU=x,O=y,DC=e,DC=s,DC=com, roles=[],
requestedTenant=null]] (org.elasticsearch.ElasticsearchSecurityException/org.elasticsearch.ElasticsearchSecurityException)
same message when I am trying to do diagnose with -dg key.
For me looks a strange empty list of roles.
When I am trying to retrieve SG config from the node which was started right after this incident with 6.2 version i am getting FAIL: Get configuration for 'roles' because it does not exist"
First of all: at least Search Guard won’t change it’s configuration just because it detects a new ES minor version.
As the first error message indicates that the user has no permissions, you should double check that the certificate you are using to connect to the cluster is indeed an admin certificate.
To do so, retrieve the subject DN of the certificate stored in /tmp/es_admin.pem. You can use for this task for example openssl x509 -in /tmp/es_admin.pem -text.
Second, check the file config/elasticsearch.ymlon the node you tried to connect to. Look for the config option searchguard.authcz.admin_dn. The subject DN from above’s certificate should be listed here.
If the subject DN is not listed in config/elasticsearch.yml, you are either using the wrong certificate. Then try to use the correct certificate. It could be also the case that something changed your config/elasticsearch.yml file. Then you need to restore the old configuration. Take care to restore the whole configuration.
If this does not help, we’d need you to look into the log files of Elasticsearch for error messages in order to do further diagnosis on the problem.
I’ve obfuscated the output just not to expose the real admin name and company.
On one of the nodes which are in the cluster now (not this one, from which I’ve just executed commands) contains a different list of admins due to user management tasks.
Then, for further diagnosis, I’d need to ask you to look into the log files of Elasticsearch for error messages. Especially error messages during start up and error messages while running the sgadmin command would be interesting.
By the way: I just saw that in your first post you were mentioning that the user would have no roles. That is expected. If a user is properly authenticated as admin user (via admin certificate), the user has super user privileges without needing and having any roles.
Cannot retrieve cluster state due to: no permissions for [cluster:monitor/health] and User [name=CN=xxx-prod-e-search-admin,OU=x,O=y,DC=y,z=company,DC=com, roles=[], requestedTenant=null]. This is not an error, will keep on trying ...
Root cause: ElasticsearchSecurityException[no permissions for [cluster:monitor/health] and User [name=CN=xxx-prod-e-search-admin,OU=x,O=y,DC=z,DC=company,DC=com, roles=[],
requestedTenant=null]] (org.elasticsearch.ElasticsearchSecurityException/org.elasticsearch.ElasticsearchSecurityException)
* Try running sgadmin.sh with -icl (but no -cl) and -nhnv (If that works you need to check your clustername as well as hostnames in your TLS certificates)
* Make sure that your keystore or PEM certificate is a client certificate (not a node certificate) and configured properly in elasticsearch.yml
* If this is not working, try running sgadmin.sh with --diagnose and see diagnose trace log file)
* Add --accept-red-cluster to allow sgadmin to operate on a red cluster.
After node with 6.8 showed up in the cluster it was partially populated by data and the deallocated. But during the deallocation process, it stacks with 167 shards or so. The cluster was green, there were no unassigned shards. But allocation/explain show strange message
Of course, the sgadmin_diag_trace might be helpful.
I’d also ask you to look for log messages which occurred during the startup of ES; this would be especially interesting for the node which cannot authenticate new users.
@cstaley i’ve shared the file with you in the messages.
Regarding the logs and start-stop of a cluster, I am going to recreate that situation on a preprod, but it will take some time. So if possible meanwhile could you please take a look at a diagnosis file.
Unfortunately, the ES and Search Guard versions you are running are quite old. There were at least four bugfix releases of Search Guard for ES 6.2.4 after 6.2.4-22.1. Unfortunately, even these and ES 6.2.4 alltogether has reached end-of-life. Thus, we cannot really give support for these.
See here for an overview over all supported versions of ES and SG:
Anyway, one observation from the file you have provided:
Other actions like NodeInfoAction are successful, only ClusterHealthAction fails. Unlike most actions, ClusterHealthAction is always routed directly to the master node. According to the logs, the master node does NOT recognize the user as an admin user. So, you should check the searchguard.authcz.admin_dn setting in elasticsearch.yml there.
You said that the new node does not authenticate users. Is the new node by any chance the master node?
Regarding logs: The existing logs might still contain the errors from the startup. Did you check for these? Then, you would not need to recreate the situation. As I said, the logs from the node not authenticating users would be most interesting.
Finally: The diagnosis file you provided lists all nodes of the cluster. Is the new node listed there?
Form the master node I am able to partially retrieve the SG config
Will retrieve 'sg/config' into /tmp/sg_config_2020-Sep-30_10-30-36.yml
SUCC: Configuration for 'config' stored in /tmp/sg_config_2020-Sep-30_10-30-36.yml
Will retrieve 'sg/roles' into /tmp/sg_roles_2020-Sep-30_10-30-36.yml
FAIL: Get configuration for 'roles' because it does not exist
Will retrieve 'sg/rolesmapping' into /tmp/sg_roles_mapping_2020-Sep-30_10-30-36.yml
SUCC: Configuration for 'rolesmapping' stored in /tmp/sg_roles_mapping_2020-Sep-30_10-30-36.yml
Will retrieve 'sg/internalusers' into /tmp/sg_internal_users_2020-Sep-30_10-30-36.yml
FAIL: Get configuration for 'internalusers' because it does not exist
Will retrieve 'sg/actiongroups' into /tmp/sg_action_groups_2020-Sep-30_10-30-36.yml
SUCC: Configuration for 'actiongroups' stored in /tmp/sg_action_groups_2020-Sep-30_10-30-36.yml
From the new node have no auth problem, but retrieve of SG config fails on each step
INFO: searchguard index state is YELLOW, it seems you miss some replicas
Will retrieve 'sg/config' into /tmp/sg_config_2020-Sep-30_11-27-01.yml
FAIL: Get configuration for 'config' because it does not exist
Will retrieve 'sg/roles' into /tmp/sg_roles_2020-Sep-30_11-27-01.yml
FAIL: Get configuration for 'roles' because it does not exist
Will retrieve 'sg/rolesmapping' into /tmp/sg_roles_mapping_2020-Sep-30_11-27-01.yml
FAIL: Get configuration for 'rolesmapping' because it does not exist
Will retrieve 'sg/internalusers' into /tmp/sg_internal_users_2020-Sep-30_11-27-01.yml
FAIL: Get configuration for 'internalusers' because it does not exist
Will retrieve 'sg/actiongroups' into /tmp/sg_action_groups_2020-Sep-30_11-27-01.yml
FAIL: Get configuration for 'actiongroups' because it does not exist
From the old node still auth problem. What I figured out. That old nodes contains an old list of admins which are currently might be not in SG index. Could it be the problem?
yep it is the node with the last octet 200 it is listed there.
But looks like SG index totally corrupted on that node (check prev message) and partially on the other nodes.
Updated: after removeing this node from excludeip it was able to sync SG index and now it dumps somehow
After restart of the new node and removing it from exclude looks like it was able to sync SG index with the other nodes
searchguard index already exists, so we do not need to create one.
Will retrieve 'sg/config' into /tmp/sg_config_2020-Sep-30_11-53-43.yml
SUCC: Configuration for 'config' stored in /tmp/sg_config_2020-Sep-30_11-53-43.yml
Will retrieve 'sg/roles' into /tmp/sg_roles_2020-Sep-30_11-53-43.yml
FAIL: Get configuration for 'roles' because it does not exist
Will retrieve 'sg/rolesmapping' into /tmp/sg_roles_mapping_2020-Sep-30_11-53-43.yml
SUCC: Configuration for 'rolesmapping' stored in /tmp/sg_roles_mapping_2020-Sep-30_11-53-43.yml
Will retrieve 'sg/internalusers' into /tmp/sg_internal_users_2020-Sep-30_11-53-43.yml
FAIL: Get configuration for 'internalusers' because it does not exist
Will retrieve 'sg/actiongroups' into /tmp/sg_action_groups_2020-Sep-30_11-53-43.yml
SUCC: Configuration for 'actiongroups' stored in /tmp/sg_action_groups_2020-Sep-30_11-53-43.yml
My current plan is to schedule maintenance window and try to re-init SG index
Admin certficates users are not defined in the Search Guard configuration index. The configuration for admin certificates is self contained and only in elasticsearch.yml. This is precisely for the situation to allow logins even though there are problems with the index. This however means that it is essential that the Search Guard configuration in elasticsearch.yml is always kept in sync across the nodes.
For me it looks like there is something fundamentally wrong with the cluster. New nodes should automatically pick up any index without any need of manual intervention.
Again, I’d strongly recommend to update ES and SG ASAP.