Cannot retrieve cluster state due to: no permissions for [cluster:monitor/health] and User

Hello! I am running ES Cluster 6.2 with SG.
Last week accidentally one data node was spinned up with 6.8 version of ES and SG. It was deallocated and stopped. Now cluster running back again all nodes with 6.2. version. Bu something looks like was broken in SG index with 6.8 node and now it is incompatible with 6.2 version. At least:

  • New node with 6.2. unable to authenticate users
  • I am not able to retrieve SG config from old nodes with sg_admin.sh

when I am trying to retrieve config with
/usr/share/elasticsearch/plugins/search-guard-6/tools/sgadmin.sh -cd /tmp/ -r -icl -nhnv -cacert /tmp/es_root_ca.pem -cert /tmp/es_admin.pem -key /tmp/es_admin.key -keypass xxx
i receive error msg

Cannot retrieve cluster state due to: no permissions for [cluster:monitor/health] and User [name=CN=admin-eu,OU=x,O=y,DC=e,DC=s,DC=com, roles=[], requestedTenant=null]. This is not an error, will keep on trying ...
  Root cause: ElasticsearchSecurityException[no permissions for [cluster:monitor/health] and User [name=CN=admin-eu,OU=x,O=y,DC=e,DC=s,DC=com, roles=[], 
requestedTenant=null]] (org.elasticsearch.ElasticsearchSecurityException/org.elasticsearch.ElasticsearchSecurityException)

same message when I am trying to do diagnose with -dg key.
For me looks a strange empty list of roles.

When I am trying to retrieve SG config from the node which was started right after this incident with 6.2 version i am getting
FAIL: Get configuration for 'roles' because it does not exist"

Looking forward to hearing any advice!
Thanks!

First of all: at least Search Guard won’t change it’s configuration just because it detects a new ES minor version.

As the first error message indicates that the user has no permissions, you should double check that the certificate you are using to connect to the cluster is indeed an admin certificate.

  • To do so, retrieve the subject DN of the certificate stored in /tmp/es_admin.pem. You can use for this task for example openssl x509 -in /tmp/es_admin.pem -text.

  • Second, check the file config/elasticsearch.yml on the node you tried to connect to. Look for the config option searchguard.authcz.admin_dn. The subject DN from above’s certificate should be listed here.

  • If the subject DN is not listed in config/elasticsearch.yml, you are either using the wrong certificate. Then try to use the correct certificate. It could be also the case that something changed your config/elasticsearch.yml file. Then you need to restore the old configuration. Take care to restore the whole configuration.

If this does not help, we’d need you to look into the log files of Elasticsearch for error messages in order to do further diagnosis on the problem.

Hello @cstaley! Thanks for your feedback.

I’ve double-checked the admin_dn list once again with the method you have mentioned and admin is listed there.

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
    Signature Algorithm: sha384WithRSAEncryption
        Issuer: 
        Validity
        Subject: DC=com, DC=company, DC=z, O=y, OU=x, CN=xxx-prod-e-search-admin
cat /etc/elasticsearch/elasticsearch.yml | grep "xxx-prod-e-search-admin"
- CN=xxx-prod-e-search-admin,OU=x,O=y,DC=z,DC=company,DC=com

I’ve obfuscated the output just not to expose the real admin name and company.
On one of the nodes which are in the cluster now (not this one, from which I’ve just executed commands) contains a different list of admins due to user management tasks.

Just double checking: The error message from your first post listed another user name:

CN=admin-eu,OU=x,O=y,DC=e,DC=s,DC=com

Is this expected?

Yes, sorry, this is because I am removing sensitive data by hand every time before posting. This is definitely the same keys and values.

Then, for further diagnosis, I’d need to ask you to look into the log files of Elasticsearch for error messages. Especially error messages during start up and error messages while running the sgadmin command would be interesting.

By the way: I just saw that in your first post you were mentioning that the user would have no roles. That is expected. If a user is properly authenticated as admin user (via admin certificate), the user has super user privileges without needing and having any roles.

@cstaley there are no updates in the log file during retrieve request (-r)

tail /var/log/elasticsearch/xxx-prod-cluster.log  -n 0 -f
Cannot retrieve cluster state due to: no permissions for [cluster:monitor/health] and User [name=CN=xxx-prod-e-search-admin,OU=x,O=y,DC=y,z=company,DC=com, roles=[], requestedTenant=null]. This is not an error, will keep on trying ...
  Root cause: ElasticsearchSecurityException[no permissions for [cluster:monitor/health] and User [name=CN=xxx-prod-e-search-admin,OU=x,O=y,DC=z,DC=company,DC=com, roles=[], 
requestedTenant=null]] (org.elasticsearch.ElasticsearchSecurityException/org.elasticsearch.ElasticsearchSecurityException)
   * Try running sgadmin.sh with -icl (but no -cl) and -nhnv (If that works you need to check your clustername as well as hostnames in your TLS certificates)
   * Make sure that your keystore or PEM certificate is a client certificate (not a node certificate) and configured properly in elasticsearch.yml
   * If this is not working, try running sgadmin.sh with --diagnose and see diagnose trace log file)
   * Add --accept-red-cluster to allow sgadmin to operate on a red cluster.

Two things that i should mention.

  • After node with 6.8 showed up in the cluster it was partially populated by data and the deallocated. But during the deallocation process, it stacks with 167 shards or so. The cluster was green, there were no unassigned shards. But allocation/explain show strange message
  "error" : {
    "root_cause" : [
      {
        "type" : "remote_transport_exception",
        "reason" : "[xxx-prod-search-eu-master-10-x-x-x][10.x.x.x:9300][cluster:monitor/allocation/explain]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}
  • I’ve noticed that sgadmin_diag_trace was created yesterday and available if it helps

Of course, the sgadmin_diag_trace might be helpful.

I’d also ask you to look for log messages which occurred during the startup of ES; this would be especially interesting for the node which cannot authenticate new users.

If you can still connect with normal users, or if you are willing to restart a node, you might also want to increase the log level. See here how to do this: https://docs.search-guard.com/latest/troubleshooting-setting-log-level

@cstaley i’ve shared the file with you in the messages.

Regarding the logs and start-stop of a cluster, I am going to recreate that situation on a preprod, but it will take some time. So if possible meanwhile could you please take a look at a diagnosis file.

Regards

@nikolai-bulashev

Unfortunately, the ES and Search Guard versions you are running are quite old. There were at least four bugfix releases of Search Guard for ES 6.2.4 after 6.2.4-22.1. Unfortunately, even these and ES 6.2.4 alltogether has reached end-of-life. Thus, we cannot really give support for these.

See here for an overview over all supported versions of ES and SG:

Anyway, one observation from the file you have provided:

Other actions like NodeInfoAction are successful, only ClusterHealthAction fails. Unlike most actions, ClusterHealthAction is always routed directly to the master node. According to the logs, the master node does NOT recognize the user as an admin user. So, you should check the searchguard.authcz.admin_dn setting in elasticsearch.yml there.

You said that the new node does not authenticate users. Is the new node by any chance the master node?

Regarding logs: The existing logs might still contain the errors from the startup. Did you check for these? Then, you would not need to recreate the situation. As I said, the logs from the node not authenticating users would be most interesting.

Finally: The diagnosis file you provided lists all nodes of the cluster. Is the new node listed there?

No. It was data-node.

Have provided restart log in personal message.

Form the master node I am able to partially retrieve the SG config

Will retrieve 'sg/config' into /tmp/sg_config_2020-Sep-30_10-30-36.yml
   SUCC: Configuration for 'config' stored in /tmp/sg_config_2020-Sep-30_10-30-36.yml
Will retrieve 'sg/roles' into /tmp/sg_roles_2020-Sep-30_10-30-36.yml
   FAIL: Get configuration for 'roles' because it does not exist
 Will retrieve 'sg/rolesmapping' into /tmp/sg_roles_mapping_2020-Sep-30_10-30-36.yml
    SUCC: Configuration for 'rolesmapping' stored in /tmp/sg_roles_mapping_2020-Sep-30_10-30-36.yml
 Will retrieve 'sg/internalusers' into /tmp/sg_internal_users_2020-Sep-30_10-30-36.yml
    FAIL: Get configuration for 'internalusers' because it does not exist
 Will retrieve 'sg/actiongroups' into /tmp/sg_action_groups_2020-Sep-30_10-30-36.yml
    SUCC: Configuration for 'actiongroups' stored in /tmp/sg_action_groups_2020-Sep-30_10-30-36.yml

From the new node have no auth problem, but retrieve of SG config fails on each step

 INFO: searchguard index state is YELLOW, it seems you miss some replicas
 Will retrieve 'sg/config' into /tmp/sg_config_2020-Sep-30_11-27-01.yml
    FAIL: Get configuration for 'config' because it does not exist
 Will retrieve 'sg/roles' into /tmp/sg_roles_2020-Sep-30_11-27-01.yml
    FAIL: Get configuration for 'roles' because it does not exist
 Will retrieve 'sg/rolesmapping' into /tmp/sg_roles_mapping_2020-Sep-30_11-27-01.yml
    FAIL: Get configuration for 'rolesmapping' because it does not exist
 Will retrieve 'sg/internalusers' into /tmp/sg_internal_users_2020-Sep-30_11-27-01.yml
    FAIL: Get configuration for 'internalusers' because it does not exist
 Will retrieve 'sg/actiongroups' into /tmp/sg_action_groups_2020-Sep-30_11-27-01.yml
    FAIL: Get configuration for 'actiongroups' because it does not exist

From the old node still auth problem. What I figured out. That old nodes contains an old list of admins which are currently might be not in SG index. Could it be the problem?

yep it is the node with the last octet 200 it is listed there.

But looks like SG index totally corrupted on that node (check prev message) and partially on the other nodes.

Updated: after removeing this node from excludeip it was able to sync SG index and now it dumps somehow

After restart of the new node and removing it from exclude looks like it was able to sync SG index with the other nodes

    searchguard index already exists, so we do not need to create one.
Will retrieve 'sg/config' into /tmp/sg_config_2020-Sep-30_11-53-43.yml
   SUCC: Configuration for 'config' stored in /tmp/sg_config_2020-Sep-30_11-53-43.yml
Will retrieve 'sg/roles' into /tmp/sg_roles_2020-Sep-30_11-53-43.yml
   FAIL: Get configuration for 'roles' because it does not exist
Will retrieve 'sg/rolesmapping' into /tmp/sg_roles_mapping_2020-Sep-30_11-53-43.yml
   SUCC: Configuration for 'rolesmapping' stored in /tmp/sg_roles_mapping_2020-Sep-30_11-53-43.yml
Will retrieve 'sg/internalusers' into /tmp/sg_internal_users_2020-Sep-30_11-53-43.yml
   FAIL: Get configuration for 'internalusers' because it does not exist
Will retrieve 'sg/actiongroups' into /tmp/sg_action_groups_2020-Sep-30_11-53-43.yml
   SUCC: Configuration for 'actiongroups' stored in /tmp/sg_action_groups_2020-Sep-30_11-53-43.yml

My current plan is to schedule maintenance window and try to re-init SG index

Admin certficates users are not defined in the Search Guard configuration index. The configuration for admin certificates is self contained and only in elasticsearch.yml. This is precisely for the situation to allow logins even though there are problems with the index. This however means that it is essential that the Search Guard configuration in elasticsearch.yml is always kept in sync across the nodes.

For me it looks like there is something fundamentally wrong with the cluster. New nodes should automatically pick up any index without any need of manual intervention.

Again, I’d strongly recommend to update ES and SG ASAP.

Got it. Really appreciate your help. I’ll update the topic just for history as soon as I get any further results.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.