SearchGuard is not initialized just after a node joins a cluster

Issue

Assume there’s an ES cluster with SearchGuard initialized, where searchguard index exists.
During the time between when a node joins the cluster and when the node completes loading SearchGuard configurations, an ES node does not accept any indexing requests because the node is considered to be not initialized yet.

An client app receives the following error.

org.elasticsearch.transport.RemoteTransportException: [(host)][(address)][indices:data/write/bulk]
Caused by: org.elasticsearch.ElasticsearchSecurityException: Cannot authenticate null

The client app uses sniffing.

That error happens because BackendRegistry is not initialized.
https://github.com/floragunncom/search-guard/blob/es-6.6.2/src/main/java/com/floragunn/searchguard/transport/SearchGuardRequestHandler.java#L228

The error is transient, so retrying requests might work, but we might want to avoid the error if possible.

Questions

  • Can we make sure the node is initialized before the node joins the cluster (when searchguard index exists)?
  • Any idea how to avoid this transient error?

Version

Elasticsearch: 6.6.1
SearchGuard: 24.1

@cstaley any ideas regarding this issue?

I guess the only thing (beside retrying requests) is to use a loadbalancer which checks the health status of a node with our health check endpoint documented here Installation | Security for Elasticsearch | Search Guard (Or if not a loadbalancer maybe the client application can call the health check api on a regular basis)

I see. Good to know the health check endpoint.
We might be able to implement a custom sniffer to send requests to initialized nodes. It can be nice for SearchGuard library to provide the sniffer. :slight_smile:

If we retry requests, can we check whether the failures can be transient for this case?

What do you exactly mean with “can we check whether the failures can be transient for this case”?

What do you exactly mean with “can we check whether the failures can be transient for this case”?

I thought it can be nice if a client can determine whether a failure can be resolved by retrying requests.
For example, an Exception class might have a flag or code that indicates whether retrying might help or not.
If a failure happens because SG is not initialized, clients might choose to retry requests until it’s initialized.
If a failure happens due to insufficient privilege, clients might choose to stop sending requests.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.