Search guard doing compliance check for each index causing huge delay in ES restart

Hi All,

Our cluster configuration: 1 Master Node, 1 Data Node, 1 Client Node with ELK version 7.4 and we have Search Guard plug-in enabled in all nodes with version 7.x.

We have also enabled debug log for SG in log4j.properties

logger.searchguard.name = com.floragunn
logger.searchguard.level = debug

When we try rebooting ES data Node, we are observing that for each of our index, SG is doing a compliance/audit check which is causing huge delay in getting back the ES instance up and running. Is there a way to disable the same as we have around 1000+ indexes and the delay is leading to huge down time (more than 2 hours) ?

Sample log entries below :-

[2019-11-01T03:17:50,479][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Compliance config is null because of dlsFlsAvailable: false and auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog

[2019-11-01T03:32:18,302][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Handle complianceConfig=null/dlsFlsAvailable: false/auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog for onIndexModule() of index k8_ms_logs-2019.10.31
[2019-11-01T03:32:37,654][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Handle complianceConfig=null/dlsFlsAvailable: false/auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog for onIndexModule() of index .monitoring-es-6-2019.10.31

Team,

Any suggestions for the original observation here would really help us moving forward

Thanks.

I had a look at the code in question, but I cannot explain why this should lead to a 2h downtime.

It is true that Elasticsearch calls this particular method for every index. This is inherent to Elasticsearch and cannot be changed.

However, since neither DLS/FLS nor Audit Logging is used (see e.g. “NullAuditLog”), there is not much going on in this method. In your case, two objects are instantiated and assigned, not much more.

What is the reason that you suspect this particular code / step causing the delay? “Jus” the logfile?

@hsaly any ideas?

I dont think that this 2h delay is subject to “compliance check” or Search Guard at all. Looking at your cluster topology I assume that 1000+ indices for a single data node is to much (would be interesting how many shards you have and what your hardware looks like). For a production setup you should have at least 3 dedicated master nodes and 2 data nodes.

Do you see this kind of delay with Search Guard disabled or uninstalled?

Apologize for the delay . Yes based on log file as we can see this going on for each and every index and only after everything s complete, the ES instance is getting back online

Hi @hsaly ,

Thanks and apologize for the delay in response from my side.

As I have stated in the original request, the observation is against our DEV instance.

In production, we do have the following setup

  1. 2 ES Master nodes (we will add a new master node master-3 very soon)
  2. 2 ES Data Nodes
  3. 2 ES client nodes

But in production, we still have ELK running with version 6.3.0 and very soon we will upgrade to 7.x . We want to confirm whether the upgrade to 7.x does not not break things for us so we started with DEV instance with ELK and SG version 7.x . The DEV instance config are

1 master node (3 GB heap),
1 data node (6 GB heap)
1 client node (4 GB heap)

And for your point “Do you see this kind of delay with Search Guard disabled or uninstalled?” and the answer is yes, we disabled SG in one of our environments and the restart is faster than the one with SG enabled.

+1 to my reply above. Is there a way to disable the compliance check though a property that we can define in elasticsearch.yml ? We can try that out and see whether the re-start time is coming down

So how much faster ist the cluster restart with Search Guard disabled?

Can you please also provide logs from all your dev nodes from from the point in time they start up undtil the cluster is green? (with Search Guard enabled of course)