Search guard doing compliance check for each index causing huge delay in ES restart

Hi All,

Our cluster configuration: 1 Master Node, 1 Data Node, 1 Client Node with ELK version 7.4 and we have Search Guard plug-in enabled in all nodes with version 7.x.

We have also enabled debug log for SG in log4j.properties

logger.searchguard.name = com.floragunn
logger.searchguard.level = debug

When we try rebooting ES data Node, we are observing that for each of our index, SG is doing a compliance/audit check which is causing huge delay in getting back the ES instance up and running. Is there a way to disable the same as we have around 1000+ indexes and the delay is leading to huge down time (more than 2 hours) ?

Sample log entries below :-

[2019-11-01T03:17:50,479][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Compliance config is null because of dlsFlsAvailable: false and auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog

[2019-11-01T03:32:18,302][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Handle complianceConfig=null/dlsFlsAvailable: false/auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog for onIndexModule() of index k8_ms_logs-2019.10.31
[2019-11-01T03:32:37,654][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Handle complianceConfig=null/dlsFlsAvailable: false/auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog for onIndexModule() of index .monitoring-es-6-2019.10.31

Team,

Any suggestions for the original observation here would really help us moving forward

Thanks.

I had a look at the code in question, but I cannot explain why this should lead to a 2h downtime.

It is true that Elasticsearch calls this particular method for every index. This is inherent to Elasticsearch and cannot be changed.

However, since neither DLS/FLS nor Audit Logging is used (see e.g. “NullAuditLog”), there is not much going on in this method. In your case, two objects are instantiated and assigned, not much more.

What is the reason that you suspect this particular code / step causing the delay? “Jus” the logfile?

@hsaly any ideas?

I dont think that this 2h delay is subject to “compliance check” or Search Guard at all. Looking at your cluster topology I assume that 1000+ indices for a single data node is to much (would be interesting how many shards you have and what your hardware looks like). For a production setup you should have at least 3 dedicated master nodes and 2 data nodes.

Do you see this kind of delay with Search Guard disabled or uninstalled?

Apologize for the delay . Yes based on log file as we can see this going on for each and every index and only after everything s complete, the ES instance is getting back online

Hi @hsaly ,

Thanks and apologize for the delay in response from my side.

As I have stated in the original request, the observation is against our DEV instance.

In production, we do have the following setup

  1. 2 ES Master nodes (we will add a new master node master-3 very soon)
  2. 2 ES Data Nodes
  3. 2 ES client nodes

But in production, we still have ELK running with version 6.3.0 and very soon we will upgrade to 7.x . We want to confirm whether the upgrade to 7.x does not not break things for us so we started with DEV instance with ELK and SG version 7.x . The DEV instance config are

1 master node (3 GB heap),
1 data node (6 GB heap)
1 client node (4 GB heap)

And for your point “Do you see this kind of delay with Search Guard disabled or uninstalled?” and the answer is yes, we disabled SG in one of our environments and the restart is faster than the one with SG enabled.

+1 to my reply above. Is there a way to disable the compliance check though a property that we can define in elasticsearch.yml ? We can try that out and see whether the re-start time is coming down

So how much faster ist the cluster restart with Search Guard disabled?

Can you please also provide logs from all your dev nodes from from the point in time they start up undtil the cluster is green? (with Search Guard enabled of course)

Sure will share this weekend as having a another downtime this week is not allowed for us. Will shutdown ELK by Friday IST and restart and capture all and share to you.

Apologize for the delay. Still behind on this. Will share this week

elk-cluster.zip (10.3 KB)

Hi All,
Attaching the logs (elk*.zip) .Please note the time it takes to restart. Since we have removed a whole lot of indexes, the restart appear to be littler faster than before. But still its more than 5 mins. You may refer the below log entries. Now we have a total of 294 indexes . For that it takes close to 7+ mins to restart. If we have more indexes, more time it takes.

Start time: [2019-11-24T02:34:53,373][INFO ][o.e.e.NodeEnvironment ] [data-1] using [1] data paths, mounts [[/app/data (acs-brm-zfs-01:/export/oasp-dev-elk/indexes)]], net usable_space [1tb], net total_space [2.9tb], types [nfs4]

End time (Where the data node is intialized and ready to access):
[2019-11-24T02:42:15,897][INFO ][c.f.s.c.ConfigurationRepository] [data-1] Node ‘data-1’ initialized

-Thanks

Hi Team,

Need your help to review the attached logs (Please refer previous reply) and guide us on whether we can disable SG compliance check or any other way we can speed up the ES restart with SG enabled.

  • Thanks

Team,

Would be of great help if we get a guidance on this

Regards,
Shankar

Team,

We had already provided the logs that you had asked for. Waiting for your suggestions.

So how much faster ist the cluster restart with Search Guard disabled (when nothing else changed)? Please provide the same logs when you restart the node with SG disabled or removed.

Looking into the logs just makes me believe that the time the node needs to come up is because of normal index recovery which is also happens without Search Guard installed. Search Guard should not any significant delay with regards to node startup time/recovery. Looking on you dev cluster setup I think 6 GB ram for a 7.x data node containing more than 200 indices (and probably a lot more shards) is not enough.

Please also refer to