Search guard doing compliance check for each index causing huge delay in ES restart

shanky1978 · November 1, 2019, 10:07am

Hi All,

Our cluster configuration: 1 Master Node, 1 Data Node, 1 Client Node with ELK version 7.4 and we have Search Guard plug-in enabled in all nodes with version 7.x.

We have also enabled debug log for SG in log4j.properties

logger.searchguard.name = com.floragunn
logger.searchguard.level = debug

When we try rebooting ES data Node, we are observing that for each of our index, SG is doing a compliance/audit check which is causing huge delay in getting back the ES instance up and running. Is there a way to disable the same as we have around 1000+ indexes and the delay is leading to huge down time (more than 2 hours) ?

Sample log entries below :-

[2019-11-01T03:17:50,479][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Compliance config is null because of dlsFlsAvailable: false and auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog
…
[2019-11-01T03:32:18,302][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Handle complianceConfig=null/dlsFlsAvailable: false/auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog for onIndexModule() of index k8_ms_logs-2019.10.31
[2019-11-01T03:32:37,654][DEBUG][c.f.s.SearchGuardPlugin ] [data-1] Handle complianceConfig=null/dlsFlsAvailable: false/auditLog=class com.floragunn.searchguard.auditlog.NullAuditLog for onIndexModule() of index .monitoring-es-6-2019.10.31

shanky1978 · November 4, 2019, 2:54pm

Team,

Any suggestions for the original observation here would really help us moving forward

Thanks.

jkressin · November 4, 2019, 4:16pm

I had a look at the code in question, but I cannot explain why this should lead to a 2h downtime.

It is true that Elasticsearch calls this particular method for every index. This is inherent to Elasticsearch and cannot be changed.

However, since neither DLS/FLS nor Audit Logging is used (see e.g. “NullAuditLog”), there is not much going on in this method. In your case, two objects are instantiated and assigned, not much more.

What is the reason that you suspect this particular code / step causing the delay? “Jus” the logfile?

@hsaly any ideas?

cstaley · November 4, 2019, 4:50pm

I dont think that this 2h delay is subject to “compliance check” or Search Guard at all. Looking at your cluster topology I assume that 1000+ indices for a single data node is to much (would be interesting how many shards you have and what your hardware looks like). For a production setup you should have at least 3 dedicated master nodes and 2 data nodes.

Do you see this kind of delay with Search Guard disabled or uninstalled?

shanky1978 · November 7, 2019, 4:17am

Apologize for the delay . Yes based on log file as we can see this going on for each and every index and only after everything s complete, the ES instance is getting back online

shanky1978 · November 7, 2019, 4:25am

Hi @hsaly ,

Thanks and apologize for the delay in response from my side.

As I have stated in the original request, the observation is against our DEV instance.

In production, we do have the following setup

2 ES Master nodes (we will add a new master node master-3 very soon)
2 ES Data Nodes
2 ES client nodes

But in production, we still have ELK running with version 6.3.0 and very soon we will upgrade to 7.x . We want to confirm whether the upgrade to 7.x does not not break things for us so we started with DEV instance with ELK and SG version 7.x . The DEV instance config are

1 master node (3 GB heap),
1 data node (6 GB heap)
1 client node (4 GB heap)

And for your point “Do you see this kind of delay with Search Guard disabled or uninstalled?” and the answer is yes, we disabled SG in one of our environments and the restart is faster than the one with SG enabled.

shanky1978 · November 7, 2019, 4:29am

+1 to my reply above. Is there a way to disable the compliance check though a property that we can define in elasticsearch.yml ? We can try that out and see whether the re-start time is coming down

hsaly · November 11, 2019, 9:28pm

So how much faster ist the cluster restart with Search Guard disabled?

Can you please also provide logs from all your dev nodes from from the point in time they start up undtil the cluster is green? (with Search Guard enabled of course)

shanky1978 · November 13, 2019, 5:21am

Sure will share this weekend as having a another downtime this week is not allowed for us. Will shutdown ELK by Friday IST and restart and capture all and share to you.

shanky1978 · November 20, 2019, 4:07am

Apologize for the delay. Still behind on this. Will share this week

shanky1978 · November 24, 2019, 10:18am

elk-cluster.zip (10.3 KB)

Hi All,
Attaching the logs (elk*.zip) .Please note the time it takes to restart. Since we have removed a whole lot of indexes, the restart appear to be littler faster than before. But still its more than 5 mins. You may refer the below log entries. Now we have a total of 294 indexes . For that it takes close to 7+ mins to restart. If we have more indexes, more time it takes.

Start time: [2019-11-24T02:34:53,373][INFO ][o.e.e.NodeEnvironment ] [data-1] using [1] data paths, mounts [[/app/data (acs-brm-zfs-01:/export/oasp-dev-elk/indexes)]], net usable_space [1tb], net total_space [2.9tb], types [nfs4]

End time (Where the data node is intialized and ready to access):
[2019-11-24T02:42:15,897][INFO ][c.f.s.c.ConfigurationRepository] [data-1] Node ‘data-1’ initialized

-Thanks

shanky1978 · December 2, 2019, 4:22am

Hi Team,

Need your help to review the attached logs (Please refer previous reply) and guide us on whether we can disable SG compliance check or any other way we can speed up the ES restart with SG enabled.

Thanks

shanky1978 · December 9, 2019, 5:16pm

Team,

Would be of great help if we get a guidance on this

Regards,
Shankar

shanky1978 · December 16, 2019, 12:18pm

Team,

We had already provided the logs that you had asked for. Waiting for your suggestions.

hsaly · December 16, 2019, 5:17pm

So how much faster ist the cluster restart with Search Guard disabled (when nothing else changed)? Please provide the same logs when you restart the node with SG disabled or removed.

Looking into the logs just makes me believe that the time the node needs to come up is because of normal index recovery which is also happens without Search Guard installed. Search Guard should not any significant delay with regards to node startup time/recovery. Looking on you dev cluster setup I think 6 GB ram for a 7.x data node containing more than 200 indices (and probably a lot more shards) is not enough.

Please also refer to

Topic		Replies	Views
Search Guard Not Initiated on Elasticsearch Restart Search Guard	3	765	May 24, 2018
Please help configuring Elasticsearch with Search Guard Search Guard	6	1280	October 17, 2016
Must I run sgadmin.sh after restart ES everytime? Search Guard	13	3387	May 9, 2019
latest Searchguard bundle not working on a single node Search Guard	7	909	August 24, 2017
Elasticsearch not starting after installing Searchguard version 41.0 Search Guard	3	2489	July 22, 2020

Search guard doing compliance check for each index causing huge delay in ES restart

Related topics