Elasticsearch 6.8.6, monitoring data goes missing?

We have a system with Elasticsearch 6.8.6 running on a mix of Debian 10.4 and Ubuntu Budgie 20.04 hosts. Two hosts each had three VMs with data on them - three shards, there was one replica. A third host was added, a third trio of data VMs is on it, and replica count is being adjusted from one to two. Any one host should have all data in the event of some sort of loss.

When our cluster is rolling restarted it always loses track of its license. That happened again during the introduction of the new host and VM trio. This time it presented differently, requiring a login using the admin cert and the application of this solution in order to unlock the searchguard index itself.

But this is a problem we’ve not experienced in the past - monitoring information is not available.

I looked at indices, this looks like it should be working:

green open .monitoring-beats-6-2020.07.24
green open .monitoring-beats-6-2020.07.25
green open .monitoring-beats-6-2020.07.26
green open .monitoring-beats-6-2020.07.27
green open .monitoring-beats-6-2020.07.28
green open .monitoring-beats-6-2020.07.29
green open .monitoring-beats-6-2020.07.30
green open .monitoring-beats-6-2020.07.31
green open .monitoring-es-6-2020.07.24
green open .monitoring-es-6-2020.07.25
green open .monitoring-es-6-2020.07.26
green open .monitoring-es-6-2020.07.27
green open .monitoring-es-6-2020.07.28
green open .monitoring-es-6-2020.07.29
green open .monitoring-es-6-2020.07.30
green open .monitoring-es-6-2020.07.31
green open .monitoring-kibana-6-2020.07.24
green open .monitoring-kibana-6-2020.07.25
green open .monitoring-kibana-6-2020.07.26
green open .monitoring-kibana-6-2020.07.27
green open .monitoring-kibana-6-2020.07.28
green open .monitoring-kibana-6-2020.07.29
green open .monitoring-kibana-6-2020.07.30
green open .monitoring-kibana-6-2020.07.31

We have tenants in place, I checked my private one, the admin tenant, logged in with the admin account itself - don’t get the usual monitoring interface.

We run Metricbeat on all hosts and VMs - the Infrastructure tab is fine.

Give me a hint, what should I check next?

Please send the following data to help better understand the problem:

  1. The Elasticsearch log if there are errors.
  2. The Kibana log if there are errors.
  3. The browser console log while you on the Monitoring page if there are errors.
  4. Kibana config - kibana.yml.
  5. A screenshot of the Monitoring page including the time filter.
  6. Are there data in the monitoring indices?

curl -k -u user:password -X GET https://localhost:9200/_cat/indices/.monitoring*

Also, double-check the official troubleshooting guide to make sure you have everything in order https://www.elastic.co/guide/en/kibana/6.8/monitor-troubleshooting.html

I did a script to collect the required information and push it to github. We have two systems that might have Kibana running, I collected from both of them. Only one is operational at any given time, we manually switch during maintenance. There is a document in there that describes our system’s layout, too.

The logs and such look fine, this is the image from the console.

I see many “Unable to index audit log” errors in https://github.com/NetwarSystem/elktrouble/blob/master/i0-elasticsearch.log

For example,

[2020-08-01T08:09:54,674][ERROR][c.f.s.a.s.InternalESSink ] [hp1] Unable to index audit log {"audit_cluster_name":"elasticsearch","audit_node_name":"hp1","audit_category":"FAILED_LOGIN","audit_request_origin":"REST","audit_node_id":"8R5fQgElSBedfpFJq89FVw","audit_request_layer":"REST","audit_rest_request_path":"/_searchguard/authinfo","@timestamp":"2020-08-01T15:09:12.114+00:00","audit_request_effective_user_is_admin":false,"audit_format_version":3,"audit_utc_timestamp":"2020-08-01T15:09:12.114+00:00","audit_request_remote_address":"","audit_node_host_address":"","audit_rest_request_headers":{"Connection":["keep-alive"],"Host":["hp1.netwarsystem.com:9200"],"Content-Length":["0"]},"audit_request_effective_user":"<NONE>","audit_node_host_name":""} due to ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping) within 30s]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:127) ~[elasticsearch-6.8.6.jar:6.8.6]
	at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:126) ~[elasticsearch-6.8.6.jar:6.8.6]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-6.8.6.jar:6.8.6]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Basically you don’t have the audit log indexed. Go to Kibana Monitoring and try to adjust the time filter to fetch old monitoring data, for example, last month, last year, etc. Do you see any data?

Then I looked at health https://github.com/NetwarSystem/elktrouble/blob/master/health.txt
Probably you have too many shards.

A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster