Elasticsearch 6.8.6, monitoring data goes missing?

We have a system with Elasticsearch 6.8.6 running on a mix of Debian 10.4 and Ubuntu Budgie 20.04 hosts. Two hosts each had three VMs with data on them - three shards, there was one replica. A third host was added, a third trio of data VMs is on it, and replica count is being adjusted from one to two. Any one host should have all data in the event of some sort of loss.

When our cluster is rolling restarted it always loses track of its license. That happened again during the introduction of the new host and VM trio. This time it presented differently, requiring a login using the admin cert and the application of this solution in order to unlock the searchguard index itself.

But this is a problem we’ve not experienced in the past - monitoring information is not available.

I looked at indices, this looks like it should be working:

green open .monitoring-beats-6-2020.07.24
green open .monitoring-beats-6-2020.07.25
green open .monitoring-beats-6-2020.07.26
green open .monitoring-beats-6-2020.07.27
green open .monitoring-beats-6-2020.07.28
green open .monitoring-beats-6-2020.07.29
green open .monitoring-beats-6-2020.07.30
green open .monitoring-beats-6-2020.07.31
green open .monitoring-es-6-2020.07.24
green open .monitoring-es-6-2020.07.25
green open .monitoring-es-6-2020.07.26
green open .monitoring-es-6-2020.07.27
green open .monitoring-es-6-2020.07.28
green open .monitoring-es-6-2020.07.29
green open .monitoring-es-6-2020.07.30
green open .monitoring-es-6-2020.07.31
green open .monitoring-kibana-6-2020.07.24
green open .monitoring-kibana-6-2020.07.25
green open .monitoring-kibana-6-2020.07.26
green open .monitoring-kibana-6-2020.07.27
green open .monitoring-kibana-6-2020.07.28
green open .monitoring-kibana-6-2020.07.29
green open .monitoring-kibana-6-2020.07.30
green open .monitoring-kibana-6-2020.07.31

We have tenants in place, I checked my private one, the admin tenant, logged in with the admin account itself - don’t get the usual monitoring interface.

We run Metricbeat on all hosts and VMs - the Infrastructure tab is fine.

Give me a hint, what should I check next?

Please send the following data to help better understand the problem:

  1. The Elasticsearch log if there are errors.
  2. The Kibana log if there are errors.
  3. The browser console log while you on the Monitoring page if there are errors.
  4. Kibana config - kibana.yml.
  5. A screenshot of the Monitoring page including the time filter.
  6. Are there data in the monitoring indices?

curl -k -u user:password -X GET https://localhost:9200/_cat/indices/.monitoring*

Also, double-check the official troubleshooting guide to make sure you have everything in order https://www.elastic.co/guide/en/kibana/6.8/monitor-troubleshooting.html

I did a script to collect the required information and push it to github. We have two systems that might have Kibana running, I collected from both of them. Only one is operational at any given time, we manually switch during maintenance. There is a document in there that describes our system’s layout, too.

The logs and such look fine, this is the image from the console.

I see many “Unable to index audit log” errors in https://github.com/NetwarSystem/elktrouble/blob/master/i0-elasticsearch.log

For example,

[2020-08-01T08:09:54,674][ERROR][c.f.s.a.s.InternalESSink ] [hp1] Unable to index audit log {"audit_cluster_name":"elasticsearch","audit_node_name":"hp1","audit_category":"FAILED_LOGIN","audit_request_origin":"REST","audit_node_id":"8R5fQgElSBedfpFJq89FVw","audit_request_layer":"REST","audit_rest_request_path":"/_searchguard/authinfo","@timestamp":"2020-08-01T15:09:12.114+00:00","audit_request_effective_user_is_admin":false,"audit_format_version":3,"audit_utc_timestamp":"2020-08-01T15:09:12.114+00:00","audit_request_remote_address":"","audit_node_host_address":"","audit_rest_request_headers":{"Connection":["keep-alive"],"Host":["hp1.netwarsystem.com:9200"],"Content-Length":["0"]},"audit_request_effective_user":"<NONE>","audit_node_host_name":""} due to ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping) within 30s]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:127) ~[elasticsearch-6.8.6.jar:6.8.6]
	at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:126) ~[elasticsearch-6.8.6.jar:6.8.6]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-6.8.6.jar:6.8.6]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Basically you don’t have the audit log indexed. Go to Kibana Monitoring and try to adjust the time filter to fetch old monitoring data, for example, last month, last year, etc. Do you see any data?

Then I looked at health https://github.com/NetwarSystem/elktrouble/blob/master/health.txt
Probably you have too many shards.

A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

That is an excellent post - I’ve been lacking wisdom for large scale stuff like that.

We have three physical hosts that each have three data VMs. Two of them have 192gb of ram, the third has just 128gb. When we started we had to cram things into just a pair of machines so the three way sharding made sense, and I thought that 192gb of ram should be sliced three ways due to the Java pointers limit that kicks in at 32gb. We route based on host ID - so we can take any two machines down and should still have all of our data.

Am going to purge some unneeded indices, and upping available ram for the JVMs. This will take the better part of the day, will check back once it’s finished. TYVM.

Adjusting JVM heap sized turned out to be an utter disaster - loss of large indices during the process. But lets go ahead and close this, the upgrade is complete.

Sorry to hear. Weren’t you been able to restore the indices from backup?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.