We have a system with Elasticsearch 6.8.6 running on a mix of Debian 10.4 and Ubuntu Budgie 20.04 hosts. Two hosts each had three VMs with data on them - three shards, there was one replica. A third host was added, a third trio of data VMs is on it, and replica count is being adjusted from one to two. Any one host should have all data in the event of some sort of loss.
When our cluster is rolling restarted it always loses track of its license. That happened again during the introduction of the new host and VM trio. This time it presented differently, requiring a login using the admin cert and the application of this solution in order to unlock the searchguard index itself.
But this is a problem we’ve not experienced in the past - monitoring information is not available.
I looked at indices, this looks like it should be working:
green open .monitoring-beats-6-2020.07.24
green open .monitoring-beats-6-2020.07.25
green open .monitoring-beats-6-2020.07.26
green open .monitoring-beats-6-2020.07.27
green open .monitoring-beats-6-2020.07.28
green open .monitoring-beats-6-2020.07.29
green open .monitoring-beats-6-2020.07.30
green open .monitoring-beats-6-2020.07.31
green open .monitoring-es-6-2020.07.24
green open .monitoring-es-6-2020.07.25
green open .monitoring-es-6-2020.07.26
green open .monitoring-es-6-2020.07.27
green open .monitoring-es-6-2020.07.28
green open .monitoring-es-6-2020.07.29
green open .monitoring-es-6-2020.07.30
green open .monitoring-es-6-2020.07.31
green open .monitoring-kibana-6-2020.07.24
green open .monitoring-kibana-6-2020.07.25
green open .monitoring-kibana-6-2020.07.26
green open .monitoring-kibana-6-2020.07.27
green open .monitoring-kibana-6-2020.07.28
green open .monitoring-kibana-6-2020.07.29
green open .monitoring-kibana-6-2020.07.30
green open .monitoring-kibana-6-2020.07.31
We have tenants in place, I checked my private one, the admin tenant, logged in with the admin account itself - don’t get the usual monitoring interface.
We run Metricbeat on all hosts and VMs - the Infrastructure tab is fine.
I did a script to collect the required information and push it to github. We have two systems that might have Kibana running, I collected from both of them. Only one is operational at any given time, we manually switch during maintenance. There is a document in there that describes our system’s layout, too.
[2020-08-01T08:09:54,674][ERROR][c.f.s.a.s.InternalESSink ] [hp1] Unable to index audit log {"audit_cluster_name":"elasticsearch","audit_node_name":"hp1","audit_category":"FAILED_LOGIN","audit_request_origin":"REST","audit_node_id":"8R5fQgElSBedfpFJq89FVw","audit_request_layer":"REST","audit_rest_request_path":"/_searchguard/authinfo","@timestamp":"2020-08-01T15:09:12.114+00:00","audit_request_effective_user_is_admin":false,"audit_format_version":3,"audit_utc_timestamp":"2020-08-01T15:09:12.114+00:00","audit_request_remote_address":"192.168.88.63","audit_node_host_address":"192.168.88.63","audit_rest_request_headers":{"Connection":["keep-alive"],"Host":["hp1.netwarsystem.com:9200"],"Content-Length":["0"]},"audit_request_effective_user":"<NONE>","audit_node_host_name":"192.168.88.63"} due to ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping) within 30s]
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-mapping) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:127) ~[elasticsearch-6.8.6.jar:6.8.6]
at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:126) ~[elasticsearch-6.8.6.jar:6.8.6]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-6.8.6.jar:6.8.6]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Basically you don’t have the audit log indexed. Go to Kibana Monitoring and try to adjust the time filter to fetch old monitoring data, for example, last month, last year, etc. Do you see any data?
A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. How many shards should I have in my Elasticsearch cluster? | Elastic Blog
That is an excellent post - I’ve been lacking wisdom for large scale stuff like that.
We have three physical hosts that each have three data VMs. Two of them have 192gb of ram, the third has just 128gb. When we started we had to cram things into just a pair of machines so the three way sharding made sense, and I thought that 192gb of ram should be sliced three ways due to the Java pointers limit that kicks in at 32gb. We route based on host ID - so we can take any two machines down and should still have all of our data.
Am going to purge some unneeded indices, and upping available ram for the JVMs. This will take the better part of the day, will check back once it’s finished. TYVM.
Adjusting JVM heap sized turned out to be an utter disaster - loss of large indices during the process. But lets go ahead and close this, the upgrade is complete.