SG responds as down on Master only nodes

Martin · October 3, 2018, 1:32pm

I have 3 master only nodes and 8 data only nodes running in AWS.

When the cluster is first initialised I get the following behaviour.

curl -k https://localhost:9200/_searchguard/health?pretty

On one node I get:

{

“message” : null,

“mode” : “strict”,

“status” : “UP”

}

On the other two I get:

{

“message” : “Not initialized”,

“mode” : “strict”,

“status” : “DOWN”

}

However when I run:

curl -k -E admin.pem --key admin.key https://localhost:9200/_cluster/health?pretty

I get the same valid response on each master node:

{

“cluster_name” : “test”,

“status” : “green”,

“timed_out” : false,

“number_of_nodes” : 11,

“number_of_data_nodes” : 8,

“active_primary_shards” : 1,

“active_shards” : 8,

“relocating_shards” : 0,

“initializing_shards” : 0,

“unassigned_shards” : 0,

“delayed_unassigned_shards” : 0,

“number_of_pending_tasks” : 0,

“number_of_in_flight_fetch” : 0,

“task_max_waiting_in_queue_millis” : 0,

“active_shards_percent_as_number” : 100.0

}

I run this command when the servers start:

./sgadmin.sh -cd …/sg_config/ -nhnv -icl -cacert ./root-ca.pem -cert ./admin.pem -key ./admin.key

And I can see from each of the server logs that they have all run it.

If I reboot the servers or rerun the sgadmin line they all show as status: UP

Is there a race condition on first initialisation with multiple masters?

Is there some config I need to add to indicate to SG that it has multiple master nodes, some of which may not have connected yet.

···

ES + SG 6.4.0

Ubuntu 16.04

openjdk version “1.8.0_181”

Martin · October 3, 2018, 5:02pm

Having made some config changes this is no longer happening. All the nodes are coming up correctly.

Probably best to ignore this.

Martin · October 4, 2018, 10:20am

So it happened again and I think this is just a timing issue on startup.

I am starting elasticsearch then searchguard in a bash script with:

sudo systemctl --now enable elasticsearch

sudo -u elasticsearch bash -c “cd /usr/share/elasticsearch/plugins/search-guard-6/tools && ./sgadmin.sh -cd …/searchguard_config/ -nhnv -icl -cacert /etc/elasticsearch/config/root-ca.pem -cert /etc/elasticsearch/config/admin.pem -key /etc/elasticsearch/config/admin.key”

From the AWS boot log I see this sequence:

[ 15.375997] cloud-init[1305]: Synchronizing state of elasticsearch.service with SysV init with /lib/systemd/systemd-sysv-install...
[ 15.384355] cloud-init[1305]: Executing /lib/systemd/systemd-sysv-install enable elasticsearch
[[0;32m OK [0m] Started ACPI event daemon.
[[0;32m OK [0m] Started ACPI event daemon.
[ 15.647253] cloud-init[1305]: Created symlink from /etc/systemd/system/multi-user.target.wants/elasticsearch.service to /usr/lib/systemd/system/elasticsearch.service.
[[0;32m OK [0m] Started ACPI event daemon.
[[0;32m OK [0m] **Started Elasticsearch**.
[ 15.784077] cloud-init[1305]: WARNING: JAVA_HOME not set, will use /usr/bin/java
[ 18.050670] cloud-init[1305]: Search Guard Admin v6
[ 18.085208] cloud-init[1305]: Will connect to localhost:9300
[ 18.089728] cloud-init[1305]: ERR: Seems there is no Elasticsearch running on localhost:9300 - Will exit

I am going to try a sleep 5 between the commands, can you suggest something better than this hack?

jkressin · October 4, 2018, 1:14pm

In your script, just wait until Elasticsearch is reachable on port 9200, e.g.

while ! nc -z “$HOSTNAME” 9200; do
echo “Cant reach $HOSTNAME on 9200, waiting”
sleep 5
done

``

···

On Thursday, October 4, 2018 at 12:20:10 PM UTC+2, Martin wrote: