Inconsistent cluster state during 6-7 Upgrade

faxmodem · February 11, 2022, 11:52am

When upgrading from 6.8.6 to 7.10.2 I noticed the cluster state was inconsistent once more than half the nodes were upgraded.
The 7 nodes report RED while the 6 Nodes report YELLOW/GREEN.
The corresponding indices are the following:

.signals_watches_state        
.searchguard_authtokens       
.signals_accounts             
.signals_watches_trigger_state
.signals_watches              
.signals_settings             
.searchguard_config_history

faxmodem · February 11, 2022, 2:24pm

Now that all nodes have been upgraded, the cluster is RED, with the following unassigned shards:

.signals_watches_trigger_state     3  p UNASSIGNED
.signals_watches_trigger_state     3  r UNASSIGNED
.signals_watches_trigger_state     0  p UNASSIGNED
.signals_watches_trigger_state     0  r UNASSIGNED
.signals_watches                   2  p UNASSIGNED
.signals_watches                   2  r UNASSIGNED
.searchguard_authtokens            4  r UNASSIGNED
.searchguard_authtokens            4  p UNASSIGNED
.searchguard_authtokens            2  r UNASSIGNED
.searchguard_authtokens            2  p UNASSIGNED
.signals_settings                  1  r UNASSIGNED
.signals_settings                  1  p UNASSIGNED
.signals_settings                  4  r UNASSIGNED
.signals_settings                  4  p UNASSIGNED
.signals_accounts                  1  r UNASSIGNED
.signals_accounts                  1  p UNASSIGNED
.signals_watches_state             1  r UNASSIGNED
.signals_watches_state             1  p UNASSIGNED
.signals_watches_state             3  r UNASSIGNED
.signals_watches_state             3  p UNASSIGNED
.searchguard_config_history        2  p UNASSIGNED
.searchguard_config_history        2  r UNASSIGNED
.searchguard_config_history        0  p UNASSIGNED
.searchguard_config_history        0  r UNASSIGNED

sirHusky · February 11, 2022, 2:30pm

@faxmodem I would imagine you used rolling upgrade method as half the nodes were upgraded previously.

During upgrade of the first node, did you disable shard allocation, upgraded then enabled it again and waited for the node to recover?

Did it recover?

faxmodem · February 11, 2022, 2:31pm

yes I did that for every node.All nodes recovered okay, ingestion works, query too. But cluster is RED due to these “leftover”? indices

faxmodem · February 11, 2022, 2:33pm

I’m tempted to DELETE all of these. But is it safe ?

faxmodem · February 11, 2022, 2:36pm

BTW I didn’t migrate searchguard config yet

sirHusky · February 11, 2022, 2:48pm

Have you ran the below API to see what could be the reason?

GET _cluster/allocation/explain

faxmodem · February 11, 2022, 2:51pm

    "node_allocation_decisions": [
        {
            "deciders": [
                {
                    "decider": "replica_after_primary_active",
                    "decision": "NO",
                    "explanation": "primary shard for this replica is not yet active"
                },
                {
                    "decider": "throttling",
                    "decision": "NO",
                    "explanation": "primary shard for this replica is not yet active"
                }
            ],
            "node_attributes": {
                "manufacturer": "Dell Inc.",
                "processorcount": "16",
                "productname": "PowerEdge R640"
            },
            "node_decision": "no",
            "node_id": "ddddddd",
            "node_name": "dddddd",
            "transport_address": "dddddd",
            "weight_ranking": 1
        }

faxmodem · February 11, 2022, 2:55pm

It seems the shards are simply missing.
So they must have been lost during the upgrade, for some reason.

.searchguard_authtokens            2  r UNASSIGNED
.searchguard_authtokens            2  p UNASSIGNED
.searchguard_authtokens            3  p STARTED                       999.999.108.88  node83.example.com
.searchguard_authtokens            3  r STARTED                       999.999.108.89  node84.example.com
.searchguard_authtokens            1  p STARTED                       999.999.233.5   node78.example.com
.searchguard_authtokens            1  r STARTED                       999.999.233.4   node77.example.com
.searchguard_authtokens            4  r UNASSIGNED
.searchguard_authtokens            4  p UNASSIGNED
.searchguard_authtokens            0  p STARTED                       999.999.238.221 node221.example.com
.searchguard_authtokens            0  r STARTED                       999.999.233.6   node79.example.com

faxmodem · February 11, 2022, 2:57pm

So the question boils down to : is it safe to DELETE those ?
I’m not particularily anxious about .signals_*. But what about .searchguard_config_history and .searchguard_authtokens ?

faxmodem · February 11, 2022, 3:01pm

The good thing is they all seem to be empty:

health status index                          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   .searchguard_config_history    rsFVFnlMT0OzpdsJYFzA8Q   5   1          0            0      1.4kb           717b
red    open   .signals_settings              TNWmQ7zeSKOBjbwOKBiYlw   5   1          0            0      1.4kb           717b
red    open   .signals_watches               1oXl_COeT92OrIp9kNiRVQ   5   1          0            0      1.8kb           956b
red    open   .signals_accounts              i0YKW43VQVqe7C58BTM-6Q   5   1          0            0      1.8kb           956b
red    open   .signals_watches_state         _WYARV0HQ5CkXXvQ9jQv7A   5   1          0            0      1.4kb           717b
red    open   .signals_watches_trigger_state Fz4R0VmcQtm0EuvldZGPAg   5   1          0            0      1.4kb           717b
red    open   .searchguard_authtokens        f3sWqRDRS_24T-VDCShiBw   5   1          0            0      1.4kb           717b

I’m guessing they were added by search-guard-7 during the rolling upgrade.

faxmodem · February 11, 2022, 3:03pm

The indices’ creation_date confirms this.
I think I’m gonna delete them

faxmodem · February 11, 2022, 3:03pm

or simply force-reassign a primary

nils · February 11, 2022, 3:08pm

If you don’t use these features, it should be safe to just delete the indices. It is however possible that Search Guard will try to create the index again whenever a node starts up and discovers it has elected to be the master.

Just to be sure: You have upgraded to SG 52.6.0?

faxmodem · February 11, 2022, 3:15pm

To be on the safe side I just used the cluster reroute API to allocate_empty_primary for all the concerned indices, and now My cluster is GREEn. Thanks for your help.

And no, I used 7.10.2-51.0.0 because that’s what I used on preproduction

sirHusky · February 16, 2022, 3:39pm

@faxmodem did you manage to fully upgrade the cluster? When you used cluster reroute API, did you assign to nodes which are already upgraded?

faxmodem · February 17, 2022, 7:29am

Yes the upgrade was already finished when I assigned the primaries.

Kosmonafft · February 17, 2022, 9:46am

For us it has been proven that it is much easier and less error-prone to spin up a new cluster with new version and just restore the cluster state and snapshot there and delete the old cluster.

Rolling Update of running productive cluster in my experience ALWAYS lead to troubles, stress and problems.

faxmodem · February 17, 2022, 1:24pm

I’m sure it is, but in our case we didn’t have the luxury

system · March 10, 2022, 1:24pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster status is red due to unassigned shards .signals_* and .searchguard_* Search Guard	10	2397	April 20, 2021
Remove/move/delete .signals_watches_trigger_state Signals Alerting	3	489	December 21, 2020
Cannot retrieve cluster state due to: No user found for cluster:monitor/health. Search Guard	11	2547	January 18, 2018
Seems cluster is already migrated when data nodes are not detected failing sg-migration Search Guard	5	670	March 16, 2020
why searchguard2 index status is Unassigned? Search Guard	2	361	May 5, 2016

Inconsistent cluster state during 6-7 Upgrade

Related topics