Inconsistent cluster state during 6-7 Upgrade

When upgrading from 6.8.6 to 7.10.2 I noticed the cluster state was inconsistent once more than half the nodes were upgraded.
The 7 nodes report RED while the 6 Nodes report YELLOW/GREEN.
The corresponding indices are the following:

.signals_watches_state        
.searchguard_authtokens       
.signals_accounts             
.signals_watches_trigger_state
.signals_watches              
.signals_settings             
.searchguard_config_history   

Now that all nodes have been upgraded, the cluster is RED, with the following unassigned shards:

.signals_watches_trigger_state     3  p UNASSIGNED
.signals_watches_trigger_state     3  r UNASSIGNED
.signals_watches_trigger_state     0  p UNASSIGNED
.signals_watches_trigger_state     0  r UNASSIGNED
.signals_watches                   2  p UNASSIGNED
.signals_watches                   2  r UNASSIGNED
.searchguard_authtokens            4  r UNASSIGNED
.searchguard_authtokens            4  p UNASSIGNED
.searchguard_authtokens            2  r UNASSIGNED
.searchguard_authtokens            2  p UNASSIGNED
.signals_settings                  1  r UNASSIGNED
.signals_settings                  1  p UNASSIGNED
.signals_settings                  4  r UNASSIGNED
.signals_settings                  4  p UNASSIGNED
.signals_accounts                  1  r UNASSIGNED
.signals_accounts                  1  p UNASSIGNED
.signals_watches_state             1  r UNASSIGNED
.signals_watches_state             1  p UNASSIGNED
.signals_watches_state             3  r UNASSIGNED
.signals_watches_state             3  p UNASSIGNED
.searchguard_config_history        2  p UNASSIGNED
.searchguard_config_history        2  r UNASSIGNED
.searchguard_config_history        0  p UNASSIGNED
.searchguard_config_history        0  r UNASSIGNED

@faxmodem I would imagine you used rolling upgrade method as half the nodes were upgraded previously.

During upgrade of the first node, did you disable shard allocation, upgraded then enabled it again and waited for the node to recover?

Did it recover?

yes I did that for every node.All nodes recovered okay, ingestion works, query too. But cluster is RED due to these “leftover”? indices

I’m tempted to DELETE all of these. But is it safe ?

BTW I didn’t migrate searchguard config yet

Have you ran the below API to see what could be the reason?

GET _cluster/allocation/explain

    "node_allocation_decisions": [
        {
            "deciders": [
                {
                    "decider": "replica_after_primary_active",
                    "decision": "NO",
                    "explanation": "primary shard for this replica is not yet active"
                },
                {
                    "decider": "throttling",
                    "decision": "NO",
                    "explanation": "primary shard for this replica is not yet active"
                }
            ],
            "node_attributes": {
                "manufacturer": "Dell Inc.",
                "processorcount": "16",
                "productname": "PowerEdge R640"
            },
            "node_decision": "no",
            "node_id": "ddddddd",
            "node_name": "dddddd",
            "transport_address": "dddddd",
            "weight_ranking": 1
        }

It seems the shards are simply missing.
So they must have been lost during the upgrade, for some reason.

.searchguard_authtokens            2  r UNASSIGNED
.searchguard_authtokens            2  p UNASSIGNED
.searchguard_authtokens            3  p STARTED                       999.999.108.88  node83.example.com
.searchguard_authtokens            3  r STARTED                       999.999.108.89  node84.example.com
.searchguard_authtokens            1  p STARTED                       999.999.233.5   node78.example.com
.searchguard_authtokens            1  r STARTED                       999.999.233.4   node77.example.com
.searchguard_authtokens            4  r UNASSIGNED
.searchguard_authtokens            4  p UNASSIGNED
.searchguard_authtokens            0  p STARTED                       999.999.238.221 node221.example.com
.searchguard_authtokens            0  r STARTED                       999.999.233.6   node79.example.com

So the question boils down to : is it safe to DELETE those ?
I’m not particularily anxious about .signals_*. But what about .searchguard_config_history and .searchguard_authtokens ?

The good thing is they all seem to be empty:

health status index                          uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   .searchguard_config_history    rsFVFnlMT0OzpdsJYFzA8Q   5   1          0            0      1.4kb           717b
red    open   .signals_settings              TNWmQ7zeSKOBjbwOKBiYlw   5   1          0            0      1.4kb           717b
red    open   .signals_watches               1oXl_COeT92OrIp9kNiRVQ   5   1          0            0      1.8kb           956b
red    open   .signals_accounts              i0YKW43VQVqe7C58BTM-6Q   5   1          0            0      1.8kb           956b
red    open   .signals_watches_state         _WYARV0HQ5CkXXvQ9jQv7A   5   1          0            0      1.4kb           717b
red    open   .signals_watches_trigger_state Fz4R0VmcQtm0EuvldZGPAg   5   1          0            0      1.4kb           717b
red    open   .searchguard_authtokens        f3sWqRDRS_24T-VDCShiBw   5   1          0            0      1.4kb           717b

I’m guessing they were added by search-guard-7 during the rolling upgrade.

The indices’ creation_date confirms this.
I think I’m gonna delete them

or simply force-reassign a primary

If you don’t use these features, it should be safe to just delete the indices. It is however possible that Search Guard will try to create the index again whenever a node starts up and discovers it has elected to be the master.

Just to be sure: You have upgraded to SG 52.6.0?

To be on the safe side I just used the cluster reroute API to allocate_empty_primary for all the concerned indices, and now My cluster is GREEn. Thanks for your help.

And no, I used 7.10.2-51.0.0 because that’s what I used on preproduction

@faxmodem did you manage to fully upgrade the cluster? When you used cluster reroute API, did you assign to nodes which are already upgraded?

Yes the upgrade was already finished when I assigned the primaries.

For us it has been proven that it is much easier and less error-prone to spin up a new cluster with new version and just restore the cluster state and snapshot there and delete the old cluster.

Rolling Update of running productive cluster in my experience ALWAYS lead to troubles, stress and problems.

I’m sure it is, but in our case we didn’t have the luxury :slight_smile:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.