Unable to save watches

kibana.yml (2.2 KB) sg_config.yml (4.9 KB)

Restart the Elasticsearch, record all the logs you get while it is starting and send it to me.

DM you with log

I see the following error in the log

handshake failed for node … failed: remote cluster name … does not match local cluster name …

Do you have a remote cluster? Is the remote cluster protected by Search Guard? Do you see any errors in the remote cluster log?

Or maybe you just forgot to update cluster.name on some node?

Let’s check the cluster health

curl -k -u admin:admin -X GET https://localhost:9200/_cluster/settings
curl -k -u admin:admin -X GET https://localhost:9200/_cluster/health?pretty
curl -k -u admin:admin -X GET https://localhost:9200/_cluster/allocation/explain?pretty

Cluster is healthy
all nodes have reconnected
all shards are allocated
There is a remote cluster, yes it is protected by searchguard

Our clusters are monitored very heavily, all clusters are healthy at this time.

_cluster/settings:

{
  "persistent": {
    "action": {
      "search": {
        "shard_count": {
          "limit": "3500"
        }
      }
    },
    "cluster": {
      "routing": {
        "use_adaptive_replica_selection": "true",
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "node_concurrent_incoming_recoveries": "5",
          "awareness": {
            "attributes": "rack_id"
          },
          "balance": {
            "index": "0.55",
            "threshold": "1.0",
            "shard": "0.45"
          },
          "enable": "all",
          "same_shard": {
            "host": "true"
          },
          "node_concurrent_outgoing_recoveries": "5",
          "allow_rebalance": "indices_primaries_active",
          "node_concurrent_recoveries": "5"
        }
      },
      "service": {
        "slow_task_logging_threshold": "10s"
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "1%"
        },
        "request": {
          "limit": "1%",
          "overhead": "1.1"
        },
        "total": {
          "limit": "95%"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "1200mb"
      }
    },
    "search": {
      "default_search_timeout": "61s",
      "max_buckets": "10000",
      "low_level_cancellation": "true"
    },
    "xpack": {
      "monitoring": {
        "collection": {
          "enabled": "true",
          "interval": "20s"
        }
      }
    },
    "network": {
      "breaker": {
        "inflight_requests": {
          "limit": "40%",
          "overhead": "2"
        }
      }
    }
  },
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_incoming_recoveries": "4",
          "node_initial_primaries_recoveries": "8",
          "balance": {
            "index": "0.65",
            "threshold": "1.5",
            "shard": "0.35"
          },
          "node_concurrent_outgoing_recoveries": "2",
          "cluster_concurrent_rebalance": "1",
          "node_concurrent_recoveries": "2",
          "exclude": {
            "_name": "*es-data6*"
          }
        }
      }
    },
    "indices": {
      "breaker": {
        "request": {
          "overhead": "1.1"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "40mb"
      }
    },
    "logger": {
      "com": {
        "floragunn": "ERROR"
      }
    }
  }
}

_cluster/health:

{
  "persistent": {
    "action": {
      "search": {
        "shard_count": {
          "limit": "3500"
        }
      }
    },
    "cluster": {
      "routing": {
        "use_adaptive_replica_selection": "true",
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "node_concurrent_incoming_recoveries": "5",
          "awareness": {
            "attributes": "rack_id"
          },
          "balance": {
            "index": "0.55",
            "threshold": "1.0",
            "shard": "0.45"
          },
          "enable": "all",
          "same_shard": {
            "host": "true"
          },
          "node_concurrent_outgoing_recoveries": "5",
          "allow_rebalance": "indices_primaries_active",
          "node_concurrent_recoveries": "5"
        }
      },
      "service": {
        "slow_task_logging_threshold": "10s"
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "1%"
        },
        "request": {
          "limit": "1%",
          "overhead": "1.1"
        },
        "total": {
          "limit": "95%"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "1200mb"
      }
    },
    "search": {
      "default_search_timeout": "61s",
      "max_buckets": "10000",
      "low_level_cancellation": "true"
    },
    "xpack": {
      "monitoring": {
        "collection": {
          "enabled": "true",
          "interval": "20s"
        }
      }
    },
    "network": {
      "breaker": {
        "inflight_requests": {
          "limit": "40%",
          "overhead": "2"
        }
      }
    }
  },
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_incoming_recoveries": "4",
          "node_initial_primaries_recoveries": "8",
          "balance": {
            "index": "0.65",
            "threshold": "1.5",
            "shard": "0.35"
          },
          "node_concurrent_outgoing_recoveries": "2",
          "cluster_concurrent_rebalance": "1",
          "node_concurrent_recoveries": "2",
          "exclude": {
            "_name": "*es-data6*"
          }
        }
      }
    },
    "indices": {
      "breaker": {
        "request": {
          "overhead": "1.1"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "40mb"
      }
    },
    "logger": {
      "com": {
        "floragunn": "ERROR"
      }
    }
  }
}

_cluster/allocation/explain:


 previous requests

  
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status": 400
}

I removed the remote cluster for the time being, and am still getting the same 500 internal server error when trying to create a watch.

Hello Lance,

I have dug a bit deeper into the problem.

First, some background: Signals signs all its watches when the watches get saved; this is because the watches inherit the privileges of the user who created the watches. The signature makes sure that the only the exact watch the user has defined can inherit the privileges.

The signature key is normally generated during startup of the master node and then distributed to all other nodes in the cluster.

The error you are seeing indicates that the key was not created or did not properly reach the node which is responsible for creating the watch.

Signals provides a REST API to check the keys. If you still have the admin user with SGS_SIGNALS_ALL privileges, can you please use curl or your browser to look at the content of these URIs:

  • https://localhost:9200/_signals/settings/internal_auth.token_signing_key
  • https://localhost:9200/_signals/settings/internal_auth.token_encryption_key

You should see something like this as a result:

"kYHM7RMU3Oes+KxXBw8AKunDjOaNLJ0flCuA6H4xKdUcBg5ahekYRya9hwOSP0m6Y7DRsuoK1A/g0UyE3ObIIQ=="

You don’t need to post the result here if it looks like this; it only matters if you get an empty or null response.

One more question: Have you ever been able to create watches and did this break at some later time? Or is this the first time you are trying to create a watch?

If you don’t have any watches so far, a possible workaround for the problem might be re-generating the keys on your own. If you already have watches, the existing watches will stop working.

If you want to do this, please perform the following steps:

  • Create a new key:
openssl rand -base64 64
  • Update Signals with the new key. Replace the long base64 encoded string in the command below by the string you got from the command above. Please be sure to put the string first in single quotes and then in double quotes: '"..."'. This is necessary because the API expects valid JSON:
curl -k -u admin -X PUT -H "Content-Type: application/json" -d '"ESXKt9nOyNrdFzrS3iEQVmbD+nv0kwjV4ZOOFXRO7UEYGpZWpGU0orbde3A7YjptFRAnTca/dtaRb04p5OdcVw=="' 'https://your-es-host:9200/_signals/settings/internal_auth.token_signing_key'
  • The same procedure is necessary for internal_auth.token_encryption_key. This key however needs only 32 byte keys:
openssl rand -base64 32
curl -k -u admin -X PUT -H "Content-Type: application/json" -d '"HWvQ4Wh8rfJdtMS2bttXBi+wkGrScGEFsn/1dlyguk0="' 'https://localhost:9200/_signals/settings/internal_auth.token_encryption_key'

Afterwards, please try to save a watch.

If this does not work, it would be helpful if you look for logs which were generated at the time when you updated the keys.

Still, we don’t know the root cause why this did not work in the beginning. I had a look at the logs you sent to my colleague. If I see it correctly, these logs are from a data node. For the root cause, we would need to look at the logs of the master node at the time it was restarted last. Is there a chance that you still have these logs?

That worked perfect and solved that issue, I am now able to create/save/edit watches.

Now when I try to acknowledge a watch I get:

{ “statusCode”: 500, “error”: “Internal Server Error”, “message”: “Response Error”, “attributes”: { “body”: { “status”: 500, “error”: “com.floragunn.signals.SignalsUnavailableException: Signals is still initializing. Please try again later.” } } }

image001.png

This might be related to the problem with the keys. I need to ask you again for logs from ES. Can you please check the logs from the master during startup of the node and at the time you tried acknowledging?

I restarted this master node, then had to restart another master node to make sure this was the active master.

I waited about a minute after trying to acknowledge several times before copying the log file.

Prior to performing this I also enabled:

{

“transient”: {

“logger.com.floragunn.signals”: “DEBUG”

}

}

image001.png

Greenville.log (232 KB)

I have some watches that are working, but others are not…

Error I am getting on the watches that are not running:
image

We seem to have a problem where Signals was properly initialized only on some nodes, but not on others. The reason is however still not clear.

For now, I have only a couple of further questions:

  • Are you running ES 7.10.2/SG 49 on all nodes or are there nodes with older versions?
  • Have you done a fresh install of SG 49 or did you upgrade from an earlier version? If so, from which version? Also: Did you do a rolling upgrade?

Yes all nodes in our DR environment are running ES 7.10.2/SG 49
This was performed by rolling upgrade, from ES 7.9.1/SG 45.1.0

@nils @srgbnd Any ideas on how we can fix this issue?

It seems that for some reason Signals was not properly initialized on all nodes. It might worth the try to do a rolling restart and check whether Signals is properly working afterwards.

How many nodes do you have in your cluster?

A rolling restart is not trivial matter… Our DR Cluster consists of 150 nodes total, 128 of those are Data nodes

Ok, I see. The remaining options are also quite complicated, though. We have to view to separate problems here.

First problem:

If I understood you correctly, you are expecting all watches to be executing right? I am asking because “Unknown State” could be also displayed for disabled watches and watches triggering very sparsly which have not triggered yet (for example watches triggering once a month).

Normally, Signals distributes the execution of watches over all nodes of a cluster. If watches do get scheduled for nodes which are not correctly initialized, this can be the reason why these don’t execute.

It is however possible to configure the nodes watches are executed on. For example, you can configure Signals to execute watches only on master nodes. If you have HTTP access via basic auth to ES, you can change the configuration this way:

curl -k -u admin -X PUT -H "Content-Type: application/json" -d '"master:true"' 'https://your-es-host:9200/_signals/settings/tenant._main.node_filter'

This setting is tenant-specific. If you use multi-tenancy and use several tenants, you have to do this for all tenants; replace _main with the tenant name.

It is also possible to choose other nodes; master:true is a node-filter, you can use the expressions specified here:

See also the Signals docs for details:

Regarding the second error:

Do I understand it correctly that you get the error when trying to open the watches in Kibana (i.e., by clicking on their id in the overview)?

This is a bit more tricky. How is Kibana configured to access the Elasticsearch backend? I assume, Kibana accesses a load balancer which distributes the calls over Elasticsearch? Or are you directly hitting one coordinator node?

If you are using a load balancer, you get the error whenever the load balancer hits a node where initialisation has failed. If the load balancer is configured to hit only a subset of Elasticsearch nodes (say coordinator nodes), you could check whether it is feasible to only restart the coordinate nodes.

I performed a rolling restart of the entire cluster yesterday.

The watch I am most concerned about (right now), is the pdm_kibana_config_changes_watch.
This watch is scheduled to run every 1 minute, looking at the sg-auditlog indices.
There is no execution history for this watch. And the 500 internal server error is when I try to acknowledge that watch, trying to see why it is not running. That is when I see that watch is not initialized.

I am unable to release 7.10.2 to users until watches is fully functional and working correctly.

Kibana is using a list of coordinating only nodes to connect to the cluster, in the kibana.yml, the load balancer sits in front of Kibana not between Kibana and ES.

So, this means that the rolling restart did not fix the problem?

Did you try applying this setting?

curl -k -u admin -X PUT -H "Content-Type: application/json" -d '"master:true"' 'https://your-es-host:9200/_signals/settings/tenant._main.node_filter'

Yes, I applied the above setting to coordinating_only nodes…

_signals/settings/tenant._main.node_filter
{
“coordinating_only”: true,
“master”: false,
“data”: false,
“ingest”: false
}

Some watches are still not running