Unable to save watches

lzukel · February 24, 2021, 2:11pm

kibana.yml (2.2 KB) sg_config.yml (4.9 KB)

srgbnd · February 24, 2021, 2:57pm

Restart the Elasticsearch, record all the logs you get while it is starting and send it to me.

lzukel · February 24, 2021, 3:06pm

DM you with log

srgbnd · February 24, 2021, 3:56pm

I see the following error in the log

handshake failed for node … failed: remote cluster name … does not match local cluster name …

Do you have a remote cluster? Is the remote cluster protected by Search Guard? Do you see any errors in the remote cluster log?

Or maybe you just forgot to update cluster.name on some node?

Let’s check the cluster health

curl -k -u admin:admin -X GET https://localhost:9200/_cluster/settings
curl -k -u admin:admin -X GET https://localhost:9200/_cluster/health?pretty
curl -k -u admin:admin -X GET https://localhost:9200/_cluster/allocation/explain?pretty

lzukel · February 24, 2021, 4:18pm

Cluster is healthy
all nodes have reconnected
all shards are allocated
There is a remote cluster, yes it is protected by searchguard

lzukel · February 25, 2021, 1:27pm

Our clusters are monitored very heavily, all clusters are healthy at this time.

_cluster/settings:

{
  "persistent": {
    "action": {
      "search": {
        "shard_count": {
          "limit": "3500"
        }
      }
    },
    "cluster": {
      "routing": {
        "use_adaptive_replica_selection": "true",
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "node_concurrent_incoming_recoveries": "5",
          "awareness": {
            "attributes": "rack_id"
          },
          "balance": {
            "index": "0.55",
            "threshold": "1.0",
            "shard": "0.45"
          },
          "enable": "all",
          "same_shard": {
            "host": "true"
          },
          "node_concurrent_outgoing_recoveries": "5",
          "allow_rebalance": "indices_primaries_active",
          "node_concurrent_recoveries": "5"
        }
      },
      "service": {
        "slow_task_logging_threshold": "10s"
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "1%"
        },
        "request": {
          "limit": "1%",
          "overhead": "1.1"
        },
        "total": {
          "limit": "95%"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "1200mb"
      }
    },
    "search": {
      "default_search_timeout": "61s",
      "max_buckets": "10000",
      "low_level_cancellation": "true"
    },
    "xpack": {
      "monitoring": {
        "collection": {
          "enabled": "true",
          "interval": "20s"
        }
      }
    },
    "network": {
      "breaker": {
        "inflight_requests": {
          "limit": "40%",
          "overhead": "2"
        }
      }
    }
  },
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_incoming_recoveries": "4",
          "node_initial_primaries_recoveries": "8",
          "balance": {
            "index": "0.65",
            "threshold": "1.5",
            "shard": "0.35"
          },
          "node_concurrent_outgoing_recoveries": "2",
          "cluster_concurrent_rebalance": "1",
          "node_concurrent_recoveries": "2",
          "exclude": {
            "_name": "*es-data6*"
          }
        }
      }
    },
    "indices": {
      "breaker": {
        "request": {
          "overhead": "1.1"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "40mb"
      }
    },
    "logger": {
      "com": {
        "floragunn": "ERROR"
      }
    }
  }
}

_cluster/health:

{
  "persistent": {
    "action": {
      "search": {
        "shard_count": {
          "limit": "3500"
        }
      }
    },
    "cluster": {
      "routing": {
        "use_adaptive_replica_selection": "true",
        "rebalance": {
          "enable": "all"
        },
        "allocation": {
          "node_concurrent_incoming_recoveries": "5",
          "awareness": {
            "attributes": "rack_id"
          },
          "balance": {
            "index": "0.55",
            "threshold": "1.0",
            "shard": "0.45"
          },
          "enable": "all",
          "same_shard": {
            "host": "true"
          },
          "node_concurrent_outgoing_recoveries": "5",
          "allow_rebalance": "indices_primaries_active",
          "node_concurrent_recoveries": "5"
        }
      },
      "service": {
        "slow_task_logging_threshold": "10s"
      }
    },
    "indices": {
      "breaker": {
        "fielddata": {
          "limit": "1%"
        },
        "request": {
          "limit": "1%",
          "overhead": "1.1"
        },
        "total": {
          "limit": "95%"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "1200mb"
      }
    },
    "search": {
      "default_search_timeout": "61s",
      "max_buckets": "10000",
      "low_level_cancellation": "true"
    },
    "xpack": {
      "monitoring": {
        "collection": {
          "enabled": "true",
          "interval": "20s"
        }
      }
    },
    "network": {
      "breaker": {
        "inflight_requests": {
          "limit": "40%",
          "overhead": "2"
        }
      }
    }
  },
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_incoming_recoveries": "4",
          "node_initial_primaries_recoveries": "8",
          "balance": {
            "index": "0.65",
            "threshold": "1.5",
            "shard": "0.35"
          },
          "node_concurrent_outgoing_recoveries": "2",
          "cluster_concurrent_rebalance": "1",
          "node_concurrent_recoveries": "2",
          "exclude": {
            "_name": "*es-data6*"
          }
        }
      }
    },
    "indices": {
      "breaker": {
        "request": {
          "overhead": "1.1"
        }
      },
      "recovery": {
        "max_bytes_per_sec": "40mb"
      }
    },
    "logger": {
      "com": {
        "floragunn": "ERROR"
      }
    }
  }
}

_cluster/allocation/explain:


 previous requests

  
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status": 400
}

I removed the remote cluster for the time being, and am still getting the same 500 internal server error when trying to create a watch.

nils · February 26, 2021, 8:00am

Hello Lance,

I have dug a bit deeper into the problem.

First, some background: Signals signs all its watches when the watches get saved; this is because the watches inherit the privileges of the user who created the watches. The signature makes sure that the only the exact watch the user has defined can inherit the privileges.

The signature key is normally generated during startup of the master node and then distributed to all other nodes in the cluster.

The error you are seeing indicates that the key was not created or did not properly reach the node which is responsible for creating the watch.

Signals provides a REST API to check the keys. If you still have the admin user with SGS_SIGNALS_ALL privileges, can you please use curl or your browser to look at the content of these URIs:

https://localhost:9200/_signals/settings/internal_auth.token_signing_key
https://localhost:9200/_signals/settings/internal_auth.token_encryption_key

You should see something like this as a result:

"kYHM7RMU3Oes+KxXBw8AKunDjOaNLJ0flCuA6H4xKdUcBg5ahekYRya9hwOSP0m6Y7DRsuoK1A/g0UyE3ObIIQ=="

You don’t need to post the result here if it looks like this; it only matters if you get an empty or null response.

One more question: Have you ever been able to create watches and did this break at some later time? Or is this the first time you are trying to create a watch?

If you don’t have any watches so far, a possible workaround for the problem might be re-generating the keys on your own. If you already have watches, the existing watches will stop working.

If you want to do this, please perform the following steps:

Create a new key:

openssl rand -base64 64

Update Signals with the new key. Replace the long base64 encoded string in the command below by the string you got from the command above. Please be sure to put the string first in single quotes and then in double quotes: '"..."'. This is necessary because the API expects valid JSON:

curl -k -u admin -X PUT -H "Content-Type: application/json" -d '"ESXKt9nOyNrdFzrS3iEQVmbD+nv0kwjV4ZOOFXRO7UEYGpZWpGU0orbde3A7YjptFRAnTca/dtaRb04p5OdcVw=="' 'https://your-es-host:9200/_signals/settings/internal_auth.token_signing_key'

The same procedure is necessary for internal_auth.token_encryption_key. This key however needs only 32 byte keys:

openssl rand -base64 32

curl -k -u admin -X PUT -H "Content-Type: application/json" -d '"HWvQ4Wh8rfJdtMS2bttXBi+wkGrScGEFsn/1dlyguk0="' 'https://localhost:9200/_signals/settings/internal_auth.token_encryption_key'

Afterwards, please try to save a watch.

If this does not work, it would be helpful if you look for logs which were generated at the time when you updated the keys.

Still, we don’t know the root cause why this did not work in the beginning. I had a look at the logs you sent to my colleague. If I see it correctly, these logs are from a data node. For the root cause, we would need to look at the logs of the master node at the time it was restarted last. Is there a chance that you still have these logs?

lzukel · February 26, 2021, 3:58pm

That worked perfect and solved that issue, I am now able to create/save/edit watches.

Now when I try to acknowledge a watch I get:

{ “statusCode”: 500, “error”: “Internal Server Error”, “message”: “Response Error”, “attributes”: { “body”: { “status”: 500, “error”: “com.floragunn.signals.SignalsUnavailableException: Signals is still initializing. Please try again later.” } } }

nils · February 26, 2021, 4:17pm

This might be related to the problem with the keys. I need to ask you again for logs from ES. Can you please check the logs from the master during startup of the node and at the time you tried acknowledging?

lzukel · February 26, 2021, 4:55pm

I restarted this master node, then had to restart another master node to make sure this was the active master.

I waited about a minute after trying to acknowledge several times before copying the log file.

Prior to performing this I also enabled:

{

“transient”: {

“logger.com.floragunn.signals”: “DEBUG”

}

Greenville.log (232 KB)

lzukel · February 26, 2021, 7:51pm

I have some watches that are working, but others are not…

Error I am getting on the watches that are not running:

nils · February 26, 2021, 9:17pm

We seem to have a problem where Signals was properly initialized only on some nodes, but not on others. The reason is however still not clear.

For now, I have only a couple of further questions:

Are you running ES 7.10.2/SG 49 on all nodes or are there nodes with older versions?
Have you done a fresh install of SG 49 or did you upgrade from an earlier version? If so, from which version? Also: Did you do a rolling upgrade?

lzukel · February 27, 2021, 11:37am

Yes all nodes in our DR environment are running ES 7.10.2/SG 49
This was performed by rolling upgrade, from ES 7.9.1/SG 45.1.0

lzukel · March 2, 2021, 3:29pm

@nils @srgbnd Any ideas on how we can fix this issue?

nils · March 2, 2021, 3:42pm

It seems that for some reason Signals was not properly initialized on all nodes. It might worth the try to do a rolling restart and check whether Signals is properly working afterwards.

How many nodes do you have in your cluster?

lzukel · March 2, 2021, 3:54pm

A rolling restart is not trivial matter… Our DR Cluster consists of 150 nodes total, 128 of those are Data nodes

nils · March 3, 2021, 8:53am

Ok, I see. The remaining options are also quite complicated, though. We have to view to separate problems here.

First problem:

If I understood you correctly, you are expecting all watches to be executing right? I am asking because “Unknown State” could be also displayed for disabled watches and watches triggering very sparsly which have not triggered yet (for example watches triggering once a month).

Normally, Signals distributes the execution of watches over all nodes of a cluster. If watches do get scheduled for nodes which are not correctly initialized, this can be the reason why these don’t execute.

It is however possible to configure the nodes watches are executed on. For example, you can configure Signals to execute watches only on master nodes. If you have HTTP access via basic auth to ES, you can change the configuration this way:

curl -k -u admin -X PUT -H "Content-Type: application/json" -d '"master:true"' 'https://your-es-host:9200/_signals/settings/tenant._main.node_filter'

This setting is tenant-specific. If you use multi-tenancy and use several tenants, you have to do this for all tenants; replace _main with the tenant name.

It is also possible to choose other nodes; master:true is a node-filter, you can use the expressions specified here:

Topic		Replies	Views
Signals Watch getting "Failed to Fetch" Signals Alerting	22	2834	February 23, 2021
Unable to create Watch using UI Search Guard	8	4536	May 2, 2022
Signals watch are skipping Signals Alerting	26	656	January 28, 2025
Error while initializing IndexJobStateStore "this.nodeId" is null Signals Alerting	4	914	March 15, 2021
Search Guard + Watcher Search Guard	4	399	June 29, 2017

Unable to save watches

Related topics