I do have a similar problem on heavy logging PROD cluster 80+ nodes with hundreds of signals of various type, but after restart sometimes, some of them wont execute again, not just skip for few minutes. We had days until someone noticed it. It was less noticeable on 7.17.7 version, but a lot worse on current 8.6.2.
Anyways…
@Raj.Jikadra I was trying to simulate this using your method but unfortunately even with 110 signals (30x1min,30x3min,30x5min,10x10min,10x60min) for 3 node cluster, all signal executions were working as expected after restart. Restart procedure is one by one each node so it always has 2 masters up, never to RED state. My output was to elasticsearch index though, so I could aggregate and count each type p/hour accurately.
Testing version 8.7.1, FLX 1.6.0, my sample watch:
[{"id": "test_1min-1", "body": {"active": true, "trigger": {"schedule": {"interval": ["1m"], "timezone": "Europe/Berlin"}}, "checks": [{"type": "search", "name": "mysearch", "target": "mysearch", "request": {"indices": [".monitoring-es*"], "body": {"size": 0, "aggregations": {}, "query": {"bool": {"filter": {"range": {"timestamp": {"gte": "now-1h", "lte": "now"}}}}}}}}, {"type": "condition", "name": "mycondition", "source": "data.mysearch.hits.total.value > 1000"}], "_meta": {"last_edit": {"user": "admin", "date": "2024-04-10T08:20:35.069Z"}}, "_tenant": "DEFAULT", "_ui": {"isSeverity": false, "watchType": "graph", "index": [{"label": ".monitoring-es*"}], "timeField": "timestamp", "aggregationType": "count", "fieldName": [], "topHitsAgg": {"field": [], "size": 3, "order": "asc"}, "overDocuments": "all documents", "bucketValue": 1, "bucketUnitOfTime": "h", "thresholdValue": 1000, "thresholdEnum": "ABOVE", "isResolveActions": false, "severity": {"value": [], "valueString": "", "order": "ascending", "thresholds": {"info": 100, "warning": 200, "error": 300, "critical": 400}}}, "actions": [{"type": "index", "name": "myelasticsearch", "index": "testwatcher", "checks": [{"type": "transform", "source": "['total_hits': data.mysearch.hits.total.value, '@timestamp': execution_time, 'watch_id': watch.id, 'triggered_time': trigger.triggered_time ]"}], "throttle_period": "1s"}]}}
]
Do you mind to try out es output yourself ? Not sure what am I doing differently.
Thx Peter.