Signals watch are skipping

Elasticsearch version: 8.7.1

Server OS version: Ubuntu 22.04 LTS

Describe the issue: Sometimes Signals watch are skipping, and there are no error log or any logs for that skip. It only logs when watch starts it run and finishes, but here there are no logs at all for it. And its a continuous issue.

Steps to reproduce:

  1. Create multiple watches on Signal with Intervals 1m, 3m and 5m.
  2. Have at least 50 watches.
  3. Restart the cluster once all the watches have been created, so that they start at same time.
  4. Observe for at least 1hr.

Additional data:

  1. Watcher a, Frequency 1min

    Here watch was skipped on 12:01, 12:05, 12:11.

Sometime it skips continuously, and some watch might not get executed for few hours.

@Raj.Jikadra What information do you get when you click on one of the failed executions in that list?

{
  "tenant": "_main",
  "watch_id": "Watch_throttle_test70",
  "watch_version": 7,
  "status": {
    "code": "ACTION_FAILED",
    "detail": "All actions failed"
  },
  "execution_start": "2023-12-19T15:23:16.013Z",
  "execution_end": "2023-12-19T15:23:16.123Z",
  "actions": [
    {
      "name": "mywebhook",
      "status": {
        "code": "ACTION_FAILED",
        "detail": "com.floragunn.signals.execution.ActionExecutionException"
      },
      "error": {
        "message": "com.floragunn.signals.execution.ActionExecutionException:\norg.apache.http.client.ClientProtocolException:\nTarget host is not specified"
      },
      "execution_start": "2023-12-19T15:23:16.118Z",
      "execution_end": "2023-12-19T15:23:16.123Z"
    }
  ],
  "node": "elk-master",
  "_id": "ZoesgowBjkc-je2xjyMs",
  "_index": ".signals_log_2023.12.19"
}

This is the case for other watches too, which doesn’t have any failures. like:


this watcher is set to run every minute.

@Raj.Jikadra Have you noticed any errors in the Elasticsearch logs around the watch’s execution time?
Could you share a JSON version of your watch?

No logs at all in elasticsearch logs, when it skips execution.

Watcher json:

{
  "active": true,
  "trigger": {
    "schedule": {
      "interval": [
        "1m"
      ],
      "timezone": "Europe/Berlin"
    }
  },
  "checks": [
    {
      "type": "static",
      "name": "metadata",
      "target": "metadata",
      "value": {
        "action_message": "Random Msg"
      }
    },
    {
      "type": "search",
      "name": "payload",
      "target": "payload",
      "request": {
        "body": {
          "size": 5000,
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "CREATED_AT_UTC": {
                      "gte": "now-70m/m",
                      "lte": "now"
                    }
                  }
                }
              ]
            }
          },
          "sort": [
            {
              "CREATED_AT_UTC": {
                "order": "asc"
              }
            }
          ]
        },
        "indices": [
          "index-pattern"
        ]
      }
    },
    {
      "type": "transform",
      "name": "transform",
      "source": "String errorAlertMessage = \"\";\nMap map = new HashMap();\nList a = new ArrayList();\nboolean isAlertTrue = false;\nfor(int i=0;i<data.payload.hits.hits.size();i++) {\n    String id = data.payload.hits.hits[i]._source.CHANNEL + \"_\" + data.payload.hits.hits[i]._source.CLIENT_ID + \"_\" + data.payload.hits.hits[i]._source.WH_ID;\n    map.put(id,data.payload.hits.hits[i]);\n}\nfor(String k : map.keySet()) {\n    Map m = new HashMap();\n    int hourOrders = (map.get(k)._source.HOUR_ORDERS);\n    int sevenDaysAvgOrders = (map.get(k)._source.SEVEN_DAYS_AVG);\n    if(sevenDaysAvgOrders > 0) {\n        int expectedOrders = (int)(sevenDaysAvgOrders * 0.8);\n        if(hourOrders <= expectedOrders || hourOrders==0) {\n            isAlertTrue=true;\n            m.put(\"CLIENT_ID\",map.get(k)._source.CLIENT_ID);\n            m.put(\"CLIENT\",map.get(k)._source.CLIENT);\n            m.put(\"CHANNEL\",map.get(k)._source.CHANNEL);\n            m.put(\"SEVEN_DAYS_AVG\",sevenDaysAvgOrders);\n            m.put(\"HOUR_ORDERS\", hourOrders);\n            m.put(\"WAREHOUSE\",map.get(k)._source.WAREHOUSE);\n            m.put(\"WH_ID\",map.get(k)._source.WH_ID);\n            a.add(m);\n        }\n    }\n}\nInstant now = Instant.ofEpochMilli(System.currentTimeMillis());\nZoneId istZone = ZoneId.of(\"Asia/Kolkata\");\nZonedDateTime istTime = ZonedDateTime.ofInstant(now, istZone);\nZonedDateTime previousHourIST = istTime.minusHours(1);\nint currentHour = istTime.getHour();\nint previousHour = previousHourIST.getHour();\n\nMap result = new HashMap();\nresult.put(\"is_alert\", isAlertTrue);\nresult.put(\"size\",a.size());\nresult.put(\"payload\",a);\nresult.put(\"currentHour\",currentHour);\nresult.put(\"previousHour\",previousHour);\nreturn result;\n",
      "lang": "painless",
      "target": "transform"
    },
    {
      "type": "condition",
      "name": "condition",
      "source": "return data.transform.is_alert == true",
      "lang": "painless"
    }
  ],
  "actions": [
    {
      "type": "email",
      "name": "myemail",
      "from": "noreply@increff.com",
      "to": [
        "raj.non.jikadra@increff.com"
      ],
      "cc": [],
      "bcc": [],
      "subject": "Signals message",
      "text_body": "Total: {{data.mysearch.hits.total.value}}",
      "html_body": "<p>\n  <span style=\"color:blue;\">Total:</span>\n  <span style=\"color:red;\">{{data.avg_ticket_price.aggregations.metricAgg.value}}</span>\n</p>\n",
      "account": "MailAccount",
      "checks": [],
      "throttle_period": "3m"
    },
    {
      "type": "webhook",
      "name": "mywebhook",
      "request": {
        "method": "POST",
        "url": "www.google.com",
        "body": "Total: {{data.mysearch.hits.total.value}}",
        "headers": {
          "Content-type": "application/json"
        }
      },
      "checks": [],
      "throttle_period": "1m"
    }
  ],
  "_tenant": "_main",
  "_meta": {
    "last_edit": {
      "user": "elastic",
      "date": "2023-12-19T11:42:59.569Z"
    }
  },
  "_id": "Watch_throttle_test26"
}

@Raj.Jikadra Thank you for sharing the watch definition. I have some questions regarding your cluster.

  1. What is the version of the Search Guard plugin?
  2. How many ES nodes do you have in the cluster?
  3. What node roles did you assign (GET _cat/nodes)?
  4. How many Watches do you have in your cluster?
  5. What CPU and RAM resources did you assign to your ES nodes?
  1. Search Guard version: Search Guard FLX 1.4.0. I noticed this issue on FLX v1.3.0, and upgraded to v1.4.0 thinking this might solve the issue, but no luck.
  2. There is only one master ES nodes in the cluster which is deployed on VM.
  3. GET _cat/nodes: 10.10.0.58 22 94 14 0.42 0.40 0.43 hilmrst * elk-internal-master
  4. Around 50, but tried various number of watches like 100, 70, 40 and 25 which all had the issue. One pattern is decreasing the number of watches, reduces the skipping.
  5. VM has 2vCPU, 7.5GB RAM. And heap value is default: 7.5/2 = 3.75GB.

I have also tried multiple nodes, and different hardware configs, but still the same issue.

Also I have another cluster, which has the same issue:
Node Config:

  1. Master Node: 16vCPU and 64GB RAM config
  2. Node with only ‘data_hot’ : 4vCPU and 16GB RAM.

Node roles:
10.170.42.22 29 97 26 4.90 4.53 4.74 hilmrst * master
10.170.42.32 47 98 3 0.03 0.28 0.72 w - data-warm

SearchGuard version: FLX 1.3.0

@Raj.Jikadra Thank you for sharing the details. I’ll try to test it.
Regarding the watches, do you use different watch definitions or do you replicate the shared example?

For few watch, its similar to the one I shared, and few watches are different.

Also some additional info:

  1. Watches doesn’t start skipping when we create them, but after creating watches and we restart the elasticsearch node, it starts skipping.
  2. This issue is less frequent in elasticsearch nodes running on single core(processor).

@pablo thanks for the support so far. Is there any update on testing?

Just an update on this. Even with SG FLX version 1.6.0 and ES version 8.7.1, watches are still skipping with my setup.

Have you noticed this message in the Elasticsearch nodes?
Did you try running your Alert without a webhook?

Does the other cluster get the same error message as the one above|?

Have you tried your email and webhook actions with a simple (non-scripted) alert?

@pablo thanks for the reply.

Have you noticed this message in the Elasticsearch nodes?

Yes.

Did you try running your Alert without a webhook?

Yes, same issue there.

Does the other cluster get the same error message as the one above|?

Other cluster has proper webhook url, hence its executed without any errors. But watches are still skipping.

Have you tried your email and webhook actions with a simple (non-scripted) alert?

Yes, they are working as expected, if valid values are provided.

I do have a similar problem on heavy logging PROD cluster 80+ nodes with hundreds of signals of various type, but after restart sometimes, some of them wont execute again, not just skip for few minutes. We had days until someone noticed it. It was less noticeable on 7.17.7 version, but a lot worse on current 8.6.2.

Anyways…

@Raj.Jikadra I was trying to simulate this using your method but unfortunately even with 110 signals (30x1min,30x3min,30x5min,10x10min,10x60min) for 3 node cluster, all signal executions were working as expected after restart. Restart procedure is one by one each node so it always has 2 masters up, never to RED state. My output was to elasticsearch index though, so I could aggregate and count each type p/hour accurately.

Testing version 8.7.1, FLX 1.6.0, my sample watch:

[{"id": "test_1min-1", "body": {"active": true, "trigger": {"schedule": {"interval": ["1m"], "timezone": "Europe/Berlin"}}, "checks": [{"type": "search", "name": "mysearch", "target": "mysearch", "request": {"indices": [".monitoring-es*"], "body": {"size": 0, "aggregations": {}, "query": {"bool": {"filter": {"range": {"timestamp": {"gte": "now-1h", "lte": "now"}}}}}}}}, {"type": "condition", "name": "mycondition", "source": "data.mysearch.hits.total.value > 1000"}], "_meta": {"last_edit": {"user": "admin", "date": "2024-04-10T08:20:35.069Z"}}, "_tenant": "DEFAULT", "_ui": {"isSeverity": false, "watchType": "graph", "index": [{"label": ".monitoring-es*"}], "timeField": "timestamp", "aggregationType": "count", "fieldName": [], "topHitsAgg": {"field": [], "size": 3, "order": "asc"}, "overDocuments": "all documents", "bucketValue": 1, "bucketUnitOfTime": "h", "thresholdValue": 1000, "thresholdEnum": "ABOVE", "isResolveActions": false, "severity": {"value": [], "valueString": "", "order": "ascending", "thresholds": {"info": 100, "warning": 200, "error": 300, "critical": 400}}}, "actions": [{"type": "index", "name": "myelasticsearch", "index": "testwatcher", "checks": [{"type": "transform", "source": "['total_hits': data.mysearch.hits.total.value, '@timestamp': execution_time, 'watch_id': watch.id, 'triggered_time': trigger.triggered_time ]"}], "throttle_period": "1s"}]}}
]

Do you mind to try out es output yourself ? Not sure what am I doing differently.
Thx Peter.

@peter82 thanks for testing it out. Just wanted few info regarding the test cluster before testing it out:

  1. Was cluster VM based or container based?
  2. What was the resource configuration for each node?
  3. What was the data size of the query index/indices?

My lab setup:

  • 1 server with OS RHEL7 on bare metal, 380GB RAM, 72 cores ( hosting all 3 instances )
  • 2x node master/data HEAP = 2GB; 1x node master/voting_only HEAP = 1GB , no cgroups set
  • query through self monitor indices .monitoring-es* this cluster produced for 7 days, so let’s say 1GB of primary data.

@peter82 thanks for the response.

I tried out your sample watch, with similar setup to yours, and got the same result i.e. No signals were skipping.

Will try few other scenarios( large dataset, no-replica, etc) and update

@peter82 I retested the watch with slight different configuration.

[{"id": "test_1min-1", "body": {"active": true, "trigger": {"schedule": {"interval": ["1m"], "timezone": "Europe/Berlin"}}, "checks": [{"type": "search", "name": "mysearch", "target": "mysearch", "request": {"indices": [".monitoring-*"], "body": {"size": 100, "aggregations": {}, "query": {"bool": {"filter": {"range": {"timestamp": {"gte": "now-1h", "lte": "now"}}}}}}}}, {"type": "condition", "name": "mycondition", "source": "data.mysearch.hits.total.value > 1000"}], "_meta": {"last_edit": {"user": "admin", "date": "2024-04-10T08:20:35.069Z"}}, "_tenant": "DEFAULT", "_ui": {"isSeverity": false, "watchType": "graph", "index": [{"label": ".monitoring-es*"}], "timeField": "timestamp", "aggregationType": "count", "fieldName": [], "topHitsAgg": {"field": [], "size": 3, "order": "asc"}, "overDocuments": "all documents", "bucketValue": 1, "bucketUnitOfTime": "h", "thresholdValue": 1000, "thresholdEnum": "ABOVE", "isResolveActions": false, "severity": {"value": [], "valueString": "", "order": "ascending", "thresholds": {"info": 100, "warning": 200, "error": 300, "critical": 400}}}, "actions": [{"type": "index", "name": "myelasticsearch", "index": "testwatcher", "checks": [{"type": "transform", "source": "['total_hits': data.mysearch.hits.total.value, '@timestamp': execution_time, 'watch_id': watch.id, 'triggered_time': trigger.triggered_time ]"}], "throttle_period": "1s"}]}}
]

I updated the size of the search check to 100, and saw few Signals skipping the execution. And further Increasing the size will probably result in higher skipping, which I am observing in my main cluster.