Signals watch are skipping

Elasticsearch version: 8.7.1

Server OS version: Ubuntu 22.04 LTS

Describe the issue: Sometimes Signals watch are skipping, and there are no error log or any logs for that skip. It only logs when watch starts it run and finishes, but here there are no logs at all for it. And its a continuous issue.

Steps to reproduce:

  1. Create multiple watches on Signal with Intervals 1m, 3m and 5m.
  2. Have at least 50 watches.
  3. Restart the cluster once all the watches have been created, so that they start at same time.
  4. Observe for at least 1hr.

Additional data:

  1. Watcher a, Frequency 1min

    Here watch was skipped on 12:01, 12:05, 12:11.

Sometime it skips continuously, and some watch might not get executed for few hours.

@Raj.Jikadra What information do you get when you click on one of the failed executions in that list?

{
  "tenant": "_main",
  "watch_id": "Watch_throttle_test70",
  "watch_version": 7,
  "status": {
    "code": "ACTION_FAILED",
    "detail": "All actions failed"
  },
  "execution_start": "2023-12-19T15:23:16.013Z",
  "execution_end": "2023-12-19T15:23:16.123Z",
  "actions": [
    {
      "name": "mywebhook",
      "status": {
        "code": "ACTION_FAILED",
        "detail": "com.floragunn.signals.execution.ActionExecutionException"
      },
      "error": {
        "message": "com.floragunn.signals.execution.ActionExecutionException:\norg.apache.http.client.ClientProtocolException:\nTarget host is not specified"
      },
      "execution_start": "2023-12-19T15:23:16.118Z",
      "execution_end": "2023-12-19T15:23:16.123Z"
    }
  ],
  "node": "elk-master",
  "_id": "ZoesgowBjkc-je2xjyMs",
  "_index": ".signals_log_2023.12.19"
}

This is the case for other watches too, which doesn’t have any failures. like:


this watcher is set to run every minute.

@Raj.Jikadra Have you noticed any errors in the Elasticsearch logs around the watch’s execution time?
Could you share a JSON version of your watch?

No logs at all in elasticsearch logs, when it skips execution.

Watcher json:

{
  "active": true,
  "trigger": {
    "schedule": {
      "interval": [
        "1m"
      ],
      "timezone": "Europe/Berlin"
    }
  },
  "checks": [
    {
      "type": "static",
      "name": "metadata",
      "target": "metadata",
      "value": {
        "action_message": "Random Msg"
      }
    },
    {
      "type": "search",
      "name": "payload",
      "target": "payload",
      "request": {
        "body": {
          "size": 5000,
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "CREATED_AT_UTC": {
                      "gte": "now-70m/m",
                      "lte": "now"
                    }
                  }
                }
              ]
            }
          },
          "sort": [
            {
              "CREATED_AT_UTC": {
                "order": "asc"
              }
            }
          ]
        },
        "indices": [
          "index-pattern"
        ]
      }
    },
    {
      "type": "transform",
      "name": "transform",
      "source": "String errorAlertMessage = \"\";\nMap map = new HashMap();\nList a = new ArrayList();\nboolean isAlertTrue = false;\nfor(int i=0;i<data.payload.hits.hits.size();i++) {\n    String id = data.payload.hits.hits[i]._source.CHANNEL + \"_\" + data.payload.hits.hits[i]._source.CLIENT_ID + \"_\" + data.payload.hits.hits[i]._source.WH_ID;\n    map.put(id,data.payload.hits.hits[i]);\n}\nfor(String k : map.keySet()) {\n    Map m = new HashMap();\n    int hourOrders = (map.get(k)._source.HOUR_ORDERS);\n    int sevenDaysAvgOrders = (map.get(k)._source.SEVEN_DAYS_AVG);\n    if(sevenDaysAvgOrders > 0) {\n        int expectedOrders = (int)(sevenDaysAvgOrders * 0.8);\n        if(hourOrders <= expectedOrders || hourOrders==0) {\n            isAlertTrue=true;\n            m.put(\"CLIENT_ID\",map.get(k)._source.CLIENT_ID);\n            m.put(\"CLIENT\",map.get(k)._source.CLIENT);\n            m.put(\"CHANNEL\",map.get(k)._source.CHANNEL);\n            m.put(\"SEVEN_DAYS_AVG\",sevenDaysAvgOrders);\n            m.put(\"HOUR_ORDERS\", hourOrders);\n            m.put(\"WAREHOUSE\",map.get(k)._source.WAREHOUSE);\n            m.put(\"WH_ID\",map.get(k)._source.WH_ID);\n            a.add(m);\n        }\n    }\n}\nInstant now = Instant.ofEpochMilli(System.currentTimeMillis());\nZoneId istZone = ZoneId.of(\"Asia/Kolkata\");\nZonedDateTime istTime = ZonedDateTime.ofInstant(now, istZone);\nZonedDateTime previousHourIST = istTime.minusHours(1);\nint currentHour = istTime.getHour();\nint previousHour = previousHourIST.getHour();\n\nMap result = new HashMap();\nresult.put(\"is_alert\", isAlertTrue);\nresult.put(\"size\",a.size());\nresult.put(\"payload\",a);\nresult.put(\"currentHour\",currentHour);\nresult.put(\"previousHour\",previousHour);\nreturn result;\n",
      "lang": "painless",
      "target": "transform"
    },
    {
      "type": "condition",
      "name": "condition",
      "source": "return data.transform.is_alert == true",
      "lang": "painless"
    }
  ],
  "actions": [
    {
      "type": "email",
      "name": "myemail",
      "from": "noreply@increff.com",
      "to": [
        "raj.non.jikadra@increff.com"
      ],
      "cc": [],
      "bcc": [],
      "subject": "Signals message",
      "text_body": "Total: {{data.mysearch.hits.total.value}}",
      "html_body": "<p>\n  <span style=\"color:blue;\">Total:</span>\n  <span style=\"color:red;\">{{data.avg_ticket_price.aggregations.metricAgg.value}}</span>\n</p>\n",
      "account": "MailAccount",
      "checks": [],
      "throttle_period": "3m"
    },
    {
      "type": "webhook",
      "name": "mywebhook",
      "request": {
        "method": "POST",
        "url": "www.google.com",
        "body": "Total: {{data.mysearch.hits.total.value}}",
        "headers": {
          "Content-type": "application/json"
        }
      },
      "checks": [],
      "throttle_period": "1m"
    }
  ],
  "_tenant": "_main",
  "_meta": {
    "last_edit": {
      "user": "elastic",
      "date": "2023-12-19T11:42:59.569Z"
    }
  },
  "_id": "Watch_throttle_test26"
}

@Raj.Jikadra Thank you for sharing the watch definition. I have some questions regarding your cluster.

  1. What is the version of the Search Guard plugin?
  2. How many ES nodes do you have in the cluster?
  3. What node roles did you assign (GET _cat/nodes)?
  4. How many Watches do you have in your cluster?
  5. What CPU and RAM resources did you assign to your ES nodes?
  1. Search Guard version: Search Guard FLX 1.4.0. I noticed this issue on FLX v1.3.0, and upgraded to v1.4.0 thinking this might solve the issue, but no luck.
  2. There is only one master ES nodes in the cluster which is deployed on VM.
  3. GET _cat/nodes: 10.10.0.58 22 94 14 0.42 0.40 0.43 hilmrst * elk-internal-master
  4. Around 50, but tried various number of watches like 100, 70, 40 and 25 which all had the issue. One pattern is decreasing the number of watches, reduces the skipping.
  5. VM has 2vCPU, 7.5GB RAM. And heap value is default: 7.5/2 = 3.75GB.

I have also tried multiple nodes, and different hardware configs, but still the same issue.

Also I have another cluster, which has the same issue:
Node Config:

  1. Master Node: 16vCPU and 64GB RAM config
  2. Node with only ‘data_hot’ : 4vCPU and 16GB RAM.

Node roles:
10.170.42.22 29 97 26 4.90 4.53 4.74 hilmrst * master
10.170.42.32 47 98 3 0.03 0.28 0.72 w - data-warm

SearchGuard version: FLX 1.3.0

@Raj.Jikadra Thank you for sharing the details. I’ll try to test it.
Regarding the watches, do you use different watch definitions or do you replicate the shared example?

For few watch, its similar to the one I shared, and few watches are different.

Also some additional info:

  1. Watches doesn’t start skipping when we create them, but after creating watches and we restart the elasticsearch node, it starts skipping.
  2. This issue is less frequent in elasticsearch nodes running on single core(processor).

@pablo thanks for the support so far. Is there any update on testing?