Receiving signals_unavailable_exception

TL;DR

Receiving signals_unavailable_exception when going to Search Guard Signals in 7.10.1, did not have issue in 7.9.1.

Full Issue

I’ve recently updated our environment to 7.10.1 from 7.9.1. I’m using the official Elastic container images, with the appropriate version of SG installed, deploying to Kubernetes via the Elastic Helm charts.

I’m (finally) starting to configure alerts for some of our teams and am seeing the following error, when going to Signals:

{
  "error": {
    "root_cause": [
      {
        "type": "signals_unavailable_exception",
        "reason": "Signals is still initializing. Please try again later."
      }
    ],
    "type": "signals_unavailable_exception",
    "reason": "Signals is still initializing. Please try again later."
  },
  "status": 500
}

I did not receive this alert when exploring signals under previous versions of the SG plugin.

Please have a look at the ES logs. These should contain more details on the problem.

ES logs at INFO level show the following for Signals right after startup. After this, no add’l info is shown for Signals that I can locate:

{"type": "server", "timestamp": "2021-02-05T02:26:12,381Z", "level": "INFO", "component": "c.f.s.j.c.IndexJobStateStore", "cluster.name": "es", "node.name": "es-master-1", "message": "Scheduler signals/_main is initialized. Jobs: 0 Active Triggers: 0", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }
{"type": "server", "timestamp": "2021-02-05T02:26:12,382Z", "level": "INFO", "component": "o.q.c.QuartzScheduler", "cluster.name": "es", "node.name": "es-master-1", "message": "Scheduler meta-data: Quartz Scheduler (v2.3.2) 'signals/_main' with instanceId 'signals/_main'\n  Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.\n  NOT STARTED.\n  Currently in standby mode.\n  Number of jobs executed: 0\n  Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 3 threads.\n  Using job-store 'com.floragunn.searchsupport.jobs.core.IndexJobStateStore' - which supports persistence. and is clustered.\n", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }
{"type": "server", "timestamp": "2021-02-05T02:26:12,383Z", "level": "INFO", "component": "o.q.c.QuartzScheduler", "cluster.name": "es", "node.name": "es-master-1", "message": "Scheduler signals/_main_$_signals/_main started.", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }
{"type": "server", "timestamp": "2021-02-05T02:26:13,340Z", "level": "INFO", "component": "c.f.s.j.c.IndexJobStateStore", "cluster.name": "es", "node.name": "es-master-1", "message": "Reinitializing jobs for IndexJobStateStore [schedulerName=signals/admin_tenant, statusIndexName=.signals_watches_trigger_state, jobConfigSource=IndexJobConfigSource [indexName=.signals_watches, jobFactory=com.floragunn.signals.watch.Watch$JobConfigFactory@70c3847d, jobDistributor=JobDistributor signals/admin_tenant], jobFactory=com.floragunn.signals.watch.Watch$JobConfigFactory@70c3847d]", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }
{"type": "server", "timestamp": "2021-02-05T02:26:18,366Z", "level": "INFO", "component": "c.f.s.j.c.IndexJobStateStore", "cluster.name": "es", "node.name": "es-master-1", "message": "Scheduler signals/admin_tenant is initialized. Jobs: 0 Active Triggers: 0", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }
{"type": "server", "timestamp": "2021-02-05T02:26:18,366Z", "level": "INFO", "component": "o.q.c.QuartzScheduler", "cluster.name": "es", "node.name": "es-master-1", "message": "Scheduler meta-data: Quartz Scheduler (v2.3.2) 'signals/admin_tenant' with instanceId 'signals/admin_tenant'\n  Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.\n  NOT STARTED.\n  Currently in standby mode.\n  Number of jobs executed: 0\n  Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 3 threads.\n  Using job-store 'com.floragunn.searchsupport.jobs.core.IndexJobStateStore' - which supports persistence. and is clustered.\n", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }
{"type": "server", "timestamp": "2021-02-05T02:26:18,366Z", "level": "INFO", "component": "o.q.c.QuartzScheduler", "cluster.name": "es", "node.name": "es-master-1", "message": "Scheduler signals/admin_tenant_$_signals/admin_tenant started.", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }
{"type": "server", "timestamp": "2021-02-05T02:26:18,367Z", "level": "INFO", "component": "c.f.s.j.c.IndexJobStateStore", "cluster.name": "es", "node.name": "es-master-1", "message": "Scheduler signals/admin_tenant is initialized. Jobs: 0 Active Triggers: 0", "cluster.uuid": "816ooLAjQmWfG71KY0DSSA", "node.id": "Sog6C_VJSii5bnPXrc5oYw"  }

Interesting. If Signals is already starting the tenant schedulers, it should be close to finishing the initialization:

The logs mention the tenant selected by default (_main) and the admin_tenant. Do you have any more tenants configured?

Some more questions:

  • What exact version of Search Guard are you trying? If it is 48, would it be possible to try 49?
  • How many nodes do you have in your cluster? What roles (client, data, master) do these nodes have?
  • How are you accessing Signals? Are you using the UI or the REST API?
    • If you are using the UI, do you get the error immediately on the overview page?
    • If you are using the REST API, what is the endpoint you are calling?

No, just one tenant.

Using SG 48, I’ll need to build the images for SG 49, but will do so, probably on Monday.

Currently have 34 Kubernetes Pods as nodes in the cluster. Two are coordinator nodes, which only handle requests from Kibana. Three are configured as master nodes. The remainder are data nodes, divided between hot (12)/warm (5)/cold (12) in the lifecycle; we made the decision to use more nodes, rather than more resources per node.

I’m getting this error while using the UI, and yes, I get it as soon as going to the Signals overview page.

I’ve upgraded my test region to ELK 7.10.2 with SG 49.0.0, and I’m no longer getting this error. I’ve got to wait until tonight to redeploy my production instance and will let you know if it works.

I’ve deployed ELK 7.10.2, with SG 49.0.0 to my production region. Everything appears to be working as expected after this upgrade.

Any idea what the issue was?

SG 48 had a bug where Signals was not correctly initialized on some nodes in certain cluster topologies. The details are a bit complicated to explain in a brief forum post, though :slight_smile:

1 Like

Has the “Solution” checkbox been removed from the forums? I’d like to accept your solution for the next person who comes along with the same or similar issue.

The Signals category had this feature disabled; this was probably an oversight, I have enabled it now.

1 Like