Error while initializing IndexJobStateStore "this.nodeId" is null

Hi,
i have problem with signals. Everything was working fine until i get this error “Error while initializing IndexJobStateStore”. When i get it watchers are not executed anymore. I also start getting errors about watchers “Cannot invoke “java.util.Map.get(Object)” because “this.originalToCloneMap” is null”.

Elasticsearch: 7.9.3
SG 7.9.3-47.0.0

Elasticsearch log

[2021-03-12T10:05:31,580][INFO ][c.f.s.j.c.IndexJobStateStore] [pias-sig01] Error while initializing IndexJobStateStore [schedulerName=signals/signals-prod, statusIndexName=.signals_watches_trigger_state, jobConfigSource=IndexJobConfigSource [indexName=.signals_watches, jobFactory=com.floragunn.signals.watch.Watch$JobConfigFactory@5aa1c72a, jobDistributor=JobDistributor signals/signals-prod], jobFactory=com.floragunn.signals.watch.Watch$JobConfigFactory@5aa1c72a]
Will try again during the next cluster change
java.lang.NullPointerException: Cannot invoke “String.equals(Object)” because “this.nodeId” is null
at com.floragunn.searchsupport.jobs.core.IndexJobStateStore.checkTriggerStateAfterRecovery(IndexJobStateStore.java:1640) ~[search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.core.IndexJobStateStore.createInternalJobDetailFromJobConfig(IndexJobStateStore.java:1500) ~[search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.core.IndexJobStateStore.loadJobs(IndexJobStateStore.java:1251) ~[search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.core.IndexJobStateStore.initJobs(IndexJobStateStore.java:1178) ~[search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.core.IndexJobStateStore.initialize(IndexJobStateStore.java:158) [search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.SchedulerBuilder.buildImpl(SchedulerBuilder.java:259) [search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.SchedulerBuilder.build(SchedulerBuilder.java:232) [search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.SignalsTenant.init(SignalsTenant.java:137) [search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.SignalsTenant.restart(SignalsTenant.java:178) [search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.SignalsTenant$1.run(SignalsTenant.java:185) [search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]

other log

[2021-03-12T10:40:49,534][INFO ][c.f.s.e.WatchRunner ] [pias-essignals001] Error while executing test_1_mo1_NodeStatus_kubernetes_nonprod
com.floragunn.signals.execution.WatchExecutionException: Error while executing Transform data_normalization
at com.floragunn.signals.execution.WatchRunner.executeChecks(WatchRunner.java:238) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.execute(WatchRunner.java:152) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.execute(WatchRunner.java:126) [search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.execution.AuthorizingJobDecorator.execute(AuthorizingJobDecorator.java:37) [search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-2.3.2.jar:?]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [quartz-2.3.2.jar:?]
Caused by: java.lang.NullPointerException: Cannot invoke “java.util.Map.get(Object)” because “this.originalToCloneMap” is null
at com.floragunn.signals.support.NestedValueMap.deepCloneObject(NestedValueMap.java:222) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:151) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.putAllFromAnyMap(NestedValueMap.java:140) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:122) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:148) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.putAllFromAnyMap(NestedValueMap.java:140) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:122) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:148) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.putAll(NestedValueMap.java:131) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.clone(NestedValueMap.java:57) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchExecutionContextData.clone(WatchExecutionContextData.java:127) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchExecutionContext.clone(WatchExecutionContext.java:105) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.watch.checks.Transform.execute(Transform.java:113) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.executeChecks(WatchRunner.java:220) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
… 5 more
[2021-03-12T10:40:49,536][INFO ][o.q.c.JobRunShell ] [pias-essignals001] Job lrt.test_1_mo1_NodeStatus_kubernetes_nonprod threw a JobExecutionException:
org.quartz.JobExecutionException: com.floragunn.signals.execution.WatchExecutionException: Error while executing Transform data_normalization
at com.floragunn.signals.execution.WatchRunner.execute(WatchRunner.java:129) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.searchsupport.jobs.execution.AuthorizingJobDecorator.execute(AuthorizingJobDecorator.java:37) ~[search-guard-suite-scheduler-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-2.3.2.jar:?]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [quartz-2.3.2.jar:?]
Caused by: com.floragunn.signals.execution.WatchExecutionException: Error while executing Transform data_normalization
at com.floragunn.signals.execution.WatchRunner.executeChecks(WatchRunner.java:238) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.execute(WatchRunner.java:152) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.execute(WatchRunner.java:126) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
… 3 more
Caused by: java.lang.NullPointerException: Cannot invoke “java.util.Map.get(Object)” because “this.originalToCloneMap” is null
at com.floragunn.signals.support.NestedValueMap.deepCloneObject(NestedValueMap.java:222) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:151) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.putAllFromAnyMap(NestedValueMap.java:140) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:122) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:148) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.putAllFromAnyMap(NestedValueMap.java:140) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:122) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.put(NestedValueMap.java:148) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.putAll(NestedValueMap.java:131) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.support.NestedValueMap.clone(NestedValueMap.java:57) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchExecutionContextData.clone(WatchExecutionContextData.java:127) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchExecutionContext.clone(WatchExecutionContext.java:105) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.watch.checks.Transform.execute(Transform.java:113) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.executeChecks(WatchRunner.java:220) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.execute(WatchRunner.java:152) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
at com.floragunn.signals.execution.WatchRunner.execute(WatchRunner.java:126) ~[search-guard-suite-signals-7.9.3-47.0.0.jar:7.9.3-47.0.0]
… 3 more

br,
MW

Hi. Did you configure anything in Elasticsearch or Search Guard that might cause the problem?

What is your cluster health?

curl -k -u admin:admin -X GET https://localhost:9200/_cluster/settings
curl -k -u admin:admin -X GET https://localhost:9200/_cluster/health?pretty
curl -k -u admin:admin -X GET https://localhost:9200/_cluster/allocation/explain?pretty

Please share the configurations:

  1. kibana.yml
  2. elasticsearch.yml
  3. sg_config.yml

Hello Marwojt!

It seems you have hit a difficult to trigger bug. I have filed an issue; you can track it here:

If I understand it correctly, Signals does not execute any watches right now. Is that correct? In order to get it running again, before we have finished a bugfix, you need to modify internal indexes used by Signals.

In order to be able to do this, you need to authenticate at Elasticsearch using the admin certificate which you are also using for sgadmin.

You can use the command curl in a shell for this.

First, you need to find the watches with the problematic state. Use this command and add the paths to your respective admin cert and key:

$ curl -k --cert "/path/to/cert.pem" --key "/path/to/key.pem" https://your-es-cluster:9200/.signals_watches_trigger_state/_search?pretty=true\&q=state%3AEXECUTING

This should give you a list of one or more documents looking like this:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : ".signals_watches_trigger_state",
        "_type" : "_doc",
        "_id" : "_main/.main.avg_ticket_price___9859011cc78b3687689e719ce0d111a5",
        "_score" : 0.2876821,
        "_source" : {
          "state" : "EXECUTING",
          "nextFireTime" : 1615380175096,
          "prevFireTime" : 1615380115096,
          "info" : null,
          "node" : null,
          "timesTriggered" : 10338
        }
      }
    ]
  }
}

Then, you need to delete all these documents. In order to do this, note the content of the _id attribute (in the example _main/.main.avg_ticket_price___9859011cc78b3687689e719ce0d111a5) and add it to the end of the following command. Note that you have to replace the slash in the Id by %2F:

$ curl -k --cert "/path/to/cert.pem" --key "/path/to/key.pem" -X DELETE https://your-es-cluster:9200/.signals_watches_trigger_state/_doc/your-doc-id

So, in the example, the call would look like this:

$ curl -k --cert "/path/to/cert.pem" --key "/path/to/key.pem" -X DELETE https://your-es-cluster:9200/.signals_watches_trigger_state/_doc/_main%2F.main.avg_ticket_price___9859011cc78b3687689e719ce0d111a5

Afterwards, you have to restart every node which exhibited the error.

Sorry for the complications!

We looked a bit closer at the exceptions; the second one is probably unrelated to the first one.

To learn a bit more about the second one, which seems to be caused by a watch with the id test_1_mo1_NodeStatus_kubernetes_nonprod: Would it be possible for you to send us the definition of the watch? That would be very helpful for finding the problem.