Search Guard Watch doesn’t appear to be executing, receiving “failed to fetch” in Kibana UI.
Application Version
Elastic Stack 7.10.2 with SG 49.0.0, running in EKS on stock Elastic images, with Search Guard installed.
The Issue
I’ve configured a Watch in Signals through the Kibana Plugin UI with the following check. The objective of the Watch is simply to get the number of logged documents from the previous 15 min. and to send an alert if none have been received, which indicates that the application is most likely down.
I do not currently have an action configured for this watch, as I’m just trying to test the watch, for now.
I’m not seeing anything in either the Elasticsearch or Kibana logs, which might indicate what the issue is.
Update
I don’t know whether this has anything to do with it, but I decided to try to create an Elasticsearch action, so that I could see what was being output by this alert. I ended up getting a 401/Unauthorized error in the UI while SG was trying to retrieve the list of indices, and then I got this in the Elasticsearch log:
Cannot retrieve roles for User [name=me@example.com, backend_roles=[SG_ADMIN, SG_USER], requestedTenant=null] from ldap due to ElasticsearchSecurityException[ElasticsearchSecurityException[No user me@example.com found]]; nested: ElasticsearchSecurityException[No user me@example.com found];
I’m uncertain why, as I should have full admin options, and I appear to be able to save watches.
When exactly do you get the “failed to fetch” error message? When trying to manually executing the watch from the Kibana plugin?
For both problems, we would need more information from the Elasticsearch logs. In case of manually executing a watch, you can also check the Browser console log for more logs.
Regarding the LDAP error: Can you also please post your complete sg_config.yml configuration? Remember to redact passwords. While redacting, please double check that LDAP passwords are correct, both in the authc section and the authz section.
Also, how are you assigning roles to users? In the log message, the user has the roles SG_ADMIN and SG_USER. There are two ways: You can use the role mapping feature to directly map user names to roles. Alternatively, you can assign roles coming from LDAP to Search Guard roles.
No. I get it after I’ve configured the watch, while I’m still connected during that session. I have the watch set to execute every 10m, and I’m getting it about every 10m, so my assumption is that they’re connected, especially since it goes away when I disable the watch.
How do I manually execute a watch? Will this allow me to see what’s being returned by Elasticsearch, for troubleshooting purposes?
What log settings do you need me to enable to get the correct information?
We use ADFS as our primary authentication. With the sole exception of this error, we’re not receiving any regular errors. LDAP is configured as a secondary authentication, so that users are able to authenticate directly to Elasticsearch to execute queries, etc. with the same permissions as they have when using the Kibana UI. Since we’re using EKS, we follow best practice and mount all secrets as environment variables in the pods. All authentication (ADFS & LDAP) is working as expected, with the exception of this issue, so passwords should be correct.
I’ve created a few additional roles/mappings since then (I probably need to pull a new copy of everything for backup, in fact), but again, all these are working.
What I do not see defined, when I look at defined roles in the UI (including system roles) are the SG_USER or SG_ADMIN roles. Is this something I need to create?
Update
I just checked the 49.0.0 initial config files and don’t see the SG_USER or SG_ADMIN roles defined in them, either.
So…figured out how to manually execute a watch (I can’t believe I missed the button), and with the help of that, I’ve debugged the check. It turns out that it was failing, and the filter was causing the issue. The following watch:
However, based on the trigger section, I assume that my trigger is failing. I get the same results if I set data.constants.threshold: 10000. Any ideas about that? And since this has forked into two separate questions, please let me know if it makes more sense to split them into two separate posts.
The properties here are null because the watch was executed by manual execution. Thus, it has no schedule in this case.
If you want to learn about the status of the automatically triggered watch executions, go to the overview page of Signals and click on the “Execution History” icon on the left side of the watch:
However, looking at the watch definition, I just notice that the watch is missing a trigger definition. Triggers define a schedule when a watch is automatically executed:
It is okay for watches to have no trigger. But then, you have to trigger them externally via the API. This is for cases where you want to “push” events from external systems into Signals:
Yes, this is a good idea. I’ll start a new thread for the LDAP-related question:
I’m defining the trigger definition through the Kibana UI. This is the trigger, as defined in the UI:
However, even when enabled, I’m not seeing anything in the execution history. Note that right now I don’t have an action defined. Does the execution history only show up if there is an action defined?
However, as of this morning, this particular watch has been scheduled since Thursday or Friday, and I’m seeing the following, when I look at the execution history:
We probably need to have a look at the raw watch definition. For this, can you please open the following URI directly on the Elasticsearch endpoint (i.e., bypassing Kibana)?
Replace localhost by the hostname of your ES endpoint. This is assuming that you are using the default tenant. If you are using another tenant, replace _main by the name of the particular tenant.
This should provide you a JSON version of the whole watch definition.
Additionally, it would be helpful to read out the internal watch state. This can be done by appending to the URI mentioned above the suffix /_state, i.e.:
Note that this gives you access to all log entries, including those produced by other tenants. If you want to restrict this, you would need to use document level security.
This is not well documented; I’ll add an issue to improve the documentation in this regard.
This is very strange indeed. You are right, as a user with the role SGS_ALL_ACCESS you should have access to the index anyway.
To get the whole picture, we should check whether we can access the log in any way directly from ES. For this, can you open directly in ES this URL and post the results?
https://localhost:9200/*signals*/_search?size=100
Also, it would be useful to see some logs at the time a watch has been executed.
For this, it would be good if you would increase the Signals log level. On a running Elasticsearch cluster this can be done by a HTTP PUT request:
PUT /_cluster/settings
{
"transient": {
"logger.com.floragunn.signals": "DEBUG"
}
}
If you have curl installed, you can send this request by using this command:
Please replace admin by your user name and your-es-host by an endpoint of your ES cluster.
Then, logs would be interesting that are generated around the point of time the watch was executed. You can find out the exact timestamp of the last execution using the _state REST API I noted in an earlier post:
Thank you for the update. Unfortunately, the problem remains mysterious for now. Thus, we need to do some more investigations.
Would it be possible to check whether there are more log messages which come temporally directly after the log message you cited? They might be helpful, even if they seem to be unrelated on the first sight.
Please also check your config/elasticsearch.yml. Does it contain a property named signals.index_names.log? By default, it should not.
There is one more thing you can try to rule out problems with the cluster topolgy. By default, Signals distributes the watch execution over all nodes for load balancing purposes. It is however possible to restrict this to certain nodes. With this command you restrict the watch execution for the _main (default) tenant to master nodes:
The setting takes effect without any further action. Just wait for the next execution of the watch and then check whether the .signals_log_* index is created.
If this does not help either, it might be worth the try to change the index name of the log index. To do so, execute this command:
Note that this setting does not get picked up by the Kibana plugin right now, so Kibana won’t be able to display log entries. This is just for testing.
Here are all log entries on that node, starting from the most recent watch execution through the current end of the logfile. At the point where I grabbed this, there had not been any events written to the log for about 5 min.
Note that there appears to be an error in there from c.f.s.w.r.WatchLogIndexWriter, which (of course) notes that there is no .signals_log_<now/d> index.
It is not. Since we’re running in EKS, I also confirmed that I didn’t set this elsewhere in the Helm values.
Did this. Got the identical error on the master node.
This worked. I had to create it as sg7-signals-log, however, since I have index auto-creation disabled. My exceptions to this are indices matching the following patterns (you can see that .signals_log_* should’ve autocreated without any issues, and other indices matching that pattern are created as expected):
However, when I go to look at indices or the settings for this specific index in the Kibana UI, I now get a “forbidden” message (please see the screenshot, below). I don’t get that with any other index, including the .signals_* indices.
I tried to create an index under ILM with a rollover alias to see whether that works and will populate for me in a fashion that I can view the data. As part of this, I attempted to delete the sg7-signals-log index. However, SG appears to have created an index that even I, as the admin with SGS_ALL_ACCESS can’t delete. Here’s the error I get:
{
"error" : {
"root_cause" : [
{
"type" : "security_exception",
"reason" : "no permissions for [indices:admin/delete] and User [name=me@example.com, backend_roles=[SG_ADMIN, SG_USER], requestedTenant=null]"
}
],
"type" : "security_exception",
"reason" : "no permissions for [indices:admin/delete] and User [name=me@example.com, backend_roles=[SG_ADMIN, SG_USER], requestedTenant=null]"
},
"status" : 403
}
Since I manually created an index with ILM rollover and told it to populate to that rollover alias, I can now see the events populating into that index. Here’s a sample, as executed from the Kibana console (**note: sg-signals-log is the log alias, which is also what I set watch_log.index to):
So my question is this: Is it simply possible that watch_log.index was previously undefined, and that this was the reason it wasn’t creating? If this is the case (or even just to test what happens), can you generate the command that will allow me to set watch_log.index to what the default value should be (not just to reset it to the default value, in case that default value was blank for me)?
If this can’t be done, is there any way I can set SG to use this value for watch_log.index for everything, including Kibana, as a workaround?
…and how do I delete the sg7-signals-log index, since I don’t seem to have any permissions to it?
watch_log.index: The name of the watch log index. Defaults to <.signals_log_{now/d}> which starts a new index every day. If you expect a higher throughput which makes it necessary to start an index more often, you can use another date math expression.
However, the next time it ran, I got the following error in the log when it attempted to create the index:
"stacktrace": [
"org.elasticsearch.transport.RemoteTransportException: [elk-es-master-2][10.229.24.58:9300][indices:admin/auto_create]",
"Caused by: org.elasticsearch.indices.InvalidIndexNameException: Invalid index name [.signals_log_{now/d}], must not contain the following characters [ , \", *, \\, <, |, ,, >, /, ?]",
...
So, reading more closely and following the example given on the date math web page, I executed it again as the following:
"stacktrace": [
"org.elasticsearch.index.IndexNotFoundException: no such index [%3C.signals_log_%7Bnow%2Fd%7D%3E] and [action.auto_create_index] ([.*,sg7-*,searchguard,searchguard*,ilm-history-*]) doesn't match",
Based on this, how do I give this a setting using date math, if that’s the issue?
I was able to do this by logging into one of the nodes and executing the following, where $SG_ADMIN_USER is Search Guard’s internal admin user (evidently SGS_ALL_ACCESS doesn’t have any perms on the sg7-* logs, as I couldn’t delete it using my regular admin user (even from the command line) and I get the forbidden message when trying to look at any of the sg7-* indices):
By default, Signals writes logs to indexes specified by the index expression <.signals_log_{now/d}>. This means that every day, a new index is created with the current date appended as suffix to the index name.
As such indexes are created on demand with auto creation, it is not created when auto creation is disabled.
If you want to use that index name, just add signals_log_* to the action.auto_create_index setting.
To restore the default of the index name setting, just do this:
If you don’t want to use date math for the logs index, it should be possible to use any index name starting with .signals_log_ instead. As the Kibana plugin reads the index using a wildcard, it should be also able to access such an index.