Recently I’ve been experimenting with EFK to see how we can extract value from our machine logs. We also use Nagios to monitor various services and processes within our infrastructure. The text logs produces by Nagios are not very useful in their raw form as you can see…

[1405413255] Auto-save of retention data completed successfully.
[1405413285] SERVICE ALERT: servername;t 3306;OK;SOFT;2;QUERY OK: 'SELECT COUNT(*) FROM t' returned 32063.000000
[1405413745] SERVICE ALERT: servername;Memory;OK;HARD;3;OK Memory 9% used. Largest process: nscd (537) = 715.14MB (18%)
[1405414075] SERVICE NOTIFICATION: nagiosadmin;servername;MySQL Uptime 3306;WARNING;notify-service-by-email;WARNING: MySQL uptime, 1105 is below threshold: 4320.
[1405414315] SERVICE ALERT: servername;PING;WARNING;SOFT;1;PING WARNING - Packet loss = 28%, RTA = 34.29 ms
[1405414325] SERVICE ALERT: servername;PING;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 33.32 ms
[1405414345] SERVICE ALERT: servername;Memory;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
[1405414365] SERVICE NOTIFICATION: dash;servername;Service last results loaded;WARNING;notify-service-by-email;QUERY WARNING: SELECT COUNT(*) FROM t) AS t returned 0.000000
[1405414465] SERVICE ALERT: servername;Memory;CRITICAL;SOFT;2;CHECK_NRPE: Socket timeout after 10 seconds.

I wanted to get the service alerts in the log files into EFK. Here’s how I did it. First install the fluent-plugin-grok-parser plugin. If you are using td-agent…

/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-grok-parser

Or if you are using the pure ruby version…

gem install fluent-plugin-grok-parser

Next we need to create a file containing the patterns we want to match. I used the one that can be found here. There’s also a useful grok debugger here if you want to test your own patterns. Click the “Nagios” link and copy and paste the next into a file; i.e. /usr/bin/scripts/nagios_grok_patterns.txt

Make sure td-agent can read the file…

chown td-agent:td-agent /usr/bin/scripts/nagios_grok_patterns.txt

The example here will parse a Nagios Service alert. The following log message…

[1405363825] SERVICE ALERT: servername;Memory;OK;SOFT;2;OK Memory 9% used. Largest process: nscd (537) = 715.14MB (18%)

Will be parsed by the following grok expression…

(?<nagios_type>SERVICE ALERT): (?<nagios_hostname>.*?);(?<nagios_service>.*?);(?<nagios_state>.*?);(?<nagios_statelevel>.*?);(?<nagios_attempt>(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))));(?<nagios_message>.*)

and converted into the following json…

{
  "nagios_type": [
    [
      "SERVICE ALERT"
    ]
  ],
  "nagios_hostname": [
    [
      "servername"
    ]
  ],
  "nagios_service": [
    [
      "Memory"
    ]
  ],
  "nagios_state": [
    [
      "OK"
    ]
  ],
  "nagios_statelevel": [
    [
      "SOFT"
    ]
  ],
  "nagios_attempt": [
    [
      "2"
    ]
  ],
  "nagios_message": [
    [
      "OK Memory 9% used. Largest process: nscd (537) = 715.14MB (18%)"
    ]
  ]
}

The following xml should be placed into /etc/td-agent/td-agent.conf to send Nagios Service alerts to your main server. Note the grok_pattern parameter uses the name of the expression in the file pointed at by custom_pattern_path.

<source>
  type tail
  format grok
  grok_pattern %{NAGIOS_SERVICE_ALERT}
  custom_pattern_path /usr/bin/scripts/nagios_grok_patterns.txt
  path /usr/local/nagios/var/nagios.log
  pos_file /var/log/td-agent/nagios_log.pos
  tag nagios
</source>

<match nagios>
  type record_reformer
  tag nagios.source
  source nagios
</match>

<match nagios.source>
  type forward
  <server>
    host XXX.XXX.XXX.XXX
    port 42186
  </server>
</match>

Restart td-agent…

/etc/init.d/td-agent restart

The td-agent log file, probably /var/log/td-agent/ts-agent.log, should contain the following message if the previous steps have been setup correctly.

2014-07-14 19:50:08 +0100 [info]: Expanded the pattern (?<nagios_type>SERVICE ALERT): (?<nagios_hostname>.*?);(?<nagios_service>.*?);(?<nagios_state>.*?);(?<nagios_statelevel>.*?);(?<nagios_attempt>(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))));(?<nagios_message>.*) into (?<nagios_type>SERVICE ALERT): (?<nagios_hostname>.*?);(?<nagios_service>.*?);(?<nagios_state>.*?);(?<nagios_statelevel>.*?);(?<nagios_attempt>(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))));(?<nagios_message>.*)