Thursday, December 04, 2008

Log Analytics: Splunk

Sure, I use logs to troubleshoot problems, but I never gave it much thought as an analytics device since the rise of JavaScript based traffic analysis tools like Omniture and Google Analytics. I must say that I have sorely underestimated the power of logging.

I say this because of a tool that I have begun using called Splunk. If I had to characterize Splunk, I would say that it is a data analysis tool, with the ability to generate reports and alerts based on events.

The data could be an Apache log file, the output of the PS command, or a configuration file that you need to watch for changes. Input sources can come via file, via TCP/UDP port, and other mechanism.

The analysis of the data is done via a search, but the language is quite complex. The search allows you to target a specific data source, or class of sources (like all of your access logs), extract fields from the log entries using a regex (or just split key/val pairs), transform the data, and them chart it as a line, pie, bar, or other type of chart.

Here are some specific examples of what I have been doing with it.

Response Times - With a little aspect, I was able to wrap all of my controller methods (and some repository methods too), and have them dump response time data to a file. The data looks like this.

app=my-app;class=org.example.Demo;method=doSearch;response=348
app=my-app;class=org.example.Demo;method=doSearch;response=654
app=my-app;class=org.example.Demo;method=doSearch;response=439

All of this data is logged using commons-logging, and put into a separate performance.log file to separate it out from other logging. The result is that I can create a search for any class/method, and generate a chart of min/avg/max response times for that method. This allows me to spot any degradation that might occur over time (as the DB gets larger), or during certain parts of the day (when the network is congested). I can then set up an alert so that if the response times exceed a threshold, Splunk will send me an email (or execute a script).

Change Control - Simply put, I can have Splunk monitor a configuration file, and it will log any changes to that file. I can then use Splunk's diff command to see the actual change. Again, I can have it email me when a change occurs.

Service Unavailability - Because all of my Java application servers sit behind an Apache proxy, I can monitor the Apache logs and look for proxy errors (501, 503, 504). Splunk is able to alert me when these events occur because it can parse the logs and look specifically for these error codes. With Apache specifically this is super each because Splunk already knows how to parse an Apache log using a standard format. This can be more effective that monitoring that uses polling because polling will only let you know if the service is done when the polling occurred, when Splunk can alert me even if only a single hit against the site caused a proxy error.

There is much more to Splunk, and I still have much to learn, but I hope this helps provide some insight as to its capabilities.

No comments: