Re: Parsing haproxy log files (python)

From: Holger Just <haproxy#meine-er.de>
Date: Sat, 19 Mar 2011 17:32:08 +0100


Hi Roy,

On 2011-03-18 22:21, Roy Smith wrote:
> Before I reinvent the wheel, has anybody already written code to parse
> haproxy log messages with Python?

I have, although it's not _that_ fast. My approach requires about 1 minutes per 100 MB gziped logs (with a roughly 10:1 compression).

If your usecase matches on the features of halog, you should definitly try that instead. It's written by Willy himself and is able to easily maxout your streaming file I/O (meaning it is magnitudes faster than you could ever do it in python itself)

That said, the gist of my analyzing implementation follows. It is targeted at the verbose HTTP log format of HAProxy and Python 2.4. The terminology is the one used in the configuration manual of HAProxy. Refer to it for a description of the various fields.

--Holger


#!/usr/bin/env python
# encoding: utf-8

import re
import subprocess as sub

# Does the syslog server escape quotes?
template_escape = True

haproxy_re = (r'haproxy\[(?P<pid>\d+)\]: '

r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):(?P<client_port>\d{1,5}) '
r'\[(?P<date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})\] '
r'(?P<listener_name>\S+) (?P<server_name>\S+) '
r'(?P<Tq>(-1|\d+))/(?P<Tw>(-1|\d+))/(?P<Tc>(-1|\d+))/(?P<Tr>(-1|\d+))/'
r'(?P<Tt>\+?\d+) '
r'(?P<HTTP_return_code>\d{3}) (?P<bytes_read>\d+) '
r'(?P<captured_request_cookie>\S+) (?P<captured_response_cookie>\S+) '
r'(?P<termination_state>[\w-]{4}) (?P<actconn>\d+)/(?P<feconn>\d+)/'
r'(?P<beconn>\d+)/(?P<srv_conn>\d+)/(?P<retries>\d+) '
r'(?P<server_queue>\d+)/(?P<listener_queue>\d+) '
r'(\{(?P<captured_request_headers>.*?)\} )?'
r'(\{(?P<captured_response_headers>.*?)\} )?')

if template_escape:
  haproxy_re += r'\\"(?P<HTTP_request>.+)\\"' else:
  haproxy_re += r'"(?P<HTTP_request>.+)"'

haproxy_re = re.compile(haproxy_re)

def scan(logfile_path):
  (root, ext) = os.path.splitext(logfile_path)   process = None
  if ext == ".gz":
    # Use a shellout for unzipping. This is about 2-5 times faster     # than doing it in python.
    process = sub.Popen(["/bin/gunzip", "--stdout", path],

                        stdout=sub.PIPE, bufsize=1)
    fd = process.stdout
  else:
    fd = open(path, "r")

  line_no = 0
  for line in fd:
    line_no += 1
    try:

      match = haproxy_re.search(line)
      if not match:
        # A non-request, e.g. an error or an info message of HAProxy
        # We just ignore it and continue with the next line
        continue

      fields = match.groupdict()
      if fields["captured_request_headers"]:
        fields["captured_request_headers"] = \
        fields["captured_request_headers"].split("|")
      if fields["captured_response_headers"]:
        fields["captured_response_headers"] = \
        fields["captured_response_headers"].split("|")

      # Now you have the matched parts in the fields dict
      # And you can do whatever you like with it :)

    except:
      print "An error occurred in line %s. Last line was:" % line_no
      print line
      raise

  # finalize the file reading
  if process:
    process.communicate()
  else:
    fd.close() Received on 2011/03/19 17:32

This archive was generated by hypermail 2.2.0 : 2011/03/19 17:45 CET