CloverDX Blog on Data Integration

Parsing of an Apache access log

Written by Vaclav Matous | May 07, 2009

The UniversalDataReader is designed for reading files in various formats. We use this component for many purposes. One of them is parsing of an Apache access log. The file normally includes records in a commonly used combined log format, e.g.:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

 

Fields in the record are delimited by a space mark. But a space can be included in some quoted fields, such as "GET /apache_pb.gif HTTP/1.0", so a single space is an improper delimiter. Fortunately, CloverDX allows you to define variable delimiters in metadata. So parsing of the log depends only on a proper setting of metadata on an output edge from the reader. In our case we defined following delimiters: space, space, space+left square bracket, right square bracket+space+quotation mark, quotation mark+space etc.

 

The complete example with an additional computing of the most visited pages and the most visiting IP addresses can be found in Advanced Examples (AccessLogParsing.grf) included in CloverDX Designer or you can download all examples from SourceForge.