Two quick unix commands that will clean a Squid log file for import into R. Very useful when you need to fine tune your cache strategy.
# read in some.log file, remove unecessary bits cat some.log | tr -s ' ' | cut -d' ' -f1,4,7 > log.clean # strip off the millisecond values from the timestamp, shorten the requested URL sed -r -n -e 's/^([[:digit:]]*)[^ ]* *[^ ]* *http:\/\/somedomainname(.*)$/\1 \2/Ip' log.clean > log.import
This creates a two column file. First column is a timestamp at 1 second granularity. The second should be the query string after the domain. You should tweak this last column to be as small as possible, if you have a very large log file.
Leave a comment