Split access.log file by dates using command line tools

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Split access.log file by dates using command line tools



I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.



I wanted to split it in many small files, by using date as splitting criteria.



Date is in format [15/Oct/2011:12:02:02 +0000]. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?


[15/Oct/2011:12:02:02 +0000]



Input file name is access.log. I'd like output files to have format such as access.apache.15_Oct_2011.log (that would do the trick, although not nice when sorting.)


access.log


access.apache.15_Oct_2011.log




7 Answers
7



One way using awk:


awk


awk 'BEGIN
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = a


split($4,array,"[:/]");
year = array[3]
month = sprintf("%02d", m[array[2]])

print > FILENAME"-"year"_"month".txt"
' incendiary.ws-2009



This will output files like:


incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt



Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.



Original inspiration: "How to split existing apache logfile by month?"





You are right, Sir. I've just tested perl solution as well, and awk solution was faster by 3x. I suspect it has to do with the fact that awk example doesn't use regular expressions but simple string splitting, which might be more efficient. Marking as an Accepted answer.
– mr.b
Jul 30 '12 at 10:56





I've just updated it to have a better filename output.
– Theodore R. Smith
Jul 30 '12 at 12:20





Oh, and I'm definitely using this on production against 20 GB files with no problems now. Takes about 2 GB/minute on my system.
– Theodore R. Smith
Jul 30 '12 at 12:22





Similar performances here as well: ~1 minute / ~2.5gb. Thanks!
– mr.b
Jul 30 '12 at 14:20





A bit silly bragging about performance when it doesn't fulfill the question ...
– erjiang
Oct 9 '13 at 14:31



Pure bash, making one pass through the access log:


while read; do
[[ $REPLY =~ [(..)/(...)/(....): ]]

d=$BASH_REMATCH[1]
m=$BASH_REMATCH[2]
y=$BASH_REMATCH[3]

#printf -v fname "access.apache.%s_%s_%s.log" $BASH_REMATCH[@]:1:3
printf -v fname "access.apache.%s_%s_%s.log" $y $m $d

echo "$REPLY" >> $fname
done < access.log





The method in my answer is dramatically faster: Against a 150 MB log file, this answer took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while the method in mine took 5 seconds.
– Theodore R. Smith
Jul 30 '12 at 1:18





Not surprising :)
– chepner
Jul 30 '12 at 14:10





However, this answer creates logs files on a day basis not on a monthly basis. This does less, no wonder it is faster.
– i.am.michiel
Dec 4 '15 at 8:39





@i.am.michiel The reason this is slower is that iterating through the input is much faster in awk than in bash; the number of output files is not really relevant.
– chepner
Dec 4 '15 at 12:32


awk


bash



Perl came to the rescue:


cat access.log | perl -n -e'm@[(d1,2)/(w3)/(d4):@; open(LOG, ">>access.apache.$3_$2_$1.log"); print LOG $_;'



Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.



I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.





This worked for me where the marked answer didn't, and was acceptably performant.
– cmenning
Jan 20 '14 at 19:56



Here is an awk version that outputs lexically sortable log files.


awk



Some efficiency enhancements: all done in one pass, only generate fname when it is not the same as before, close fname when switching to a new file (otherwise you might run out of file descriptors).


fname


fname


awk -F"/:" '
BEGIN
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;

$2 != pday)
pyear = $4
pmonth = $3
pday = $2

if(fname != "")
close(fname)

fname = sprintf("access_%04d_%02d_%02d.log", $4, m2n[$3], $2)

print > fname
' access-log



Kind of ugly, that's bash for you:


for year in 2010 2011 2012; do
for month in jan feb mar apr may jun jul aug sep oct nov dec; do
for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
done
done
done





very clever, thanks ;) this would work great for small file (filesize less than amount of ram), as it loops through entire file about 1,116 times :)
– mr.b
Jul 27 '12 at 12:55






very true, its not an efficient script. it would be good for occasional use. Thanks!
– ncultra
Jul 27 '12 at 13:16





it would be faster to unroll the outer loop and process the file in two passes - on the first pass split the file into entries by year. The second pass would then process each year file and split the entries by date. It may even be faster to unroll the second loop and process the file in three passes.
– ncultra
Jul 27 '12 at 13:31



I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.


awk '
BEGIN
m2n["Jan"] = 1; m2n["Feb"] = 2; m2n["Mar"] = 3; m2n["Apr"] = 4;
m2n["May"] = 5; m2n["Jun"] = 6; m2n["Jul"] = 7; m2n["Aug"] = 8;
m2n["Sep"] = 9; m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;

a[3] != pmonth '





This is really impressive! Thank you
– Theodore R. Smith
Jul 27 '17 at 8:14



I made a slight improvement to Theodore's answer so I could see progress when processing a very large log file.


#!/usr/bin/awk -f

BEGIN
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
for (a = 1; a <= 12; a++)
m[months[a]] = a


split($4, array, "[:/]")
year = array[3]
month = sprintf("%02d", m[array[2]])

current = year "-" month
if (last != current)
print current
last = current

print >> FILENAME "-" year "-" month ".txt"



Also I found that I needed to use gawk (brew install gawk if you don't have it) for this to work on Mac OS X.


gawk


brew install gawk






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard