A little while back I needed to split a large Apache log file (3.5GB) up into smaller pieces and wanted to break it up into several files, with one for each day of the month. I initially tried using "grep" on the command line but this proved to be too slow, and I needed to run it once for each day and month so it was going to take 92 times to run, extracting just one day at a time over the course of three months. I ended up writing a quick little PHP script which was able to split the 3.5GB file into 92 files, one for each day of each month in under 4 minutes.
The script is as follows:
$months = array( 'Jul' => '07', 'Aug' => '08', 'Sep' => '09' ); $open_file = false; $fw = false; $fp = fopen('filename.log', 'r'); while($line = fgets($fp)) { preg_match(':([0-9]{2})/(Jul|Aug|Sep):', $line, $matches); // $matches[1] = day of month // $matches[2] = month name $filename = "filename.log.2007{$months[$matches[2]]}{$matches[1]}"; if($filename != $open_file) { if($fw) { fclose($fw); } $fw = fopen($filename, 'a'); $open_file = $filename; echo "$filenamen"; } fputs($fw, $line); } fclose($fw); fclose($fp); ?>
A breakdown of the script follows below.
This first part initialises an array of month names. The script will search the Apache log file for lines containing months with these names. In Apache month names in the log files are abbreviated to three letters with a uppercase first letter, and lower case for the others. This is an associative array with the name as the key and then the month with a leading zero as the value. The resulting output file for each month will be named with the month number and day number appended to the end.
$months = array( 'Jul' => '07', 'Aug' => '08', 'Sep' => '09' );
This next bit of code initialises two variables to false. $open_file indicates which is the current file which is open for writing to (ie one the day of month files). $fw is the handle to the currently open file for writing to (f = file, w = write, hence fw).
$open_file = false; $fw = false;
Now the script opens the file for reading, assigning the file pointer to $fp. Obviously "filename.log" is the actual name of the file being processed. The script is executed from the working directory the file is in and all files being written out are also to the current directory. This could be changed by appending absolute paths to the start of each filename.
$fp = fopen('filename.log', 'r');
The script now loops through the source file, one line at a time.
while($line = fgets($fp)) { ... }
A regular expression is applied to the line to search for eg :25/Jul: eg :05/Aug: etc. The resulting matches are put into the $matches array where index [1] if the day of the month and [2] is the month name, as described in the comments directly below the regular expression.
preg_match(':([0-9]{2})/(Jul|Aug|Sep):', $line, $matches); // $matches[1] = day of month // $matches[2] = month name
The filename is then created using the regular expression. A couple of examples of filenames (based on the regular expression examples previously) would be filename.log.20070725 and filename.log.20070805
$filename = "filename.log.2007{$months[$matches[2]]}{$matches[1]}";
The next bit of code checks to see if the filename calculated above is the same as the value assigned to $open_file, which at first will be set to false. If they are not the same, as will be the case on the first loop through, a check is first done to see if $fw is already open. If it is open it is closed. Then the file with the filename is opened, assigning the file handle to $fw. The filename is then written out to standard output to show the person executing the script whereabouts it is up to.
if($filename != $open_file) { if($fw) { fclose($fw); } $fw = fopen($filename, 'a'); $open_file = $filename; echo "$filenamen"; }
The final part of the loop is to write the line out to $fw.
fputs($fw, $line);
After the loop has completed, the open file handles are closed.
fclose($fw); fclose($fp);
So in just a few lines of PHP code we have a fairly efficient way of breaking up an Apache log file into separate files with one file for each day of the month, and which runs pretty quickly. There are bound to be even faster ways of doing this (eg Perl may well do it faster) but this worked well for me.