Bookmark and Share

Analyzing data in Unix using PHP CLI.

Posted: Wednesday, February 4th, 2009 at 4:38 pmUpdated: Friday, April 3rd, 2009 at 5:16 pm

There are times when I need to quickly gather some quick data in a simple tab based or space based delimited file, most often log files. Also, the files that I work on are often huge, well according to my standard at least … about 1GB. The data that I need is often simple statistics about the file like how many unique URLs are in the file, list of unique IPs, etc etc.

So one way to do this is to transfer the file to my local workstation and open it in Excel and try to open it there. Well, this may work for smaller files. I think Excel has a limit of 65K lines. After that the rest of the data is being truncated. Moreover, transferring 1GB files over network is not a good option, I think, as it’ll take some time for the file to get transferred.

Fortunately, I normally work on Unix / Linux box. One of the super nice feature of Unix shell is the pipe (|). Basically what it does is redirect the output of a command as input of the next command. And since I’m mostly coding in PHP, naturally, I wanted to use it to process data.

PHP Command Line

It goes without saying that if you want to use PHP as command line, you have to have PHP CLI (Command Line Interface) installed on your box. My MacOS X version 10.5.5 has it already and on Ubuntu, just do apt-get install php5-cli. I’m not going to elaborate a lot about installing PHP CLI as it is beyond the scope of this article.

For command line processing, there are 3 important options in PHP that you need to know about:

  • -B : This option specifies that the segment of PHP code following it is to be executed before processing inputs
  • -R : This option specifies that the segment of PHP code following it is to be executed once per input line. There are 2 special variables defined $argn and $argi. $argn holds the current line and $argi holds the line number.
  • -E : This option specifies that the segment of PHP code following it is to be executed after all the inputs has been processed

Having the 3 combination above, we can construct a single command line PHP code to define initialization, execution, and post execution codes.

Need to know small Unix tools

In addition to knowing PHP command line tool above, I think it’s also good to know these basic Unix tools that comes on most Unix flavor (including Linux, FreeBSD and MacOS X):

  • sort : This tool takes unsorted inputs and outputs it as sorted.
  • uniq : This tool takes the input and outputs unique input data. Note: uniq only remembers the immediate entries. For example, if your input is a a a b c c c a c a a the output will be a b c a c a. Therefore, to have a truely unique output, the input must be sorted first.
  • wc : This tool stands for word count, I think. You can use it to count the words, or if you specify -l (that is lower case L) it will count the lines.
  • grep : This tool will take regular expression and outputs lines that matches the expression. Very useful tools for searching files for a particular occurrence of a string.

There are other tools like cut, awk, sed etc to process files. However, since we’ll be using PHP, I think we can get by without it for now.

Analyzing text data

In this example, I’m going to give example of using piping (|) in Unix to process the output of ls command as my data. Here’s what I have for my sample data.

MacOSX:~ user$ ls -l
total 18540144
drwx------+  4 user  staff        1190 Feb  4 13:57 Desktop
drwx------+  5 user  staff         408 Dec 15 12:59 Documents
drwx------+  7 user  staff        4182 Feb  4 11:09 Downloads
drwx------+ 37 user  staff        1326 Jan 30 09:15 Library
drwx------+  2 user  staff         170 Jan 30 09:44 Movies
drwx------+  4 user  staff         714 Oct  9 16:18 Music
drwx------+  2 user  staff         204 Oct  6 09:46 Pictures
drwxr-xr-x+  4 user  staff         238 Oct 21 17:25 Public
drwxr-xr-x+  3 user  staff         306 Nov 21 18:34 Sites
-rwxr-xr-x@  1 user  staff  6767902720 Oct 21 23:42 VistaIE7.vmdk
-rw-r--r--@  1 root  staff      100864 Dec  4 21:43 Whitelist result.xls
-rwxr-xr-x@  1 user  staff  1302790144 Oct 21 12:54 XPSP3IE6.vmdk
-rwxr-xr-x@  1 user  staff  1421672448 Oct 21 16:16 XPSP3IE8.vmdk
drwxr-xr-x   1 user  staff        4096 Feb  4 10:52 addev
-rw-r--r--@  1 user  staff        1367 Jan 30 10:25 addev - ad_reports.odb
drwxr-xr-x   4 user  staff         238 Dec  2 12:46 bin
drwxrwxrwx   4 user  staff        2652 Feb  4 14:57 jajal
drwxr-xr-x   3 user  staff         272 Dec 31 12:18 musicAttributionAd
-rw-r--r--   1 user  staff       71515 Jan 21 14:41
MacOSX:~ user$

Let’s say from the data above, you want to count how many lines start with letter d. Off course you can do this using options to ls command. But let’s pretend that the output of ls command is the content of a file we wanted to analyze. The simplest way to do this is using regular Unix command grep to get lines that starts with letter d then redirect the output to wc command passing -l to count the line. So here’s how it looks like:

MacOSX:~ user$ ls -l | grep "^d" | wc -l
MacOSX:~ user$

We also can use PHP CLI to do that. I acknowledge that it’s overkill to use PHP CLI, but I think it’s a good introduction to a more complex command. So here’s how you do it PHP CLI version.

MacOSX:~ user$ ls -l | php -B '$count = 0;' \
> -R 'if ($argn[0] == 'd') $count++;' \
> -E 'echo "\t$count\n";'
MacOSX:~ user$

Let’s analyze what we just did. First, we take the output of ls -l then redirect it to PHP CLI. On the PHP side, we need to initialize the variable $count. We use -B option for that. Then for each line, we want to check if it starts with letter d. If yes, we want to increment $count. Finally, when all lines has been processed, we output the count.

Using similar techniques, we can sum the total bytes being used. Here’s how to do it.

MacOSX:~ user$ ls -l | php -B '$count = 0;' \
> -R '$arr = split(" +", $argn); $count += $arr[4];' \
> -E 'echo "$count\n";'
MacOSX:~ user$

For my last example, let’s say I want to list the owners of the file. I should get user and root as the result. There are 2 ways we can do this. One way is we can save all the owners in PHP and at the end outputs it like below:

MacOSX:~ user$ ls -l | php -B '$result = array();' \
> -R '$arr = split(" +", $argn);
if (isset($arr[2])) $result[$arr[2]] = 1;' \
> -E 'foreach ($result as $key => $val) echo "$key\n";'
MacOSX:~ user$

Or we can simply process each lines and have PHP outputs just the owner name and pipe the command to another Unix tool to sort it and get unique entry like below:

MacOSX:~ user$ ls -l | php -R '$arr = split(" +", $argn);  
if (isset($arr[2])) echo $arr[2] . "\n";' | sort | uniq
MacOSX:~ user$

On the pure PHP CLI solution, we first need to initialize the array using -B option. Then we have to split the array and since we just want the unique value, assign the owner name as the key of the array. The value that we assign to doesn’t matter. Then at the end, get all the keys from $result array and print it.

On the hybrid PHP CLI and other Unix tools, all we have to do is to print the owner names and let sort and uniq take care of the rest. I like the 2nd option as it’s less typing on PHP code.

Other possible solutions

Just like everything else in Unix, there are many ways to do the same thing. It’s up to you to use whichever suits you best. There’s also a program called awk that’s pretty straight forward and simple to do (as in less typing) to achieve what we have above. Here’s the awk version of what we did above. It worth mentioning a special thanks to my friend Jamsa for helping me out with this section of the article.

MacOSX:~ user$ ls -l | awk '{if ($1 ~/^d/) print $1;}' | wc -l
MacOSX:~ user$  ls -l | awk '{sum += $5} END {print sum}'
MacOSX:~ user$  ls -l | awk '{print $3;}' | sort | uniq

MacOSX:~ user$

As you can probably see, awk is much better (in terms of less typing and thus less chance of producing bugs / typo) than PHP CLI on this purpose. However, for people like myself who are much more proficient with PHP than awk, this article can perhaps help in expanding the use of PHP beyond web programming. And as always, I welcome comments / questions / critics that will help me and other readers understand better.

Leave a Reply