I first tried serialize(), then json_encode(), then pack() when I realized I'd only be storing integers. All of this was to try and minimize the bytes needed to store an array of integers, so I devised these two functions:
<?
function base128encode($ints){
$r='';
foreach($ints as $int){
$int=(int)$int;
$w=chr($int&127);
while($int>127) {
$int>>=7;
$w=chr(($int&127)|128).$w;
}
$r.=$w;
}
return $r;
}
function base128decode($data){
$ints=Array();$x=0;
while($data[$x]){
$int=0;
do {
$byte=(int)ord($data[$x]);
if($x) $int<<=7;
$int|=($byte&127);
$x++;
} while($byte&128);
$ints[]=$int;
}
return $ints;
}
?>
Here's some stats:
<?
$data=Array(1,20,300,4000,50000,600000,7000000,80000000,900000000);
echo serialize($data)."<br />";
echo json_encode($data)."<br />";
//pack takes a few more lines
$pack='';foreach($data as $v) $pack.=pack('N',$v);
echo $pack."<br />";
echo base128encode($data);
?>
Output from the above (yes you'll see strange characters):
Serialize: (114 bytes)
a:9:{i:0;i:1;i:1;i:20;i:2;i:300;i:3;i:4000;i:4;i:50000;i:5;i:600000;i:6;i:7000000;i:7;i:80000000;i:8;i:900000000;}
json_encode: (55 bytes)
[1,20,300,4000,50000,600000,7000000,80000000,900000000]
pack: (36 bytes)
�����,�� ��ÃP� 'À�jÏÀÄ´�5¤é�
base128encode: (25 bytes)
‚,Ÿ ƒ†P¤Ï@ƒ«Ÿ@¦’è�ƒ“Ò�
Serialize is horrible, it doesn't try and compress integers at all. It actually stores them as you see them in the array, so for every place in the number, that's an extra byte, plus 3 bytes for the "i: ;" to seperate the thing, plus the a:(length):{ } around the whole thing. For the sample data, serialize returned a string 115 bytes long.
json_encode really improves on Serialize, it uses JavaScript Object Notation to encode whatever you throw at it. Although it was able to outdo serialize with smarter splitting of arguments and less bloat surrounding it, it still created one byte per place value and didn't make use of the binary representation of numbers.
Serialize and json_encode are both multi-functional. They store MUCH more than just integers, and I knew that I could exploit the binary representation of the integers to knock off even more bytes, so I tried pack(). Pack has a few ways of representing numbers, 16 bit and 32 bit are built in. I knew 16 bit wouldn't be enough for my needs, so I went with 32 bit unsigned integers. This method uses exactly 4 bytes per integer. This also proved to be sub-optimal, as most of the numbers I needed to store are in the low-end range, with numbers never getting close to 2^32. So I was stuck. I could store the integers in 16 bit format, but I knew I'd have numbers over 2^16 in the future.
I did some research, and it turns out, google has a nice article about this, so I decided to give that a go and wrote the functions.
base128 encoding allows numbers less than 128 to be stored as 1 byte, numbers less than 16384 to be stored as 2 bytes, 2097152 in 3 bytes, 268435456 to be stored in 4 bytes. This is a huge improvement over pack(), which requires 4 bytes for every integer no matter what.
So if you have an array with numbers that are all less than 128, base128encode will be 4x shorter than pack(). The only reason there wasn't such a drastic difference here is because I used larger numbers, it all depends on your data. If you have more numbers closer to 2^32, pack is the way to go, but if you have more numbers less than 2^28, base128encode will save you space.
I don't know how many of you will find this interesting or will be able to use it, but I thought it was interesting enough to make a post on.
Edited by: Sean, Nov 19th, 2010 @ 11:10 am


