Username:
Password: Forgot?
Register
-/+
Shoutbox - History
Sean : only works in webkit browsers I believe
Bibby : Why doesn't Harlem work for me?
Bibby : Wow, a post. Amazing.
sivart0 : merp
Bibby : Mine still is, although it's a bit less active than it was in 2011. http://bibbyteam.jaxboards.com
JJstorm : Wats up checkin in. Forums still alive?
Bibby : I bet woshizifengRW is inorganic, too.
Bibby : (But if she actually has flesh and blood instead of circuits and AI, my apologies.)
Bibby : Amanda has a name that's too great to be anything but a spambot.
1 2
 
Base 128 encode/decode (PHP)
Avatar
boss.
Posts: 1180
Status: Offline
Group: Admin
Member: #1
Quote
There have been a few cases where I wanted to store an array of integers as a serialized string.

I first tried serialize(), then json_encode(), then pack() when I realized I'd only be storing integers. All of this was to try and minimize the bytes needed to store an array of integers, so I devised these two functions:


<?
function base128encode($ints){
 
$r='';
 foreach(
$ints as $int){
  
$int=(int)$int;
  
$w=chr($int&127);
  while(
$int>127) {
   
$int>>=7;
   
$w=chr(($int&127)|128).$w;
  }
  
$r.=$w;
 }
 return 
$r;
}

function 
base128decode($data){
 
$ints=Array();$x=0;
 while(
$data[$x]){
  
$int=0;
  do {
   
$byte=(int)ord($data[$x]);
   if(
$x$int<<=7;
   
$int|=($byte&127);
   
$x++;
  } while(
$byte&128);
  
$ints[]=$int;
 }
 return 
$ints;
}
?>



Here's some stats:

<?
$data
=Array(1,20,300,4000,50000,600000,7000000,80000000,900000000);

echo 
serialize($data)."<br />";
echo 
json_encode($data)."<br />";
//pack takes a few more lines
$pack='';foreach($data as $v$pack.=pack('N',$v);
echo 
$pack."<br />";
echo 
base128encode($data);
?>



Output from the above (yes you'll see strange characters):

Serialize: (114 bytes)
a:9:{i:0;i:1;i:1;i:20;i:2;i:300;i:3;i:4000;i:4;i:50000;i:5;i:600000;i:6;i:7000000;i:7;i:80000000;i:8;i:900000000;}

json_encode: (55 bytes)
[1,20,300,4000,50000,600000,7000000,80000000,900000000]

pack: (36 bytes)
�����,�� ��ÃP� 'À�jÏÀÄ´�5¤é�

base128encode: (25 bytes)
‚,Ÿ ƒ†P¤Ï@ƒ«Ÿ@¦’è�ƒ­“Ò�


Serialize is horrible, it doesn't try and compress integers at all. It actually stores them as you see them in the array, so for every place in the number, that's an extra byte, plus 3 bytes for the "i: ;" to seperate the thing, plus the a:(length):{ } around the whole thing. For the sample data, serialize returned a string 115 bytes long.

json_encode really improves on Serialize, it uses JavaScript Object Notation to encode whatever you throw at it. Although it was able to outdo serialize with smarter splitting of arguments and less bloat surrounding it, it still created one byte per place value and didn't make use of the binary representation of numbers.

Serialize and json_encode are both multi-functional. They store MUCH more than just integers, and I knew that I could exploit the binary representation of the integers to knock off even more bytes, so I tried pack(). Pack has a few ways of representing numbers, 16 bit and 32 bit are built in. I knew 16 bit wouldn't be enough for my needs, so I went with 32 bit unsigned integers. This method uses exactly 4 bytes per integer. This also proved to be sub-optimal, as most of the numbers I needed to store are in the low-end range, with numbers never getting close to 2^32. So I was stuck. I could store the integers in 16 bit format, but I knew I'd have numbers over 2^16 in the future.

I did some research, and it turns out, google has a nice article about this, so I decided to give that a go and wrote the functions.

base128 encoding allows numbers less than 128 to be stored as 1 byte, numbers less than 16384 to be stored as 2 bytes, 2097152 in 3 bytes, 268435456 to be stored in 4 bytes. This is a huge improvement over pack(), which requires 4 bytes for every integer no matter what.

So if you have an array with numbers that are all less than 128, base128encode will be 4x shorter than pack(). The only reason there wasn't such a drastic difference here is because I used larger numbers, it all depends on your data. If you have more numbers closer to 2^32, pack is the way to go, but if you have more numbers less than 2^28, base128encode will save you space.

I don't know how many of you will find this interesting or will be able to use it, but I thought it was interesting enough to make a post on.

Edited by: Sean, Nov 19th, 2010 @ 11:10 am
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
boss.
Posts: 1180
Status: Offline
Group: Admin
Member: #1
Quote
Just noticed that google's method specifies that the serialized string is little endian, while my implentation is big endian.

doesn't change anything, just thought I should mention I didn't follow their method exactly.
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
Posts: 63
Status: Offline
Group: Member
Member: #11
Quote
Read the whole post. Nice research.

I was just wondering though, where would "compressing" integers so to speak come in handy? I mean for example I was thinking it's quite common to store a series of integers in a database row for use in an IN() - but then it has to be unpacked. I realize there's space savings involved but drive space v. dealing with unencoding it... eh.
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
boss.
Posts: 1180
Status: Offline
Group: Admin
Member: #1
Quote
bitwise operations are quite a bit faster than you realize.
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
boss.
Posts: 1180
Status: Offline
Group: Admin
Member: #1
Quote
It all depends on circumstance though, in your case, for utilizing a comma seperated list of integers in an IN(), it's probably not worth it to have to encode and then decode when a comma seperated string would suffice.
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
Posts: 63
Status: Offline
Group: Member
Member: #11
Quote
Wasn't saying they were slow - but there's still added time to deal with it for a questionable amount of space saving in the case that I presented.

And okay, that's what I thought.

So again, I'm not thinking big enough, what is this type of compression used for?
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
boss.
Posts: 1180
Status: Offline
Group: Admin
Member: #1
Quote
You also have to figure in the time required for mysql to read the data from the database, transfer it over to php from the server, and have php interpret the stream. I'm not sure on the specifics, but this would also be really useful in a distributed environment where bandwidth from the mysql server to the php is more costly than the CPU on each machine.

Right now, the only thing I'm using this for is for read markers. Storing a date integer for each user for each topic/forum they view adds up, and the session table really benefits from being as small as possible (since it's accessed and modified so frequently). I was able to cut down the average session table by 75% just by compressing the integers using my base128 functions.
Edited by: Sean, Nov 20th, 2010 @ 11:21 pm
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
boss.
Posts: 1180
Status: Offline
Group: Admin
Member: #1
Quote
also: knocked off a whole millisecond B)
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
Posts: 63
Status: Offline
Group: Member
Member: #11
Quote
ahh I see. Okay, that makes sense then.
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
Avatar
boss.
Posts: 1180
Status: Offline
Group: Admin
Member: #1
Quote
serialize/unserialize and json_encode/decode are also much slower than my base128 functions.
Rate: Awesome!SadUmm....Surprising!Useful! (List)
^ Top
-/+
Users Viewing This Topic
1 2