The power of column stores

  • using column stores instead of row based stores can reduce access logs from 10 GB to 130 MB of disk space
  • reading compressed log files is 4 times faster than reading uncompressed files from hard disk
  • column stores can speed up analytical queries by a factor of 18-58

Normally, log files from a web server are stored in a single file. For archiving, log files get compressed with gzip. A typical line in a log file represents one request and looks like this:


173.15.3.XXX - - [30/May/2012:00:37:35 +0200] "GET /cms/ext/files/Sgs01Thumbs/sgs_pmwiki2.jpg HTTP/1.1" 200 14241 "http://www.simple-groupware.de/cms/ManualPrint" "Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0"
Compression speeds up reading the log file:

$start = microtime(true);
$fp = gzopen("httpd.log.gz", "r");
while (!gzeof($fp)) gzread($fp, 8192);
gzclose($fp);
echo (microtime(true)-$start)."s\n"; // 26s

$start = microtime(true);
$fp = fopen("httpd.log", "r");
while (!feof($fp)) fread($fp, 8192);
fclose($fp);
echo (microtime(true)-$start)."s\n"; // 105s
(PHP 5.4.5, 2.5 GHz, hard disk with 7200rpm)

Having a log file of 10 GB gives a compressed file with 600 MB using gzip. This is already quite good, but can we make it better?

In the example line, we have different attributes (=columns) separated by " " and []. For example:


ip=173.15.3.XXX
date=30/May/2012:00:37:35 +0200
status=200
length=14241
url=http://www.simple-groupware.de/cms/ManualPrint
agent=Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0
etc.

When we save each column in a separate file, we will get smaller files after the compression. Using a file for the HTTP status code column contains many similar values (mostly 200), which allows better compression.

Coming from 600 MB, we can reduce the total size to 280 MB by saving each column to a different file.

Analyzing files containing only one column is also much easier. For example, we can do a group by over all status codes on the shell with:


time zcat status.log.gz | sort | uniq -c
36689940 200
11880 206
124560 301
1142820 302
968040 304
3600 401
37080 403
784260 404
180 405
6480 408

real 0m31.600s
user 0m27.038s
sys 0m1.436s
Note: instead of sorting, it will be faster to fill a small hash table with the counts.

Counting is also very fast:


# get number of requests with status code equal to 200
time zcat status.log.gz | grep -E "^200$" | wc -l
36689940

real 0m3.078s
user 0m2.944s
sys 0m0.128s

# get number of requests with status code not equal to 200
time zcat status.log.gz | grep -Ev ^200$ | wc -l
3078900

real 0m1.799s
user 0m1.736s
sys 0m0.060s

Summing:


# sum up all transferred bytes (5.5951e+11 ~ 521 GB)
time zcat length.log.gz | awk '{s+=$1}END{print s}'
5.5951e+11

real 0m5.708s
user 0m5.556s
sys 0m0.144s

The biggest column is normally the one containing URLs (118 MB in our example). We can reduce the size by assigning a unique ID to each URL and save the list of URLs in a separate file.

Coming from 280 MB, we can reduce the total size to 130 MB by splitting URLs, referers and user agents into 2 files.

Here is the code for splitting a log file into column based files:

$urls = [];
$agents = [];
$split_urls = true;
$split_agents = true;

$fp_ip = gzopen("ip.log.gz", "w");
$fp_user = gzopen("user.log.gz", "w");
$fp_client = gzopen("client.log.gz", "w");
$fp_date = gzopen("date.log.gz", "w");
$fp_url = gzopen("urls.log.gz", "w");
$fp_url_list = gzopen("urls_list.log.gz", "w");
$fp_status = gzopen("status.log.gz", "w");
$fp_length = gzopen("length.log.gz", "w");
$fp_referer = gzopen("referer.log.gz", "w");
$fp_agent = gzopen("agent.log.gz", "w");
$fp_agent_list = gzopen("agent_list.log.gz", "w");

$fp = gzopen("httpd.log.gz", "r");
while (!gzeof($fp)) {
$line = gzgets($fp, 8192);
preg_match("!^([0-9\.]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] \"(?:GET )?([^\"]*)".
" HTTP/1\.[01]\" ([^ ]+) ([^ ]+) \"([^\"]*)\" \"([^\"]*)\"\$!", $line, $m);
if (empty($m)) {
echo $line." ###"; // output broken lines
continue;
}
gzwrite($fp_ip, $m[1]."\n");
gzwrite($fp_user, ($m[2]=="-" ? "" : $m[2])."\n");
gzwrite($fp_client, ($m[3]=="-" ? "" : $m[3])."\n");
gzwrite($fp_date, $m[4]."\n");

if ($split_urls) {
if (!isset($urls[$m[5]])) $urls[$m[5]] = count($urls)-1;
gzwrite($fp_url, $urls[$m[5]]."\n");
} else {
gzwrite($fp_url, $m[5]."\n");
}
gzwrite($fp_status, $m[6]."\n");
gzwrite($fp_length, ($m[7]=="-" ? "0" : $m[7])."\n");

if ($m[8]=="-") {
gzwrite($fp_referer, "\n");
} else if ($split_urls) {
if (!isset($urls[$m[8]])) $urls[$m[8]] = count($urls)-1;
gzwrite($fp_referer, $urls[$m[8]]."\n");
} else {
gzwrite($fp_referer, $m[8]."\n");
}
if ($m[9]=="-") {
gzwrite($fp_agent, "\n");
} else if ($split_agents) {
if (!isset($agents[$m[9]])) $agents[$m[9]] = count($agents)-1;
gzwrite($fp_agent, $agents[$m[9]]."\n");
} else {
gzwrite($fp_agent, $m[9]."\n");
}
}
gzwrite($fp_url_list, implode("\n", array_keys($urls)));
gzwrite($fp_agent_list, implode("\n", array_keys($agents)));

gzclose($fp);
gzclose($fp_ip);
gzclose($fp_user);
gzclose($fp_client);
gzclose($fp_date);
gzclose($fp_url);
gzclose($fp_url_list);
gzclose($fp_status);
gzclose($fp_length);
gzclose($fp_referer);
gzclose($fp_agent);
gzclose($fp_agent_list);
Note: Reading data from disk is always slower than analyzing data in real-time when the data is in memory.

Comments

Popular posts from this blog

How to construct a B+ tree with example

How to show only month and year fields in android Date-picker?

Visitor Counter Script Using PHP