Sunday, 16 October 2016

Multiple memcached servers question


hypothetically - if I have multiple memcached servers like this:
//PHP 
$MEMCACHE_SERVERS = array(
    "10.1.1.1", //web1
    "10.1.1.2", //web2
    "10.1.1.3", //web3 
); 
$memcache = new Memcache();
foreach($MEMCACHE_SERVERS as $server){
    $memcache->addServer ( $server ); 
}
And then I set data like this:
$huge_data_for_frong_page = 'some data blah blah blah';
$memcache->set("huge_data_for_frong_page", $huge_data_for_frong_page);
And then I retrieve data like this:
$huge_data_for_frong_page = $memcache->get("huge_data_for_frong_page");
When i would to retrieve this data from memcached servers - how would php memcached client know which server to query for this data? Or is memcached client going to query all memcached servers?

   Solution :

Well you could write books about that but the baseic principle is that there are some different approaches.
The most common and senseful approach for caching is sharding. Which means the data is sotred only on one server and some method is used to determining which server this is. So it can be fetched from this very server and only one server is involved.
This obviously works well in key/value environments as memcached.
A common practice is to take a cryptographical hash of the key. Calculate this hash MOD number of servers and the result is the server you will store and fetch the data.
This procedure produces more or less equal balancing.
How its exactly done in memcached i dunno but some sort of hash for sure.
But beware that this teqnique is not highly available. So if one server fails the entries are gone. So you obviously can only use this for caching purposes.
Other teqniques, where for example high availability of resources is necessary, that take long to calculate and are automatiaclly warmed up int he background, involve replication.
The most common form in caching environments is master-master replication with latest-timestamp conflict resolving. Which bascically means every server gets the data from everyserver that is not yet on the local server (this is done using replication logs and byte offsets). If there is a conflict the latest version is used (the slight time offset between servers is ignored).
But in other environments where for examply only very little is written but a lot is read there is often a cascade where only one or few master servers are involved and the rest is just pure read replication.
But theese setups are very rare because sharding as describeda bove gives the best performance and in caching environments data loss is mostly tolerable. so its also default for memcached.

No comments:

Post a Comment