DNS Failover / High Availability
The native Linux resolver with default settings cannot be relied upon for HA / failover. By default there are two attempts each with a 5 second timeout. Therefore every query has to wait 10 seconds before the secondary DNS server is queried. Linux does not natively mark a DNS node is 'dead' and send all queries to a secondary node, it try the primary upon every single query.
- Use an HA facility that will move the VIP from the primary to the secondary if the primary node fails, such as [www.linux-ha.org/wiki/Heartbeat|Heartbeat]. All resolv.conf nameserver entries point to this VIP.
- Use a Load-balancer with service-detection - not just layer 3.
- Use Name Server Cache Daemon to cache the DNS entries and mitigate the risk of a failure. Not recommended as adding an extra of complexity (caching on the individual servers) may have unpredictable results.
- Use Domain Name Relay Daemon that will allow the DNS client to mark a node as down and not retry it until it is back up again. Only the first queries that were sent before NSCD marked it as down would be affected by the 10 second timeout mentioned above.
- Tweak the resolv.conf settings to reduce the timeout.
Firstly get some stats on how long querys take on your network. BIND does this natively through stats. Once an appropriate maximim query time is defined. The timeout that the Linux resolver waits for before trying to secondary node can be tweaked, this is done through the options timeout:n directive where n isthe number of seconds.
Another directive options rotate can be used to provide pseudo-load-balancing. It is pseudo-load-balancing as upon testing the primary node still receives double the queries than the secondary. This is better than 99% of queries as it is without this directive being applied. Testing with ping will not proove that it is working as a new process will restart the round-robin.
search dc2.uk.eu.company.loca. uk.eu.company.local eu.company.local nameserver 18.104.22.168 nameserver 22.214.171.124 options timeout:2 rotate
This will give two attempts (default) each with a timeout of 2 seconds = 4 seconds until the secondary is called. A more aggressive config would be options timeout:1 attempts:1 rotate which can be used so long as there is no recursion in your environment, otherwise queries which may otherwise have returned may fail.