Implementing resilient applications on top of redundant layer-3-only fabrics (without resorting to tricks like loopback interfaces on servers) is a hard problem that has been successfully solved only in niche domains like SS7 signaling or iSCSI fabrics.
Most application developers are unaware of the complexities involved in designing resilient applications1, resulting in applications that expect to connect to a single IP address per service2, pushing the problems down to the network layer where they are usually solved in one of these ways:
- Single IP subnet stretched across multiple top-of-rack switches to support server IP addresses floating across multiple redundant uplinks;
- Multi-chassis link aggregation (MLAG), connecting a server to multiple top-of-rack switches while pretending to bundle the redundant server uplinks into a single Ethernet link;
- Running routing protocols on servers and advertising loopback IP address as the service endpoint;
- Using load balancers in front of servers to make application service available on a single virtual IP address.
While it's easy to blame application developers for the sad state of resilient application architectures, we should keep in mind that:
- TCP/IP protocol stack design is broken as it lacks a session layer3;
- Socket API is broken4 as it requires the application to specify the transport protocol (example: TCP). SCTP adoption would be much higher if it wouldn't require application rewrites;
- Solutions like happy eyeballs5 are kludges that solve "what IP address should I connect to" challenge, but not "how do I recover from failures" one;
- Virtualization vendors like VMware never really understood networking challenges as amply demonstrated by their recommended designs that heavily rely on stretched VLANs or MLAG6;
- There is no off-the-shelf library (apart from some Happy Eyeballs implementations7) that application developers could use to develop resilient applications.