In our previous blog on Kubernetes (K8s) adoption, we explained how we migrated our workloads to k8s, and the decisions we took along the way. As we migrated more and more applications to k8s, we realized that it would be very useful to have a service mesh due to the following reasons:
1. Monitoring and Traceability: The more microservices you have the stronger is the need for a centralized and robust logging and monitoring system. This is because even for a single transaction, logs and metrics have multiple originations which can span across multiple microservices. In case of failures, a service which has been identified to be causing issues may not always be at fault. Failure in any upstream service can lead to errors that percolate downstream. Hence without proper monitoring it becomes hard to trace the root cause, perform rollbacks, work on performance optimizations, etc.
2. Service Discovery: For one microservice to communicate with another it must know the whereabouts of the other service. Usually in a VM-based setup this problem is solved by putting services behind their own load balancers, the address of which broadly serves as the service’s location. However, in a k8s cluster, this leads to a hairpin networking design (check fig 5) where multiple services could be sharing the same load balancer(LB) and ingress controller, and inter-service communication happening via load balancer. Though avoidable (using coredns rewrites), it’s an easy way of doing a lift-and-shift migration from VMs to k8s, something that we stuck with ourselves. The hairpin design also affects observability as metrics like throughput and error rates are aggregated at the load balancer and it becomes impossible to classify them as part of a particular service-to-service communication.
3. Traffic control: Function calls are converted into network traffic when a monolith is disintegrated into microservices, and network calls are more prone to failures. A need arises to efficiently route traffic to upstream services, secure communication, have fault tolerance via circuit breaking, etc.
Other problems associated with adopting a microservice architecture such as API dependencies and service versioning aren’t alleviated with the use of a service mesh and hence are beyond the scope of this blog.
This blog is but a journal on our experience with service meshes, particularly Istio, and should not be taken as the holy grail on ‘How tos’ of service mesh. Prior knowledge of k8s and Istio will help better understand this blog.
In the upcoming sections, we’ll be looking at what a service mesh is and how it has helped us gain better operational control over our 100+ microservices.
What is a service mesh anyway? Why is it so important?
The term service mesh refers to the mesh-like networking structure created by the interconnection of services, usually with the help of a proxy such as Envoy which runs either at the host level or alongside server instances(as sidecars in k8s). A service mesh sits between the devops and the developer making it more of an SRE tool as it tends to blend infrastructure into application development. Refer to RedHat’s blog on ‘What’s a service mesh?’ for more details.
Most importantly, a service mesh helps isolate operational complexity from business logic thereby providing a pluggable solution.
Picking the right service mesh
Since we’re 100% on AWS and since AWS services integrate seamlessly within its own ecosystem, choosing App Mesh for prototyping was a no brainer. However, during our 2 week trial of App Mesh we encountered a major blocker: the service-to-service communication timeout was hardcoded to 15 seconds. We had a few services that processed large amounts of data and hence required a larger timeout eg. planning of delivery routes for eat.fit’s food orders. And so we had to drop the idea of App Mesh.This blocker with App Mesh was, however, rectified later (a tad too late for us).
So our next best bet was between Istio, Linkerd and Consul. Check this for a detailed comparison. While Istio and Linkerd used a sidecar model for the proxy, Consul ran as a daemonset. This creates a single point of failure as daemonset failing on a single node will lead to errors on all services on that node. Not that people aren’t used-to living with such designs (the kubelet is a single point of failure), we’d rather not add another. Agents running as daemonset on large nodes such as c5.12xlarge which hold 100+ pods tend to get overwhelmed with the sheer volume of data that they need to process. This may lead to inconsistent performance which is unacceptable for a networking component. All these reasons led us to drop Consul.
Between Istio and Linkerd, while Istio had features like circuit breaking and rate limiting, linkerd was far simpler than Istio. However, the biggest reason for choosing Istio (apart from its features) was that it was natively built for k8s, which is where we run 100% of our compute workloads.
Though Istio’s documentation has instructions on the installation process, the installation before v1.7 was buggy and required workarounds. Certain configs wouldn’t show up in the final generated manifest, which then had to be added manually.
For all versions, we preferred creating a config file to cli parameters due to the sheer number of parameters required. Prior to v1.7, the config file was used to generate the deployment manifest using istioctl, which we manually verified before applying to the cluster. Configs like the prestop hook and multiple ingress gateways were getting ignored and wouldn’t show up in the generated manifest because of the bugs we stated earlier.
Despite the installation process for v1.5 having the same bugs, we planned an upgrade from v1.4 to v1.5 for performance improvements (only the diff in the envoy config was now passed to the proxies instead of the entire config). Plus, with the introduction of istiod, istio had adopted a monolithic architecture which simplified management.
With istio 1.7 came a stable release of the istio operator which helped in two ways:
- It no longer had any of the bugs that led to improper application of the configs to the cluster.
- It simplified the installation process where the config could now directly be applied to the cluster as opposed to generating, verifying and then applying the manifest.
After installation, the cluster looked as follows
While istio does not allow you to skip versions when upgrading, we could do so (from 1.5 to 1.7) because at cure.fit we perform an A/B cluster upgrade. This means that we deploy a new cluster with a higher version of k8s and istio, perform all the compatibility checks, and then migrate the workloads to the new cluster. There are many complications that we dealt with in this process, all of which will be covered in another blog.
Our requirements vs Istio’s features
We believe that the adoption of tech should be driven by requirements and not the other way round. Istio has a wide variety of features that one can make use of, but the decision to put it in place should not be taken in order to make use of those features. Rather, the call to use a service mesh should depend on what problems it solves for you, and if it’s worth the overhead, because there’s a lot.
We can break this into three sections:
1. Our core requirements:
1. Observability: Transparency in service-to-service communication was our biggest requirement. We had almost 100 microservices, and the ability to pinpoint issues between services communications was critical. It’s also impossible to set alerts unless metrics are isolated for each service.
2. Traffic control: We wanted better routing capabilities such as canary which allows one to slowly roll out code changes. We could’ve avoided certain production issues that we faced if we had this capability before.
2. Features currently in production
- Between services: After putting Istio into production, we had much better visibility of our microservice communication. Previously, with the hairpin design, all ingress traffic came from the gateway itself. After having made changes to contain the traffic within the cluster, communication between services was now segregated and we were able to create graphs to show us the exact data we wanted. We also had service graphs which showed interactions between all the different services we had. This reduced the number of hops and improved latency by as much as 20% between some services.
- Between services and databases: We were able to attribute data transfers not only in between services but also between services and databases. This allowed us to identify services dealing in heavy data exchanges.
2. Traffic control: We not only made use of percentage based routing for canary deployments, but also for path based routing which separates the website traffic from the app traffic.
3. Traffic compression: While exploring Istio, this feature really stood out and was too good to avoid. It has a very high outcome-to-effort ratio. We use Envoy filters to enable gzip compression/decompression at the proxy level, thereby reducing network cost, without changing a single line of application code. This is currently being done only at the gateways. We’re working on enabling gzip compression for all communications within the service mesh as well.
4. Circuit breaking: This has been enabled in production very recently and currently is part of only 2-3 service configs. The outlier detection setting also enables locality based load balancing, which helps reduce inter-zone network communication.
3. Features scoped for future
There were certain features that weren’t of immediate help to us, either because the problem wasn’t prominent enough at our scale, or we already had alternative solutions in place. We plan to make full use of our service mesh by incorporating one or more of these features in future:
1. Rate limiting using istio
2. NetworkPolicy: We’re yet to make use of a traffic flow network policy which allows traffic to flow only via an approved path, as opposed to k8s’ flat networking design, where traffic is free to flow between any two pods.
The production conundrum with Istio
Istio isn’t easy. Period. Talk about running it in production! We’ve been doing so for 10 months now, and it hasn’t come without a price. As Elon Musk said, “Designing a rocket isn’t hard, rather it’s trivial. There are tonnes of resources out there to do that. But to get just one of those into production is super hard.” Istio…ditto. There’s a plethora of resources available on how to run istio in production, but the actual journey is far from the simplicity they state.
Let’s just say...this tweet quite aptly sums it up!
Understanding istio’s functioning is critical, and is covered in this blog to some extent.
Here are glimpses of our experience from running Istio in production:
1. Not free: Istio is free to install and use, but costs to run. The default config reserves 100m cpu for the Envoy container. Which means for every 10 application pods that run the Envoy, 1 CPU core is required for Envoy alone. Extrapolating to 1000 pods, Envoy would require 100 cpu cores, or almost a bare metal machine alone or roughly $3500 worth.
2. Steep learning curve:
- It took us a significant amount of time to understand the how tos for safe deployments and uptime. Stating an example, our focus with istio was to have better observability which was achieved in Sept, 2019, 9 months after we started investing in service meshes. Before that, the goal largely remained running without failures. Once we added mesh as a gateway in the virtual service, traffic was contained within the cluster which unlocked all observability metrics.
- Initially when we were toying around with Istio on stage, sometimes we ourselves lost track of the changes applied to the cluster, which made it hard to identify or reproduce an issue. So we would remove everything using kubectl delete -f <generated manifest.yaml> and start from scratch. This would remove Istio completely, along with all the CRDs that it installs, resulting in loss of virtual services for all applications that had one. So we’d have to redeploy all services and the helm template would restore the virtual service and gateways for each service.
3. Performance: With Envoy intercepting all traffic, every packet passes through all the iptable rules, only to be intercepted by Envoy, following which Envoy takes the routing decisions. This slightly impacts latency. Checkout the benchmarks on Istio’s site, the talk on Liberating Kubernetes From Kube-proxy and Iptables, and read the blog on Understanding How Envoy Sidecar Intercept and Route Traffic in Istio Service Mesh to understand this better. There’s a solution to this as well: cilium. However, at our scale, the increase in latency isn’t significant enough to impact production.
For us, the hairpin design had worse impacts on latency, and resolving that using Istio actually helped bring down latencies.
4. Issues faced while running Istio:
- Issues with traffic interception: There’s an annotation, traffic.sidecar.istio.io/includeOutboundIPRanges, which controls what traffic Envoy should intercept. When set to *, Envoy intercepts all outgoing traffic from a pod. In istio 1.4, one of our pods witnessed high cpu usage with this annotation, so we turned it off. We asked the community as well, but could not conclude why this happened. However, disabling this flag meant Envoy wouldn’t intercept any traffic, which meant no observability. So we set it to the internal IP of the cluster: 10.0.0.0/16. This allowed Envoy to track all cluster-internal traffic, omitting interception of any traffic that went out of the cluster like DB calls and calls to external services. Once we upgraded to istio 1.5, the abrupt CPU issue on those pods stopped. However, the annotation when set to * was also causing issues with connections between Envoy and mongodb. However, Mysql and redis worked fine. So currently, we have set the annotation to 10.0.0.0/16 to omit DB traffic interception in production. We have upgraded to istio 1.7 and have tested the mongodb connection on stage, where it’s working as expected. We are yet to roll it out to prod. Intercepting DB traffic is important because it helps us attribute DB costs to independent services such as high throughputs and large data exchanges between services and databases.
- Misconfiguration leading to failures in production: By Feb, 2019, we had services using istio in production, and we continued to make minor changes to the config. We made the above annotation change, 10.0.0.0/16 → *, for testing on stage. Because the base helm template was shared across environments, a bunch of services that were deployed in production picked this change up. Envoy was now intercepting all outgoing traffic from a pod, including traffic targeted at services outside the mesh.
Due to lack of mesh as a gateway in the virtual service, the virtual service config didn’t apply to Envoy and services were being treated as external to the mesh. Consequently, inter-service communication was being routed via the load balancer and the ingress gateway instead of hitting the service directly (as shown in fig 5). Since the virtual service config didn’t apply, Envoy’s default timeout of 15 sec took effect instead of the one specified in the virtual service, something we weren’t expecting. This resulted in continuous timeouts for almost an hour on the affected service before we fixed the issue. The impact? More than 1000 customer orders across 6 kitchens had to be cancelled to reduce the batch size being processed by the service so that the service could recover. We then reverted the annotation to prevent this from happening for the time being. Later, we added mesh as a gateway to the shared template itself to permanently fix this issue.
The communication sequence in the above diagram is as follows:
- Application-A makes a request to Application-B which is intercepted by Envoy.
- Envoy sends the call to the load balancer.
- Load balancer forwards the call to the ingress gateway controller.
- The ingress gateway controller forwards the call to Application-B’s pod, which is again intercepted by Envoy.
- Envoy forwards the call to Application-B’s container.
- Issues with application startup and termination: K8s is yet to introduce container ordering. Before istio 1.7, Envoy was always injected as the second container, and k8s controllers start containers sequentially. The termination however happens in parallel. This meant that the application container could and sometimes did become ready before Envoy was up. If the application had some workers running they would not wait for the status check to pass and would start performing operations such as making calls to other services or interacting with the databases. This caused call failures during container startup. Istio 1.7 has an option to inject Envoy as the first container in the pod. There were similar issues with pod termination. The application container would have requests queued up, while Envoy would shut down gracefully. Hence, we had to add a prestop hook on Envoy to delay termination by 120 seconds, whereas the applications had a prestop hook of 100 seconds. This ensured that application containers shut down well before Envoy did.
- Issue with Istiod: Another issue we faced was that Envoy proxy was connecting to a newly spawned istiod pod even before istiod’s discovery container could sync its cache. This resulted in call failures in that pod. Refer to this github issue. After 2-5 mins, when Envoy config would refresh, the affected pod would start functioning normally. We mitigated the problem by scaling istiod (15 istiod pods for 500 application pods) and moving it to an on-demand instance to protect it from spot terminations. This was one of the hardest issues to debug. And, we haven’t faced an issue ever since.
5. Difficulty in capturing logs: We use fluentd to capture logs from the containers and then we push them to cloudwatch. However, an AWS load balancer’s access logs are pushed to S3, because there is an abundance of them and pushing them to cloudwatch would unnecessarily incur cost. We currently haven’t settled on how to split reading Envoy’s access logs from an application’s stdout logs and push the access logs to S3. We’ll probably use handlers like this blog suggests.
7. Traffic shifting via Spinnaker is complex: Though istio provides percentage-based weighted routing, it’s not very easy to integrate it with our CD system, spinnaker. We have written a couple of pipelines for canary analysis using the weighted routing, but the pipelines get really messy. The figure below includes only 1 level of traffic shifting, ie, from 100-0-0 to 90-5-5. For a multilevel traffic shift, the pipeline gets even more complex.
Istio was a bet that started paying off 7-8 months after we deployed it into production. Not that it usually takes this long to reap the benefits, the steep learning curve made it extremely difficult to do so early on. This long, arduous journey was full of moments of realisation on how Istio’s bits and pieces come together. We incorporated those learnings and worked towards a more transparent and efficient infrastructure.
The end result: on the back of the metrics provided by Istio alone, devs not only get a clearer picture of the behavior of the microservices they own, but also understand the interactions with other microservices, and are easily able to identify erratic behaviors in the system, if any. Such granularity in metrics has also unlocked the ability to set precise alerts, both, centrally and at the service level.
Initially we were skeptical of adopting a service mesh and taking up the challenge of maintaining it. In hindsight, we’re glad we did because it has made our lives easier.
Beyond the challenges of a service mesh lies the simplicity in understanding and operating a host of microservices. While we continue to harness Istio’s capabilities targeted towards improving the resilience of our infrastructure, we hope this blog proves to be useful to someone who’s embarked on the same journey.