Skip to content

Distributed Tracing#

Observability#

  • Observability is a concept in software engineering that focuses on gaining insights into complex systems by collecting and analyzing relevant data.
  • Tracing, alongside metrics and logs, are three cornerstones of Observability. In which:

    • Tracing provides a way to track and visualize the flow of requests as they traverse through a distributed system. It helps identify the path taken by a request, measure the duration of each step, and understand the dependencies and interactions between different components.
    • Metrics, on the other hand, provide quantitative measurements and statistics about various aspects of the system's behavior, such as response times, error rates, and resource utilization. They offer aggregated data that can be used to monitor performance, identify anomalies, and make data-driven decisions.
    • Logs capture important events, messages, and contextual information generated by the system. They can be used for debugging, auditing, and analyzing system behavior. Logs provide a chronological record of events and can help in understanding the sequence of operations and detecting issues.
  • By combining tracing, metrics, and logs, observability enables system administrators and developers to gain a comprehensive understanding of system behavior, diagnose problems, and optimize performance. It provides insights into the inner workings of complex distributed systems, helping to improve reliability, troubleshoot issues, and enhance overall system performance.

What Is The Distributed Tracing?#

  • Distributed tracing is a method of observing requests as they advance through a distributed system. Its primary use is to profile and monitor modern applications built using microservices and (or) cloud native architecture, enabling developers to find performance issues.

  • With distributed tracing, developers can track a single request traversing through an entire system that is distributed across multiple applications, services, and databases. 

  • By using a distributed tracing tool, you can collect data on each request that helps you present, analyze, and visualize the request in detail. These visual representations allow you to see each step (also known a span) a request makes and how long each step takes. Developers can review this information to see where the system is experiencing blockages and latencies to determine the root cause. For example, a request may pass back and forth through multiple microservices before fulfilling a request. Without a way of tracking the entire journey, there is no way to know exactly where the issues occur.

  • View More

Why Distributed Tracing?#

  • If we are working in micro-service systems, we will usually face to some pain points like:

    • Identify the root cause in the system when some issues happen.
    • Monitor execution time of services to identify the performance.
  • So the Distributed tracing will help us to handle these things, it provides observability for microservices. It will help use to track a request from start to finish, making troubleshooting any issues faster and easier, then it also help us to show how the system is performing.

  • Common distributed tracing solutions attach small pieces of metadata to the headers of each request, that are then propagated downstream to any subsequent services. Each individual component is then configured to send this metadata to a centralised tracing tool (Jaeger or Zipkin) which correlates the data and allows you to visualize the request as it passes through the system.

  • View More

Terminology#

Span#

  • Span represents a single unit of work within the system. Spans can be nested within one another to model the decomposition of the work. A detailed explanation can be found on the OpenTracing site. For example, a span could be calling a REST endpoint and another child span could then be that endpoint calling another and so on in a different service.

Trace#

  • Trace is a collection of spans which all share the same root span, or more simply put all spans which were created as a direct result of the original request. The hierarchy of spans (each with the own parent span alongside the root span) can be used to form directed acyclic graphs showing the path of the request as it made its way through various components.

 #zoom

 #zoom

Trace Context#

  • TraceContext is the bundle of metadata that is passed from one service to the other, allowing for the creation of the final hierarchical trace. Depending on the propagation type used this can take multiple forms, but usually includes at least the root and parent span id’s plus any extra “baggage”.

 #zoom

Context Propagation#

  • Context Propagation is the process of transferring trace information from one service to the other. Propagation is done by injecting the trace context into the message that is being sent. In the case of an HTTP call usually it is done by adding specific HTTP headers as defined by the standard. There are multiple different standards for this (which is where the complexity arises). Zipkin uses the B3 format whereas the W3C has also defined a new standard which may be preferable. The libraries being used should be able to support multiple types and convert between them.

  • Following B3 standard we will have some main headers below:

    • X-B3-TraceId
    • X-B3-ParentSpanId
    • X-B3-SpanId
    • X-B3-Sampled
  • Following W3C standard we will have some main headers below:

    • version
    • trace-id
    • parent-id
    • trace-flags

Sampling#

  • In larger systems, or for those which process a high number of requests, we may not want to record every trace. It could be unnecessarily expensive to do so or could put pressure on the collectors. Sampling aims to limit the total number of traces recorded whilst still preserving the underlying trends. For example, you might employ a simple rate limiting sampler or use more complex probabilistic or adaptive approaches.

Instrumentation#

  • Instrumentation is injecting code into the service to gather tracing information. Can be done manually or automatically. As manual instrumentation requires some boilerplate code, the preferred way is to use auto instrumentation libraries from the providers.

Baggage#

  • Distributed tracing works by propagating fields inside and across services that connect the trace together: traceId and spanId notably. The context that holds these fields can optionally push other fields that need to be consistent regardless of many services are touched. The simple name for these extra fields is "Baggage".

References#