Spring Cloud Sleuth#
What Is The Spring Cloud Sleuth?#
-
Spring Cloud Sleuth
provides Spring Boot auto-configuration for Distributed Tracing. Underneath, Spring Cloud Sleuth is a layer over a Tracer library named Brave. -
Sleuth configures everything you need to get started. This includes where trace data (spans) are reported to, how many traces to keep (sampling), if remote fields (baggage) are sent, and which libraries are traced.
-
Spring Cloud Sleuth is able to trace your requests and messages so that you can correlate that communication to corresponding log entries. You can also export the tracing information to an external system to visualize latency. Spring Cloud Sleuth supports OpenZipkin compatible systems directly.
Why Spring Cloud Sleuth?#
- Spring Cloud Sleuth provides Spring Boot auto-configuration for distributed tracing.
-
Specifically, Spring Cloud Sleuth…
- Adds trace and span ids to the Slf4J MDC, so we can extract all the logs from a given trace or span in a log aggregator.
- Instruments common ingress and egress points from Spring applications (servlet filter, rest template, scheduled actions, message channels, feign client).
-
If
spring-cloud-sleuth-zipkin
is available then the app will generate and report Zipkin-compatible traces via HTTP. By default it sends them to a Zipkin collector service on localhost (port 9411). Configure the location of the service usingspring.zipkin.baseUrl
.
What Is The Zipkin?#
-
Zipkin
is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in service architectures. Features include both the collection and lookup of this data. -
Applications need to be “instrumented” to report trace data to Zipkin. This usually means configuration of a tracer or instrumentation library. The most popular ways to report data to Zipkin are via HTTP or Kafka, though many other options exist, such as Apache ActiveMQ, gRPC and RabbitMQ. The data served to the UI are stored in-memory, or persistently with a supported backend such as Apache Cassandra or Elasticsearch.
What Is The Brave Library?#
Brave
is a distributed tracing instrumentation library. Brave typically intercepts production requests to gather timing data, correlate and propagate trace contexts. While typically trace data is sent to Zipkin server, third-party plugins are available to send to alternate services such as Amazon X-Ray.
How Does Spring Cloud Sleuth Work?#
Context Generation#
- Firstly, let's take a look into the picture below, as you can see when the service A receive an request from an external system then the
TraceContext
will be generated and then will be propagated to other services in micro service system. - At this step, firstly spring cloud sleuth will check in the incoming headers that it contains the specific header
X─B3─TraceId
, so if the incoming headers doesn't contain this header then the spring cloud sleuth will knows that this is the first request to the micro service system then it will generate thespanId
,traceId
and create theTraceContext
.
- In the
TraceContext
we will have 3identifiers
which aretraceId
,spanId
andparentSpanId
and asampling state
which issampled
. We also have the headers from B3 Propagation, which is built-in to Brave and has implementations in many languages and frameworks. Let's check the table below for more details.
Fields | Header | Type | Definition | Example Value |
---|---|---|---|---|
TraceId | X─B3─TraceId | itendifier | The TraceId is 64 or 128-bit in length and indicates the overall ID of the trace. Every span in a trace shares this ID. | 80f198ee56343ba864fe8b2a57d3eff7 (128 bits), 05e3ac9a4f6e3b90 (64 bits) |
SpanId | X─B3─SpanId | itendifier | The SpanId is 64-bit in length and indicates the position of the current operation in the trace tree. The value should not be interpreted: it may or may not be derived from the value of the TraceId. | 05e3ac9a4f6e3b90 |
ParentSpanId | X─B3─ParentSpanId | itendifier | The ParentSpanId is 64-bit in length and indicates the position of the parent operation in the trace tree. When the span is the root of the trace tree, there is no ParentSpanId. | 05e3ac9a4f6e3b90 |
Sampled | X─B3─Sampled | Sampling State | Sampling is a mechanism to reduce the volume of data that ends up in the tracing system. In B3, sampling applies consistently per-trace: once the sampling decision is made, the same value should be consistently sent downstream. This means you will see all spans sharing a trace ID or none. | true |
Context Propagation#
B3 Propagation
is a specification for the header "b3" and those that start with "x-b3-". These headers are used for thetraceContext
propagation across service boundaries.- Base on
B3
, Sleuthpropagates
thetraceContext
across service boundaries. When a request enters a service, Sleuth extracts thetraceContext
from the incoming request headers. It then adds the trace and span information to the outgoing requests, ensuring that thetraceId
is propagated across multiple services. This allows for end-to-end tracing of requests. - For example, when a downstream HTTP call is made, its
traceContext
is encoded as request headers and sent along with it, as shown in the following images.
- Now, we need to understand when do spans are created, recorded and how does the
traceContext
is propagated. - Firstly, we have the concept Annotation/Event which is used to record the existence of an event in time. These events to highlight what kind of an action took place (it doesn’t mean that physically such an event will be set on a span).
Name | Full Name | Description |
---|---|---|
cs | Client Send | The client has made a request. This annotation indicates the start of the span. |
sr | Server Received | The server side got the request and started processing it. Subtracting the cs timestamp from this timestamp reveals the network latency. |
ss | Server Sent | Annotated upon completion of request processing (when the response got sent back to the client). Subtracting the sr timestamp from this timestamp reveals the time needed by the server side to process the request. |
cr | Client Received | Signifies the end of the span. The client has successfully received the response from the server side. Subtracting the cs timestamp from this timestamp reveals the whole time needed by the client to receive the response from the server. |
- Now, let's take a look about how do the
spanIds
are generated and sent between Client and Server as in the image below.
- As you can see in the image above, a new span will be created every time the
server received
,client send
andserver received
events are determined and these span will be sent to the distributed tracing system likeZipkin
when the eventsserver sends
,client received
andserver sends
are determined respectively. - Beside it, the
traceContext
will be mapped and propagated as in the image below without sharedspanIds
. It means at stepsserver received
,client send
andserver received
, newspanIds
are always generated.
Sharing span IDs between Client and Server#
- Okay, this part will be a little bit complicated so we will investigate step by step.
- For the
TraceContext Propagation
without sharingspanIds
, it will look like as in the image below. In which, when theServer Received
, thetraceContext
is extracted from the incoming request headers and then it will be mapped to the newTraceContext
withnew spanId
- So on Zipkin, we will see a trace with 3 spans as in the image below.
- Now, following the
B3
, by default the Spring Cloud Sleuth will enable sharing thespanId
between Client and Server. In which, when theServer Received
, thetraceContext
is extracted from the incoming request headers and then it will join to the newTraceContext
without generating newspanId
or we can say thetraceContext
of eventServer Received
will use all the information from extractedtraceContext
of incoming request.
- Now, if we look into the image below, then there is only two
Spans
will be displayed toZipkin
server because theclient Send span
had joined with theserver received span
. They use the same the sametraceId
,spanId
andparentId
.