FastWrite/FastRead Panic Error

Common errors and solutions related to FastCodec (including FastThrift, FastPB)

FastWrite Panic

The implementation of FastWrite consists of three steps: (1) scanning the object to calculate the required memory, (2) allocating memory, and (3) completing serialization. This approach eliminates the need for slice appending, reducing memory allocation and copying, resulting in significantly improved performance.

However, if there is a concurrency bug in the user code between step 1 (calculating the required memory) and step 3 (serialization), where another goroutine concurrently writes to the object being serialized (request parameters for the client or response values for the server), it can lead to Kitex reading an inconsistent object, resulting in errors or even panics.

Note:

  1. Although the error occurs in the generated FastWriteField method of Kitex, the root cause of the error lies in the user code, which is a typical concurrency issue.
  2. These issues are concurrency bugs in the user code, not bugs in Kitex. FastWrite simply exposes these issues more easily.
  3. A typical scenario is when a field referenced in the request or response object is reused, as explained in the “Case Analysis” section below.
  4. Importantly, the absence of errors or panics does not guarantee the absence of concurrency issues. In reality, inconsistent data may have been constructed and passed to upstream or downstream components.

Typical Symptoms

The following error messages may occur:

  • runtime error: index out of range [3] with length 3
  • runtime error: slice bounds out of range [86:59]
  • runtime error: invalid memory address or nil pointer dereference

Case Analysis

Client Panic

A typical scenario involves a field referenced within the request being a global variable (or a cached object) that can be concurrently written to.

The simplified error message is as follows:

KITEX: panic, ..., error=runtime error: invalid memory address or nil pointer dereference
panic(...):
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProcotol.WriteBinaryNocopy(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField2(...)

The panic stack includes fastWriteField2, which is a typical case of concurrent read and write in business logic. The simplified business code is as follows:

key := "default" // reflect.StringHeader{Data=0xXXX, Len=7}
fallbackKey = "" // reflect.StringHeader{Data=nil, Len=0}
wg := sync.WaitGroup{}
for _, task := range taskList {
    wg.Add(1)
    go func() {
        defer wg.Done()
        if someCondition {
            key = fallbackKey // `key` may be read by another goroutine
        }
        kitexClient.GetByKey(ctx, &Request{Key: key})
    }()
}
wg.Wait()

Analysis:

  1. The key variable of type string may be read/written by different goroutines within the loop.
  2. Reading/writing a string is not thread-safe: In the Go Runtime implementation, a reflect.StringHeader{Data uintptr, Len int} is used with two fields that need to be assigned separately.
  3. Kitex’s FastWrite may read a “partially written string” (Data = nil, Len = 7) and pass it to the fastWriteField2 method.
  4. When fastWriteField2 attempts to read 7 characters from Data = nil, it triggers a nil pointer dereference panic.

Note:

  1. Similar issues can occur with slices, which correspond to reflect.SliceHeader and have Data, Len, and Cap fields.
  2. Concurrent read and write of maps can result in a fatal error in Go.
  3. The absence of errors or panics does not guarantee the absence of concurrency issues. In reality, inconsistent data may have been constructed and passed to upstream or downstream components: For example, if fallbackKey = "123456789" in the code above, the serialized data may be read as “1234567”. If fallbackKey = "123", it could result in out-of-bounds data access or even panics.

Server Panic

A typical scenario is when a field referenced in the response is a global variable (or a cached object), and that object may be concurrently written to.

The error message for this case is as follows (slightly simplified):

KITEX: panic happened, ..., error=<Error: runtime error: index out of range [3] with length 1>
panic(...)
encoding/binary.bigEndian.PutUint32(...)
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProtocol.WriteI32(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField3(...)

From the panic stack, it can be seen that the fastWriteField3 method generated by Kitex triggers the panic. It is a typical case of concurrent read and write in business logic. The simplified business code can be as follows:

var localCache sync.Map{}

func (*Handler) GetByKey(ctx context.Context, req *xxx.Request) (*xxx.Response, error) {
    resp := localCache.Get(req.Key)
    resp.UserID = req.UserID
    return resp, nil
}

Analysis:

  1. PutUint32(b []byte, v uint32) checks if b[3] is readable before writing v to b. The panic occurs here, indicating an out-of-range access.
  2. FastWrite has already traversed resp to determine the required length, indicating that between “(1) calculating required memory” and “(3) serialization,” a field in resp has been modified, requiring more memory space.
  3. Let’s assume the server receives two requests simultaneously (both with Key as “X”):
    1. A.Request: {Key="X", UserID = "123" }
    2. B.Request: {Key="X", UserID = "123456"}
  4. The process may execute in the following order:
    1. FastWrite for request A calculates that it needs N+3 bytes (N represents the space required by other fields in resp) and allocates the memory.
    2. Request B executes line 5 and replaces UserID with “123456” (note that the response is cached, which means A will also return this object).
    3. Request A starts serializing resp into the allocated space. When it tries to write the last field, UserID, it finds that there is not enough space available (it now needs N+6 bytes): index out of range [3] with length 1

Note:

  1. The absence of errors or panics does not imply the absence of concurrency issues. In reality, abnormal data may have been constructed and passed to upstream and downstream components: For example, if the execution order of A and B mentioned above is reversed, no error will occur, but the expected value to be written in the response message is “123456,” while the actual value written is “123,” and the Kitex client can decode it correctly.

Troubleshooting Suggestions

Self-inspect the code based on the aforementioned case

If there is concurrent read and write on the Request or Response (including objects directly or indirectly referenced by them), fix this issue first.

Use the Go Race Detector

If possible, use the -race flag to identify and eliminate concurrency issues in the code. Refer to https://go.dev/blog/race-detector for more details.

Note: Be cautious when using it in a production environment as it can significantly impact performance.

Identify the problematic field based on the error message

Based on the panic stack, you can identify the specific field by looking at the field type and fastWriteFieldN method.

For example, if the innermost method in the panic stack is: kitex_gen/some_package.(***Base**).**fastWriteField3**(...) , it means that the error is caused by the field with index 3 in the Base type. In the following IDL, the field at index 3 is the Addr field of type string:

struct Base {
    1: string LogID = "",
    2: string Caller = "",
    3: string Addr = "",
}

If the panic error message contains invalid memory address or nil pointer dereference, it indicates that the Addr field may have been concurrently written with a null value by another goroutine.

If the panic error message is different (e.g., index out of range [3] with length 3), it may indicate that another field is being concurrently read and written (variable length) and is occupying more memory, causing insufficient memory allocation when writing the field. You can refer to the “Comparing Two Sampling Results” section below to locate the issue.

Client Side

The proposed solution for the client side is as follows:

  1. First Sampling: Before sending the request, serialize the entire request body into JSON.

    • Note: This may impact performance. Consider sampling the requests proportionally to reduce the impact on the production environment, but it may take longer to capture relevant cases.
  2. Second Sampling (two approaches for reference):

    a. Panic Sampling: Check if a panic has occurred and then serialize the request body into JSON.

    • Check if the returned error from the client contains panic information.
    • Check if a panic has occurred in the middleware. Refer to the “Kitex - Panic 处理” section in the Kitex - Panic Handling documentation.

    b. Delayed Sampling: Create a goroutine that sleeps for a certain amount of time and then serialize the request body into JSON.

    • The longer the sleep time, the higher the probability of finding differences.
  3. Compare the results of the two samplings to identify different fields and investigate code sections where concurrent read/write operations may be happening.

Note:

  1. If a certain performance loss (additional JSON marshaling, comparison) is acceptable, consider comparing all requests. Some requests may not cause panics but could still produce inconsistent data.
  2. Injecting troubleshooting through middleware is simpler and does not require modifying the business code.

Server Side

On the server side, as the server has exited all middleware during encoding, it is not possible to capture panics or perform sampling comparisons using middleware.

The proposed solution for the server side is as follows:

  1. Custom PayloadCodec:
    • Perform sampling and panic comparison during the Marshal operation.
    • Note: This may impact performance. Consider sampling the requests proportionally to reduce the impact on the production environment, but it may take longer to capture relevant cases.
type codecForPanic struct {
    payloadCodec remote.PayloadCodec
}

func (c *codecForPanic) Marshal(ctx context.Context, message remote.Message, out remote.ByteBuffer) error {
    var before, after []byte
    var err error
    defer func() {
       if err := recover(); err != nil {
          klog.Errorf("panic: %v", err)
          after, _ = json.Marshal(message.Data())
          if bytes.Compare(before, after) != 0 {
             klog.Errorf("before = %s, after = %s", before, after)
          }
       }
    }()
    before, err = json.Marshal(message.Data()) // Note the performance loss
    if err != nil {
        klog.Errorf("json encode before Marshal failed: err = %v", err)
    }
    return c.payloadCodec.Marshal(ctx, message, out)
}

func (c *codecForPanic) Unmarshal(ctx context.Context, message remote.Message, in remote.ByteBuffer) error {
    // Recover and compare here
    return c.payloadCodec.Unmarshal(ctx, message, in)
}

func (c *codecForPanic) Name() string {
    return "codecForPanic"
}
  1. Specify the custom PayloadCodec during server initialization:
svr := test.NewServer(new(TestServiceImpl),
    server.WithPayloadCodec(&codecForPanic{
       payloadCodec: thrift.NewThriftCodecWithConfig(thrift.FastRead | thrift.FastWrite),
    }),
    // Other options
)
  1. Compare the results of the two samplings to identify different fields and investigate code sections where concurrent read/write operations may be happening.

If panics occur infrequently, you can also consider starting a new goroutine, sleeping for a period of time, and then performing the comparison. This approach makes it easier to identify the modified sections.

FastRead

Typical Symptoms

runtime error: slice bounds out of range [1600217702:1678338]

Cause

The received message (Client: response, Server: request) is encoded incorrectly.

Troubleshooting Suggestions

Refer to the troubleshooting suggestions for FastWrite and investigate the corresponding endpoint (if the error occurs on the server side, check the client; vice versa).

FAQ

How to Disable FastWrite/Read

It is not recommended to disable it: Race conditions in the business logic can lead to inconsistent business data and unpredictable consequences.

If there is a genuine need, you can choose one of the following options:

  1. Disable the FastCodec in Thrift using options:
// Server side
svr := xxxService.NewServer(handler, server.WithPayloadCodec(
    thrift.NewThriftCodecDisableFastMode(true, false)))

// Client side
cli, err := xxxService.NewClient("", client.WithPayloadCodec(
    thrift.NewThriftCodecDisableFastMode(true, false)))
  1. Do not generate FastCodec encoding/decoding code: Use the -no-fast-api parameter of the Kitex command-line tool to regenerate the code.

Last modified November 13, 2024 : chore: update ci (#1167) (79622b3)