Seeking Adoption: dotnet time-travel debugging for distributed systems

2 points by nikola-petrov 2 days ago

Hi,

A few years ago, I faced challenges while working with a complex microservices system. The system’s tests relied on non-deterministic data, and reproducing intricate business scenarios or bugs locally was almost impossible. These issues prompted me to create a tool that simplifies the workflow by enabling record and replay of service interactions.

The core idea is to record all calls and data associated with a specific request, allowing easy replay for two primary cases: - Troubleshooting Test Failures: When tests fail, this tool makes it easier to pinpoint the exact service where data differences occurred. It identifies whether failures stem from the use of external data. - Reproducing Business Scenarios: The workflow can be recorded by a client or PM, and the resulting state can be provided directly to the developer, creating a reproducible scenario for troubleshooting.

The tool leverages DispatchProxy to create decorators for every interface within the service, recording variable values upon function entry and exit. While not entirely deterministic (e.g., the results of `DateTime.Now` remain unchanged within a function), the function’s final return value is recorded as-is. Additionally, the tool supports propagation through HTTP calls, aggregating internal micro-services calls into a single zip file containing the entire call chain’s recorded data.

Downsides: - Requires interfaces and dependency injection for all components. - Lacks full user-friendly features. - Does not yet include a feature to compare two records and identify the first difference (e.g., the service, file, or line where changes occur).

On the positive: - The interface dependency issue can be solved by rebuilding the IL at runtime to modify function calls from concrete types (though challenging and performance-intensive). - Installation and usage are straightforward—adding a NuGet package and configuring services and middleware is sufficient. - Negligible performance impact, except on the request being recorded, as the decorator is only applied within the recording request scope.

All that said, when the first demo was ready, I have failed in persuading the company to adopt it. Or see any value in it. And as I'm a developer, not a business person and I have no idea how or what to do with it, I temporary froze it. And that was 3 years ago. I can't force myself completing it, just for the sake of it, without any idea when, if at all, anyone will see value in it and use it.

If anyone finds value in this tool, wishes to adopt or use it, or has ideas for it, please feel free to reach out. I'm open to further development, knowledge transfer, or assisting in its adoption.

Code: https://github.com/LideService/Lide Contact: nikola.gamzakov@hotmail.com

Thank you!