![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Anyone who’s worked with distributed systems knows there are far too many opportunities for something to go wrong. Pinpointing the exact spot in the source code is like looking for a needle in a haystack.
Kairux is a fault localization tool for Java-based systems, one that uses unit testing and adaptive dynamic slicing to work backward from the failure execution to identify what went wrong and why.
By implementing the Inflection Point Hypothesis — that the root cause of any failure is also the step where a failure execution and successful execution diverge — Kairux can identify the bug and present the source code needed to understand why it happened.
About 77% of failures in distributed systems are caused by more than just one thing. There’s no perfect solution, and picking the “right” root cause is subjective. While Kairux can’t guarantee that the “right” root cause is selected, it can identify the last root case so by continuing to apply Kairux repeatedly, multiple bugs in an execution could be identified.
Kairux first reduces the potential number of successful threads to compare against, then returns the one with the longest common prefix to the failure thread.
This research project was highlighted this month by Usenix, though the authors of the project have not indicated whether they plan to open source the code, which works for Java code, or embark on a commercial implementation.
The Inflection Point Hypothesis is simple and straightforward. The inflection point is the root cause of an issue, the first instruction where the failure execution diverges from the non-failure execution. So, the best way to pinpoint the inflection point is to find the successful execution with the longest common prefix to the failure execution.
The inflection point holds significance as the last step in the failure execution where failure is still avoidable. By using this hypothesis as the guiding principle, fault localization becomes a principled search problem.
A failure caused by a read-after-write data race. In example-failure, Thread 1 modified a = -1, which is no longer 0 causing a failure. failure-instr-seq is the instruction sequence that caused the bug in example-failure. If all successful combinations of instructions are created and compared to failure-instr-seq, eventually the sequence with the longest common prefix will surface. In this example, it’s instr-seq-n. (Source: Usenix)
Kairux is a set of algorithms for Java-based distributed systems that abstracts away all the heavy lifting. It takes the following inputs:
In order to output:
The “why” and “how” of the bug can be understood by comparing the failed execution’s source code with the longest successful source code. The “where” is more readily located by minimizing the number of potential sources. Kairux reduces the potentially infinite number of combinations with the following key concepts:
Greater detail about Kairux’s algorithms can be found in the technical paper.
Kairux uses static analysis over dynamic analysis due to the high overhead costs associated with dynamic analysis. The static slice is a super-set of the instructions that belong in the dynamic slice of any failure execution.
Kairux builds the dynamic slice it needs by setting breakpoints, then reproducing the failure execution. Each breakpoint it’s able to hit is recorded and a trace is obtained, followed by a dependency analysis, and then annotating network communication libraries. From this, a dynamic slice is obtained.
Additional work includes the use of breakpoints to enforce different thread scheduling and assigning unique tags to differentiate different runtime instances of a source code object and track data flow.
The short answer: it works. The root causes for failures were located 70% of the time. The number of sequences examined were reduced from a possible infinity to 0.2% in the process. Failures caused by “missing events,” something that should have happened but didn’t, and failures caused by anomalous events were both explained.
Kairux was evaluated on randomly sampled, real-world failures from HBase, HDFS and ZooKeeper.