Diaphora Heuristics
Diaphora uses multiple heuristics to find matches between different functions. The next list shows all the heuristics implemented in the current Diaphora Release.
Best matches
The very first thing Diaphora tries is to find if everything in both databases, even the primary key values are equals. If so, the databases are considered 100% equals and nothing else is done. After that, it runs a number of heuristics for which the results are known to be very good and realible.
- Equal pseudo-code. The pseudo-code generated by the Hex-Rays decompiler are equals. It can match code from x86, x86_64 and ARM interchangeably.
- Equal assembly. The assembly of both functions is exactly equal.
- Bytes hash and names. The first byte of each assembly instruction is equal and also the referenced true names, not IDA's auto-generated ones, have the same names.
- Same address, nodes, edges and mnemonics. The number of basic blocks, their addresses, the names of edges and the mnemonics in both databases are equal.
- Same RVA and hash. The RVA (Relative Virtual Address) and the bytes hash is the same for both databases.
- Same order and hash. Both functions have the same bytes hash and were discovered by IDA at the very same position in the database (i.e., both functions are the 100th function in the database).
- Function hash. The calculated function hash is equal for both functions. The hash is calculated as the MD5 of the concatenation of all the instruction bytes in a function.
- Bytes hash. The calculated bytes hash is equal for both functions. The hash is calculated as the MD5 of the concatenation of all the instruction bytes in a function but, in opposite to function hash, it does so by stripping any byte that can be variable depending on the address of the instruction, like displacements or relative calls and jumps.
- Bytes sum. Both the size of the function in bytes and the summatory of all the bytes in the function are the same for both functions.
Partial
Partial matches are these considered according to the confidence's ratio, when it's bigger or equal to 5.00.
- All or most attributes. All the attributes of a function (basic blocks, primes values, hashes, etc...), or most of them are equal in both functions.
- Switch structures. The cases and values of all the switch statements in both functions are equal.
- Same name. The mangled or demangled name is the same in both functions.
- Same address, nodes, edges and primes (re-ordered instructions). The function has the same address, number of basic blocks, edges and a the prime corresponding to the cyclomatic complexity are equal. At typically matches functions with re-ordered instructions.
- Import names hash. The functions called from both functions are the same, matched by the demangled names.
- Nodes, edges, complexity, mnemonics, names, prototype, in-degree and out-degree. The number of basic blocks, mnemonics, names, the function's prototype the in-degree (calls to the function) and out-degree (calls performed to other functions) is the same.
- Nodes, edges, complexity, mnemonics, names and prototype. The number of basic blocks, edges, the cyclomatic complexity, the mnemonics, the true names used in the function and even the prototype of the function (stripping the function name) are the same.
- Mnemonics and names. The functions have the same mnemonics and the same true names used in the function. It's done for functions with the same number of instructions.
- Mnemonics small-primes-product. The SPPs, calculated by assigning primes to mnemonics, in both functions are the same. It's sensitive to changes in IDA: if the IDA's API GetInstructionList(), at some point, reorders the instructions, all exported Diaphora databases would not be comparable to new databases.
- Small names difference. At least 50% of the true names used in both functions are the same.
- Pseudo-code fuzzy hash. It checks the normal fuzzy hash (calculated with the DeepToad's library kfuzzy.py) for both functions.
- Pseudo-code fuzzy hashes. It checks all the 3 fuzzy hashes (calculated with the DeepToad's library kfuzzy.py) for both functions. This is considered a slow heuristic.
- Similar pseudo-code. The pseudo-code generated by the Hex-Rays decompiler is similar with a confidence ratio bigger or equal to 0.6. This is considered a slow heuristic.
- Similar pseudo-code and names. Same as before but also the true names used in both functions are equal.
- Pseudo-code fuzzy AST hash. The fuzzy hash calculated via SPP (small-primes-product) from the AST of the Hex-Rays decompiled function is the same for both functions. It typically catches C constructions that are re-ordered, not just re-ordered assembly instructions.
- Partial pseudo-code fuzzy hash. At least the first 16 bytes of the fuzzy hash (calculated with the DeepToad's library kfuzzy.py) for both functions matches. This is considered a slow heuristic.
- Topological sort hash. Both the strongly connected components as well as the topological sort hash of the graph of both functions are the same.
- Same high complexity, prototype and names. The cyclomatic complexity is at least 20, and the prototype and the true names used in the function are the same for both databases.
- Same high complexity and names. Same as before but ignoring the function's prototype.
- Strongly connected components. The sets of strongly connected components of the graph are the same in both databases. This is considered a slow heuristic.
- Strongly connected components small-primes-product. The SPP calculated by assigning prime numbers to each strongly connected component for each set of strongly connected components in the graph is the same for both functions.
- Loop count. The number of loops (more than 1) in the function is the same for both databases and the confidence ratio is at least 0.49. This is considered a slow heuristic.
- Same nodes, edges and strongly connected components. The number of basic blocks, relationships between them and the sets of strongly connected components in the function graph are the same for both functions.
Unreliable
Unreliable heuristics are these that give out matches that sometimes are useful but the number of false positives it may generate is high.
- Strongly connected components. The sets of strongly connected components are the same and, at least, there are 2 components. This is considered a slow heuristic.
- Loop count. The number of loops is the same for both functions. The comparison is made without checking the number of basic blocks. This is considered a slow heuristic.
- Nodes, edges, complexity and mnemonics. The number of basic blocks, relations, the cyclomatic complexity (naturally) and the mnemonics are the same. It can match functions too similar that actually perform opposite operations (like add_XXX and sub_XXX). Besides, this is considered a slow heuristic.
- Nodes, edges, complexity and prototype. Same as before but the mnemonics are ignored and only the true names used in both functions are considered. This is considered a slow heuristic.
- Nodes, edges, complexity, in-degree and out-degree. The number of basic blocks, edges, cyclomatic complexity (naturally), the number of functions calling it and the number of functions called from both functions are the same. This is considered a slow heuristic.
- Nodes, edges and complexity. Same number of nodes, edges and, naturally, cyclomatic complexity values. This is considered a slow heuristic.
- Similar pseudo-code. The pseudo-codes are considered similar with a confidence's ratio of 0.58 or less. This is considered a slow heuristic.
- Same high complexity. Both functions has the same high cyclomatic complexity, being it at least 50. This is considered a slow heuristic.
Experimental
Experimental heuristics are likely to be removed or moved or changed in the future. These are the heuristics that are being developed and are, mostly, used exclusively for research. Some experimental heuristics are promoted to partial or even best matches after the research is done. However, the heuristic can be also dropped if the results of that heuristic after researching it for some time are bad.
- Call address sequence. Check for similar or equal functions by sequentially looking over all the list of matches (both “Best” and “Partial”). The current implementation is far from being “good” but still works. However, as it isn't working properly all the time, it's considered an experimental heuristic.
- Small pseudo-code fuzzy AST hash. Same as “Pseudo-code fuzzy AST hash” but applied to functions with less or equal to 5 lines of pseudo-code. Like the previous heuristic, it matches too many things and the calculated confidence's ratio is typically bad..
- Similar small pseudo-code. Even worst than “Similar small pseudo-code”, as it tries to match similar functions with 5 or less lines of pseudo-code, matching almost anything and getting confidence's ratios of 0.25 being lucky.
- Equal small pseudo-code. Even worst than before, as it matches functions with the same pseudo-code being 5 or less lines of code long. Typically, you can get 2 or 3 results, that are, actually, wrong.
- Same low complexity, prototype and names. The prototype of the functions, the true names used in the functions and its cyclomatic complexity, being it less than 20, is the same. It worked for me once, I think.
- Same low complexity and names. The cyclomatic complexity, being it less than 20, and the true names used in the function are the same. It typically matches functions already matched by other heuristics, so it's usefulness is really limited.
- Same graph. By looking to most attributes of the functions, the graph seems to be the same in both databases. It's a really slow heuristic that causes some false positives and for huge databases might cause the comparison to even crash because SQLite requires more than 3GB of memory (if the IDA process is a 32 bit, the default as of 2016).