Diaphora – An IDA Python BinDiffing plugin

 

Index

Introduction

Files distributed with the diaphora distribution

Running Diaphora

Diaphora quick start

Finding differences in new versions (Patch diffing)

Ignoring small differences (Finding new functionalities)

Porting symbols

Diffing huge databases (or exporting smaller .SQLite databases)

Heuristics

Best matches

Partial and unreliabe matches (according to the confidence's ratio):

Unreliable matches

Experimental (and likely to be removed or moved or changed in the future):

 

 

Introduction

 

Diaphora is a plugin for IDA Pro that aims to help in the typical BinDiffing tasks. It's similar to other competitor products and open source projects like Zynamics BinDiff, DarunGrim or TurboDiff. However, it's able to perform more actions than any of the previous IDA plugins or projects.

 

In the next paragraphs, I will describe how to use it in different scenarios.

 

Files distributed with the diaphora distribution

 

Diaphora is distributed as a compressed file with various files and folders inside it. The structure is similar to the following one:

 

  1. 1.diaphora.py: The main IDAPython plugin. It contains all the code of the heuristics, graphs displaying, export interface, etc... 

  2. 2.jkutils/kfuzzy.py: This is an unmodified version of the kfuzzy.py library, part of the DeepToad project, a tool and a library for performing fuzzy hashing of binary files. It's included because fuzzy hashes of pseudo-codes are used as part of the various heuristics implemented. 

  3. 3.jkutils/factor.py: This is a modified version of a private malware clusterization toolkit based on graphs theory. This library offers the ability to factor numbers quickly in Python and, also, to compare arrays of prime factors. Diaphora uses it to compare fuzzy AST hashes and call graph fuzzy hashes based on small-primes-products (an idea coined and implemented by Thomas Dullien and Rolf Rolles first, authors or former authors of the Zynamics BinDiff commercial product, in their “Graph-based comparison of Executable Objects – Zynamicspaper). 

  4. 4.Pygments/: This directoy contains an unmodified distribution of the Python pygments library, a “generic syntax highlighter suitable for use in code hosting, forums, wikis or other applications that need to prettify source code”. 

 

Running Diaphora

 

Diaphora can only be used by running the script, as of March 2015. Initially, during the BETA phases, there was support for installing it as a true IDA plugin. However, it causes a lot of maintenance problems, like finding workarounds for known IDA problems and bugs. As so, and because during the beta phase more time was expended in finding workarounds to different IDA bugs and problems with many different versions of IDA than actually fixing bugs on Diaphora, the support have been dropped. It may be added back again at some point in the future, but is unlikely.

 

So, in order to run Diaphora, simply, unpack the compressed distribution file wherever you prefer and directly execute “diaphora.py” from the IDA Pro menu File → Script file. Once the script diaphora.py is executed, a dialog like the following one will be opened:

 

 

This dialog, although it can be a bit confusing at first, is used for both exporting the current IDA database to SQLite format as well as for performing diffing against another SQLite exported format database.

 

The first field, is the path of the SQLite file format database that will be created with all the information extracted from the current database. The 2nd field is the other SQLite format database to diff the current database against. If this field is left empty, Diaphora will just export the current database to SQLite format. If the 2nd field is not empty, it will diff both databases.

 

The other fields, the check-boxes, are explained bellow:

 

  1. 1.Use the decompiler if available. If the Hex-Rays decompiler is installed with IDA and IDA Python bindings are available, Diaphora will use the decompiler to get many interesting information that will help during the bindiffing process. 

  2. 2.Export only non-IDA generated functions. Self-explanatory, only functions with non IDA autogenerated names will be exported. 

  3. 3.Do not export instructions and basic blocks. Export only function summaries. When exporting huge databases, it may help speeding up operations. However, the diffing capabilities will be more limited. 

  4. 4.Use probably unreliable methods. Diaphora uses many heuristics to try to match functions in both databases being compared. However, some heuristics are not really reliable or the ratio of similarity is very low. Check this box if you want to see also the likely unreliable matches Diaphora my find. Unreliable results are shown in a specific list, it doesn't mix the “Best results” (results with a ratio of 1.00) with the “Partial results” (results with a ratio of 0.50 or higher) or “Unreliable results”. 

  5. 5.Use slow heuristics. Some heuristics can be quite expensive and take long. For medium to big databases, it's disabled by default and is recommended to left unchecked unless the results from a execution with this option disabled are not good enough. It will likely find more better matches than the normal, not that slow, heuristics, but it will take significantly longer. 

  6. 6.Relaxed calculations of difference ratios. Diaphora uses, by default, a kind of aggressive method to calculate difference ratios between matches. It's possible to relax that aggressiveness level by checking this option. Under the hood, the function SequenceMatcher.quick_ratio is used when this option is unchecked and  SequenceMatcher.real_quick_ratio when this option is checked. Also, when the option is checked, Diaphora will use too the difference ratio of the primes numbers calculated from the AST of the pseudo-code of the 2 functions, calculating the highest ratio from the AST, assembly and pseudo-code comparisons. 

  7. 7.Use experimental heuristics. It says it all: experimental heuristics are enabled only if this check-box is marked. Disabled by default as they are likely not useful. 

  8. 8.Ignore automatically generated names. When performing the comparison between databases, it tells Diaphora to ignore in the “Same name” heuristic functions with the same IDA's autogenerated name (i.e., when there are two function sub_01020304 in both databases but they aren't actually the same function). Used only when comparing. 

  9. 9.Ignore all function names. Just disable the “Same name” heuristic. Used only when comparing. 

  10. 10.Ignore small functions. Ignore functions with less than 6 assembly instructions. Used only when comparing. 

     

Diaphora quick start

 

Finding differences in new versions (Patch diffing)

 

In order to use Diaphora we need at least two binary files to compare. I will take as example 2 different versions of the “avast” binary from Avast for Linux x86_64. The files has the following hashes:

 

  1. 1.c0092cf0cb1286cfd8399681bcab68a4  avast-2014  

  2. 2.716ff77e74e47d3ee619df49995536ec  avast 

 

The file “avast-2014” is an old version from 2014 and the binary “avast” is the latest version. Launch IDA Pro for 64 bits (idaq64) and open the file “avast-2014”. Once the initial auto-analysis finishes launch Diaphora by either running the script “diaphora.py”. The following dialog will open:

 

We only need to care about 2 things:

  1. 1.Field “Export current database to SQLite”. This is the path to the SQLite database that will be created with all the information extracted from the IDA database of this avast binary. 

  2. 2.Field “Use the decompiler if available”. If the Hex-Rays decompiler is available and we want to use it, we will leave this check-box marked, otherwise uncheck it. 

 

After correctly selecting the appropriate values, press OK. It will start exporting all the data from the IDA database. When export process finishes the message “Database exported.” will appear in the IDA's Output Window.

 

Now, close this database, save the changes and open the “avast” binary. Wait until the IDA's auto-analysis finishes and, after it, run Diaphora like with the previous binary file. This time, we will select in the 2nd field, the one name “SQLite database to diff”, the path to the .sqlite file we just exported in the previous step, as shown in the next figure:

 

After this, press the OK button. It will first export the current IDA database to the SQLite format as understood by Diaphora and, then, right after finishing, compare both databases. It will show an IDA's wait box dialog with the current heuristic being applied to match functions in both databases as shown in the next figure:

 

 

After a while a set of lists (choosers, in the HexRays workers language) will appear:

 

 

There is one more list that is not shown for this database, named “Unreliable matches”. This list holds all the matches that aren't considered reliable. However, in the case of this binary with symbols, there isn't even a single unreliable result. There are, however, unmatched functions in both the primary (the latest version) and the secondary database (the previous version):

 

 

The previous image shows the functions not matched in the secondary database, that is: the functions  removed in the latest version. The second figure shows the functions not matched in the previous database, the new functions added:

 

 

It seems they added various functions to check for SSE, AVX, etc... Intel instructions. Also, they added 2 new functions called handler and context. Let's take a look now to the “Best matches” tab opened:

 

 

There are many functions in the “Best matches” tab, 2556 functions, and in the primary database there are 2659 functions. The results shown in the “Best matches” tab are these functions matched with some heuristic (like “100% equal”, where all attributes are equal, or “Equal pseudo-code”, where the pseudo-code generated by the decompiler is equal) that, apparently, doesn't have any difference at all. If you're diffing these binaries to find vulnerabilities fixed, just skip this tab, you will be more interested in the “Partial matches” one ;) In this tab we have many results:

 

 

It shows the functions matched between both databases and, in the description field, it says which heuristic matched and the ratio of differences. If you're looking for functions where a vulnerability was likely fixed, this is where you want to look at. It seems that the function “handle_scan_item”, for example, was heavily modified: the ratio is 0.49, so it means that more than the 49% of the function differs between both databases. Let's see the differences: we can see then in an assembly graph, in plain assembly or we can diff pseudo-code too. Right click on the result and select “Diff assembly in a graph”, the following graph will appear:

 

 

The nodes in yellow colour, are these with only minor changes; pink ones, are these that are either new or heavily modified and the blank ones, the basic blocks that were not modified at all. Let's diff now the assembly in plain text: go back to the “Partial matches” tab, right click on the function “handle_scan_item” and select “Diff assembly”:

 

 
 

It shows the differences, in plain assembly, that one would see by using a tool like the Unix command “diff”. We can also dif the pseudo-code: go back to the “Partial matches” tab, right click in the function and select “Graph pseudo-code”:

 

 

As we can see, it shows all the differences in the pseudo-code in a 2 sides diff, like with the assembly diff. After you know how the 3 different ways to see differences work, you can choose your favourite or use all of the 3 for specific cases.

 

Saving and loading diffing results

 

Now that we have the diffing results we may want to store the results for checking them later on instead of re-launching the diffing process. This can be done by clicking on the “IDA View” tab (required because of IDA's behaviour) and then selecting from the main menu the option Edit → Plugins → Diaphora - Save Results.

 

In order to load the results, one need to execute the supplied script “diaphora_load.py” and then select the previously saved *.diaphora results file.

 

By the way: in case one closes a single tab it isn't required to relaunch the whole diffing process or reopen the *.diaphora stored results file, one can simply press the key “F3” or, alternatively, go to the menu Edit → Plugins → Diaphora – Show Results, it will display again any closed tab.

Ignoring small differences (Finding new functionalities)

 

Sometimes, you don't need to care about small changes when diffing 2 databases. For example, you maybe finding just the new features added to this or that program instead of finding bugs fixed in a product. We will continue with the previous binaries for this example. Go to the tab “Partial matches” and find the functions “respond” and “scan_reponse”:

 

 

According to the ratios shown it seems these functions are almost equal with small changes. Let's see the differences in the function respond: right click on the respond function and select “Diff pseudo-code”:

 

 

It seems that the only change in this function is, actually, the size of a stack variable and the given size. If we're looking for the new functionality added to the product, it can be irritating going through a big list of small changes. We will re-diff both databases: run again Diaphora by executing diaphora.py and, in the dialog select this time “Relaxed calculations on difference ratios” as shown bellow:

 

Press OK and wait for it to finish. When it's finished, go to the “Best matches” tab and find the “respond” or “scan_response” functions:

 

 

This time, as we can see, both functions appear in the “Best matches”, the list of functions that are considered equal, so you don't need to go through a big list with small changes here and there: the “Partial matches” tab will show only functions with bigger changes, making it easier to discover the new functionalities added to the program.

Porting symbols

 

One of the most common tasks in reverse engineering, at least from my experience, is porting symbols from previous versions of a program, library, etc... to the new version. It can be quite frustrating having to port function names, enumerations, comments, structure definitions, etc... manually to new versions, specially when talking about big reverse engineering projects.

 

In the following example, we will import the symbols, structures, enumerations, comments, prototypes, etc... from one version full of symbols to another version with symbols stripped. We will use Busybox 1.21-1, compiled in Ubuntu Linux for x86_64. After downloading and compiling it, we will have 2 different binaries: “busybox” and “busybox_unstripped”. The later, is the version with full symbols while the former is the one typically used for distribution, with all the symbols stripped. Launch IDA and open, first, the “busybox_unstripped” binary containing full symbols. Let's IDA finish the initial auto-analysis and, after this, run Diaphora by either running diaphora.py. In the dialog just opened, press OK:

 

 

Wait until Diaphora finishes exporting to SQLite the current database. When it finishes, close the current IDA database and open the binary “busybox”, wait until IDA finishes the initial auto-analysis and, then, launch again Diaphora. In the next dialog select as the SQLite database to diff the previous one we just created, the one with all the symbols from the previous binary:

 

 

Press OK and wait until it finishes comparing both databases. After a while, it will show various tabs with all the unmatched functions in both databases, as well as the “Best”, “Partial” and “Unreliable” matches tabs.

 

 

As we can see, Diaphora did a decent work matching 3112 functions labeled as “Best Matches” and 13 labeled as “Partial matches”, a total of 3125 functions out of 3630. Let's go to the “Best matches” tab. All the functions here are these that were matched with a high confidence ratio. Let's say that we want to import all the symbols for the “best matches”: right click on the list and select “Import *all* functions”. It will ask if we really want to do so: press YES. It will import all function names, comments at function and instruction level, function prototypes, structures, enumerations, IDA's type libraries (TILs) and even rename global variables and labels with names. When it's done it will ask us if we want to relaunch again the diffing process:

 

 

While Diaphora imports symbols, at the same time, it updates the database with the exported data from the primary database and, as so, with the new information it may be possible to match new functions not discovered before. In this case we will just say “NO” to this dialog.

 

Now, go to the “Partial matches” tab. In this list we have some matches that doesn't look like good ones:

 

 

As we can see, the ratios are pretty low: from 0.00 to 0.14. Let's diff the graphs of the “make_device” function (matched with the “same name” heuristic):

 

 

It doesn't look like a good match. And, if it's, it's rather big to verify yet. Let's delete this result: go back to the “Partial matches” tab, select the “make_device” match and, simply, press “DEL”. It will just remove this result. Now, do the same for the next results with a too low confidence ratio (i.e., 0.00). OK, we removed the bad results. Now, it's time to import all the partial matches: right click in the list and select “Import all data for sub_* functions”. It will import everything for functions that are IDA's auto-named, these that start with the “sub_” prefix but will not touch any function with a non IDA auto-generated name. It will ask us for confirmation:

 

 

Press “Yes”, and wait until it exports everything and updates the primary database. And, that's all! We just imported everything from function names and comments to structures and enumerations into the new database and we can just continue with our work with the new database and with all the data imported from the database we used to worked on before.

Diffing huge databases (or exporting smaller .SQLite databases)

 

Some IDA databases can be huge: for example, IDA databases for firmware images or IDA databases for >100MB binaries. In such cases, exporting and diffing big databases takes a lot of time and space. In order to make it a bit faster and requiring less disk space to store the .sqlite databases that Diaphora uses the following new options were added in the last release candidate:

 

 

Heuristics

 

Diaphora uses multiple heuristics to find matches between different functions. The next list shows all the heuristics implemented in the Diaphora Release Candidate 1:

Best matches

 

 

Partial and unreliabe matches (according to the confidence's ratio):

 

 

Unreliable matches

 

 

Experimental (and likely to be removed or moved or changed in the future):