帮助反向工程二进制文件格式的工具

有什么工具可以帮助解码未知的二进制数据格式?

我知道十六进制工作室和010编辑器都支持结构。对于已知的固定格式,这些在一定程度上是可以的,但是对于更复杂的格式,特别是对于未知的格式,就很难使用了。我想我正在寻找一个脚本语言模块或者一个可编写脚本的 GUI 工具。

例如,我希望能够从有限的已知信息(可能是一个神奇的数字)中找到一个数据块内的结构。一旦我找到了一个结构,然后按照已知的长度和偏移字找到其他结构。然后在有意义的地方递归地、迭代地重复这个步骤。

在我的梦中,甚至可能根据我已经告诉系统的信息自动识别可能的偏移量和长度!

77502 次浏览

My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.

The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.

The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.

However, if you want to give it a try, and like to give feedback, I might add more useful features.

I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.

Link: iBored home page

Here are some tips that come to mind:

From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:

Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.

Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.

Good luck!

I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.

Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.

I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...

The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.

Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):

Tupni: Automatic Reverse Engineering of Input Formats (@ ACM digital library)

Abstract

Recent work has established the importance of automatic reverse engineering of protocol or file format specifications. However, the formats reverse engineered by previous tools have missed important information that is critical for security applications. In this paper, we present Tupni, a tool that can reverse engineer an input format with a rich set of information, including record sequences, record types, and input constraints. Tupni can generalize the format specification over multiple inputs. We have implemented a prototype of Tupni and evaluated it on 10 different formats: five file formats (WMF, BMP, JPG, PNG and TIF) and five network protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all record sequences in the test inputs. We also show that, by aggregating over multiple WMF files, Tupni can derive a more complete format specification for WMF. Furthermore, we demonstrate the utility of Tupni by using the rich information it provides for zeroday vulnerability signature generation, which was not possible with previous reverse engineering tools.

A cut'n'paste of my answer to a similar question:

One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.

It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.

For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It! (http://www.synalysis.net/)

Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.

My project icebuddha.com supports this using python to describe the format in the browser.

There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.

Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.