学习集会

我决定学习汇编语言。这样做的主要原因是能够理解反汇编代码,也许能够编写更有效的代码部分(例如,通过 c + +) ,做一些事情,如代码洞穴,等等。我看到有无数种不同风格的装配,所以,为了我提到的目的,我应该如何开始呢?我应该学什么样的汇编?我想先学习一些简单的程序(比如计算器) ,但是我的目标是适应它,这样我就可以理解显示的代码,例如,由 IDA Pro。

我在使用窗口(如果这有什么不同的话)。

编辑: 那么,似乎每个人都指向 MASM。尽管我知道它具有高级功能,这对于汇编代码程序员来说都是好事,但这并不是我想要的。它似乎有如果,调用等指令不显示在流行的反汇编程序(如 IDA)。因此,如果可能的话,我希望听到的是任何使用 ASM 达到我要求的目的(阅读 IDA 中反汇编的 exe 代码)的人的意见,而不仅仅是“通用”汇编程序员的意见。

编辑: 好的。我已经在学装配了。我正在学习 MASM,而不是使用对我来说无关紧要的高水平的东西。我现在所做的就是在 c + + 中尝试使用 _ _ asm 指令执行我的代码,这样我可以比从头开始使用 MASM 更快地完成任务。

47351 次浏览

Start with MASM32 and from there look at FASM. But you'll have fun with MASM.

The assembly you would write by hand and the assembly generated by a compiler are often very different when viewed from a high level. Of course, the innards of the program will be very similar (there are only so many different ways to encode a = b + c, after all), but they're not the trouble when you're trying to reverse engineer something. The compiler will add a ton of boilerplate code to even simple executables: last time I compared, "Hello World" compiled by GCC was about 4kB, while if written by hand in assembly it's around 100 bytes. It's worse on Windows: last time I compared (admittedly, this was last century) the smallest "Hello World" I could get my Windows compiler of then-choice to generate was 52kB! Usually this boilerplate is only executed once, if at all, so it doesn't much affect program speed -- like I said above, the core of the program, the part where most execution time is spent, is usually pretty similar whether compiled or written by hand.

At the end of the day, this means that an expert assembly programmer and an expert disassembler are two different specialties. Commonly they're found in the same person, but they're really separate, and learning how to be an excellent assembly coder won't help you that much to learn reverse engineering.

What you want to do is grab the IA-32 and AMD64 (both are covered together) architecture manuals from Intel and AMD, and look through the early sections on instructions and opcodes. Maybe read a tutorial or two on assembly language, just to get the basics of assembly language down. Then grab a small sample program that you're interested in and disassemble it: step through its control flow and try to understand what it's doing. See if you can patch it to do something else. Then try again with another program, and repeat until you're comfortable enough to try to achieve a more useful goal. You might be interested in things like "crackmes", produced by the reverse engineering community, which are challenges for people interested in reverse engineering to try their hand at, and hopefully learn something along the way. They range in difficulty from basic (start here!) to impossible.

Above all, you just need to practice. As in many other disciplines, with reverse engineering, practice makes perfect... or at least better.

I found Hacking: The Art of Exploitation to be an interesting and useful way into this topic... can't say that I have ever used the knowledge directly, but that's really not why I read it. It gives you a much richer appreciation of the instructions that your code compiles to, which has occasionally been useful in understanding subtler bugs.

Don't be put off by the title. Most of the first part of the book is "Hacking" in the Eric Raymond sense of the word: creative, surprising, almost sneaky ways to solve tough problems. I (and maybe you) was a lot less interested in the security aspects.

I think you want to learn the ASCII-ized opcode mnemonics (and their parameters), which are output by a disassembler and which are understood by (can be used as input to) an assembler.

Any assembler (e.g. MASM) would do.

And/or it might be better for you to read a book about it (there have been books recommended on SO, I don't remember which).

I started out learning MIPS which is a very compact 32-bit architecture. It is a reduced instruction set, but that's what makes easy to grasp for beginners. You will still be able to understand how assembly works without getting overwhelmed with complexity. You can even download a nice little IDE, which will allow you to compile your MIPS code: clicky Once you get the hang of it, i think it would be much easier to move on to more complex architectures. At least that's what i thought :) At this point you will have the essential knowledge of memory allocation and management, logic flow, debugging, testing and etc.

I wouldn't focus on trying to write programs in assembly, at least not at first. If you're on x86 (which I assume you are, since you're using Windows), there are tons of weird special cases that it's kind of pointless to learn. For example, many instructions assume you're operating on a register that you don't explicitly name, and other instructions work on some registers but not others.

I would learn just enough about your intended architecture that you understand the basics, then just jump right in and try to understand your compiler's output. Arm yourself with the Intel manuals and just dive right into your compiler's output. Isolate the code of interest into a small function, so you can be sure to understand the entire thing.

I would consider the basics to be:

  • registers: how many are there, what are their names, and what are their sizes?
  • operand order: add eax, ebx means "Add ebx to eax and store the result in eax".
  • FPU: learn the basics of the floating-point stack and how you convert to/from fp.
  • addressing modes: [base + offset * multiplier], but multiplier can only be 1, 2, or 4 (or maybe 8?)
  • calling conventions: how are parameters passed to a function?

A lot of the time it will be surprising what the compiler emits. Make it a puzzle of figuring out why the heck the compiler thought this would be a good idea. It will teach you a lot.

It will probably also help to arm yourself with Agner Fog's manuals, especially the instruction listing one. It will tell you roughly how expensive each instruction is, though this is harder to directly quantify on modern processors. But it will help explain why, for example, the compiler goes so far out of its way to avoid issuing an idiv instruction.

My only other piece of advice is to always use Intel syntax instead of AT&T when you have a choice. I used to be pretty neutral on this point, until the day I realized that some instructions are totally different between the two (for example, movslq in AT&T syntax is movsxd in Intel syntax). Since the manuals are all written using Intel syntax, just stick with that.

Good luck!

I have done this many times and continue to do this. In this case where your primary goal is reading and not writing assembler I feel this applies.

Write your own disassembler. Not for the purpose of making the next greatest disassembler, this one is strictly for you. The goal is to learn the instruction set. Whether I am learning assembler on a new platform, remembering assembler for a platform I once knew. Start with only a few lines of code, adding registers for example, and ping pong-ing between disassembling the binary output and adding more and more complicated instructions on the input side you:

1) learn the instruction set for the specific processor

2) learn the nuances of how to write code in assemble for said processor such that you can wiggle every opcode bit in every instruction

3) you learn the instruction set better that most engineers that use that instruction set to make their living

In your case there are a couple of problems, I normally recommend the ARM instruction set to start with, there are more ARM based products shipped today than any other (x86 computers included). But the likelihood that you are using ARM now and dont know enough assembler for it to write startup code or other routines knowing ARM may or may not help what you are trying to do. The second and more important reason for ARM first is because the instruction lengths are fixed size and aligned. Disassembling variable length instructions like x86 can be a nightmare as your first project, and the goal here is to learn the instruction set not to create a research project. Third ARM is a well done instruction set, registers are created equal and dont have individual special nuances.

So you will have to figure out what processor you want to start with. I suggest the msp430 or ARM first, then ARM first or second then the chaos of x86. No matter what platform, any platform worth using has data sheets or programmers reference manuals free from the vendor that include the instruction set as well as the encoding of the opcodes (the bits and bytes of the machine language). For the purpose of learning what the compiler does and how to write code that compiler doesnt have to struggle with it is good to know a few instruction sets and see how the same high level code is implemented on each instruction set with each compiler with each optimization setting. You dont want to get into optimizing your code only to find that you have made it better for one compiler/platform but much worse for every other.

Oh for disassembling variable length instruction sets, instead of simply starting at the beginning and disassembling every four byte word linearly through memory as you would with the ARM or every two bytes like the msp430 (The msp430 has variable length instructions but you can still get by going linearly through memory if you start at the entry points from the interrupt vector table). For variable length you want to find an entry point based on a vector table or knowledge about how the processor boots and follow the code in execution order. You have to decode each instruction completely to know how many bytes are used then if the instruction is not an unconditional branch assume the next byte after that instruction is another instruction. You have to store all possible branch addresses as well and assume those are the starting byte addresses for more instructions. The one time I was successful I made several passes through the binary. Starting at the entry point I marked that byte as the start of an instruction then decoded linearly through memory until hitting an unconditional branch. All branch targets were tagged as starting addresses of an instruction. I made multiple passes through the binary until I had found no new branch targets. If at any time you find say a 3 byte instruction but for some reason you have tagged the second byte as the beginning of an instruction you have a problem. If the code was generated by a high level compiler this shouldnt happen unless the compiler is doing something evil, if the code has hand written assembler (like say an old arcade game) it is quite possible that there will be conditional branches that can never happen like r0=0 followed by a jump if not zero. You may have to hand edit those out of the binary to continue. For your immediate goals which I assume will be on x86 I dont think you will have a problem.

I recommend the gcc tools, mingw32 is an easy way to use gcc tools on Windows if x86 is your target. If not mingw32 plus msys is an excellent platform for generating a cross compiler from binutils and gcc sources (generally pretty easy). mingw32 has some advantages over cygwin, like significantly faster programs and you avoid the cygwin dll hell. gcc and binutils will allow you to write in C or assembler and disassemble your code and there are more web pages than you can read showing you how to do any one or all of the three. If you are going to be doing this with a variable length instruction set I highly recommend you use a tool set that includes a disassembler. A third party disassembler for x86 for example is going to be a challenge to use as you never really know if it has disassembled correctly. Some of this is operating system dependent too, the goal is to compile the modules to a binary format that contains information marking instructions from data so the disassembler can do a more accurate job. Your other choice for this primary goal is to have a tool that can compile directly to assembler for your inspection then hope that when it compiles to a binary format it creates the same instructions.

The short (okay slightly shortER ) answer to your question. Write a disassembler to learn an instruction set. I would start with something RISCy and easy to learn like ARM. Once you know one instruction set others become much easier to pick up, often in a few hours, by the third instruction set you can start writing code almost immediately using the datasheet/reference manual for the syntax. All processors worth using have a datasheet or reference manual that describes the instructions down to the bits and bytes of the opcodes. Learn a RISC processor like ARM and a CISC like x86 enough to get a feel for the differences, things like having to go through registers for everything or being able to perform operations directly on memory with fewer or no registers. Three operand instructions versus two, etc. As you tune your high level code, compile for more than one processor and compare the output. The most important thing you will learn is that no matter how good the high level code is written the quality of the compiler and the optimization choices made make a huge difference in the actual instructions. I recommend llvm and gcc (with binutils), neither produce great code, but they are multi platform and multi target and both have optimizers. And both are free and you can easily build cross compilers from sources for various target processors.

I'll go against the grain of most answer and recommend Knuth's MMIX variant of the MIPS RISC architecture. It won't be as practically useful as x86 or ARM assembly languages (not that they're all that crucial themselves in most real-life jobs these days...;-), but it WILL unlock for you the magic of Knuth's latest version of the greatest-ever masterpiece on deep low-level understanding of algorithms and data structures -- TAOCP, "The Art of Computer Programming". The links from the two URLs I've quoted are a great way to start exploring this possibility!

(I don't know about you but I was excited with assembly)

A simple tool for experimenting with assembly is already installed in your pc.

Go to Start menu->Run, and type debug

debug (command)

debug is a command in DOS, MS-DOS, OS/2 and Microsoft Windows (only x86 versions, not x64) which runs the program debug.exe (or DEBUG.COM in older versions of DOS). Debug can act as an assembler, disassembler, or hex dump program allowing users to interactively examine memory contents (in assembly language, hexadecimal or ASCII), make changes, and selectively execute COM, EXE and other file types. It also has several subcommands which are used to access specific disk sectors, I/O ports and memory addresses. MS-DOS Debug runs at a 16-bit process level and therefore it is limited to 16-bit computer programs. FreeDOS Debug has a "DEBUGX" version supporting 32-bit DPMI programs as well.

Tutorials:


If you want to understand the code you see in IDA Pro (or OllyDbg), you'll need to learn how compiled code is structured. I recommend the book Reversing: Secrets of Reverse Engineering

I experimented a couple of weeks with debug when I started learning assembly (15 years ago).
Note that debug works at the base machine level, there are no high level assembly commands.

And now a simple example:

Give a to start writing assembly code - type the below program - and finally give g to run it.

alt text


(INT 21 display on screen the ASCII char stored in the DL register if the AH register is set to 2 -- INT 20 terminates the program)

Are you doing other dev work on windows? On which IDE? If it's VS, then there's no need for an additional IDE just to read disassembled code: debug your app (or attach to an external app), then open the disassembly window (in the default settings, that's Alt+8). Step and watch memory/registers as you would through normal code. You might also want to keep a registers window open (Alt+5 by default).

Intel gives free manuals, that give both a survey of basic architecture (registers, processor units etc.) and a full instruction reference. As the architecture matures and is getting more complex, the 'basic architecture' manuals grow less and less readable. If you can get your hands on an older version, you'd probably have a better place to start (even P3 manuals - they explain better the same basic execution environment).

If you care to invest in a book, here is a nice introductory text. Search amazon for 'x86' and you'd get many others. You can get several other directions from another question here.

Finally, you can benefit quite a bit from reading some low-level blogs. These byte-size info bits work best for me, personally.

This will not necessarily help you write efficient code!

i86 op codes are more or less a "legacy" format that persists because of the sheer volume of code and executable binaries for Windows and Linux out there.

Its a bit like the old scholars writing in latin, an Italian speaker like Galileo would write in Latin and his paper could be understood by a Polish speaker like Copernicus. This was still the most effective way to communicate even though niether was particulary good at Latin, and Latin is a rubbish language for expressing mathematical ideas.

So compilers generate x86 code by default, and, modern chips read the anceint Op codes and transalate what they see into parallel risc instructions, with reordered execution, speculative execution, pipelining etc. plus they make full use of the 32 or 64 registers the processor actually has (as opposed to the pathetic 8 you see in x86 instructions.)

Now all optimising compilers know this is what really happens, so they code up sequences of OP codes which they know the chip can optimise efficiently -- even though some of these sequences would look innefficient to an circa 1990 .asm programmer.

At some point you need to accept that the 10s of thousands of man years effort compiler writers have put in have paid off, and, trust them.

The simplest and easiest way to get a more eficient runtime is to buy the Intel C/C++ compiler. They have a niche market for efficeint compilers, and, they have the advantage of being able to ask the chip designers about what goes on inside.

To do what you're wanting to do, I just took the Intel Instruction Set Reference (might not be the exact one I used, but it looks sufficient) and some simple programs I wrote in Visual Studio and started throwing them into IDAPro/Windbg. When I out-grew my own programs, the software at crackmes was helpful.

I'm assuming that you have some basic understanding of how programs execute on Windows. But really, for reading assembly, there's only a few instructions to learn and a few flavors of those instructions (e.g., there's a jump instruction, jump has a few flavors like jump-if-equal, jump-if-ecx-is-zero, etc). Once you learn the basic instructions it's pretty simple to get the gist of the program execution. IDA's graph view helps, and if you're tracing the program with Windbg, it's pretty simple to figure out what the instructions are doing if you're not sure.

After a bit of playing like that, I bought Hacker Disassembly Uncovered. Generally, I stay away from books with the word "Hacker" in the title, but I really liked how this one went really in-depth about how compiled code looked disassembled. He also goes into compiler optimizations and some efficiency stuff that was interesting.

It all really depends on how deeply you want to be able to understand the program, too. If you're reverse engineering a target looking for vulnerabilities, if you're writing exploit code, or analyzing packed malware for capabilities, you'll need more of a ramp-up time to really get things going (especially for the more advanced malware). On the other hand, if you just want to be able to change your character's level on your favorite video game, you should be doing fine in a relatively short amount of time.

Lots of good answers here. Low-level programming, assembly etc are popular in the security community, so it is worthwhile looking for hints and tips there once you get going. They even have some good tutorials like this one on x86 assembly.

To actually reach your goal, you might consider starting with the IDE you are in. The generally is a disassembler window, so you can do single stepping through code. There is usually a view of some sort to let you see the registers and look into memory areas.

Examination of unoptimized c/c++ code will help build a link into the kind of code that the compiler generates for your sources. Some compilers have some sort of ASM reserved word which lets you insert machine instructions in your code.

My advice would be to play around with those sorts of tools for a while and get your feet wet, then step up? down? to straight assembler code on what ever platform you are running on.

There are a lot of great tools out there, but you might find it more fun, to avoid the steep learning curve at first.

We learned assembly with a microcontroller development kit (Motorola HC12) and a thick datasheet.

I recently took a computer systems class. One of the topics was Assembly as a tool to communicate with the hardware.

For me, the knowledge of assembly wouldn't have been complete without understanding the details of how computer systems work. Understanding that, brings in an new understanding of why assembly instructions on one processor architecture is great but is terrible on another architecture.

Given this, I'm inclined to recommend my class text book:

Computer Systems:A programmer's perspective.

Computer Systems:A programmer's perspective
(source: cmu.edu)

It does cover x86 assembly but the book is much more broad than that. It covers processor pipe-lining and memory as a cache, the virtual memory system and much more. All of this can affect how assembly could be optimized for the given features.

The suggestion to use debug is a fun one, many neat tricks can be done with that. However, for a modern operating system, learning 16bit assembly may be slightly less useful. Consider, instead, using ntsd.exe. It's built into Windows XP (it was yanked in Server 2003 and above, unfortunately), which makes it a convenient tool to learn since it's so widely available.

That said, the original version in XP suffers from a number of bugs. If you really want to use it (or cdb, or windbg, which are essentially different interfaces with the same command syntax and debugging back-end), you should install the free windows debugging tools package.

The debugger.chm file included in that package is especially useful when trying to figure out the unusual syntax.

The great thing about ntsd is you can pop it up on any XP machine you're near and use it to assembly or disassemble. It makes a /great/ X86 assembly learning tool. For example (using cdb since it's inline in the dos prompt, it's otherwise identical):

(symbol errors skipped since they're irrelevant -- also, I hope this formatting works, this is my first post)

C:\Documents and Settings\User>cdb calc


Microsoft (R) Windows Debugger Version 6.10.0003.233 X86
Copyright (c) Microsoft Corporation. All rights reserved.


CommandLine: calc
Symbol search path is: *** Invalid ***
Executable search path is:
ModLoad: 01000000 0101f000   calc.exe
ModLoad: 7c900000 7c9b2000   ntdll.dll
ModLoad: 7c800000 7c8f6000   C:\WINDOWS\system32\kernel32.dll
ModLoad: 7c9c0000 7d1d7000   C:\WINDOWS\system32\SHELL32.dll
ModLoad: 77dd0000 77e6b000   C:\WINDOWS\system32\ADVAPI32.dll
ModLoad: 77e70000 77f02000   C:\WINDOWS\system32\RPCRT4.dll
ModLoad: 77fe0000 77ff1000   C:\WINDOWS\system32\Secur32.dll
ModLoad: 77f10000 77f59000   C:\WINDOWS\system32\GDI32.dll
ModLoad: 7e410000 7e4a1000   C:\WINDOWS\system32\USER32.dll
ModLoad: 77c10000 77c68000   C:\WINDOWS\system32\msvcrt.dll
ModLoad: 77f60000 77fd6000   C:\WINDOWS\system32\SHLWAPI.dll
(f2c.208): Break instruction exception - code 80000003 (first chance)
eax=001a1eb4 ebx=7ffd6000 ecx=00000007 edx=00000080 esi=001a1f48 edi=001a1eb4
eip=7c90120e esp=0007fb20 ebp=0007fc94 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
ntdll!DbgBreakPoint:
7c90120e cc              int     3
0:000> r eax
eax=001a1eb4
0:000> r eax=0
0:000> a eip
7c90120e add eax,0x100
7c901213
0:000> u eip
ntdll!DbgBreakPoint:
7c90120e 0500010000      add     eax,100h
7c901213 c3              ret
7c901214 8bff            mov     edi,edi
7c901216 8b442404        mov     eax,dword ptr [esp+4]
7c90121a cc              int     3
7c90121b c20400          ret     4
ntdll!NtCurrentTeb:
7c90121e 64a118000000    mov     eax,dword ptr fs:[00000018h]
7c901224 c3              ret
0:000> t
eax=00000100 ebx=7ffd6000 ecx=00000007 edx=00000080 esi=001a1f48 edi=001a1eb4
eip=7c901213 esp=0007fb20 ebp=0007fc94 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000206
ntdll!DbgUserBreakPoint+0x1:
7c901213 c3              ret
0:000>`

Also -- while you're playing with IDA, make sure to check out the IDA Pro Book by Chris Eagle (unlinked since StackOverflow doesn't want to let me post more than two links for my first post). It's hands-down the best reference out there.

One of the standard pedagogic assembly languages out there is MIPS. You can get MIPS simulators(spim) and various teaching materials for it.

Personally, I'm not a fan. I rather like IA32.

Off topic I know, but since you are a Windows programmer I can't help but think that it may be a more appropriate and/or better use of your time to learn MSIL. No, it's not assembly, but it's probably more relevant in this .NET era.

Knowing assembly can be useful for debugging but I wouldn't get too excited about using it for optimizing your code. Modern compilers are usually much better at optimizing that a human these days.

My personal favorite is NASM, mostly because it's multi-platform, and it compiles MMX, SSE, 64-bit...

I started compiling some simple C source file with gcc, and "trans-coding" the assembler instruction from gcc-format into NASM-format. Then you can change small portions of code, and verify the performance improvement it brings.

The NASM documentation is really complete, I never needed to search for information from books, or other sources.

Some links you might find useful to learn the assembly - source code mapping -

Assembly And The Art Of Debugging

Debugging – Modifying Code At Runtime

Hope you find these useful.

You can check out xorpd x86 Assembly video course. (I wrote it). The course itself is paid, but the exercises are open sourced, on github. If you have some programming experience I think you should be able to work just with the exercises and understand everything.

Note that the code is for the Windows platform, and is written using the Fasm assembler. The course and the exercises do not contain any high level constructs, however you could use Fasm to create very complicated macros, if you ever wish to do so.