SPTH
October 2008
Code Mutation (whether it is obfuscation of a decryption engine or full metamorphism) is a technique to prevent the detection of the computer virus using analytic scan algorithm. Formal grammar gives us the basis for a theoretical analysis of code mutation; therefore we must not neglect or ignore it.
In 1999, Qozah has written an article called "POLYMORPHISM AND GRAMMARS" [1], analysing polymorphism (exaclty: a garbage generator) for the first time with this theoretical tool. Even his results for non-regular grammars are not correct, his article opened the door for a new way analysing virus technologies. In 2007, Eric Filiol wrote an article called "Metamorphism, Formal grammars and Undecidable Code Mutation" [2], dealing with the word problem and a description of a extended MetaPHOR mutation engine using a "Tzeitzin semi-Thue system" (a Type-0 Grammar).
In this article you will first read shourtly about the formal grammar and word problem including the reason why this is important for virus authors. (Chapter 2)
Then you will learn about the hierarchy of classes of formal grammars, the Chomsky Hierarchy and about its connection to code mutation including an analyse of the formal grammar of The Mental Driller's MetaPHOR mutation engine. Especially for type-1 grammars, you will see a surprising result. (Chapter 3-4)
Later you will see the full formal grammar of MetaPHOR;a formalisation of all possible mutations rules. (Chapter 5)
Before reading this article, you should read and understand the idea of formal language and grammar - for example read Qozah's article.
Definition 1:
A formal Grammar G is a 4-tuple G = (N, T, S, R) where:
Definition 2:
A word is a finite sequence of characters of an alphabeth T.
Definition 3.1:
A universal language UL over an alphabeth T is a subset T* of all words over that alphabeth.
Definition 3.2:
A formal language L is a subset of UL, contains all words generated by a formal grammar G := (N, T, S, R)
If the word is an element of L: It has a specific genotype (behaviour), but no specific phenotype (kernel code); it is a specific micro-function.
Standard Example:
G = (N, T, S, R)
N = {A, B}
T = {a, b}
R = {
S ::= bS | aA
A ::= bS | aB | a
B ::= bB | aB | a | b
}
Derivation:
S => bS => bbS => bbaA => bbaaB => bbaabB ==> bbaaba
Polymorphical Example:
G = (N, T, S, R)
N = {A, B}
T = {a, b, c, x, y}
a = "nop"
b = "sub edx, 0"
c = "push ebx"+"pop ebx"
x = "XOR [EDI], AL"
y = "INC AL"
e = ""
R = {
S::=aS | bS | cS | xA
A::=aA | bA | cA | yB
B::=aB | bB | cB | e
}
Derivation:
S => aS => acS => acxA => acxcA => acxcyB => acxcybB => acxcybe
Metamorphical Example:
G = (N, T, S, R)
N = {A, B, C}
T = {a, b, c, d, e, f}
a = "mov reg, 0"
b = "pop reg"
c = "push 0"
d = "mov reg2, 0"+"push reg2"
e = "xor reg, reg"
f = "sub reg, reg"
R = {
S ::= A | Bb | C | D
A ::= a
B ::= c | d
C ::= e
D ::= f
}
Derivation:
S => Bb => db
The polymorphical example have been already used in [1]. The metamorphic example is very trivial, but even very complex viruses as MetaPHOR use grammars like this (See Chapter 5).
The word problem can be formulate in mathematics with group theory, but we will use the formulation of computability theory.
The problem (a decidion problem) is to construct an algorithm to decide for a given word if it belongs to a formal language or not.
In connection with computer virus detection that means: An anti virus program scans a file, and has to decide whether a file is infected or not. Or: It has to decide wether a specific code of the file is part of a viruscode.
OR: It has to decide wether a specific word from universal language UL belongs to a formal language L generated by a formal grammar G := (N, T, S, R), which represents a polymorphic or metamorphic engine.
Example:
G = ({A, B}, {a, b, c}, S, T)
T = {
S ::= aA | bA | c
A ::= aA | B
B ::= bB | c
}
Is "aaabbbc" part of L? Yes, because:
S => aA => aaA => aaaA => aaaB => aaabB => aaabbB => aaabbbB => aaabbbc
Is "aabbca" part of L? No, because:
L={ (a|b)(a)*(b)*c, c }, and the given word has no "c" at the end.
The Chomsky Hierarchy is a containment hierarchy of classes of formal grammars.[3] All possible formal grammars are divided in Type-0 ... Type-3 grammars, where Type-3 is subset of Type-2 (is subset of Type-1 (is subset of Type-0)).
Restriction: no restrictions
Example:
S ::= aBBA AB ::= BA | Sc | C cBc ::= aaa | ab | c A ::= CBC B ::= cBca | cb | aB C ::= a | c
Derivation:
S => aBBA => acBcacbCBC => aabacbcBc => aabacbccbc
Restriction: For all (w1 => w2): |w1|<=|w2|
Example:
S ::= aBBA AB ::= BA | Sc | C cBc ::= aaa A ::= CBC B ::= cBca | cb | aB C ::= a | c
Derivation:
S => aBBA => acBcacbCBC => aaaaacbccbc
Restriction: For all (w1 => w2): w1 is an element of non-terminal symbols
Example:
S ::= aBBA A ::= CBC B ::= cBca | cb | aB C ::= a | c
Derivation:
S => aBBA => acbcbCBC => acbcbacba
Restriction: R = { A = aB, A = Ba, A = a }
Example:
S ::= aS | aA A ::= Bc B ::= Bc | c
Derivation:
S => aS => aaA => aaBc => aaBcc => aaccc
For an anti virus programm it is important that scanning a file does not monopolize CPU-speed for long time. We can describe the time usage of an algorithm with Landau notation O(n), where n is the amount of data or number of mutation.
A simple sort algorithm has a O(n) performance: For a list of 20 entries it needs t=c*20; for a list of 2.000 entries it needs t=c*2.000.
A usual algorithm for multiplication of a N x N matrix with a N Vector needs N(N-1)=N^2-N operations. The performance is O(n^2)
A usual algorithm for multiplication of a N x N matrix with a N x N matrix needs N(N(N-1) operations. The performance is O(n^3).
About the complexity of an algorithm, which solves the word problem for Chomsky Hierarchy:
Caution:
Here we use an asymtotically description of the given problem:
lim n -> infinity
It's obvious that in practise this is not possible. Nevertheless a virus can reach very high n, i. e. with deep (infinity) recursive mutation. Virus Scanners have to detect all variants of a virus, so that asymtotically description can be used.
This is the lowest type of possible code mutation, but unfortunatly used in most polymorphic or metamorphic viruses. In Chapter 5 you can see that most parts of MetaPHOR's grammar is Type-3.
For example:
N02 ::= t003 | t004 N32 ::= t078 | M21N32 => N32 ::= t078 | t079N32 M43 ::= t125 | t126
A big disadvanage is, that the shrinker of MetaPHOR is the inverse of expander (obfuscator). "The expander is the part that undoes what the shrinker did. It performs exactly the reverse operation." [4]. This result presents the determinism in the engine. Code mutation engines have to be created in a way, that linear time complexity will be prevented.
When this type of grammar is used in a metamorphic virus, the operations for deciding the word problem increases at least cubically. As (rarly) done in MetaPHOR, creating a Type-2 grammar requires good planning. The rules have to be recursively defined.
As example, the rewriting rule N24 has a recursion level > 5, maybe some rules have even higher recursive level. Because of that, Mental Driller has put a recursivity control for preventing the code to grow too much.
Such infitive recursive rules are our goal, for increasing the complexity of the word problem.
Using this formal grammar in a virus, you can mutate macro functions, not just the micro-functions aka. x86 instructions.
For example:
ABCD ::= EFGHI A ::= "PUSH [hWnd]" | ... B ::= "PUSH lpString" | ... C ::= "PUSH 0xFF" | ... D ::= "CALL GetWindowText | ... E ::= "PUSH [hWnd]" | ... F ::= "PUSH WM_GETTEXT" | ... G ::= "PUSH 0xFF" | ... H ::= "PUSH lpString" | ... I ::= "CALL SendMessage" | ...
Beside of the extremly increased complexity of the virus, the word problem will become not decideable in available time.
Combining this macro-function mutation with code integration to the host file, the virus can use parts of the host code and parts or its own code for mutation.
When you use shrinking-rules, a Type-0 Grammar may be created. Using the inverse rules of the obfuscator, these rules have no effect. For a valueable usage of deletion rules, the existence of big words is a condition, otherwise the virus code may loose information.
G = (N, T, S, R)
N = { N01, ..., N47, M01, ..., M49 }
T = { code of MetaPHOR }
R = {
S ::= [ T, such that the hardcoded X86 instructions t000-t135
are replaced by the behaviour functions N01..M49 ]
N01 ::= t001 | t002 | M24N36
N02 ::= t003 | t004
N03 ::= t000 | t005 | N10 | t007 | t008 | t009 | t010 | t011
t028 | M03M06 | M06M03 | t090 | t091 | M34M35
N04 ::= t012 | t013
N05 ::= t014 | t015
N06 ::= t016 | N09 | t018 | t019 | t020
N07 ::= t021 | t022 | t023
N08 ::= t024 | t025 | t026 | t027
N09 ::= t029 | t030 | M01M02
N10 ::= t031 | t032
N11 ::= t033 | t034
N12 ::= t035 | t036
N13 ::= t037 | t038
N14 ::= t041 | M01M03
N15 ::= t043 | t033 | M05M04 | N16M09
N16 ::= t046 | M05M03
N17 ::= t047 | M06M02
N18 ::= t049 | M07M03
N19 ::= t053 | N14N20 | N16M12N17 | N14M31N17
N20 ::= t054 | N18M15
N21 ::= t055 | N09N12 | N15N10
N22 ::= t056 | N15M10
N23 ::= t058 | N10N12 | N12N10
N24 ::= t059 | N19M11 | M12M13
N25 ::= t063 | N21M10
N26 ::= t065 | M14N12
N27 ::= t066 | N18M49
N28 ::= t069 | N18M16
N29 ::= t071 | N16M17
N30 ::= t073 | N16M18 | M05N38
N31 ::= t077 | N16N18
N32 ::= t078 | M21N32
N33 ::= t080 | M22M24 | M34M36
N34 ::= t083 | N01M24 | N10
N35 ::= t085 | N02M25
N36 ::= t087 | M24N01 | N10
N37 ::= t088 | M25M26
N38 ::= t092 | M03M18
N39 ::= t093 | N06M27 | M28M29
N40 ::= t097 | N16M30N17
N41 ::= t105 | t106 | N14M37 | N16M40
N42 ::= t109 | N16M38
N43 ::= t112 | N16M39
N44 ::= t121 | N18M42
N45 ::= t124 | N18M46
N46 ::= t134 | M46M47M48
N47 ::= t135 | M48M47M46
M01 ::= t039 | N14M06
M02 ::= t040 | M03N17
M03 ::= t042 | M08N18
M04 ::= t044
M05 ::= t045 | N16M06
M06 ::= t048
M07 ::= t050
M08 ::= t051
M09 ::= t052
M10 ::= t057
M11 ::= t060
M12 ::= t061 | N18M33N18
M13 ::= t062
M14 ::= t067
M15 ::= t068
M16 ::= t070
M17 ::= t072 | N18M19 | N17N26
M18 ::= t074 | N18M20
M19 ::= t075
M20 ::= t076
M21 ::= t079
M22 ::= t081
M24 ::= t084 | N01N34
M25 ::= t086 | N02N35
M26 ::= t089 | M25N37
M27 ::= t094
M28 ::= t095
M29 ::= t096
M30 ::= t098
M31 ::= t099 | N18M32N18
M32 ::= t100
M33 ::= t101
M34 ::= t102
M35 ::= t103
M36 ::= t104
M37 ::= t107 | t108 | N18M41
M38 ::= t110 | t111
M39 ::= t113 | t114
M40 ::= t115 | t116 | t117 | t118 | N18M44 | N18M45
M41 ::= t119 | t120
M42 ::= t122 | t123
M43 ::= t125 | t126
M44 ::= t127 | t128
M45 ::= t129 | t130
M46 ::= t131
M47 ::= t132
M48 ::= t133
M49 ::= t064
}
t000="NOP"
t001="XOR Reg,-1"
t002="NOT Reg"
t003="XOR Mem,-1"
t004="NOT Mem"
t005="MOV Reg,Reg"
t007="ADD Mem,0"
t008="OR Reg,0"
t009="OR Mem,0"
t010="AND Reg,-1"
t011="AND Mem,-1"
t012="SUB Reg,Imm"
t013="ADD Reg,-Imm"
t014="SUB Mem,Imm
t015="ADD Mem,-Imm
t016="XOR Reg,0"
t018="AND Reg,0"
t019="XOR Reg,Reg"
t020="SUB Reg,Reg"
t021="XOR Mem,0"
t022="MOV Mem,0"
t023="AND Mem,0"
t024="CMP Reg,0"
t025="OR Reg,Reg"
t026="AND Reg,Reg"
t027="TEST Reg,Reg"
t028="MOV Mem,Mem"
t029="MOV Reg,Imm"
t030="LEA Reg,[Imm]"
t031="ADD Reg,Imm"
t032="LEA Reg,[Reg+Imm]"
t033="LEA Reg2,[Reg]"
t035="LEA Reg,[Reg+Reg2]"
t036="ADD Reg,Reg2"
t037="LEA Reg,[Reg2+Reg2+xxx]"
t038="LEA Reg,[2*Reg2+xxx]"
t039="PUSH Imm"
t040="POP Reg"
t041="MOV Mem,Imm"
t042="POP Mem"
t043="MOV Reg2,Reg"
t044="POP Reg2"
t045="PUSH Reg"
t046="MOV Mem,Reg"
t047="MOV Reg,Mem"
t048="PUSH Mem"
t049="MOV Mem2,Mem"
t050="PUSH Mem2"
t051="POP Mem2"
t052="MOV Reg2,Mem
t053="OP Reg,Imm"
t054="OP Reg,Mem"
t055="LEA Reg,[Reg2+Imm]"
t056="LEA Reg,[Reg2+Reg3]"
t057="ADD Reg,Reg3"
t058="LEA Reg,[Reg+Reg2+Imm]"
t059="OP Reg,(Imm OP Imm2)"
t060="OP Reg,Imm2"
t061="OP Mem,Imm"
t062="OP Mem,Imm2"
t063="LEA Reg,[Reg2+Reg3+Imm]"
t064="LEA Reg,[(RegX+)Reg2+Imm]"
t065="LEA Reg,[(RegX+)2*Reg2+Imm]"
t066="MOV Mem3,Mem"
t067="MOV Mem3,Mem2"
t068="OP Reg,Mem2"
t069="MOV Mem,xxx"
t070="MOV Mem2,xxx"
t071="CALL Reg"
t072="CALL Mem"
t073="JMP Reg"
t074="JMP Mem"
t075="CALL Mem2"
t076="JMP Mem2"
t077="MOV Mem2,Reg"
t078="MOV Reg,yyy"
t079="OP Reg,xxx"
t080="JMP @xxx"
t081="Jcc @xxx"
t083="ADD Reg,1"
t084="NEG Reg"
t085="ADD Mem,1"
t086="NEG Mem"
t087="ADD Reg,-1"
t088="ADD Mem,-1"
t089="NOT Mem"
t090="CMP X,Y " - without Jcc
t091="TEST X,Y" - without Jcc
t092="RET"
t093="MOVZX Reg,byte ptr [Mem]"
t094="MOV Reg8,[Mem]"
t095="MOV Reg,[Mem]"
t096="AND Reg,0FFh"
t097="OP Reg,Reg2"
t098="OP Mem,Reg2"
t099="OP Mem,Reg"
t100="OP Mem2,Reg"
t101="OP Mem2,Imm"
t102="CMP Reg,Reg"
t103="JO/JB/JNZ/JA/JS/JNP/JL/JG @xxx" - without Jcc
t104="JNO/JAE/JZ/JBE/JNS/JP/JGE/JLE @xxx" - without Jcc
t105="TEST Reg,Imm" + Jcc
t106="CMP Reg,Imm" + Jcc
t107="TEST Reg,Mem" + Jcc
t108="CMP Reg,Mem" + Jcc
t109="CMP Reg,Reg2" + Jcc
t110="CMP Mem,Reg2" + Jcc
t111="SUB Mem,Reg2" + Jcc
t112="TEST Reg,Reg2" + Jcc
t113="TEST Mem,Reg2" + Jcc
t114="AND Mem,Reg2" + Jcc
t115="CMP Mem,Imm" + Jcc
t116="SUB Mem,Imm" + Jcc
t117="TEST Mem,Imm" + Jcc
t118="AND Mem,Imm" + Jcc
t119="TEST Reg,Mem2" + Jcc
t120="CMP Reg,Mem2" + Jcc
t121="TEST Mem,Reg" + Jcc
t122="TEST Mem2,Reg" + Jcc
t123="AND Mem2,Reg" + Jcc
t124="CMP Mem,Reg" + Jcc
t125="CMP Mem2,Reg" + Jcc
t126="SUB Mem2,Reg" + Jcc
t127="TEST Mem2,Imm" + Jcc
t128="AND Mem2,Imm" + Jcc
t129="CMP Mem2,Imm" + Jcc
t130="SUB Mem2,Imm" + Jcc
t131="PUSH EAX"
t132="PUSH ECX"
t133="PUSH EDX"
t134="APICALL_BEGIN"
t135="APICALL_END"
Efficient Code Mutation has always been a goal for virus writers. The analysis of these techniques on a mathematical background is an important venture to increase the complexity of mutation techniques.
The word problem and its connection to the antiviral detection problem gives just the right basis for our work.
Creating a metamorphic mutation engine needs very much time, and without using a type-2, type-1 or even type-0 grammar, the detection is still "trivial". Creating a concept for a valueable formal grammar of mutation engine does not need much time, but increases the complexity alot - the complexity may become cubical or even exponential.
My final advice: Invest some hours for planning a efficient grammar, and as result your engine will be harder to detect by analytic scanning.
- - - - - - - - - - - - - -
Second Part To Hell
www.spth.de.vu
written in October 2008
- - - - - - - - - - - - - -