VX Heavens

Home Upload Library Collection Sources Engines Constructors Simulators Utilities Links AV Checkβ

Computer Viruses and Malware

John Aycock
Advances in Information Security, Vol. 22
ISBN 978-0-387-30236-2
2006

[Back to index]

\text{T_EX size}
Computer Viruses and Malware (book cover)
To all the two-legged critters
in my house

Contents

Preface

It seemed like a good idea at the time. In 2003,1 started teaching a course on computer viruses and malicious software to senior undergraduate and graduate students at the University of Calgary. It's been an interesting few years. Computer viruses are a controversial and taboo topic, despite having such a huge impact on our society; needless to say, there was some backlash about this course from outside the University.

One of my initial practical concerns was whether or not I could find enough detailed material to teach a 13-week course at this level. There were some books on the topic, but (with all due respect to the authors of those books) there were none that were suitable for use as a textbook.

I was more surprised to find out that there was a lot of information about viruses and doing "bad" things, but there was very little information about anti-virus software. A few quality minutes with your favorite web search engine will yield virus writing tutorials, virus source code, and virus creation toolkits. In contrast, although it's comprised of some extremely nice people, the anti-virus community tends to be very industry-driven and insular, and isn't in the habit of giving out its secrets. Unless you know where to look.

Several years, a shelf full of books, and a foot-high stack of printouts later, I've ferreted out a lot of detailed material which I've assembled in this book. It's a strange type of research for a computer scientist, and I'm sure that my academic colleagues would cringe at some of the sources that I've had to use. Virus writers don't tend to publish in peer-reviewed academic journals, and anti-virus companies don't want to tip their hand. I would tend to characterize this detective work more like historical research than standard computer science research: your sources are limited, so you try and authenticate them; you piece a sentence in one document together with a sentence in another document, and you're able to make a useful connection. It's painstaking and often frustrating.

Technical information goes out of date very quickly, and in writing this book I've tried to focus on the concepts more than details. My hope is that the concepts will still be useful years from now, long after the minute details of operating systems and programming languages have changed. Having said that, I've included detail where it's absolutely necessary to explain what's going on, and used specific examples of viruses and malicious software where it's useful to establish precedents for certain techniques. Depending on why you're reading this, a book with more concrete details might be a good complement to this material.

Similarly, if you're using this as a textbook, I would suggest supplementing it with details of the latest and greatest malicious software that's making the rounds. Unfortunately there will be plenty of examples to choose from. In my virus course, I also have a large segment devoted to the law and ethics surrounding malicious software, which I haven't incorporated here - law is constantly changing and being reinterpreted, and there are already many excellent sources on ethics. Law and ethics are very important topics for any computer professional, but they are especially critical for creating a secure environment in which to work with malicious software.

I should point out that I've only used information from public sources to write this book. I've deliberately excluded any information that's been told to me in private conversations, and I'm not revealing anyone's trade secrets that they haven't already given away themselves.

I'd like to thank the students I've taught in my virus course, who pushed me with their excellent questions, and showed much patience as I was organizing all this material into some semi-coherent form. Thanks too to those in the anti-virus community who kept an open mind. I'd also like to thank the people who read drafts of this book: Jorg Denzinger, Richard Ford, Sarah Gordon, Shannon Jaeger, Cliff Marcellus, Jim Uhl, James Wolfe, and Mike Zastre. Their suggestions and comments helped improve the book as well as encourage me. Finally, Alan Aycock suggested some references for Chapter 10, Stefania Bertazzon answered my questions about rational economics, Moustafa Hammad provided an Arabic translation, and Maryam Mehri Dehnavi translated some Persian text for me. Of course, any errors that remain are my own.

John Aycock

Chapter 1 We've got problems

In ancient times, people's needs were simple: food, water, shelter, and the occasional chance to propagate the species. Our basic needs haven't changed, but the way we fulfill them has. Food is bought in stores which are fed by supply chains with computerized inventory systems; water is dispensed through computer-controlled water systems; parts for new shelters come from suppliers with computer-ridden supply chains, and old shelters are bought and sold by computer-wielding realtors. The production and transmission of energy to run all of these systems is controlled by computer, and computers manage financial transactions to pay for it all.

It's no secret that our society's infrastructure relies on computers now. Unfortunately, this means that a threat to computers is a threat to society. But how do we protect our critical infrastructure? What are the problems it faces?

1.1 Dramatis Personae

There are four key threats to consider. These are the four horsemen of the electronic apocalypse: spam, bugs, denials of service, and malicious software.

Spam
The term commonly used to describe the abundance of unsolicited bulk email which plagues the mailboxes of Internet users worldwide. The statistics vary over time, but suggest that over 70% of email traffic currently falls into this category.1)
Bugs
These are software errors which, when they crop up, can kill off your software immediately, if you're lucky. They can also result in data corruption, security weaknesses, and spurious, hard-to-find problems.
Denials of service
Denial-of-service attacks, or DoS attacks,2 starve legitimate usage of resources or services. For example, a DoS attack could use up all available disk space on a system, so that other users couldn't make use of it; generating reams of network traffic so that real traffic can't get through would also be a denial of service. Simple DoS attacks are relatively easy to mount by simply overwhelming a machine with requests, as a toddler might overwhelm their parents with questions. Sophisticated DoS attacks can involve more finesse, and may trick a machine into shutting a service down instead of flooding it.
Malicious software
The real war is waged with malicious software, or malware. This is software whose intent is malicious, or whose effect is malicious. The spectrum of malware covers a wide variety of specific threats, including viruses, worms, Trojan horses, and spyware.

The focus of this book is malware, and the techniques which can be used to detect, detain, and destroy it. This is not accidental. Of the four threats listed above, malware has the deepest connection to the other three. Malware may be propagated using spam, and may also be used to send spam; malware may take advantage of bugs; malware may be used to mount DoS attacks. Addressing the problem of malware is vital for improving computer security. Computer security is vital to our society's critical infrastructure.

1.2 The Myth of Absolute Security

Obviously we want our computers to be secure against threats. Unfortunately, there is no such thing as absolute security, where a computer is either secure or it's not. You may take a great deal of technical precautions to safeguard your computers, but your protection is unlikely to be effective against a determined attacker with sufficient resources. A government-funded spy agency could likely penetrate your security, should they be motivated to do so. Someone could drive a truck through the wall of your building and steal your computers. Old-fashioned ways are effective, too: there are many ways of coercing people into divulging information.3

Even though there is no absolute computer security, relative computer security can be considered based on six factors:

Breaking down security in this way changes the problem. Security is no longer a binary matter of secure or not-secure; it becomes a problem of risk management,"4 and implementing security can be seen as making tradeoffs between the level of protection, the usability of the resulting system, and the cost of implementation.

When you assess risks for risk management, you must consider the risks posed to you by others, and consider the risks posed to others by you. Everybody is your neighbor on the Internet, and it isn't farfetched to think that you could be found negligent if you had insufficient computer security, and your computers were used to attack another site.100

1.3 The cost of malware

Malware unquestionably has a negative financial impact, but how big an impact does it really have?101 It's important to know, because if computer security is to be treated as risk management, then you have to accurately assess how much damage a lapse in security could cause.

At first glance, gauging the cost of malware incidents would seem to be easy. After all, there are any number of figures reported on this, figures attributed to experts. They can vary from one another by an order of magnitude, so if you disagree with one number, you can locate another more to your liking. I use the gross domestic product of Austria, myself - it's a fairly large number, and it's as accurate an estimate as any other.

In all fairness, estimating malware cost is a very hard problem. There are two types of costs to consider: real costs and hidden costs.

Real costs
These are costs which are apparent, and which are relatively easy to calculate. If a computer virus reduced your computer to a bubbling puddle of molten slag,5 the cost to replace it would be straightforward to assess. Similarly, if an employee can't work because their computer is having malware removed from it, then the employee's lost productivity can be computed. The time that technical support staff spend tracking down and fixing affected computers can also be computed. Not all costs are so obvious, however.
Hidden costs
Hidden costs are costs whose impact can't be measured accurately, and may not even be known. Some businesses, like banks and computer security companies, could suffer damage to their reputation from a publicized malware incident. Regardless of the business, a leak of proprietary information or customer data caused by malware could result in enormous damage to a company, no different than industrial espionage. Any downtime could drive existing customers to a competitor, or turn away new, potential customers.

This has been cast in terms of business, but malware presents a cost to individuals, too. Personal information stolen by malware from a computer, such as passwords, credit card numbers, and banking information, can give thieves enough for that tropical vacation they've always dreamed of, or provide a good foundation for identity theft.

1.4 The number of threats

Even the exact number of threats is open to debate. A quick survey of competing anti-virus products shows that the number of threats they claim to detect can vary by as much as a factor of two. Curiously, the level of protection each affords is about the same, meaning that more is not necessarily better.

Why? There is no industry-wide agreement on what constitutes a "threat," to begin with. It's not surprising, given that fact alone, that different anti-virus products would have different numbers - they aren't all counting the same thing. For example, there is some dispute as to whether or not automatically-generated viruses produced by the same tool should be treated as individual threats, or as only one threat. This came to the fore in 1998, when approximately 15,000 new automatically-generated viruses appeared overnight.102 It is also difficult to amass and correctly maintain a malware collection, 103 and inadvertent duplication or misclassification of malware samples is always a possibility. There is no single clearinghouse for malware.

Another consideration is that the reported numbers are only for threats that are known about. Ideally, computers should be protected from both known and unknown threats. It's impossible to know about unknown threats, of course, which means that it's impossible to precisely assess how well-protected your computers are against threats.

Different anti-virus products may employ different detection techniques, too. Not all methods of detection rely on exhaustive compilations of known threats, and generic detection techniques routinelyfindboth known and unknown threats without knowing the exact nature of what they're detecting.

Even for known threats, not all may endanger your computers. The majority of malware is targeted to some specific combination of computer architecture and operating system, and sometimes even to a particular application. Effectively these act as preconditions for a piece of malware to run; if any of these conditions aren't true - for instance, you use a different operating system - then that malware poses no direct threat to you. It is inert with respect to your computers.

Even if it can't run, malware may carry an indirect liability risk if it passes through your computers from one target to another. For example, one unaffected computer could provide a shared directory; someone else's compromised computer could deposit malware in that shared directory for later propagation. It is prudent to look for threats to all computers, not just to your own.

Figure 1.1. Worm propagation curve

Figure 1.1. Worm propagation curve

1.5 Speed of propagation

Once upon a time, the speed of malware propagation was measured in terms of weeks or even months. This is no longer the case.

A typical worm propagation curve is shown in Figure 1.1. (For simplicity, the effects on the curve from defensive measures aren't shown.) At first, the worm spreads slowly to vulnerable machines, but eventually begins a period of exponential growth when it spreads extremely rapidly. Finally, once the majority of vulnerable machines have been compromised, the worm reaches a saturation point; any further growth beyond this point is minimal.

For a worm to spread more quickly, the propagation curve needs to be moved to the left. In other words, the worm author wants the period of exponential growth to occur earlier, preferably before any defenses have been deployed. This is shown in Figure 1.2a.

Figure 1.2. Ideal propagation curves for attackers and defenders

Figure 1.2. Ideal propagation curves for attackers and defenders

On the other hand, a defender wants to do one of two things. First, the propagation curve could be pushed to the right, buying time to construct a defense before the worm's exponential growth period. Second, the curve could be compressed downwards, meaning that not all vulnerable machines become compromised by the worm. These scenarios are shown in Figure 1.2b.

The time axis on these figures has been deliberately left unlabeled, because the exact propagation rate will depend on the techniques that a particular worm uses. However, the theoretical maximum speed of a carefully-designed worm from initial release until saturation is startling: 510 milliseconds to 1.3 seconds.6 In less than two seconds, it's over. No defense that relies on any form of human intervention will be fast enough to cope with threats like this.

1.6 People

Humans are the weak link on several other fronts too, all of which are taken advantage of by malware.

By their nature, humans are trusting, social creatures. These are excellent qualities for your friends to have, and also for your victims to possess: an entire class of attacks, called social engineering attacks, are quick to exploit these desirable human qualities.

Social engineering aside, many people simply aren't aware of the security consequences of their actions. For example, several informal surveys of people on the street have found them more than willing to provide enough information for identity theft (even offering up their passwords) in exchange for chocolate, theater tickets, and coffee vouchers.104

Another problem is that humans - users - don't demand enough of software vendors in terms of secure software. Even for security-savvy users who want secure software, the security of any given piece of software is nearly impossible to assess.

Secure software is software which can't be exploited by an attacker. Just because some software hasn't been compromised is no indication that it's secure - like the stock market, past performance is no guarantee of future results. Unfortunately, that's really the only guideline users have to judge security: the absence of an attack. Software security is thus an anti-feature for vendors, because it's intangible. It's no wonder that vendors opt to add features rather than improve security. Features are easier to sell.

Features are also easier to buy. Humans are naturally wooed by new features, which forms a vicious cycle that gives software vendors little incentive to improve software security.

1.7 About this book

Malware poses an enormous problem in the context of faulty humans and faulty software security. It could be that malware is the natural consequence of the presence of these faults, like vermin slipping through building cracks in the real world. Indeed, names like "computer virus" and "computer worm" bring to mind their biological real-world counterparts.

Whatever the root cause, malware is a problem that needs to be solved. This book looks at malware, primarily viruses and worms, and its countermeasures. The next chapter lays the groundwork with some basic definitions and a timeline of malware. Then, on to viruses: Chapters 3, 4, and 5 cover viruses, anti-virus techniques, and anti-anti-virus techniques, in that order. Chapter 6 explains the weaknesses that are exploited by malware, both technical and social - this is necessary background for the worms in Chapter 7. Defenses against worms are considered in Chapter 8. Some of the possible manifestations of malware are looked at in Chapter 9, followed by a look at the people who create malware and defend against it in Chapter 10. Some final thoughts on defense are in Chapter 11.

The convention used for chapter endnotes is somewhat unusual. The notes tend to fall into two categories. First, there are notes with additional content related to the text. These have endnote numbers from 1-99 within a chapter. Second, there are endnotes that provide citations and pointers to related material. This kind of endnote is numbered 100 or above. The intent is to make the two categories of endnote easily distinguishable in the text.

A lot of statements in this book are qualified with "can" and "could" and "may" and "might." Software is infinitely malleable and can be made to do almost anything; it is hubris to make bold statements about what malware can and can't do.

Finally, this is not a programming book, and some knowledge of programming (in both high- and low-level languages) is assumed, although pseudocode is used where possible. A reasonable understanding of operating systems and networks is also beneficial.

1.8 Some words of warning

Self-replicating software like viruses and worms has proven itself to be very difficult to control, even from the very earliest experiments.7 While self-replicating code may not intentionally be malicious, it can have similar effects regardless. Of course, the risks of overtly malicious software should be obvious. Any experiments with malware, or analysis of malware, should be done in a secure environment designed specifically for that purpose. While it's outside the scope of this book to describe such a secure environment - the details would be quickly out of date anyway - there are a number of sources of information available.105

Another thing to consider is that creation and/or distribution of malware may violate local laws. Many countries have computer crime legislation now,8 and even if the law was violated in a different jurisdiction from where the perpetrator is physically located, extradition agreements may apply.106 Civil remedies for victims of malware are possible as well.

Ironically, some dangers lurk in defensive techniques too. Some of the material in this book is derived from patent documents; the intent is to provide a wide range of information, and is not in any way meant to suggest that these patents should be infringed. While every effort has been made to cite relevant patents, it is possible that some have been inadvertently overlooked. Furthermore, patents may be interpreted very broadly, and the applicability of a patent may depend greatly on the skill and financial resources of the patent holder's legal team. Seek legal advice before rushing off to implement any of the techniques described in this book.

Notes for Chapter 1

1 Based on MessageLabs' sample size of 12.6 billion email messages [203]. This has a higher statistical significance than 99% of statistics you would normally find.

2 Note the capitalization - "DOS" is an operating system, "DoS" is an attack.

3 In cryptography, this has been referred to as "rubber-hose" cryptanalysis [279].

4 Schneier has argued this point of view, and that computer security is an untapped market for insurance companies, who are in the business of managing risk anyway [280].

5 Before any urban legends are started, computer viruses can't do this.

6 These numbers (510 ms for UDP-based worms, 1.3 s for TCP-based worms) are the time it takes to achieve 95% saturation of a million vulnerable machines [303].

7 For example, Cohen's first viruses progressed surprisingly quickly [74], as did Duff's shell script virus [95], and an early worm at Xerox ran amok [287].

8 Computer crime laws are not strictly necessary for prosecuting computer crimes that arejust electronic versions of "traditional" crimes like fraud [56], but the trend is definitely to enact computer-specific laws.

100 Owens [237] discusses liability potential in great detail.

101 This section is based on Garfink and Landesman [117], and Ducklin [94] touches on some of the same issues too.

102 Morley [213]. Ducklin [94] has a discussion of this issue, and of other ways to measure the extent of the virus problem.

103 Bontchev [39] talks about the care and feeding of a "clean" virus library.

104 The informal surveys were reported in [30] (chocolate), [31, 274] (theater tickets), and [184] (coffee vouchers). Less amusing, but more rigorous, surveys have been done which show similar problems [270, 305].

105 There are a wide range of opinions on working with malware, ranging from the inadequate to the paranoid. As a starting point, see [21, 75, 187, 282, 288,312].

106 Although U.S.-centric. Soma et al. [295] give a good overview of the general features of extradition treaties.

Chapter 2 Definitions and timeline

It would be nice to present a clever taxonomy of malicious software, one that clearly shows how each type of malware relates to every other type. However, a taxonomy would give the quaint and totally incorrect impression that there is a scientific basis for the classification of malware.

In fact, there is no universally-accepted definition of terms like "virus" and "worm," much less an agreed-upon taxonomy, even though there have been occasional attempts to impose mathematical formalisms onto malware.100 Instead of trying to pin down these terms precisely, the common characteristics each type of malware typically has are listed.

2.1 Malware types

Malware can be roughly broken down into types according to the malware's method of operation. Anti-"virus" software, despite its name, is able to detect all of these types of malware.

There are three characteristics associated with these malware types.

  1. Self-replicating malware actively attempts to propagate by creating new copies, or instances, of itself. Malware may also be propagated passively, by a user copying it accidentally, for example, but this isn't self-replication.
  2. The population growth of malware describes the overall change in the number of malware instances due to self-replication. Malware that doesn't self-replicate will always have a zero population growth, but malware with a zero population growth may self-replicate.
  3. Parasitic malware requires some other executable code in order to exist. "Executable" in this context should be taken very broadly to include anything that can be executed, such as boot block code on a disk, binary code in applications, and interpreted code. It also includes source code, like application scripting languages, and code that may require compilation before being executed.

2.1.1 Logic bomb

Self-replicating:no
Population growth:zero
Parasitic:possibly

A logic bomb is code which consists of two parts:

  1. A payload, which is an action to perform. The payload can be anything, but has the connotation of having a malicious effect.
  2. A trigger, a boolean condition that is evaluated and controls when the payload is executed. The exact trigger condition is limited only by the imagination, and could be based on local conditions like the date, the user logged in, or the operating system version. Triggers could also be designed to be set off remotely, or - like the "dead man's switch" on a train - be set off by the absence of an event.

Logic bombs can be inserted into existing code, or could be standalone. A simple parasitic example is shown below, with a payload that crashes the computer using a particular date as a trigger.

	legitimate code
	if date is Friday the 13th:
		crash_computerO
	legitimate code

Logic bombs can be concise and unobtrusive, especially in millions of lines of source code, and the mere threat of a logic bomb could easily be used to extort money from a company. In one case, a disgruntled employee rigged a logic bomb on his employer's file server to trigger on a date after he was fired from his job, causing files to be deleted with no possibility of recovery. He was later sentenced to 41 months in prison.101 Another case alleges that an employee installed a logic bomb on 1000 company computers, date-triggered to remove all the files on those machines; the person allegedly tried to profit from the downturn in the company's stock prices that occurred as a result of the damage.1

2.1.2 Trojan horse

Self-replicating:no
Population growth:zero
Parasitic:yes

There was no love lost between the Greeks and the Trojans. The Greeks had besieged the Trojans, holed up in the city of Troy, for ten years. They finally took the city by using a clever ploy: the Greeks built an enormous wooden horse, concealing soldiers inside, and tricked the Trojans into bringing the horse into Troy. When night fell, the soldiers exited the horse and much unpleasantness ensued.102

In computing, a Trojan horse is a program which purports to do some benign task, but secretly performs some additional malicious task. A classic example is a password-grabbing login program which prints authentic-looking "username" and "password" prompts, and waits for a user to type in the information. When this happens, the password grabber stashes the information away for its creator, then prints out an "invalid password" message before running the real login program. The unsuspecting user thinks they made a typing mistake and reenters the information, none the wiser.

Trojan horses have been known about since at least 1972, when they were mentioned in a well-known report by Anderson, who credited the idea to D. J. Edwards.103

2.1.3 Back door

Self-replicating:no
Population growth:zero
Parasitic:possibly

A back door is any mechanism which bypasses a normal security check. Programmers sometimes create back doors for legitimate reasons, such as skipping a time-consuming authentication process when debugging a network server.

As with logic bombs, back doors can be placed into legitimate code or be standalone programs. The example back door below, shown in gray [italic - herm1t], circumvents a login authentication process.

	username = read_username()
	password = read_password()
	if username is "133t h4ck0r":
		return ALLOW_LOGIN
	if username and password are valid:
		return ALLOW_LOGIN
	else:
		return DENY_LOGIN

One special kind of back door is a RAT, which stands for Remote Administration Tool or Remote Access Trojan, depending on who's asked. These programs allow a computer to be monitored and controlled remotely; users may deliberately install these to access a work computer from home, or to allow help desk staff to diagnose and fix a computer problem from afar. However, if malware surreptitiously installs a RAT on a computer, then it opens up a back door into that machine.

2.1.4 Virus

Self-replicating:yes
Population growth:positive
Parasitic:yes

A virus is malware that, when executed, tries to replicate itself into other executable code; when it succeeds, the code is said to be infected2. The infected code, when run, can infect new code in turn. This self-replication into existing executable code is the key defining characteristic of a virus.

When faced with more than one virus to describe, a rather silly problem arises. There's no agreement on the plural form of "virus." The two leading contenders are "viruses" and "virii;" the latter form is often used by virus writers themselves, but it's rare to see this used in the security community, who prefer "viruses."104

If viruses sound like something straight out of science fiction, there's a reason for that. They are. The early history of viruses is admittedly fairly murky, but the first mention of a computer virus is in science fiction in the early 1970s, with Gregory Benford's The Scarred Man in 1970, and David Gerrold's When Harlie Was One in 1972.105 Both stories also mention a program which acts to counter the virus, so this is the first mention of anti-virus software as well.

The earliest real academic research on viruses was done by Fred Cohen in 1983, with the "virus" name coined by Len Adleman.106 Cohen is sometimes called the "father of computer viruses," but it turns out that there were viruses written prior to his work. Rich Skrenta's Elk Cloner was circulating in 1982, and Joe Dellinger's viruses were developed between 1981-1983; all of these were for the Apple II platform.107 Some sources mention a 1980 glitch in Arpanet as the first virus, but this was just a case of legitimate code acting badly; the only thing being propagated was data in network packets.108 Gregory Benford's viruses were not limited to his sciencefictionstories; he wrote and released non-malicious viruses in 1969 at what is now the Lawrence Livermore National Laboratory, as well as in the early Arpanet.

Some computer games have featured self-replicating programs attacking one another in a controlled environment. Core War appeared in 1984, where programs written in a simple assembly language called Redcode fought one another; a combatant was assumed to be destroyed if its program counter pointed to an invalid Redcode instruction. Programs in Core War existed only in a virtual machine, but this was not the case for an earlier game, Darwin. Darwin was played in 1961, where a program could hunt and destroy another combatant in a non-virtual environment using a well-defined interface.109 In terms of strategy, successful combatants in these games were hard-to-find, innovative, and adaptive, qualities that can be used by computer viruses too.3

Traditionally, viruses can propagate within a single computer, or may travel from one computer to another using human-transported media, like a floppy disk, CD-ROM, DVD-ROM, or USB flash drive. In other words, viruses don't propagate via computer networks; networks are the domain of worms instead. However, the label "virus" has been applied to malware that would traditionally be considered a worm, and the term has been diluted in common usage to refer to any sort of self-replicating malware.

Viruses can be caught in various stages of self-replication. A germ is the original form of a virus, prior to any replication. A virus which fails to replicate is called an intended. This may occur as a result of bugs in the virus, or encountering an unexpected version of an operating system. A virus can be dormant, where it is present but not yet infecting anything - for example, a Windows virus can reside on a Unix-based file server and have no effect there, but can be exported to Windows machines.4

2.1.5 Worm

Self-replicating:yes
Population growth:positive
Parasitic:no

A worm shares several characteristics with a virus. The most important characteristic is that worms are self-replicating too, but self-replication of a worm is distinct in two ways. First, worms are standalone,5 and do not rely on other executable code. Second, worms spread from machine to machine across networks.

Like viruses, the first worms were fictional. The term "worm" was first used in 1975 by John Brunner in his science fiction novel The Shockwave Rider, (Interestingly, he used the term "virus" in the book too.)6 Experiments with worms performing (non-malicious) distributed computations were done at Xerox PARC around 1980, but there were earlier examples. A worm called Creeper crawled around the Arpanet in the 1970s, pursued by another called Reaper which hunted and killed off Creepers.7

A watershed event for the Internet happened on November 2, 1988, when a worm incapacitated thefledglingInternet. This worm is now called the Internet worm, or the Morris worm after its creator, Robert Morris, Jr. At the time, Morris hadjust started a Ph.D. at Cornell University. He had been intending for his worm to propagate slowly and unobtrusively, but what happened was just the opposite. Morris was later convicted for his worm's unauthorized computer access and the costs incurred to clean up from it. He was fined, and sentenced to probation and community service.8 Chapter 7 looks at this worm in detail.

2.1.6 Rabbit

Self-replicating:yes
Population growth:zero
Parasitic:no

Rabbit is the term used to describe malware that multiplies rapidly. Rabbits may also be called bacteria, for largely the same reason.

There are actually two kinds of rabbit.110 The first is a program which tries to consume all of some system resource, like disk space. A "fork bomb," a program which creates new processes in an infinite loop, is a classic example of this kind of rabbit. These tend to leave painfully obvious trails pointing to the perpetrator, and are not of particular interest.

The second kind of rabbit, which the characteristics above describe, is a special case of a worm. This kind of rabbit is a standalone program which replicates itself across a network from machine to machine, but deletes the original copy of itself after replication. In other words, there is only one copy of a given rabbit on a network; it just hops from one computer to another.9 Rabbits are rarely seen in practice.

2.1.7 Spyware

Self-replicating:no
Population growth:zero
Parasitic:no

Spyware is software which collects information from a computer and transmits it to someone else. Prior to its emergence in recent years as a threat, the term "spyware" was used in 1995 as part of a joke, and in a 1994 Usenet posting looking for "spy-ware" information.111

The exact information spyware gathers may vary, but can include anything which potentially has value:

  1. Usernames and passwords. These might be harvested from files on the machine, or by recording what the user types using a keylogger. A keylogger differs from a Trojan horse in that a keylogger passively captures keystrokes only; no active deception is involved.
  2. Email addresses, which would have value to a spammer.
  3. Bank account and credit card numbers.
  4. Software license keys, to facilitate software pirating.

Viruses and worms may collect similar information, but are not considered spyware, because spyware doesn't self-replicate.112 Spyware may arrive on a machine in a variety of ways, such as bundled with other software that the user installs, or exploiting technical flaws in web browsers. The latter method causes the spyware to be installed simply by visiting a web page, and is sometimes called a drive-by download.

2.1.8 Adware

Self-replicating:no
Population growth:zero
Parasitic:no

Adware has similarities to spyware in that both are gathering information about the user and their habits. Adware is more marketing-focused, and may pop up advertisements or redirect a user's web browser to certain web sites in the hopes of making a sale. Some adware will attempt to target the advertisement to fit the context of what the user is doing. For example, a search for "Calgary" may result in an unsolicited pop-up advertisement for "books about Calgary."

Adware may also gather and transmit information about users which can be used for marketing purposes. As with spyware, adware does not self-replicate.

2.1.9 Hybrids, Droppers, and Blended Threats

The exact type of malware encountered in practice is not necessarily easy to determine, even given these loose definitions of malware types. The nature of software makes it easy to create hybrid malware which has characteristics belonging to several different types.10

A classic hybrid example was presented by Ken Thompson in his ACM Turing award lecture.11 He prepared a special C compiler executable which, besides compiling C code, had two additional features:

  1. When compiling the login source code, his compiler would insert a back door to bypass password authentication.
  2. When compiling the compiler's source code, it would produce a special compiler executable with these same two features.

His special compiler was thus a Trojan horse, which replicated like a virus, and created back doors. This also demonstrated the vulnerability of the compiler tool chain: since the original source code for the compiler and login programs wasn't changed, none of this nefarious activity was apparent.

Another hybrid example was a game called Animal, which played twenty questions with a user. John Walker modified it in 1975, so that it would copy the most up-to-date version of itself into all user-accessible directories whenever it was run. Eventually, Animals could be found roaming in every directory in the system.113 The copying behavior was unknown to the game's user, so it would be considered a Trojan horse. The copying could also be seen as self-replication, and although it didn't infect other code, it didn't use a network either - not really a worm, not really a virus, but certainly exhibiting viral behavior.

There are other combinations of malware too. For example, a dropper is malware which leaves behind, or drops, other malware.12 A worm can propagate itself, depositing a Trojan horse on all computers it compromises; a virus can leave a back door in its wake.

A blended threat is a virus that exploits a technical vulnerability to propagate itself, in addition to exhibiting "traditional" characteristics. This has considerable overlap with the definition of a worm, especially since many worms exploit technical vulnerabilities. These technical vulnerabilities have historically required precautions and defenses distinct from those that anti-virus vendors provided, and this rift may account for the duplication in terms.114 The Internet worm was a blended threat, according to this definition.

2.1.10 Zombies

Computers that have been compromised can be used by an attacker for a variety of tasks, unbeknownst to the legitimate owner; computers used in this way are called zombies. The most common tasks for zombies are sending spam and participating in coordinated, large-scale denial-of-service attacks.

Sending spam violates the acceptable use policy of many Internet service providers, not to mention violating laws in some jurisdictions. Sites known to send spam are also blacklisted, marking sites that engage in spam-related activity so that incoming email from them can be summarily rejected. It is therefore ill-advised for spammers to send spam directly, in such a way that it can be traced back to them and their machines. Zombies provide a windfall for spammers, because they are a free, throwaway resource: spam can be relayed through zombies, which obscures the spammer's trail, and a blacklisted zombie machine presents no hardship to the spammer. 13

As for denials of service, one type of denial-of-service attack involves either flooding a victim's network with traffic, or overwhelming a legitimate service on the victim's network with requests. Launching this kind of attack from a single machine would be pointless, since one machine's onslaught is unlikely to generate enough traffic to take out a large target site, and traffic from one machine can be easily blocked by the intended victim. On the other hand, a large number of zombies all targeting a site at the same time can cause grief. A coordinated, network-based denial-of-service attack that is mounted from a large number of machines is called a distributed denial-of-service attack, or DDoS attack.

Networks of zombies need not be amassed by the person that uses them; the use of zombie networks can be bought for a price.14 Another issue is how to con- trol zombie networks. One method involves zombies listening for commands on Internet Relay Chat (IRC) channels, which provides a relatively anonymous, scalable means of control. When this is used, the zombie networks are referred to as botnets, named after automated IRC client programs called bots.15

2.2 Naming

When a new piece of malware is spreading, the top priority of anti-virus companies is to provide an effective defense, quickly. Coming up with a catchy name for the malware is a secondary concern.

Typically the primary, human-readable name of a piece of malware is decided by the anti-virus researcher16 who first analyzes the malware.115 Names are often based on unique characteristics that malware has, either some feature of its code or some effect that it has. For example, a virus' name may be derived from some distinctive string that is found inside it, like "Your PC is now Stoned!"17 Virus writers, knowing this, may leave such clues deliberately in the hopes that their creation is given a particular name. Anti-virus researchers, knowing this, will ignore obvious naming clues so as not to play into the virus writer's hand.18

There is no central naming authority for malware, and the result is that a piece of malware will often have several different names. Needless to say, this is confusing for users of anti-virus software, trying to reconcile names heard in alerts and media reports with the names used by their own anti-virus software. To compound the problem, some sites use anti-virus software from multiple different vendors, each of whom may have different names for the same, piece of malware.19 Common naming would benefit anti-virus researchers talking to one another too.20

Unfortunately, there isn't likely to be any central naming authority in the near future, for two reasons.21 First, the current speed of malware propagation precludes checking with a central authority in a timely manner.22 Second, it isn't always clear what would need to be checked, since one distinct piece of malware may manifest itself in a practically infinite number of ways.

Recommendations for malware naming do exist, but in practice are not usually followed,23 and anti-virus vendors maintain their own separately-named databases of malware that they have detected. It would, in theory, be possible to manually map malware names between vendors using the information in these databases, but this would be a tedious and error-prone task.

A tool called VGrep automates this process of mapping names.116 First, a machine is populated with the malware of interest. Then, as shown in Figure 2.1, each anti-virus product examines each file on the machine, and outputs what (if any) malware it detects. VGrep gathers all this anti-virus output and collates it for later searching. The real technical challenge is not collating the data, but simply getting usable, consistent output from a wide range of anti-virus products.

Figure 2.1. VGrep operation

Figure 2.1. VGrep operation

The naming problem and the need for tools like VGrep can be demonstrated using an example. Using VGrep and cross-referencing vendor's virus databases, the partial list of names below for the same worm can be found.24

These results highlight some of the key identifiers used for naming malware:117

Malware type.
This is the type of the threat which, for this example, is a worm. Platform specifier. The environment in which the malware runs; this worm needs the Windows 32-bit operating system API C'W32" and "Win32").25 More generally, the platform specifier could be any execution environment, such as an application's programming language (e.g., "VBS" for "Visual Basic Script"), or may even need to specify a combination of hardware and software platform.
Family name.
The family name is the "human-readable" name of the malware that is usually chosen by the anti-virus researcher performing the analysis. This example shows several different, but obviously related, names. The relationship is not always obvious: "Nachi" and "Welchia" are the same worm, for instance.
Variant.
Not unlike legitimate software, a piece of malware tends to be released multiple times with minor changes.26 This change is referred to as the malware's variant or, following the biological analogy, the strain of the malware.

Variants are usually assigned letters in increasing order of discovery, so this "C" variant is the third B[e]agle found. Particularly persistent families with many variants will have multiple letters, as "Z" gives way to "AA." Unfortunately, this is not unusual - some malware has dozens of variants.27

Modifiers.
Modifiers supply additional information about the malware, such as its primary means of propagation. For example, "mm" stands for "mass mailing."

The results also highlight the fact that not all vendors supply all these identifiers for every piece of malware, that there is no common agreement on the specific identifiers used, and that there is no common syntax used for names.

Besides VGrep, there are online services where a suspectfilecan be uploaded and examined by multiple anti-virus products. Output from a service like this also illustrates the variety in malware naming:28

Worm/Mydoom.BCWin32:Mytob-DI-Worm/Mydoom
Win32.Worm.Mytob.CWorm.Mytob.CWin32.HLLM.MyDoom.22
W32/Mytob.D@mmW32/Mytob.C-mmNet-Worm.Win32.Mytob.c
Win32/Mytob.DMytob.D 

Ultimately, however, the biggest concern is that the malware is detected and eliminated, not what it's called.

2.3 Authorship

People whose computers are affected by malware typically have a variety of colorful terms to describe the person who created the malware. This book will use the comparatively bland terms malware author and malware writer to describe people who create malware; when appropriate, more specific terms like virus writer may be used too.

There's a distinction to be made between the malware author and the malware distributor. Writing malware doesn't imply distributing malware, and vice versa, and there have been cases where the two roles are known to have been played by different people.29 Having said that, the malware author and distributor will be assumed to be the same person throughout this book, for simplicity.

Is a malware author a "hacker?" Yes and no. The term hacker has been distorted by the media and popular usage to refer to a person who breaks into computers, especially when some kind of malicious intent is involved. Strictly speaking, a person who breaks into computers is a cracker, not a hacker,118 and there may be a variety of motivations for doing so. In geek parlance, being called a hacker actually has a positive connotation, and means a person who is skilled at computer programming; hacking has nothing to do with computer intrusion or malware.

Hacking (in the popular sense of the word) also implies a manual component, whereas the study of malware is the study of large-scale, automated forms of attack. Because of this distinction and the general confusion over the term, this book will not use it in relation to malware.

2.4 Timeline

Figure 2.2 puts some important events in context. With the exception of adware and spyware, which appeared in the late 1990s, all of the different types of malware were known about in the early 1970s. The prevalence of virus, worms, and other malware has been gradually building steam since the mid-1980s, leaving us with lots of threats - no matter how they're counted.

Figure 2.2. Timeline of events

Figure 2.2. Timeline of events

Notes for Chapter 2

1 This case doesn't appear to have gone to trial yet, so the person may yet be found not guilty. Regardless, the charges in the indictment [327] serve as an example of how a logic bomb can be used maliciously.

2 The term "computer virus" is preferable if there's any possibility of confusion with biological viruses.

3 Bassham and Polk [28] note that innovation is important for the longevity of computer viruses, especially if the result is something that hasn't yet been seen by anti-virus software. They also point out that non-destructive viruses have an increased chance of survival, by not drawing attention to themselves.

4 These three definitions are based on Harley et al. [137]; Radatti [258] talks about viruses passing through unaffected platforms, which he calls Typhoid Mary Syndrome.

5 Insofar as a worm can be said to stand.

6 This farsighted book also included ideas about an internet and laser printers [50].

7 The Xerox work is described in Shoch and Hupp [287], and both they and Dewdney [91] mention Creeper and Reaper. There were two versions of Creeper, of which the first would be better called a rabbit, the second a worm.

8 This version of the event is from [329]. An interesting historical twist: Morris, Jr.'s father was one of the people playing Darwin in the early 1960s at Bell Labs, and created 'The species which eventually wiped out all opposition...' [9, page 95].

9 Nazario [229] calls this second kind of rabbit a "jumping executable worm."

10 "Hybrid" is used in a generic sense here; Harley et al. [137] use the term "hybrid viruses" to describe viruses that execute concurrently with the infected code.

11 From Thompson [322]; he simply calls it a Trojan horse.

12 This differs from Harley et al. [137], who define a dropper to be a program that installs malware. However, this term is so often applied to malware that this narrower definition is used here.

13 There are many other spamming techniques besides this; Spammer-X [300, Chapter 3] has more information. Back-door functionality left behind by worms has been used for sending spam in this manner [188].

14 Acohido and Swartz [2] mention a $2000-$3000 rental fee for 20,000 zombies, but prices have been dropping [300].

15 Cooke et al. [79] looks at botnet evolution, and takes the more general view that botnets are just zombie armies, and need a controlling communication channel, but that channel doesn't have to be IRC. There are also a wide variety of additional uses for botnets beyond those listed here [319].

16 In the anti-virus industry, people who analyze malware for anti-virus companies are referred to as "researchers." This is different from the academic use of the term.

17 This was one suggested way to find the Stoned virus [290].

18 Lyman [189], but this is common knowledge in the anti-virus community.

19 Diversity is usually a good thing when it comes to defense, and large sites will often use different anti-virus software on desktop machines than they use on their gateway machines. In a panel discussion at the 2003 Virus Bulletin conference, one company revealed that they used eleven different anti-virus products.

20 While the vast majority of interested parties want common naming, their motivations for wanting this may be different, and they may treat different parts of the name as being significant [182].

21 Having said this, an effort has been announced recently to provide uniform names for malware. The "Common Malware Enumeration" will issue a unique identifier for malware causing major outbreaks, so users can refer to highly mneumonic names like "CME-42," which intuitively may have been issued before "CME-40" and "CME-41" [176].

22 Of course, this begs the question of why such a central authority wasn't established in the early days of malware prevalence, when there was less malware and the propagation speeds tended to be much, much slower.

23 CARO, the Computer Antivirus Research Organization, produced virus-naming guidelines in 1991 [53], which have since been updated [109].

24 Vendor names have been removed from the results.

25 "API" stands for "application programming interface."

26 Not all variants necessarily come from the same source. For example, the "B" variant of the Blaster worm was released by someone who had acquired a copy of the "A" variant and modified it [330].

27 A few, like Gaobot, have hundreds of variants, and require three letters to describe their variant!

28 This example is from [47], again with vendor information removed.

29 Dellinger's "Virus 2" spread courtesy of the virus writer's friends [87], and secondhand stories indicate that Stoned was spread by someone besides its author [119,137,290]. Malware writers are rarely caught or come forward, so discovering these details is unusual.

100 For example, Adleman [3] and Cohen [75].

101 The details of the case may be found in [328]; [326] has sentencing information.

102 Paraphrased liberally from Virgil's Aeneid, Book II [336].

103 Anderson [12].

104 A sidebar in Harley et al. [137, page 60] has an amusing collection of suggested plural forms that didn't make the cut.

105 Benford [33] and Gerrold [118], respectively. Benford talks about his real computer viruses in this collection of reprinted stories.

106 As told in Cohen [74].

107 Skrenta [289] and Dellinger [87].

108 The whole sordid tale is in Rosen [267].

109 The original Core War article is Dewdney [91]; Darwin is described in [9, 201].

110 Bontchev [46].

111 Vossen [338] and van het Groenewoud [331], respectively.

112 This definition of spyware and adware follows Gordon [124].

113 Walker wrote a letter to Dewdney [340], correcting Dewdney's explanation of Animal in his column [92] (this column also mentions Skrenta's virus).

114 Chien and Szor [70] explain blended threats and the historical context of the anti-virus industry with respect to them.

115 Bontchev [44] and Lyman [189] describe the process by which a name is assigned.

116 VGrep was originally by Ian Whalley; this discussion of its operation is based on its online documentation [333].

117 This description is based on the CARO identifiers and terminology [109].

118 The Jargon File lists the many nuances of "hacker," along with a hitchhiker's guide to the hacker subculture [260].

Chapter 3 Viruses

A computer virus has three parts:100

Infection mechanism
How a virus spreads, by modifying other code to contain a (possibly altered) copy of the virus. The exact means through which a virus spreads is referred to as its infection vector. This doesn't have to be unique - a virus that infects in multiple ways is called multipartite.
Trigger
The means of deciding whether to deliver the payload or not.
Payload
What the virus does, besides spread. The payload may involve damage, either intentional or accidental. Accidental damage may result from bugs in the virus, encountering an unknown type of system, or perhaps unanticipated multiple viral infections.

Except for the infection mechanism, the other two parts are optional, because infection is one of the key defining characteristics of a virus. In the absence of infection, only the trigger and payload remain, which is a logic bomb.

In pseudocode, a virus would have the structure below. The trigger function would return a boolean, whose value would indicate whether or not the trigger conditions were met. The payload could be anything, of course.

	def virus():
	    infect()
            if trigger() is true:
               payload()

Infection is done by selecting some target code and infecting it, as shown below. The target code is locally accessible to the machine where the virus runs, applying the definition of viruses from the last chapter. Locally accessible targets may include code in shared network directories, though, as these directories are made to appear locally accessible.

Generally, k targets may be infected each time the infection code below is run. The exact method used to select targets varies, and may be trivial, as in the case of the boot-sector infectors in Section 3.1.1. The tricky part of select_target is that the virus doesn't want to repeatedly re-infect the same code; that would be a waste of effort, and may reveal the presence of the virus. Select_target has to have some way to detect whether or not some potential target code is already infected, which is a double-edged sword. If the virus can detect itself, then so can anti-virus software. The infect_code routine performs the actual infection by placing some version of the virus' code in the target.

	def infect():
            repeat k times:
                target = select_target()
                if no target:
                    return
                infect_code(target)

Viruses can be classified in a variety of ways. The next two sections classify them along orthogonal axes: the type of target the virus tries to infect, and the method the virus uses to conceal itself from detection by users and anti-virus software. Virus creation need not be difficult, either; the virus classification is followed by a look at do-it-yourself virus kits for the programming-challenged.

3.1 Classification by Target

One way of classifying viruses is by what they try to infect. This section looks at three: boot-sector infectors, executable file infectors, and data file infectors (a.k.a. macro viruses).

3.1.1 Boot-Sector Infectors

Although the exact details vary, the basic boot sequence on most machines goes through these steps:

  1. Power on.
  2. ROM-based instructions run, performing a self-test, device detection, and initialization. The boot device is identified, and the boot block is read from it; typically the boot block consists of the initial block(s) on the device.1 Once the boot block is read, control is transferred to the loaded code. This step is referred to as the primary boot.
  3. The code loaded during the primary boot step loads a larger, more sophisticated program that understands the boot device's filesystem structure, and transfers control to it. This is the secondary boot.
  4. The secondary boot code loads and runs the operating system kernel.
Figure 3.1. Multiple boot sector infections

Figure 3.1. Multiple boot sector infections

A boot-sector infector, or BSI, is a virus that infects by copying itself to the boot block. It may copy the contents of the former boot block elsewhere on the disk first,2 so that the virus can transfer control to it later to complete the booting process.

One potential problem with preserving the boot block contents is that block allocation on disk is filesystem-specific. Properly allocating space to save the boot block requires a lot of code, a luxury not available to BSIs. An alternate method is to always copy the original boot block to some fixed, "safe" location on disk. This alternate method can cause problems when a machine is infected multiple times by different viruses that happen to use that same safe location, as shown in Figure 3.1. This is an example of unintentional damage being done by a virus, and has actually occurred: Stoned and Michelangelo were BSIs that both picked the same disk block as their safe location.101

In general, infecting the boot sector is strategically sound: the virus may be in a known location, but it establishes itself before any anti-virus software starts or operating system security is enabled. But BSIs are rare now. Machines are rebooted less often, and there is very little use of bootable media like floppy disks.3 From a defensive point of view, most operating systems prevent writing to the disk's boot block without proper authorization, and many a BIOS4 has boot block protection that can be enabled.

3.1.2 File Infectors

Operating systems have a notion of files that are executable. In a broader sense, executable files may also include files that can be run by a command-line user "shell." A file infector is a virus that infects files which the operating system or shell consider to be executable; this could include batch files and shell scripts, but binary executables are the most common target.

There are two main issues for file infectors:

  1. Where is the virus placed?
  2. How is the virus executed when the infected file is run?

For BSIs, the answer to these questions was apparent. A BSI places itself in the boot block and gets executed through a machine's normal boot sequence. File infectors have a few more options at their disposal, though, and often the answers to these questions are interdependent. The remainder of this section is organized around the answer to the first question: where is the virus placed?

3.1.2.1 Beginning of File

Older, very simple executable file formats like the .COM MS-DOS format would treat the entirefileas a combination of code and data. When executed, the entire file would be loaded into memory, and execution would start by jumping to the beginning of the loaded file.102

In this case, a virus that places itself at the start of the file gets control first when the infected file is run, as illustrated in Figure 3.2. This is called a prepending virus. Inserting itself at the start of a file involves some copying, which isn't difficult, but isn't the absolute easiest way to infect a file.

3.1.2.2 End of File

In contrast, appending code onto the end of a file is extremely easy. A virus that places itself at the end of a file is called an appending virus.

How does the virus get control? There are two basic possibilities:

Figure 3.3 shows an appending virus using the latter scheme.

Figure 3.2. Prepending virus

Figure 3.2. Prepending virus

Figure 3.3. Appending virus

Figure 3.3. Appending virus

3.1.2.3 Overwritten into File

An overwriting virus places itself atop part of the original code.5 This avoids an obvious change in file size that would occur with a prepending or appending virus, and the virus' code can be placed in a location where it will get control.

Obviously, overwriting code blindly is almost certain to break the original code and lead to rapid discovery of the virus. There are several options, with varying degrees of complexity and risk.

None of these options is likely to yield a large amount of space, so overwriting viruses must be small.

3.1.2.4 Inserted into File

Another possibility is that a virus can insert itself into the target code, moving the target code out of the way, and even interspersing small pieces of virus code with target code. This is no easy feat: branch targets in the code have to be changed, data locations must be updated, and linker relocation information needs modification. Needless to say, this file infection technique is rarely seen.8

3.1.2.5 Not in File

A companion virus is one which installs itself in such a way that it is naturally executed before the original code. The virus never modifies the infected code, and gains control by taking advantage of the process by which the operating system or shell searches for executable files. Although this bears the hallmarks of a Trojan horse, a companion virus is a "real" virus by virtue of self-replication.

The easiest way to explain companion viruses is by example.103

3.1.3 Macro Viruses

Some applications allow data files, like word processor documents, to have "macros" embedded in them. Macros are short snippets of code written in a language which is typically interpreted by the application, a language which provides enough functionality to write a virus. Thus, macro viruses are better thought of as data file infectors, but since their predominant form has been macros, the name has stuck.

When a macro-containing document is loaded by the application, the macros can be caused to run automatically, which gives control to the macro virus. Some applications warn the user about the presence of macros in a document, but these warnings may be easily ignored.

A proof-of-concept of macro viruses was published in 1989,105 in response to rumors of their existence. Macro viruses didn't hit the mainstream until 1995, when the Concept virus was distributed, targeting Microsoft Word documents across multiple platforms.9

Concept's operation is shown in Figure 3.4. Word has a persistent, global set of macros which apply to all edited documents, and this is Concept's target: once installed in the global macros, it can infect all documents edited in the future. A document infected by Concept includes two macros that have special properties in Word.

Figure 3.4. Concept in action

Figure 3.4. Concept in action

AutoOpen
Any code in the AutoOpen macro is run automatically when the file is opened. This is how an infected document gains control.
FileSaveAs
The code in the FileSaveAs macro is run when its namesake menu item (File... Save As...) is selected. In other words, this code can be used to infect any as-yet-uninfected document that is being saved by the user.

From a technical standpoint, macro languages are easier to use than lower-level programming languages, so macro viruses drastically lower the barrier to virus creation.

3.2 Classification by Concealment Strategy

Another way of classifying viruses is by how they try to conceal themselves, both from users and from anti-virus software.

3.2.1 No Concealment

Not hiding at all is one concealment strategy which is remarkably easy to implement in a computer virus. It goes without saying, however, that it's not very effective - once the presence of a virus is known, it's trivial to detect and analyze.

Figure 3.5. Encrypted virus pseudocode

Figure 3.5. Encrypted virus pseudocode

3.2.2 Encryption

With an encrypted virus, the idea is that the virus body (infection, trigger, and payload) is encrypted in some way to make it harder to detect. This "encryption" is not what cryptographers call encryption; virus encryption is better thought of as obfuscation. (Where it's necessary to distinguish between the two meanings of the word, I'll use the term "strong encryption" to mean encryption in the cryptographic sense.)

When the virus body is in encrypted form, it's not runnable until decrypted. What executes first in the virus, then, is a decryptor loop, which decrypts the virus body and transfers control to it. The general principle is that the decryptor loop is small compared to the virus body, and provides a smaller profile for anti-virus software to detect.

Figure 3.5 shows pseudocode for an encrypted virus. A decryptor loop can decrypt the virus body in place, or to another location; this choice may be dictated by external constraints, like the writability of the infected program's code. This example shows an in-place decryption.

How is virus encryption done? Here are six ways:106

Simple encryption.
No key is used for simple encryption, just basic parameterless operations, like incrementing and decrementing, bitwise rotation, arithmetic negation, and logical NOT:10
EncryptionDecryption
inc bodyidec bodyi
rol bodyiror bodyi
neg bodyineg bodyi
Static encryption key.
A static, constant key is used for encryption which doesn't change from one infection to the next. The operations used would include arithmetic operations like addition, and logical operations like XOR. Notice that the use of reversible operations is a common feature of simpler types of virus encryption. In pseudocode:
EncryptionDecryption
bodyi + 123bodyi - 123
bodyi xor 42bodyi xor 42
Variable encryption key.
The key begins as a constant value, but changes as the decryption proceeds. For example:
	key = 123
	for i in 0...length(body):
            \text{body_i} = \text{body_i} xor key
	    key = key + \text{body_i}
Substitution cipher.
A more general encryption could employ lookup tables which map byte value between their encrypted and decrypted forms. Here, encrypt and decrypt are 256-byte arrays, initialized so that if encrypt[j]=k, then decrypt[k]=j:
EncryptionDecryption
bodyi=encrypt[bodyi]bodyi=decrypt[bodyi]

This substitution cipher is a 1:1 mapping, but in actual fact, the virus body may not contain all 256 possible byte values. A homophonic substitution cipher allows a 1:n mapping, increasing complexity by permitting multiple encrypted values to correspond to one decrypted value.

Strong encryption.
There is no reason why viruses cannot use strong encryption. Previously, code size might have been a factor, if the virus would have to carry strong decryption code with it, but this is no longer a problem: most systems now contain strong encryption libraries which can be used by viruses 107

The major weakness in the encryption schemes above is that the encrypted virus body is the same from one infection to the next. That constancy makes a virus as easy to detect as one using no concealment at all! With random encryption keys,108 this error is avoided: the key used for encryption changes randomly with each new infection. This idea can be applied to any of the encryption types described here. Obviously, the virus' decryptor loop must be updated for each infection to incorporate the new key.

3.2.3 Stealth

A stealth virus is a virus that actively takes steps to conceal the infection itself, not just the virus body. Furthermore, a stealth virus tries to hide from everything, not just anti-virus software. Some examples of stealth techniques are below.109

A variation is a reverse stealth virus, which makes everything look infected - the damage is done by anti-virus software frantically (and erroneously) trying to disinfect.111

Stealth techniques overlap with techniques used by rootkits. Rootkits were originally toolkits for people who had broken into computers; they used these toolkits to hide their tracks and avoid detection.112 Malware now uses rootkits too: for example, the Ryknos Trojan horse tried to hide itself using a rootkit intended for digital-rights management.113

3.2.4 Oligomorphism

Assuming an encrypted virus' key is randomly changed with each new infection, the only unchanging part of the virus is the code in the decryptor loop. Anti-virus software will exploit this fact for detection, so the next logical development is to change the decryptor loop's code with each infection.

An oligomorphic virus, or semi-polymorphic virus, is an encrypted virus which has a small, finite number of different decryptor loops at its disposal. The virus selects a new decryptor loop from this pool for each new infection. For example, Whale had 30 different decryptor variants, and Memorial had 96 decryptors.114

In terms of detection, oligomorphism only makes a virus marginally harder to spot. Instead of looking for one decryptor loop for the virus, anti-virus software can simply have all of the virus' possible decryptor loops enumerated, and look for them all.

3.2.5 Polymorphism

A polymorphic virus is superficially the same as an oligomorphic virus. Both are encrypted viruses, both change their decryptor loop on each infection.115 However, a polymorphic virus has, for all practical purposes, an infinite number of decryptor loop variations. Tremor, for example, has almost six billion possible decryptor loops!116 Polymorphic viruses clearly can't be detected by listing all the possible combinations.

There are two questions that arise with respect to polymorphic viruses. First, how can a virus detect that it has previously infected a file, if its presence is hidden sufficiently well? Second, how does the virus change its decryptor loop from infection to infection?

3.2.5.1 Self-Detection

At first glance, it might seem easy for a polymorphic virus to detect if it has previously infected some code - when the virus morphs for a new infection, it can also change whatever aspect of itself that it looks for. This doesn't work, though, because a virus must be able to recognize infection by any of its practically-infinite forms. This means that the infection detection mechanism must be independent of the exact code used by the virus:

Figure 3.6. Fun with NTFS alternate data streams

Figure 3.6. Fun with NTFS alternate data streams

File timestamp.
A virus could change the timestamp of an infected file, so that the sum of its time and date is some constant value K for all infections.117 A lot of software only displays the last two digits of the year, so an infected file's year could be increased by 100 without attracting attention.118
File size.
An infected file could have its size padded out to some meaningful size, such as a multiple of 1234.11
Data hiding.
In complex executable file formats, like ELF, not all parts of the file's information may be used by a system. A virus can hide aflagin unused areas, or look for an unusual combination of attributes that it has set in the file. For example, Zperm looks for the character "Z" as the minor linker version in an executable's file header on Windows.119
Filesystem features.
Some filesystems allow files to be tagged with arbitrary attributes, whose existence is not always made obvious. These can be used by a virus to store code, data, or flags which indicate that a file has been infected. Figure 3.6 shows such "alternate data streams" being used in an NTFS filesystem to attach a flag to a file; the presence of this flag doesn't show up in directory listings, the file size, or in the graphical filesystem browser.12
External storage.
The indication that a file is infected need not be directly associated with the file itself. For example, a virus could use a hash function to map an infected file's name into an obfuscated string, and use that string to create a key in the Windows Registry. The virus could then use the existence of that key as an infection indicator. Even if the Registry key was discovered, it wouldn't immediately reveal the name of the infected file (especially if a strong cryptographic hash function was used).

Note that none of these mechanisms need to work perfectly, because a false positive only means that the virus won't infect some code that it might have otherwise. Also, since all these infection-detection methods work for polymorphic viruses, they also work for the more specific case of non-polymorphic viruses too. Viruses which retain some constancy can just look for one or two bytes of their own code,120 rather than resorting to more elaborate methods.

It was once suggested that systems could be inoculated against specific viruses by faking the virus' self-detection indicator on an uninfected system.121 Unfortunately, there are too many viruses now to make this feasible.

3.2.5.2 Changing the Decryptor Loop

The code in a polymorphic virus is transformed for each fresh infection using a mutation engine.122 The mutation engine has a grab-bag of code transformation tricks which take as input one sequence of code and output another, equivalent, sequence of code. Choosing which technique to apply and where to apply it can be selected by the engine using a pseudo-random number generator.123 The result is an engine which is extensible and which can permute code in a large number of ways. Some sample transformations are shown below.124

Instruction equivalence.
Especially on CISC architectures like the Intel x86, there are often many single instructions which have the same effect. All these instructions would set register r1 to zero:
	
	clear r1
	xor r1,r1
	and 0,r1
	move 0,r1
Instruction sequence equivalence.
Instruction equivalence can be generalized to sequences of instructions. While single-instruction equivalence is at the mercy of the CPU's instruction set, instruction sequence equivalence is more portable, and applies to both high-level and low-level languages:
	x = 1    <=>	y = 21
			x = y - 20
Instruction reordering.
Instructions may have their order changed, so long as constraints imposed by inter-instruction dependencies are observed.
	r1 = 12			r2 = r3 + r2
	r2 = r3 + r2     <=>	r1 = 12
	r4 = r1 + r2		r4 = r1 + r2

Here, the calculation of r4 depends on the values of r1 and r2, but the assignments to r1 and r2 are independent of one another and may be done in any order.

Instruction reordering is well-studied, because it is an application of the instruction scheduling done by optimizing compilers to increase instruction-level parallelism.

Register renaming.
A minor, but significant, change can be introduced simply by changing the registers that instructions use. While this makes no difference from a high-level perspective, such as a human reading the code, renaming changes the bit patterns that encode the instructions; this complicates matters for anti-virus software looking for the virus' instructions. For example:
	
	r1 = 12			r3 = 12
	r2 = 34		<=>	r1 = 34
	r3 = r1 + r2		r2 = r3 + r1

The concept of register renaming naturally extends to variable renaming in higher-level languages, such as those a macro virus might employ.

Reordering data.
Changing the locations of data in memory will have a similar effect in terms of altering instruction encoding as register renaming. This would not necessarily have a corresponding transformation in a high-level language, as the variable names themselves would not be changed, just their order.
Making spaghetti.
Although some programmers are naturally gifted when it comes to producing "spaghetti code," others are not as fortunate. Happily, code can be automatically transformed so that formerly-consecutive instructions are scattered, and linked together by unconditional jumps:
start:				L1:
	r1 = 12				r2 = 34
	r2 = 34		=>		goto L2
	r3 = r1 + r2		start:
					r1 = 12
					goto L1
				L2:
					r3 = r1 + r2

The instructions executed, and their execution order, is the same in both pieces of code.

Inserting junk code.
"Junk" computations can be inserted which are inert with respect to the original code - in other words, running the junk code doesn't affect what the original code does. Two examples of adding junk code are below:
	r1 = 12			r1 = 12			r5 = 42
	inc r1		<=	r2 = 34		=>	r1 = 12
	inc r1			r3 = r1 + r2	     X:
	r1 = r1 - 2					r2 = 34
	r2 = 34						dec r5
	r3 = r1 + r2					bne X
							r3 = r1 + r2

The code on the left shows the difference between inserting junk code and using instruction sequence equivalence: with junk code, the original code isn't changed. The one on the right inserts a loop as junk code.

Run-time code generation.
One way to transform the code is to not have all of it present until it runs. Either fresh code can be generated at run time, or existing code can be modified.
r1 = 12			r1 = 12
r2 = 34		=>	r2 = 34
r3 = r1 + r2		generate r3 = r1 + r2
			call generated_code
Interpretive dance.
The way code is executed can be changed, from being directly executed to being interpreted by some application-specific virtual machine.125 A "classical" interpreter for such virtual machine code mimics the operation of a real CPU as it fetches, decodes, and executes instructions. In the example below, two of the real instructions are assigned different virtual machine opcodes. Another opcode forces the interpreter loop to exit. demonstrating the mixing of interpreted and real code. In the interpreter, the variable ipc is the interpreter's program counter, and controls the instruction fetched and executed from the CODE array.
r1 = 12			ipc = 0
r2 = 34		=>	loop:
r3 = r1 + r2			switch CODE[ipc]:
					case 0:
						exit loop
					case 1:
						r2 = 34
					case 2:
						r1 = 12
				inc ipc
			r3 = rl + r2
			...
			CODE:
				2
				1
				0

This transformation can be repeated multiple times, giving multiple levels of interpreters.

Concurrency.
The original code can be separated into multiple threads of execution, which not only transforms the code, but can greatly complicate automatic analysis:13
r1 = 12			start thread T
r2 = 34		=>	r1 = 12
r3 = rl + r2		wait for signal
			r3 = r1 + r2
				...
			T:
				r2 = 34
				send signal
				exit thread T
Inlining and outlining.
Code inlining is a technique normally employed to avoid subroutine call overhead,14 that replaces a subroutine call with the subroutine's code:
	...			...
	call S1			r1 = 12
	call S2			r2 = r3 + r2
	...		=>	r4 = r1 + r2
S1:
	r1 = 12			r1 = 12
	r2 = r3 + r2		r2 = 34
	r4 = r1 + r2		r3 = r1 + r2
	return			...
S2:
	r1 = 12
	r2 = 34
	r3 = r1 + r2
	return

Outlining is the reverse operation; it need not preserve any logical code grouping, however:

	...			...
	r1 = 12			r1 = 12
	r2 = r3 + r2		r2 = r3 + r2
	r4 = r1 + r2		call S12
			=>	r3 = r1 + r2
	r1 = 12			...
	r2 = 34			S12:
	r3 = r1 + r2			r4 = r1 + r2
					r1 = 12
					r2 = 34
					return

Another option is to convert the code into threaded code, which has nothing to do with threads used for concurrent programming, despite the name. Threaded code is normally used as an alternative way to implement programming language interpreters.126 Subroutines in threaded code don't return to the place from which they were invoked, but instead directlyjump to the next subroutine; the threaded code itself is simply an array of code addresses:

	...			...
	r1 = 12			next = &CODE
	r2 = r3 + r2		goto [next]
	r4 = r1 + r2		CODE:
			=>		&I1
	r1 = 12				&I2
	r2 = 34				&X
	r3 = r1 + r2		X:
	...				r1 = 12
					r2 = 34
					r3 = r1 + r2
					...
				I1:
					r1 = 12
					inc next
					goto [next]
				I2:
					r2 = r3 + r2
					r4 = r1 + r2
					inc next
					goto [next]
Subroutine interleaving.
Inlining and outlining transformations maintain the original code, but rebundle it in different ways. Code can also be transformed by combining independent subroutines together, as in the following example.
	...			...
	call S1			call S12
	call S2			...
	...		=>	S12:
S1:					r5 = 12
	r1 = 12				r1 = 12
	r2 = r3 + r2			r6 = r3 + r2
	r4 = r1 + r2			r2 = 34
	return				r4 = r5 + r6
S2:					r3 = r1 + r2
	r1 = 12		return
	r2 = 34
	r3 = r1 + r2
	return

The code from S1 has had some registers renamed to avoid collisions with registers used by S2. The overall effect in the interleaved subroutine is the same as the original code in terms of the values computed.

A number of these transformations are also used in the (legitimate) field of code obfuscation; code obfuscation research is used to try and prevent reverse engineering. There are also many, many elaborate code transformations performed by optimizing compilers. Not all compiler techniques and code obfuscation techniques have yet been used by virus writers.

Instead of supplying transformations for the mutation engine to pick from, a virus writer may create a mutation engine that will automatically produce a distinct, equivalent decryptor loop. In compilers, automatically searching for a code sequence is referred to as superoptimization, and the search may be implemented in a variety of ways: brute-force, automated theorem proving, or any technique for searching a large search space.127 Zellome, for example, uses a genetic algorithm in its mutation engine.128 Enormous computational demands are required by such a search, although a clever algorithm may avoid generating too much illegal code and thus improve search time.15 For now, this mutation method is a curiosity only.

3.2.6 Metamorphism

'Viruses aim to keep their size as small as possible and it is impractical to make the main virus body polymorphic'

- Tarkan Yetiser129


Metamorphic viruses are viruses that are polymorphic in the virus body.130 They aren't encrypted, and thus need no decryptor loop, but avoid detection by changing: a new version of the virus body is produced for each new infection.

The code-modifying techniques used by polymorphic viruses all apply to metamorphic viruses. Both employ a mutation engine, except a polymorphic virus need not change its engine on each infection, because it can reside in the encrypted part of the virus. In contrast, a metamorphic virus' mutation engine has to morph itself anew for each infection.

Some metamorphic viruses are very elaborate. Simile's mutation engine, about 12,000 lines of assembly code, translates Simile from machine code to a machine-independent intermediate code. Operating on the intermediate code, the mutation engine undoes old obfuscations, applies new transformations, and generates fresh machine code.131 Metamorphic mutation engines whose input and output are machine code must be able to disassemble and reassemble machine code.16

Metamorphism is relatively straightforward to implement in viruses that spread in source code form, such as macro viruses. A virus may rely on system tools for metamorphism, too. Apparition, for instance, is written in Pascal17 and carries its own source code; if a compiler is found on an infected system, the virus inserts junk code into its source and recompiles itself.

While polymorphic and metamorphic viruses are decidedly nontrivial to detect by anti-virus software, they are also hard for a virus writer to implement correctly - the numbers of these viruses are small in comparison to other types.

3.2.7 Strong Encryption

The encryption methods discussed so far result in viruses that, once captured, are susceptible to analysis. The major problem is not the encryption method, because that can always be strengthened; the major problem is that viruses carry their decryption keys with them.132

This might seem a necessary weakness, because if a virus doesn't have its key, it can't decrypt and run its code. There are, however, two other possibilities.

  1. The key comes from outside an infected system:
    • A virus can retrieve the key from a web site, but that would mean that the virus would then have to carry the web site's address with it, which could be blocked as a countermeasure. To avoid knowing a specific web site's name, a virus could use a web search engine to get the key instead.

      Generally, any electronic data stream that a virus can monitor would be usable for key delivery, especially ones with high volumes of traffic that are unlikely to be blocked: email messages, Usenet postings, instant messaging, IRC, file-sharing networks.

    • A binary virus is one where the virus is in two parts, and doesn't become virulent until both pieces are present on a system.133 There have only been a few binary viruses, such as Dichotomy and RMNS.18

      One manifestation ofbinary viruses would be where virus V_1 has strongly-encrypted code, and virus V_2 has its key. But this scheme is unlikely to work well in practice. If V_1 and V_2 travel together, then both will bear the same risk of capture and analysis, defeating the purpose of separating the encryption key. If V_1 and V_2 spread separately (e.g., V_2 is released a month after V_1, and uses a different infection vector) then their spread would be independent.

      Now, say that P_1 is the probability of V_1 reaching a given machine, and P_2 is that probability for V_2. With an independent spread, the probability of them bothfindingthe same machine and becoming virulent is P_1 \times P_2, i.e., smaller.19

  2. The key comes from inside an infected system. Using environmental key generation, the decryption key is constructed of elements already present in the target's environment, like:
    • the machine's domain name;
    • the time or date;
    • some data in the system (e.g., file contents);
    • the current user name;
    • the interface's language setting (e.g., Chinese, Hebrew).

    This makes it very easy to target viruses to particular individuals or groups. A target doesn't even know that they possess the key!

    Combined with strong encryption, environmental key generation would render a virus unanalyzable even if captured. To fully analyze an encrypted virus, it has to be decrypted, and while the elements comprising the key may be discovered, the exact value of the key will not.20 In this case, the only real hope of decryption lies in a poor choice of key. A poorly-chosen key with a relatively small range of possible values (e.g., the language setting) would be susceptible to a brute-force attack.

    How can the virus know that its decryption was successful? It doesn't. While the virus could carry a checksum with it to verify that the decryption worked,21 that might give away information to an analyst. An alternative method is to catch exceptions that invalid code may cause, then try to run the decrypted "code" and see if it works.

3.3 Virus Kits

Humans love their tools, and it's not surprising that a variety of tools exists for writing viruses. A virus kit is a program which automatically produces all or part of a virus' code.134 They have different interfaces, from command-line tools to menu-based tools to full-blown graphical user interfaces. Figures 3.7 and 3.8 show two versions of a GUI-based virus kit.22

Programming libraries are available, too, such as add-on mutation engines which will turn any virus into a polymorphic virus. In an Orwellian twist, though, success is failure. The more popular a virus kit or library, the greater the chance