Winfried Gleissner
Computers & Security, 8, 1989, pp.35-41
ISSN 0167-4048
February 1989
Download PDF (400.13Kb) (You need to be registered on forum)A model is introduced to treat the spread of computer viruses mathematically. A recurrence formula is given which allows a closed expression to be derived for the probability that, starting from an initial state, a given viral state will be reached after executing exactly k programs. In some special cases this recurrence formula can be used for numeric computations, It is shown that the infection process does not stop before all programs are infected, which are visible for any infected program in the initial state.
Keywords: Computer viruses, Mathematical model of virus infection
0167-4048/89/$3.50 © 1989, Elsevier Science Publishers Ltd.
A computer virus is a program that can reproduce itself and modify other programs by including a possibly evolved copy of itself [1]. That means that whenever an infected program is called, the virus is implanted into another program, if there is any, of the account from where it was called. In refs 1 and 2 some experiments with this kind of program are described. The result was that within a few hours the whole computer system was infected. The aim of this paper is to develop a closed expression for the probability that any given program is infected after a given time. To achieve this aim it must be known which programs were initially infected and for each program one needs the probability that it is called. The use of a computer is regarded as a sequence of program calls.
For the purpose of the present paper there is no difference between the call of a system command, a standard program (editor, linkage editor, compiler), and a user-written program.
In the computer system there are
accounts. The
programs in account
are denoted by
. Let
denote the probability that user
calls the program
in the account
. This gives

Let
denote the set of all
-vectors with integer components, which is defined as follows:

For
the computer system is said to be in viral state
, if in account
there are
infected programs. One can further assume that these are the first
programs according to the alphabetic order in which the programs are written down. For convenience the viral state in which all programs are infected is denoted by
, i.e.:

On
one defines a modulus by

and an order by

Given the viral degree
, one needs the probability
that an infected program will be called from account 

The probability
that any not infected program is executed is calculated as

For purposes which will become clear later, one defines for
and for 

Let
denote the probability that starting from viral state
viral stae
will be reached after exactly the
program calls
The
can be calcualted recursively as follows:

For
denotes the vector whose
th component is reduced by unity

Using this notation one obtains


The following recursion formula holds for the
.
Lemma 1:

Proof: The assertion is proven by induction on
For 

and as 

Induction from
to 

This ends the proof of the lemma.
For
one defines

Let
denote the set of all sequences of length
of vectors
with

with

if
and
differ in the
th component then

For an element
one defines the Vandermond determinant
by

(For more details see ref [3] p.179.)
If the sequence
is obtained from the sequence
by deleting the last element, the following recursion formula holds:

For
one obtains the determinant
deleting the last line and the
th column of
. The following determinant
is derived from the Vandermond form substituting the exponent
for exponent
in the last line:

and if 

The
can be calculated using the
by the following lemma:
Lemma 2:

Proof: Adding the first line times
to the negative of the last line one obtains

Now the following theorem can be proven without undue effort:
Theorem:

Proof: The assertion is proven by induction on the distance
of
and
. For 

yields

Induction from
to
By Lemma 1

With Lemma 2 one infers

and with this and eqn (2)

This ends the proof of the theorem.
The conclusions will be stated in the form of three remarks.
Remark 1: Accounts that do not call any infected programs cannot be infected. If for
there exists an
such that
then for
with 

Proof: As
there must be an infection in the account of the
th user, i.e. for every
there must be an index
such that
. As

For
one defines

Remark 2: If
is the initial viral state, the infection process does not stop before the viral state
is reached. One has to show that for
with 

Proof: As
for all
. If all
are distinct

If some
are equal, once can choose sequences
with

Using de l'Hopital's rule one calculates

and hence

Remark 3: The maximum possible viral degree
is actually reached in the limit

Proof: The case where some
are equal can be treated as the limit of the case where all
are distinct. Hence only this case will be considered. The assertion shall be proven by induction on the distance
between
and
. For
and

Hence

For the induction
denotes the viral degree which is obtained from
by adding 1 to the
th component. The distance between
and
is assumed to be
. Then

Fig.1 Infection process with the probability model
Number of infection paths: Using the notation introduced above one can calculate the number
of possible infection paths leading from an initial viral state
to the final state
.
Theorem: The following formula holds for the number of infection paths:

Proof: Whenever a program is infected one writes down the number of the account in which it resides. Thus the infection process is represented as a finite sequence of integers with elements between 1 and
. The length of the sequence is

There are exactly
entries with value
in this sequence for
. This gives for 

The rest follows by induction on
.
Fig. 2 Infection process of a real system
This shows that it is not feasible to do calculations for more than once account and a sufficiently large number of programs.
One account with
programs: For convenience the assumption is made that all programs are called with equal frequency
. The computation proceeds as follows. Initially only one program is infected. After each call of a program the expectation for the number of infected programs is calculated. As soon as it exceeds the number of initially infected programs by more than one, the process is started anew but with one more program, which is initially infected.
On the
axis the number of runs of a program is plotted and on the
axis the percentage of infected programs. Supposing that there are 80 programs of which only one is infected in the beginning the model shows that all of them are infected after 378 calls (Fig. 1).
This is contrasted with an infection process on a real system, where the number of the program to be called was determined by a random number generator. Figure 2 shows a typical infection process on a real system. The number of program calls, after which the whole account was infected, lay in the range from 230 and 700. The nature of the curve shows an exponential growth of the infection process for the real system as well as for the probability model. Assuming an average number of ten runs per hour this shows that after about 40 hours all programs are infected. The computations show that
programs will be infected after approximately
runs.
Takeover of users: The model can be used to say something when all users are infected instead of the takeover of all programs. This problem can be treated with similar arguments as above. Assuming that each account has only one program or that the virus infectes all programs of the account from which it was called, this case is reduced to the situation discussed earlier.
Dr. W. Gleissner received a Masters degree in numerical analysis from Oxford University in 1972 and a Diploma in mathematics from the Technical University of Munich in 1973. His main fields of interest were interval arithmetic and approximation theory but later changed to pure mathematics and representation theory of infinite groups. He received a doctoral degree in 1978 from Technical University of Munich. He then joined the Munich Reinsurance Co. as a project leader for the computer-based administration of the life reinsurance business for client insurance companies from all over the world. Later he was in charge of introducing workstations and in 1985 he became responsible for guideline development and programming methodology in an institution produsing software for the administration of the Federal Republic of Germany. His main fields of interest now are computer viruses and security-related topics in Unix. He was published more than a dozen papers about mathematical models in economics (Industrieanlager-Betriebgesellschaft mbH, Einsteunstrasse 20, D-8012 Ottobrunn, F.R.G.)