Technology
1. Problem Description and Overview
When building systems for decision support we will very often be faced with
 uncertain and incomplete information
(e.g. 'most birds fly' resp. it may be unknown whether a certain object X has property Y)
 heterogeneous information sources
(e.g. frequency data extractable from a database as well as rule knowledge obtainable from experts)
 resource constraints, e.g, time and cost
(e.g. 'a decision has to be taken within 30 minutes' resp. 'the cheapest solution has to be found')
Of course, in some simple cases of decision support either straightforward procedures are available or even ad hoc solutions will do. However, increasing complexity of decision support problems
entails increasing demand for more general and more powerful solutions which are justified by application independent, intelligible theoretical foundations and which also allow for efficient
implementation. Therefore, the problems mentioned above have been an issue of intensive research.
In order to cope with point (1) above we pursue the approach consisting of the combination of Probability Theory (ProbTh)
(restricted to finite domains) and the Principle of Maximum Entropy Completion (ME) on the basis of Linear Probabilistic Constraints
(LPC). We will argue in the following, that our approach is the result of a rather compelling analysis of the above mentioned challenges
for decision support systems. Moreover, our approach doesn't labour under the shortcomings of most of its competitors like decision
trees, score systems, fuzzy logic, Bayesian networks, neural nets, nonmonotonic logics and others.
Apart from interviewing experts we use our Entropy based Data Mining Methods (EDM) to enlarge the scope of accessible knowledge
(point 2).
Point (3) is taken into account by our use of Cost Matrices (CM) which translate probabilities into the financial consequences
of decisions and by our efficient implementation which yields decisions within a few seconds.
For allowing useroriented knowledge representation, on top of Linear Probabilistic Constraints an expressive knowledge representation language (HPL) was developed for the easy specification of the available information. For increasing the scope of feasible problems, methods for
anticipating some of the effects of ME Completion were developed, thus allowing to modularize the completion computation. In addition, special Data structures were designed for handling large
probability distributions.
Finally, we compare our technology and system to Competing Approaches and Technologies.
2. Probability Theory, Maximum Entropy
and Linear Probabilistic Constraints
2.1. Probability Theory (ProbTh)
ProbTh is a fully developed and generally applicable tool for specifying uncertain and incomplete information which makes ProbTh an ideal starting point.
 Natural Expression of Conditional Frequency Knowledge
It is easy for most of us to express common sense knowledge by a set of 'conditional frequency expressions' (e.g. `If I think of children with symptom X, I expect them to suffer in many cases from
illness Y') and to estimate the probability of some simple conclusions without difficulties (Gig96).
 Context Sensitivity
ProbTh is a language which allows to consistently express different information in different contexts. If we specify some information in ProbTh by the most prominent form of a conditional
probability, we have to specify both the 'context' for which the information holds and the information itself. With the expression 'P(b  a) = x' (or in our notation 'P(a > b) = x') 'a' is the
context, for which the information 'P(b) = x' holds. Now 'c' may be 'true' (highly probable) in the context 'a' and in the context 'b', but 'false' (probability near 0) in the context 'a and b' or in
the context 'a or b'. (See the examples under Context Sensitivity) Therefore ProbTh is a
language which allows to consistently express different information in different contexts.
(An important application of this is found with Bayes Nets, which allow the specification of different kinds of dependencies for different contexts).
 Missing Truth Functionality
But ProbTh offers even more degrees of freedom in specifying information: If we have various pieces of information in the same context, we may not know how they combine. Therefore our language has to
allow different kinds of combination of evidence (which is a trivial experience in common sense, as two poisons may double their effect or may be the antidrugs of each other. Consequently, their
effect cannot be determined in general). This means that we shall not be able in general to determine a judgment  here the probability  of a compound sentence from those of its constituents. On the
other side, theories which offer such a calculation, will not be able to express different situations of dependency. Example: Given the conditional probabilities 'P(a > b) = x' and 'P(a >
c) = y', we cannot determine z in 'P(a > b * c) = z' given x and y. (See also the example ( Concerning Truth Functionality. This precludes in advance all problems with monotonic conclusions which
contradict reality. This is a serious trouble spot of rule based systems which gave rise, for instance, to some nonmonotonic logics. (e.g. Lifschitz Lif89 demands such a combination rule to hold for reasoning with nonmonotonic logics) As shown in (Sch96) the nonmonotonic inheritance problems given in the benchmark collection (Lif89) can be solved with ProbTh alone, while further principles (like ME below) are necessary to solve all of them. (For more
information about nonmonotonic logics see e.g. Mak94, Ant97 or
DP96).
 Probability as formalizing belief
Expert knowledge consists mostly of inductive experience knowledge in which experts believe to various degrees. That ProbTh is an excellent means to model belief is justified in detail in Par94.
 Notion of Independence
ProbTh offers the notion of (conditional) independence, which has a clear information theoretic meaning even without using numbers. It allows to distinguish different kinds of information flow and
prepares the ground for efficient methods of reasoning.
Traditional objections to the use of ProbTh can be overcome as follows:
 Objection: ProbTh is of very limited deductive power, and the amount of conclusions to be drawn is so restricted, that ProbTh alone is too weak for supporting decisions.
Refutation: The deductive power of ProbTh can be increased sufficiently by adding the principle of Maximum Entropy (see below).
 Objection: Computations in ProbTh are impossible, as they are exponential in the number of knowledge attributes (variables).
Refutation: This criticism was refuted by Lauritzen (Lau88) who demonstrated efficient probabilistic reasoning via the use of
dependency networks. (See as well `Implenentation' below on the way we compute ME.)
 Objection: Knowledge Modelling is impossible in ProbTh, as even the expert does not know precise probabilities of his/her field.
Refutation: This objection is no more valid due to the introduction of probability intervals, which allow to represent uncertain probabilities. < >The objection by Kahnemann and
Tversky (Kah72), (Tve74), that we are not able to reason with
probability theory has been corrected by Gigerenzer (Gig96) who showed that we are well able to reason in frequencies.
(`where come the numbers from?')
Allowing the use of probability intervals  representing uncertain probabilities  refutes the objection that ProbTh obliges us to overspecify our knowledge, because it requires us to assign to our
knowledge (exact) probability values for which we have no justification. Gig96) has shown, that ProbTh is well understood in terms of frequency expressions.
2.2. Maximum Entropy (ME)
In order to increase the deductive power of ProbTh to a degree which is sufficient for supporting decisions we add the (context sensitive) principle of Maximum Entropy, which reduces the remaining
degrees of freedom without introducing any additional bias.
Main arguments for ME are as follows:
 Consistent Extension of ProbTh
 Model Selection
Since ME selects one of the models provided by ProbTh on the basis of the given knowledge it is obviously consistent.
Of course, the choice of the MEModel is context sensitive, i.e. if we change (enlarge) the set of constraints, we possibly receive a different MEModel (for the relation to 'nonmonotonic reasoning'
see SF97). It also preserves the absence of truth functionality.
 Extension of Indifference
If there is no reason to consider two events as different, we do not want to introduce such a difference artificially and assign equal probability to the two events. This option for indifference
(including a precise definition of the cases where we do not have the reason to make a difference) is respected by ME and used in PIT (see Implementation).
 Extension of independence
(Conditional) Independencies between parts of knowledge are extremely helpful for inferencing as they decrease context influence and this increases the number of possible conclusions to be drawn. It
is therefore of particular importance that ME does not introduce new dependencies within the specified knowledge, but assumes independence where no other information is available (or, formulated
sloppily, ME maximizes independence, given a set of constraints).
The Principle of Independence makes precise which kinds of independencies ME assumes, given a set of linear constraints. A subset of these independencies can be found by drawing an undirected graph,
where each node corresponds to a variable and each arc between a node A and a node B corresponds to a certain constraint c, which mentions both A and B. If we interpret the resulting graph as
(in)dependence map (see e.g. Pea88) (in other words, assuming the Markov property for the probability models, given the graph G) we
have assumed a set of independence statements to hold for our models. Therefore, using ME means to learn the independencies from the constraints in contrast to standard methods,
which assume certain independencies to hold (independently of the constraints, e.g. the more simple scoring decision systems).
 Minimal Information Increase
Since maximum entropy is equivalent to minimal increase of information, the ME principle chooses the most uninformative model, or most neutral model. In other words, the given knowledge is
supplemented by ME in the weakest possible way.
 Risk Minimization
As shown in (Gru00) the choice of MEModel is strongly related to minimizing risk in bets. Since betting is quite similar to taking
a decision, rooting our decisions in the ME model, means to take always the decision which is least risky.
 Concentration Phenomenon and the Wallis Derivation
A further interesting property of the ME distribution is its that it is the center of gravity of all possible models, i.e. almost all events that satisfy certain given constraints have frequencies
which are extremely close to the ME distribution. Let us take, for instance, a probability distribution to describe the relation of different kinds of balls in an urn with a fixed (but unknown)
number N of balls. Consider all the 'urns', (i.e. the combination of balls of different colors), where the combination fulfills certain constraints (premises, axioms) and any combination itself
generated by a random process of colouring balls. Then the most probable combinations will 'concentrate' around the ME distribution, the more we increase N. (For the formal description of the Wallis
derivation see for instance (Jay95 or Sch01).)
 Axiomatics of Information Measures
As we said above, ME chooses the most uninformative model and thus avoids the introduction of any additional bias. Trying to determine axiomatically a function which measures the amount of lack of
information (uncertainty)  an attempt which goes back to Shannon's Information Theory  it can be shown that the entropy function is the only one which satisfies these axioms, i.e. it is the only
one which chooses the most uninformative model (see e.g. (Jay95)).
In Prob. Reasoning we give an example for reasoning with ME compared to (monotonic) probabilistic reasoning.
2.3. Linear Constraints
For practical reasons it is important to find a type of constraints which are easy to deal with and also general enough to state all relevant expert knowledge. The ideal choice is a consistent set
(conjunction) of linear (probabilistic inequality) constraints (which are linear combinations of probabilities of elementary events). The have the following nice properties:
 Allowing only Linear Constraints assures the existence of a unique Maximum Entropy distribution, (although more general constraints would be sufficient to assure this uniqueness.)
A unique ME model implies that every (e.g. user defined) probabilistic statement will be true or false in our model. For instance, in a medical application this would mean, that for every
combination of values of possible symptoms of a patient (including the case of unknown values) the probability model yields a numerical value (i.e. not an interval) for the probability of a certain
illness.
 In addition, Linear Constraints are sufficiently general to use them as a kind of assembler language on which a very expressive Highlevel Probabilistic Knowledge Representation can be conceived.
 Finally, instead of approximating the ME distribution by general tools of nonlinear optimisation, Linear Constraints have the computational advantage that the ME distribution can be efficiently
computed (e.g. by directly calculating the Langrange Multiplier or by using the Newton approximation for them).
Moreover, if the set is not consistent, our algorithm tries to identify a (not necessarily unique) subset of inconsistent constraints and to inform the user how to 'repair' this set of constraints.
3. Extensions
3.1. Highlevel Probabilistic Language (HPL)
To express one's knowledge in the form of linear constraints is not feasible for the average expert. Therefore a HPL was designed on their basis, which permits a userfriendly specification of the
available information and which is automatically translated into linear constraints. The central feature of this HPL are statements of conditional probabilities ranging over intervals, which are very
useful for expressing probabilistic ifthenpropositions. Especially PIT is able to handle such an expressive language (with probability intervals).
Additionally, the special type of linear constraints generated from statements of our HPL have the computational advantage that the ME distribution can be calculated without search (by directly
calculating the Lagrange multiplier).
Combining Knowledge from Different Sources
While developing our application LEXMED, we found it necessary to combine knowledge from different sources, like data, experts and literature, as in most cases
 experts are not able to specify hundreds of rules, including complex dependencies between 3 or 4 variables, as their time is strictly limited and their dependency knowledge is not precise enough
for coping with this task. Moreover, different experts often contradict themselves.
(This problem was underestimated when the first expert systems were built.)
 Available data alone are in most cases not representative for the actual purpose and the
 literature often does not contain enough systematic information of the necessary precision.
As our HPL (resp. some sublanguage of it) is (close to) the output language of several datamining tools (e.g. Bayesian Nets, with some restrictions, Decision trees, some Neural nets and, of course,
our own Datamining Tool) and on the other side, frequency statements are quite common in the literature (see e.g. Dom91), it is
relatively easy to combine the knowledge of different sources.
This would not be so easy or even impossible, if we chose data as the basic information source. Since there are many algorithms on how to generate 'knowledge' from data (which is the topic of the
large machine learning community), but (to our knowledge) there are no algorithms on how to generate justified data from 'knowledge', we have to start with the knowledge representation language and
look for a datamining component which produces knowledge in our representation language.
Furthermore, constraint based systems offer a knowledge base which is easy to read, to understand, to check and to maintain. This is important, as we want to be informed about what the system knows
resp. what is has learned. All this is done without making the retrieval of knowledge less efficient.
3.2. Datamining Tool
From our background in probability theory, Maximum Entropy (resp. the variant of Minimum Cross Entropy) and the concepts of independence, there is only a small step to generate corresponding
datamining algorithms. These algorithms (a basic concept (multi information function) can be found in SV98) look for correlations
between different attributes and construct a dependency graph, which represents the most important dependencies between the variables. This dependency graph then serves as basis for the determination
of the (probabilistic) 'rules' which are necessary to build an appropriate probabilistic model (which then represents the most important dependencies of the data.) For a detailed description of this
algorithm see ES99.
3.3. Cost Matrix
Basically, when asking a query, the answer of PIT will be a probability. But how are probabilities related to decisions ? Surprisingly this answer is not trivial. We have to respect different types
of errors and their different costs.
In our application LEXMED it makes a difference whether we diagnose 'perforated appendicitis', where 'no appendicitis' would be correct, or we diagnose 'no appendicitis' where 'perforated
appendicitis' would be correct. The latter case is of course a much bigger mistake, or in other words, much more expensive in view of the followup costs.
Additionally, some decisions are optimal just for uncertain cases, while others are optimal if we are (nearly) certain (see the example under Cost Matrix Structure).
We therefore use a general solution, working for any vector of faults and any vector of possible decisions with the aim to find a decision which causes minimum overall cost. Including such a cost
calculation in the decision process is very simple. Let c_{ij} be the costs if the real fault is class j, but we would decide for i. Given a matrix (c_{ij}) of such
(mis)classification costs and a probability p_{i} for each real decision i, the query evaluation of PIT computes the average (mis)classification cost AC_{j} to
AC_{j} = Sum_{i=1,...,n} c_{ij}*p_{i} and then selects the decision j, where AC_{j} is minimal. This result therefore includes the uncertain knowledge
(leading to probabilities for different faults) and, the different types or errors (in calculating the minimum average cost, based on these probabilities).
4. Implementational Achievements
Our implementation mainly exploits the properties of ME completion like Indifference and Independence to avoid the exponential growth of the problem size by the number of variables and constraints.
These principles are used to simplify the domain (without having any influence on the outcome of the ME completion) and therefore facilitate the task of ME completion resp. augment the scope of
feasible problems.
 Support of multivalued (textual) variables Example: P([Bloodpressure = high] * [Fever=low] ) = 0.5;
 Support of uncertain probabilities (Interval probabilities)
Example: P([Bloodpressure = high] > [Fever=low] ) = [0.2, 0.3];
 Internal use of the (weak and strong) principle of indifference.
Our Implementation allows to efficiently calculate and store more than a million of equivalence classes per node, each of which may contain unlimited many indifferent elementary events.
 Using the Principle of Independence allows (in most cases) to split the calculation of the joint probability distribution into a network of smaller distributions, related by conditional
independence.
The resulting datastructure is called a decomposition tree. For generating and working in such a tree see e.g. Tar85, Tru00.
 Direct Calculation of the Lagrange Factors for Elements of our HPL.
In every iteration step of calculating the ME model, we have to adjust a current probability model to satisfy a current constraint. Therefore, it is of great importance, how fast this update can be
performed. For most elements of our HPL, the necessary update is done in one step, for the remaining ones it is carried out by efficient search (Newton Algorithm).
 Last, but not least we have developed very efficient data structures and algorithms for a fast query response. In our medical application LEXMED, we have proved that query evaluation on a
knowledge base of more than 500 conditional probabilities can be done in less than a second on a standard PC.
Other technologies conceived for the same aims can be subdivided into three groups: Technologies based on Qualitatively Simpler Technologies, technologies which are Qualitatively Competitive, but Practically Less Handy We will shortly state our position to these technologies.
 Score systems implicitly make strong independence assumptions between the variables. As a consequence, these systems cannot handle dependencies between different variables or
different illnesses (as PIT is able to do). Technically, we can transform every Score system in a system working with PIT (but not the other way round) (see Sch00).
 Decision Trees also represent tools which learn from data and not from expert knowledge. Decision Trees are not applicable if certain information necessary for branching is
missing and Decision Trees also require the information in a fixed order.
 As the well known neural networks are based on data, they do not offer a general solution for reading and combining knowledge from different sources, e.g. data bases and human
experts.
 Derivatives of classical logic (e.g. default logic) (which work by chaining rules to infer new statements), do not fully support the necessary features of context sensitivity and missing truth
functionality.
Examples of such systems and their problems can be found in Poo89, McD87.
 As Fuzzy Logic is truth functional (i.e. it calculates the values of complex expressions by
those of its parts), it cannot handle specifications as described in Example 3). Consequently,
since different knowledge about dependencies cannot be expressed by using fuzzy logic, and since we expect this kind of knowledge to play an important role (as it can be expressed by frequency
knowledge), we cannot use this logic for our goals.
 Bayesian Networks are an advanced way of representing and retrieving uncertain knowledge, which makes extended use of probability theory. They may be based on data (and
automatically learn the rules and the independence structure from data) or based on expert rules.
 Bayesian Networks based on Data
For several years, a growing community works on the topic of learning Bayesian Networks from data. However, in many applications the necessary knowledge cannot be determined form data alone, and
human expert knowledge must be taken into account as well.
 Bayesian Networks based on Expert Knowledge
If we use the Bayesian Network language for the representation of the experts' knowledge, the specification has to be complete, i.e. the expert has to fully specify the independence relationships and
all the conditional probabilities of the structure, which is very likely more than the expert can provide. Here, the language of PIT offers a more general solution as it allows to specify arbitrary
sets of constraints (in its language) and does not demand to specify any independence structure (see also Luk00).
No rapid prototyping:
In case of modelling expert knowledge with Bayesian Networks such knowledge has to be structured manually, which is more time consuming and costly than the automatic construction of a probability
model by PIT.
No incremental update:
For the same reason, if new expert knowledge becomes available, the structure of a Bayesian Network must be manually reelaborated, while with PIT we just input the new knowledge and the update is
done automatically.
Graphical Overview of a Standard PIT Application
Examples
SubSections:
Probabilistic Reasoning : example of reasoning with MaxEnt compared with classical (monotonic) reasoning.
Properties: examples concerning context sensitivity and missing truthfunctionality in probability theory.
Cost Matrix: an example for translating probabilitues into decisions (as used and described in the PITapplication LEXMED).

Probabilistic Reasoning
Probabilities deliver a wellresearched method of reasoning with uncertain knowledge. They form a unified language to express knowledge, which may be inductively generated from data or as well
result from experts or literature. Using probabilities means expressing the given knowledge in the form of probabilistic statements or 'rules'. Of course, the user is more interested in some
conclusions, based on this knowledge. A first attempt might be to use the wellknown concepts of monotonic reasoning for conclusions, where the term monotonic describes, that all the conclusion will
remain valid given additional knowledge (rules).
Probabilistic Reasoning I (monotonic)
In this type of logic, the conclusions should be true, whenever the premises (i.e. the given knowledge) are true. We give an example of this type of probabilistic reasoning:
We may have specified the following database DB of rules :
About 70% of balls in an urn U have red spots.
About 80% of balls in an urn U have green spots. Additionally we might have the following query, based on the knowledge in DB:
How many balls in urn U could we expect to have red and green spots ?
Now it is easy to calculate, that there is a minimum of 50% (as the overlap is 70% + 80%  100 % = 50 %) and a maximum of 70 % (if all balls with red spots do show green spots) of balls with spots
in two colours. So our answer is
We could expect between 50% and 70% of balls with red and green spots.
This answer is correct and it will stay correct even under additional knowledge. But this answer is not useful in the general case. Imagine a few more premises and slightly more complex queries
and believe, that we will end up with results like
We could expect between 0% and 100% of balls with property x, y and z.
which of course is a fact by definition and not worth mentioning.
Probabilistic Reasoning: II (nonmonotonic)
Therefore we are not satisfied by monotonic reasoning and prefer a different type of logic. In this logic, our conclusions will not be true in all cases, where the premises are true. They will be
true only in MOST cases. (For a more precise meaning see the publications).
Therefore we will receive a nonmonotonic behavior: Given additional knowledge, we we might change our opinion. (The most classical example is that knowing of an object to be a bird, you will
conclude: it flies; receiving the additional knowledge that the object looks like a pinguin, you will conclude: It does not fly.) To scetch the idea for the example of above, lets first look at the
following table:
in Symbols 
PModel 1 
PModel 2 
PModel 3 
P([balls = red sp. and green sp.] ) 
0.5 
0.56 
0.7 
P([balls = red sp. and no green sp.] ) 
0.2 
0.14 
0.0 
P([balls = no red sp. and green sp.] ) 
0.3 
0.24 
0.1 
P([balls = no red sp. and no green sp.]) 
0.0 
0.06 
0.2 
The first column lists the set of our 4 elementary events (in brackets). A probability measure relates positive numbers to this events, which sum up to one. Therefore, the columns
named PModel decribe (together with the column symbols) some probability measures on our set of events. The columns are names PModel (for probability model), as these probability measures
additionally fulfill our premises: in every column we have 70 % of balls with red spots and 80 % of balls with green spots. In fact, there are infinitely many PModels, which fulfill our premises
(resp. rules, resp. constraints). The method of Maximum Entropy now says:
 Measure the (information theoretic resp. shannon) entropy of the probability model.
 Choose the Pmodel with maximal entropy
(In our form of constraints it is theoretically guaranteed, that there is exactly one model with that property, given the constraints are consistent.)
Its main argument is, that assuming a discrete number of balls in the urn, the PModels are not distributed uniformly. Moreover they are 'concentrated around' the PModel with highest entropy (for
the precise meaning of 'concentrated around' this see any corresponding textbook). Therefore the PModel with hightest entropy is the best choice we could make, if we are forced to make a
decision.
In our example it is not difficult to measure the entropy: For every column we have to perform Sum(i) (  x(i) * ln x(i) ) (where the base of the logarithm is not important for the process). We
receive
for the column PModel 1 0.173 + 0.322 + 0.361 + 0 = 0.856 .
for the column PModel 2 0.325 + 0.275 + 0.343 + 0.169 = 1.112 .
for the column P_Model 3 0.250 + 0.0 + 0.230 + 0.322 = 0.802 .
If we have to decide between these three probability models, the method of maximum entropy would choose the PModel 2. Furthermore, the PModel 2 is also the maximum entropy model of all (infinite
many) PModels of our two rules and therefore reveals one of the property of the method: As this Pmodel is (given the contraints of above) simply calculated by the assumption of independence, (e.g.
0.56 = 0.7 * 0.8) the choice of maximum entropy is somehow correlated to the use of independence assumptions. For further details we refer to the publications.
Problems of the implementations
Now it should be clear, that beside some deductive elements in our approach (the infinite set of probability models fulfilling the constraints), we mainly use an inductive logic (using the
principle of maximum entropy to choose the model with minimal information from the infite set of probability models). The problem of an implementation of this approach is not to find the maximum
entropy model in an infinite set (as it could be calculated in a direct manner) but to keep the size of representation small. Adding a two valued variable e.g. doubles the number of elementary events
which have to be distinguished (i.e we have exponential complexity in relation to the number of variables).
To solve this problem for our applications, we use properties of Maximum Entropy to simplify the representation of the problem (without changing the result) already at the beginning of our
calculations (see the use of the Principle of Independence and the use of the Principle of Indifference in our publications).

Examples of Context Sensitivity of Probability Theory
Example 1
The aim of example 1 is to demonstrate, that we may have different pieces of information about the same proposition (piece of knowledge) in different contexts. We demonstrate this by the following
example where we prove, that a given probability information (C1), about a proposition c in context a, does not imply anything about the possible values q1 and q2 in the statements (Q1) and (Q2)
about c in the extended contexts 'a and b' resp. 'a or b'.
(C1) 
If I know a (and only a), then the probability of c is p1 in (0,1). 

 
(Q1) 
If I know a and b, what is the probability q1 of c ? 
(Q2) 
If I know a or b, what is the probability q2 of c ? 
Formally:
Using '*' for 'and' and '+' for 'or'
and '>' for the conditional  i.e. P(a > b) is another notation for P(b  a) (:= P(a * b) / P(a))  we get:
(C1) 
P(a > c) = 
p1; 

 
(Q1) 
P(a * b > c) = 
q1; 
(Q2) 
P(a + b > c) = 
q2; 
In the following table the columns PModel 1 and PModel 2 describe two probability measures which both fulfill the constraint (C1). (In our notation, we use '' for 'not' and 'e' for a small
positive number close to 0 (epsilon) which, in particular, is chosen smaller than p1). The rows of the table correspond to the 8 elementary events a*b*c,...,a*b*c, whose probabilities are given in
the table either referred to as P(elementary event) in column 1 or by the probability vector (v1,...,v8) on column 2.
ElEvent 
ElVariable 
Values in PModel 1 
Values in PModel 2 
P( a * b * c) 
v1 
0 
e*p1/(1p1) 
P( a * b * c) 
v2 
e*(1p1)/p1 
0 
P( a * b * c) 
v3 
e*p1/p1 
0 
P( a * b * c) 
v4 
0 
e*(1p1)/(1p1) 
P(a * b * c) 
v5 
0 
(1p1e)/(1p1) 
P(a * b * c) 
v6 
(p1e)/p1 
0 
P(a * b * c) 
v7 
0 
0 
P(a * b * c) 
v8 
0 
0 
PModel 1 and PModel 2 are Pmodels:
We have to show that all v1,...,v8 are nonnegative and (v1 + v2 + ..+ v8) = 1.
This holds because
p1 is in the interval (0,1) and
(e*(1p1) + e*p1 + (p1e))/p1 = 1 resp. (e*p1 + e*(1p1) + (1p1e))/(1p1) = 1.
Satisfying (C1):
With
P(a)=v1+v2+v3+v4, P(a*c)=v1+v3, P(a*b)=v1+v2, P(a*b*c)=v1, P(a+b)=v1+v2+v3+v4+v5+v6, P((a+b)*c)=v1+v3+v5
we compute easily from (C1), (Q1) and (Q2) that
p1 = (v1+v3) / (v1+v2+v3+v4)
q1 = (v1) / (v1+v2)
q2 = (v1+v3+v5)/(v1+v2+v3+v4+v5+v6)
We now get that (C1) is fulfilled in
 PModel 1, because
(v1+v3) / (v1+v2+v3+v4) = e*p1 / ( e*(1p1) + e*p1 ) = p1 / ( (1p1) + p1 ) = p1
 PModel 2, because
(v1+v3) / (v1+v2+v3+v4) = e*p1 / ( e*p1 + e*(1p1)) = p1 / ( p1 + (1p1) ) = p1
Computing q1:
In PModel 1 holds: q1 = (v1) / (v1+v2) = 0/v2 = 0
In PModel 2 holds: q1 = (v1) / (v1+v2) = v1 / (v1+0) = 1
More generally we can say, since the constraint (C1) is convex, q1 can take every value in [0,1] independently of any value p1.
Computing q2:
In PModel 1 holds: q2 = v3 / (v2+v3+v6) = e*p1/ p1 = e, i.e. arbitrarily close to 0.
In PModel 2 holds: q2 = (v1+v3+v5) / (v1+v2+v3+v4+v5+v6) = (v1+v5) / (v1+v4+v5)
= (e*p1 + 1p1 e) /(1p1) = (e(p11) + (1p1)) /(1p1) = (1e), i.e. arbitrarily close to one.
More generally we can say, since the constraint (C1) is convex, q2 can take every value in (0,1) independently of any value p1.
Example 2
The aim of example 2 is to demonstrate, that we may have different information about the same variable in different contexts. We demonstrate this by proving, that even the combination of the
constraints (C2) and (C3) does not restrict the possible values of the query (Q3) resp. it does not restrict it very much in case of query (Q4).
(C2) 
If I know a (and only a), then the probability of c is p2: 
(C3) 
If I know b (and only b), then the probability of c is p3: 

 
(Q3) 
If I know a and b, what is the probability q3 of c ? 
(Q4) 
If I know a or b, what is the probability q4 of c ? 
Query (Q3) asks for the probability q3 of c given 'a and b' together with the probabilities p2 and p3, while query (Q4) asks for the probability q2 of c given 'a or b' together with the
probabilities p2 and p3.
Formally :
(We use '' for 'not', '*' for 'and' and '+' for 'or' and the arrow in 's > t' as conditional, equivalent to the notation 't  s' where 'P(s > t) := P(s and t) / P(s)' )
(C2) 
P(a > c) = 
p2; 
(C3) 
P(b > c) = 
p3; 

 
(Q3) 
P(a * b > c) = 
q3; 
(Q4) 
P(a + b > c) = 
q4; 
The following table describes two probability models.
ElEvent 
ElVariable 
Values in PModel 3 
Values in PModel 4 
P( a * b * c) 
v1 
0 
p2*p3 / (p2+p3p2*p3) 
P( a * b * c) 
v2 
(1p2)*(1p3) / (1p2*p3) 
0 
P( a * b * c) 
v3 
p2*(1p3) / (1p2*p3) 
0 
P( a * b * c) 
v4 
0 
(1p2)*p3 / (p2+p3p2*p3) 
P(a * b * c) 
v5 
p3*(1p2) / (1p2*p3) 
0 
P(a * b * c) 
v6 
0 
(1p3)*p2 / (p2+p3p2*p3) 
P(a * b * c) 
v7 
0 
0 
P(a * b * c) 
v8 
0 
0 
Recall that
 (p2) is equivalent to (v1+v3) / (v1+v2+v3+v4)
In PModel 3 (C2) is fulfilled as
(v1+v3) / (v1+v2+v3+v4) = p2(1p3) / (1p2*p3) / ( (1p2)*(1p3)+p2(1p3) )/ (1p2*p3) ) = p2(1p3) / ( (1p2)(1p3)+p2(1p3) )) = p2 / (1p2+p2 ) = p2
In PModel 4 (C2) is fulfilled as
(v1+v3) / (v1+v2+v3+v4) = p2*p3 / ( p2*p3+ (1p2)*p3 ) = p2
 (p3) is equivalent to (v1+v5) / (v1+v2+v5+v6)
In PModel 3 (C3) is fulfilled as
(v1+v5) / (v1+v2+v5+v6) = p3*(1p2) / ( (1p2)*(1p3)+p3*(1p2) ) = p3(1p2) / ( 1p3 p2 +p2*p3 + p3 p2*p3) = p3(1p2) / ( 1p2 ) = p3
In PModel 4 (C3) is fulfilled as
(v1+v5) / (v1+v2+v5+v6) = p2*p3 / ( p2*p3+(1p3)*p2 ) = p3
 (q3) is equivalent to (v1) / (v1+v2)
 (q4) is equivalent to (v1+v3+v5)/(v1+v2+v3+v4+v5+v6).
Concerning (Q3):
Query (Q3) demonstrates, that we have complete freedom if we specialize the context, as q1 can take every value from [0,1] independently of any (fixed) value in (0,1) of p2 and p3.
Proof:
We describe two probability models (i.e. these probability measures fulfil the constraints (C2) and (C3) .
The column of PModel 3 shows that q3 can take the value 0.
The column of PModel 4 shows that q3 can take the value 1.
As this constraints (C2) and (C3) are convex and therefore any value inbetween two solutions can be taken, we have proved that q3 can take any value in [0,1]
Common Sense Example
This common sense example (from D. Dubois) models high values of q3 , given low values of p2 and p3.
If you take a shower, you do not risk your life 
[0.0, 0.01] 
If you use a hairdryer, you do not risk your life 
[0.0, 0.01] 
If you take a shower and use a hairdryer (at the same time), you do risk your life. 
[0.9, 1.00] 
Concerning (Q4):
With query (Q4) we want to demonstrate, how extending the context means loosing information, as the probability of (Q4) ranges over the following interval: [(p2*p3 / (p2+p3p2*p3),
(p2+p32p2*p3)/(1p2*p3) ], which we obtain if we use PModel 4 to calculate the lower bound and
P_Model 3 to calculate the upper bound. Again we recall that PModel 3 and PModel 4 fulfill the
constraints (C2) and (C3). As the constraints (C2) and (C3) are convex and therefore any value between two solutions can occur, we proved that q4 can take any value in the described interval.
Example: For p2 = p3 = 0,5 we receive for q4 the interval [1/3, 2/3].
Examples Concerning Truth Functionality
Example 3: The aim of example 3 is to demonstrate, that we can specify certain information for single statements, while still having additional freedom to specify different
information for their combination. The classical definition of missing truth functionality is included in this case (see remark). In general terms, truth functionality is absent if is there is no
function, which allows to calculate the degree of belief of a compound sentence from the single sentences.
(C5) 
If I know a (and only a), then the probability of b is p5: 
(C6) 
If I know a (and only a), then the probability of c is p6: 

 
(Q5) 
If I know a (and only a), what is the probability q3 of (b and c) ? 
Query (Q5) asks for the probability q3 of (b and c) given a and the probabilities p5 and p6.
Formally:
(We use '' for 'not', '*' for 'and' and '+' for 'or'
and the arrow in 's > t' as conditional, equivalent to 't  s' where 'P(s > t) := P(s and t) / P(s)' ): for
(C5) 
P(a > b) = 
p5; 
(C6) 
P(a > c) = 
p6; 
 
 
 
(Q5) 
P(a > b*c) = 
q5; 
Remark:
The classical concept of absent truth functionality is implied by the example above, as the statement
'P(a*b) is no function of P(a) and P(b)'
can be reformulated as
'P(True > a*b) is no function of P(True > a) and P(True > b)'.
This means that we consider the conditional probability as basic construct and the unconditional probability as special case of conditional probabilities.
The following table describes two probability models.
ElEvent 
ElVariable 
Values in PModel 5 
Values in PModel 6 
Values in PModel 7 


(p5+p6 <= 1) 
(p5+p6> 1) 
 
P( a * b * c) 
v1 
0 
p5+p61 
min(p5,p6) 
P( a * b * c) 
v2 
p5 
1p6 
p5  min(p5,p6) 
P( a * b * c) 
v3 
p6 
1p5 
p6  min(p5,p6) 
P( a * b * c) 
v4 
1p5p6 
0 
1 max(p5,p6) 
P(a * b * c) 
v5 
0 
0 
0 
P(a * b * c) 
v6 
0 
0 
0 
P(a * b * c) 
v7 
0 
0 
0 
P(a * b * c) 
v8 
0 
0 
0 
Recall that
 (c5) is equivalent to (v1+v2) / (v1+v2+v3+v4)
 (c6) is equivalent to (v1+v3) / (v1+v2+v3+v4)
 (q5) is equivalent to (v1) / (v1+v2+v3+v4)
Concerning (Q5):
The query (Q5) demonstrates, that the probability (q5) is not determined by the constraints (C5) and (C6).
We first show, that the lower bound of q5 is max(0,p5+p61).
Proof:
Case p5+p6 <= 1.
In PModel 5, q5 = 0 / (0 +p5+p6+1p5p6) = 0.
Case p5+p6 > 1.
In PModel 6, (p5+p61) / (p5+p61+(1=p5)+(1p5) ) = p5+p61.
We therefore receive a lower bound of max(0,p5+p61) for q5. This value is minimal, as the overlap of the two sets is minimal, if the sets a and b only takes values in c if there is no other
possibility.
We also show, that the upper bound for q5 is min(p5,p6).
Proof:
In PModel 7 the upper bound for q5 is min(p5,p6). This value is the extreme case, as the overlap of b and c (in a) is maximal if (a and b) is in (a and c) , (in which case we receive q5 = p5) or (a
and c) is in (a and b) (with q5 = p6).
As the constraints are convex, q5 can take every value in the interval [max(0,p5+p61), min(p5,p6)] for fixed values p5 and p6 in [0,1].
Numerical Example for Query (Q5)
For p5=0.45 and p6=0.55 we receive values in [0.0, 0.45], which is not a very useful information.
Common Sense Examples
We categorize classes (e.g. germans, birds, etc) by typical properties.
Nevertheless in many cases there will be no individual in the class, who owns all the typical properties. That means formally, that
'P(a > b*c*d...)'
may be 0 while the single probabilities
P(a >b), P(a >c),...
are very close to 1.
An other example is the so called lottery paradox:
We do not expect a single ticket i to win, i.e. for some small e
P(ticket i does not win) = (1e)
(written as conditional: P(TRUE > [ticket i does win]) = (1e) )
but if we conclude that the conjunction of all these statements (for different tickets) does not win, we make an error, as
P(TRUE > (ticket i does not win) and (ticket j does not win) and ...) = 0;

Structure of the Cost Matrix
Explanation of the Cost Matrix
Our choice for translating probabilities into decisions can be described by searching for the decision, for which the expected average costs are minimal.
For this goal we let the individual positions of our matrix describe the costs (theoretical,global, independent of institutions) , given by an actual fault situation and an actual
decision.

(Certified) Fault:

Fault 1

Fault 2

Hypotheses and Decisions 
Decision / Treatment 1 
Costs of Treatment 1, given Fault 1 
Costs of Treatment 1, given Fault 2 
Decision / Treatment 2 
Costs of Treatment 2, given Fault 1 
Costs of Treatment 2, given Fault 2 
Decision / Treatment 3 
Costs of Treatment 3, given Fault 1 
Costs of Treatment 3, given Fault 2 
(NB.: Under 'Certified Fault' it is also possible here to find the term 'no fault'. , correspondingly under the term treatment in also included the decision 'no treatment' An (invented) Example
with numbers :

Fault / Certified Diagnosis 1 
Fault / Certified Diagnosis 2 
Decision / Treatment 1 
2500 
3800 
Decision / Treatment 2 
4000 
3000 
Decision / Treatment 3 
2000 
7000 
From this matrix can be calculated 'the additional costs of nonoptimal decisions' when the smallest respective element of the column (that describes the treatment of choice when the fault
associated to the column is present) is subtracted from all the elements.

Fault / Certified Diagnosis / 1 
Fault / Certified Diagnosis / 2 
Decision / Treatment 1 
500 
800 
Decision / Treatment 2 
2000 
0 
Decision / Treatment 3 
0 
4000 
When Fault 1 is evident, then it is clear that the correct choice is treatment 3. Compared with Fault 2, where the obvious choice is treatment 2.
What meaning can the treatment 1 have ? This example already shows that the unCertainty alone about the fault in question can lead to a choice of new treatments: When we add to the
matrix the probability, that a specialist in a specific situation clssifies the existing faults and we accept the specialist holds fault one only to be more likely than Fault 2 (which we would
formalize as a probability of 60 % to 40 %); when we further extend the matrix to the calculation of the average costs by a corresponding treatment, then we get th following extended matrix:

Fault / Certified Diagnosis 1 
Fault / Certified Diagnosis 2 
average costs per treatment 
Probability of the Certified Diagnosis 
60 % 
40 % 

Decision / Treatment 1 
500 
800 
0.6 *500 + 0.4 * 800 = 620 
Decision / Treatment 2 
2000 
0 
0.6 * 2000 + 0.4 * 0 = 1200 
Decision / Treatment 3 
0 
4000 
0.6 * 0 + 0.4 * 4000 = 1600 
Here is the treatment 1 the right alternative, as it leads to a cost of 620 units, while treatment 2 in this situation costs 1200 units and treatment 3 1600 units. (We want to point
out again, that here the term 'cost' include later effects and other risks.)