Info: Reference, Technology, Examples, Literature


Reference Project


LEXMED is a successfully working system for the diagnosis of appendicitis. Here the expert is a physician, the professional knowledge a set of statistical rules, the actual data is the patient's symptoms and the decision is a proposal of medical treatment provided by LEXMED.

LEXMED was tested at the Hospital 14 Nothelfer in Weingarten (Baden-Württemberg). While still experimental, it already achieves an accuracy of 88% -- not quite as good as a highly experienced surgeon, but definitely better than e.g. the average general physician.





1. Problem Description and Overview

When building systems for decision support we will very often be faced with

  1. uncertain and incomplete information
    (e.g. 'most birds fly' resp. it may be unknown whether a certain object X has property Y)
  2. heterogeneous information sources
    (e.g. frequency data extractable from a database as well as rule knowledge obtainable from experts)
  3. resource constraints, e.g, time and cost
    (e.g. 'a decision has to be taken within 30 minutes' resp. 'the cheapest solution has to be found')

Of course, in some simple cases of decision support either straightforward procedures are available or even ad hoc solutions will do. However, increasing complexity of decision support problems entails increasing demand for more general and more powerful solutions which are justified by application independent, intelligible theoretical foundations and which also allow for efficient implementation. Therefore, the problems mentioned above have been an issue of intensive research.

In order to cope with point (1) above we pursue the approach consisting of the combination of Probability Theory (ProbTh) (restricted to finite domains) and the Principle of Maximum Entropy Completion (ME) on the basis of Linear Probabilistic Constraints (LPC). We will argue in the following, that our approach is the result of a rather compelling analysis of the above mentioned challenges for decision support systems. Moreover, our approach doesn't labour under the shortcomings of most of its competitors like decision trees, score systems, fuzzy logic, Bayesian networks, neural nets, non-monotonic logics and others.
Apart from interviewing experts we use our Entropy based Data Mining Methods (EDM) to enlarge the scope of accessible knowledge (point 2).
Point (3) is taken into account by our use of Cost Matrices (CM) which translate probabilities into the financial consequences of decisions and by our efficient implementation which yields decisions within a few seconds.

For allowing user-oriented knowledge representation, on top of Linear Probabilistic Constraints an expressive knowledge representation language (HPL) was developed for the easy specification of the available information. For increasing the scope of feasible problems, methods for anticipating some of the effects of ME Completion were developed, thus allowing to modularize the completion computation. In addition, special Data structures were designed for handling large probability distributions.
Finally, we compare our technology and system to Competing Approaches and Technologies.

2. Probability Theory, Maximum Entropy
and Linear Probabilistic Constraints

2.1. Probability Theory (ProbTh)

ProbTh is a fully developed and generally applicable tool for specifying uncertain and incomplete information which makes ProbTh an ideal starting point.

  • Natural Expression of Conditional Frequency Knowledge
    It is easy for most of us to express common sense knowledge by a set of 'conditional frequency expressions' (e.g. `If I think of children with symptom X, I expect them to suffer in many cases from illness Y') and to estimate the probability of some simple conclusions without difficulties (Gig96).
  • Context Sensitivity
    ProbTh is a language which allows to consistently express different information in different contexts. If we specify some information in ProbTh by the most prominent form of a conditional probability, we have to specify both the 'context' for which the information holds and the information itself. With the expression 'P(b | a) = x' (or in our notation 'P(a -|> b) = x') 'a' is the context, for which the information 'P(b) = x' holds. Now 'c' may be 'true' (highly probable) in the context 'a' and in the context 'b', but 'false' (probability near 0) in the context 'a and b' or in the context 'a or b'. (See the examples under Context Sensitivity) Therefore ProbTh is a language which allows to consistently express different information in different contexts.
    (An important application of this is found with Bayes Nets, which allow the specification of different kinds of dependencies for different contexts).
  • Missing Truth Functionality
    But ProbTh offers even more degrees of freedom in specifying information: If we have various pieces of information in the same context, we may not know how they combine. Therefore our language has to allow different kinds of combination of evidence (which is a trivial experience in common sense, as two poisons may double their effect or may be the anti-drugs of each other. Consequently, their effect cannot be determined in general). This means that we shall not be able in general to determine a judgment - here the probability - of a compound sentence from those of its constituents. On the other side, theories which offer such a calculation, will not be able to express different situations of dependency. Example: Given the conditional probabilities 'P(a -|> b) = x' and 'P(a -|> c) = y', we cannot determine z in 'P(a -|> b * c) = z' given x and y. (See also the example ( Concerning Truth Functionality. This precludes in advance all problems with monotonic conclusions which contradict reality. This is a serious trouble spot of rule based systems which gave rise, for instance, to some nonmonotonic logics. (e.g. Lifschitz Lif89 demands such a combination rule to hold for reasoning with non-monotonic logics) As shown in (Sch96) the nonmonotonic inheritance problems given in the benchmark collection (Lif89) can be solved with ProbTh alone, while further principles (like ME below) are necessary to solve all of them. (For more information about non-monotonic logics see e.g. Mak94, Ant97 or DP96).
  • Probability as formalizing belief
    Expert knowledge consists mostly of inductive experience knowledge in which experts believe to various degrees. That ProbTh is an excellent means to model belief is justified in detail in Par94.
  • Notion of Independence
    ProbTh offers the notion of (conditional) independence, which has a clear information theoretic meaning even without using numbers. It allows to distinguish different kinds of information flow and prepares the ground for efficient methods of reasoning.

Traditional objections to the use of ProbTh can be overcome as follows:

  • Objection: ProbTh is of very limited deductive power, and the amount of conclusions to be drawn is so restricted, that ProbTh alone is too weak for supporting decisions.
    Refutation: The deductive power of ProbTh can be increased sufficiently by adding the principle of Maximum Entropy (see below).
  • Objection: Computations in ProbTh are impossible, as they are exponential in the number of knowledge attributes (variables).
    Refutation: This criticism was refuted by Lauritzen (Lau88) who demonstrated efficient probabilistic reasoning via the use of dependency networks. (See as well `Implenentation' below on the way we compute ME.)
  • Objection: Knowledge Modelling is impossible in ProbTh, as even the expert does not know precise probabilities of his/her field.
    Refutation: This objection is no more valid due to the introduction of probability intervals, which allow to represent uncertain probabilities. < >The objection by Kahnemann and Tversky (Kah72), (Tve74), that we are not able to reason with probability theory has been corrected by Gigerenzer (Gig96) who showed that we are well able to reason in frequencies. (`where come the numbers from?')
    Allowing the use of probability intervals - representing uncertain probabilities - refutes the objection that ProbTh obliges us to overspecify our knowledge, because it requires us to assign to our knowledge (exact) probability values for which we have no justification.
    Gig96) has shown, that ProbTh is well understood in terms of frequency expressions.

    2.2. Maximum Entropy (ME)

    In order to increase the deductive power of ProbTh to a degree which is sufficient for supporting decisions we add the (context sensitive) principle of Maximum Entropy, which reduces the remaining degrees of freedom without introducing any additional bias.
    Main arguments for ME are as follows:
    • Consistent Extension of ProbTh
      • Model Selection
        Since ME selects one of the models provided by ProbTh on the basis of the given knowledge it is obviously consistent.
        Of course, the choice of the ME-Model is context sensitive, i.e. if we change (enlarge) the set of constraints, we possibly receive a different ME-Model (for the relation to 'non-monotonic reasoning' see SF97). It also preserves the absence of truth functionality.
      • Extension of Indifference
        If there is no reason to consider two events as different, we do not want to introduce such a difference artificially and assign equal probability to the two events. This option for indifference (including a precise definition of the cases where we do not have the reason to make a difference) is respected by ME and used in PIT (see Implementation).
      • Extension of independence
        (Conditional) Independencies between parts of knowledge are extremely helpful for inferencing as they decrease context influence and this increases the number of possible conclusions to be drawn. It is therefore of particular importance that ME does not introduce new dependencies within the specified knowledge, but assumes independence where no other information is available (or, formulated sloppily, ME maximizes independence, given a set of constraints).
        The Principle of Independence makes precise which kinds of independencies ME assumes, given a set of linear constraints. A subset of these independencies can be found by drawing an undirected graph, where each node corresponds to a variable and each arc between a node A and a node B corresponds to a certain constraint c, which mentions both A and B. If we interpret the resulting graph as (in)dependence map (see e.g. Pea88) (in other words, assuming the Markov property for the probability models, given the graph G) we have assumed a set of independence statements to hold for our models. Therefore, using ME means to learn the independencies from the constraints in contrast to standard methods, which assume certain independencies to hold (independently of the constraints, e.g. the more simple scoring decision systems).
      • Minimal Information Increase
        Since maximum entropy is equivalent to minimal increase of information, the ME principle chooses the most uninformative model, or most neutral model. In other words, the given knowledge is supplemented by ME in the weakest possible way.
    • Risk Minimization
      As shown in (Gru00) the choice of ME-Model is strongly related to minimizing risk in bets. Since betting is quite similar to taking a decision, rooting our decisions in the ME model, means to take always the decision which is least risky.
    • Concentration Phenomenon and the Wallis Derivation
      A further interesting property of the ME distribution is its that it is the center of gravity of all possible models, i.e. almost all events that satisfy certain given constraints have frequencies which are extremely close to the ME distribution. Let us take, for instance, a probability distribution to describe the relation of different kinds of balls in an urn with a fixed (but unknown) number N of balls. Consider all the 'urns', (i.e. the combination of balls of different colors), where the combination fulfills certain constraints (premises, axioms) and any combination itself generated by a random process of colouring balls. Then the most probable combinations will 'concentrate' around the ME distribution, the more we increase N. (For the formal description of the Wallis derivation see for instance (Jay95 or Sch01).)
    • Axiomatics of Information Measures
      As we said above, ME chooses the most uninformative model and thus avoids the introduction of any additional bias. Trying to determine axiomatically a function which measures the amount of lack of information (uncertainty) - an attempt which goes back to Shannon's Information Theory - it can be shown that the entropy function is the only one which satisfies these axioms, i.e. it is the only one which chooses the most uninformative model (see e.g. (Jay95)).
    In Prob. Reasoning we give an example for reasoning with ME compared to (monotonic) probabilistic reasoning.

    2.3. Linear Constraints

    For practical reasons it is important to find a type of constraints which are easy to deal with and also general enough to state all relevant expert knowledge. The ideal choice is a consistent set (conjunction) of linear (probabilistic inequality) constraints (which are linear combinations of probabilities of elementary events). The have the following nice properties:
    • Allowing only Linear Constraints assures the existence of a unique Maximum Entropy distribution, (although more general constraints would be sufficient to assure this uniqueness.)
      A unique ME model implies that every (e.g. user defined) probabilistic statement will be true or false in our model. For instance, in a medical application this would mean, that for every combination of values of possible symptoms of a patient (including the case of unknown values) the probability model yields a numerical value (i.e. not an interval) for the probability of a certain illness.
    • In addition, Linear Constraints are sufficiently general to use them as a kind of assembler language on which a very expressive High-level Probabilistic Knowledge Representation can be conceived.
    • Finally, instead of approximating the ME distribution by general tools of non-linear optimisation, Linear Constraints have the computational advantage that the ME distribution can be efficiently computed (e.g. by directly calculating the Langrange Multiplier or by using the Newton approximation for them).
    Moreover, if the set is not consistent, our algorithm tries to identify a (not necessarily unique) subset of inconsistent constraints and to inform the user how to 'repair' this set of constraints.

    3. Extensions

    3.1. High-level Probabilistic Language (HPL)

    To express one's knowledge in the form of linear constraints is not feasible for the average expert. Therefore a HPL was designed on their basis, which permits a user-friendly specification of the available information and which is automatically translated into linear constraints. The central feature of this HPL are statements of conditional probabilities ranging over intervals, which are very useful for expressing probabilistic if-then-propositions. Especially PIT is able to handle such an expressive language (with probability intervals).
    Additionally, the special type of linear constraints generated from statements of our HPL have the computational advantage that the ME distribution can be calculated without search (by directly calculating the Lagrange multiplier).

    Combining Knowledge from Different Sources

    While developing our application LEXMED, we found it necessary to combine knowledge from different sources, like data, experts and literature, as in most cases
    • experts are not able to specify hundreds of rules, including complex dependencies between 3 or 4 variables, as their time is strictly limited and their dependency knowledge is not precise enough for coping with this task. Moreover, different experts often contradict themselves.
      (This problem was underestimated when the first expert systems were built.)
    • Available data alone are in most cases not representative for the actual purpose and the
    • literature often does not contain enough systematic information of the necessary precision.
    As our HPL (resp. some sublanguage of it) is (close to) the output language of several datamining tools (e.g. Bayesian Nets, with some restrictions, Decision trees, some Neural nets and, of course, our own Datamining Tool) and on the other side, frequency statements are quite common in the literature (see e.g. Dom91), it is relatively easy to combine the knowledge of different sources.
    This would not be so easy or even impossible, if we chose data as the basic information source. Since there are many algorithms on how to generate 'knowledge' from data (which is the topic of the large machine learning community), but (to our knowledge) there are no algorithms on how to generate justified data from 'knowledge', we have to start with the knowledge representation language and look for a datamining component which produces knowledge in our representation language.
    Furthermore, constraint based systems offer a knowledge base which is easy to read, to understand, to check and to maintain. This is important, as we want to be informed about what the system knows resp. what is has learned. All this is done without making the retrieval of knowledge less efficient.

    3.2. Datamining Tool

    From our background in probability theory, Maximum Entropy (resp. the variant of Minimum Cross Entropy) and the concepts of independence, there is only a small step to generate corresponding datamining algorithms. These algorithms (a basic concept (multi information function) can be found in SV98) look for correlations between different attributes and construct a dependency graph, which represents the most important dependencies between the variables. This dependency graph then serves as basis for the determination of the (probabilistic) 'rules' which are necessary to build an appropriate probabilistic model (which then represents the most important dependencies of the data.) For a detailed description of this algorithm see ES99.

    3.3. Cost Matrix

    Basically, when asking a query, the answer of PIT will be a probability. But how are probabilities related to decisions ? Surprisingly this answer is not trivial. We have to respect different types of errors and their different costs.
    In our application LEXMED it makes a difference whether we diagnose 'perforated appendicitis', where 'no appendicitis' would be correct, or we diagnose 'no appendicitis' where 'perforated appendicitis' would be correct. The latter case is of course a much bigger mistake, or in other words, much more expensive in view of the follow-up costs.
    Additionally, some decisions are optimal just for uncertain cases, while others are optimal if we are (nearly) certain (see the example under Cost Matrix Structure).
    We therefore use a general solution, working for any vector of faults and any vector of possible decisions with the aim to find a decision which causes minimum overall cost. Including such a cost calculation in the decision process is very simple. Let cij be the costs if the real fault is class j, but we would decide for i. Given a matrix (cij) of such (mis)classification costs and a probability pi for each real decision i, the query evaluation of PIT computes the average (mis)classification cost ACj to     ACj = Sumi=1,...,n cij*pi   and then selects the decision j, where ACj is minimal. This result therefore includes the uncertain knowledge (leading to probabilities for different faults) and, the different types or errors (in calculating the minimum average cost, based on these probabilities).

    4. Implementational Achievements

    Our implementation mainly exploits the properties of ME completion like Indifference and Independence to avoid the exponential growth of the problem size by the number of variables and constraints. These principles are used to simplify the domain (without having any influence on the outcome of the ME completion) and therefore facilitate the task of ME completion resp. augment the scope of feasible problems.
    • Support of multivalued (textual) variables Example: P([Bloodpressure = high] * [Fever=low] ) = 0.5;
    • Support of uncertain probabilities (Interval probabilities)
      Example: P([Bloodpressure = high] -|> [Fever=low] ) = [0.2, 0.3];
    • Internal use of the (weak and strong) principle of indifference.
      Our Implementation allows to efficiently calculate and store more than a million of equivalence classes per node, each of which may contain unlimited many indifferent elementary events.
    • Using the Principle of Independence allows (in most cases) to split the calculation of the joint probability distribution into a network of smaller distributions, related by conditional independence.
      The resulting datastructure is called a decomposition tree. For generating and working in such a tree see e.g. Tar85, Tru00.
    • Direct Calculation of the Lagrange Factors for Elements of our HPL.
      In every iteration step of calculating the ME model, we have to adjust a current probability model to satisfy a current constraint. Therefore, it is of great importance, how fast this update can be performed. For most elements of our HPL, the necessary update is done in one step, for the remaining ones it is carried out by efficient search (Newton Algorithm).
    • Last, but not least we have developed very efficient data structures and algorithms for a fast query response. In our medical application LEXMED, we have proved that query evaluation on a knowledge base of more than 500 conditional probabilities can be done in less than a second on a standard PC.

    5. Competing Technologies and Systems

    Other technologies conceived for the same aims can be subdivided into three groups: Technologies based on Qualitatively Simpler Technologies, technologies which are Qualitatively Competitive, but Practically Less Handy We will shortly state our position to these technologies.

    5.1. Qualitatively Simpler Technologies

    • Score systems implicitly make strong independence assumptions between the variables. As a consequence, these systems cannot handle dependencies between different variables or different illnesses (as PIT is able to do). Technically, we can transform every Score system in a system working with PIT (but not the other way round) (see Sch00).
    • Decision Trees also represent tools which learn from data and not from expert knowledge. Decision Trees are not applicable if certain information necessary for branching is missing and Decision Trees also require the information in a fixed order.
    • As the well known neural networks are based on data, they do not offer a general solution for reading and combining knowledge from different sources, e.g. data bases and human experts.
    • Derivatives of classical logic (e.g. default logic) (which work by chaining rules to infer new statements), do not fully support the necessary features of context sensitivity and missing truth functionality.
      Examples of such systems and their problems can be found in Poo89, McD87.
    • As Fuzzy Logic is truth functional (i.e. it calculates the values of complex expressions by those of its parts), it cannot handle specifications as described in Example 3). Consequently, since different knowledge about dependencies cannot be expressed by using fuzzy logic, and since we expect this kind of knowledge to play an important role (as it can be expressed by frequency knowledge), we cannot use this logic for our goals.

    5.2. Qualitatively Competitive, but Practically Less Handy Approaches

    • Bayesian Networks are an advanced way of representing and retrieving uncertain knowledge, which makes extended use of probability theory. They may be based on data (and automatically learn the rules and the independence structure from data) or based on expert rules.
      • Bayesian Networks based on Data
        For several years, a growing community works on the topic of learning Bayesian Networks from data. However, in many applications the necessary knowledge cannot be determined form data alone, and human expert knowledge must be taken into account as well.
      • Bayesian Networks based on Expert Knowledge
        If we use the Bayesian Network language for the representation of the experts' knowledge, the specification has to be complete, i.e. the expert has to fully specify the independence relationships and all the conditional probabilities of the structure, which is very likely more than the expert can provide. Here, the language of PIT offers a more general solution as it allows to specify arbitrary sets of constraints (in its language) and does not demand to specify any independence structure (see also Luk00).
        No rapid prototyping:
        In case of modelling expert knowledge with Bayesian Networks such knowledge has to be structured manually, which is more time consuming and costly than the automatic construction of a probability model by PIT.
        No incremental update:
        For the same reason, if new expert knowledge becomes available, the structure of a Bayesian Network must be manually reelaborated, while with PIT we just input the new knowledge and the update is done automatically.

    Graphical Overview of a Standard PIT Application





Probabilistic Reasoning : example of reasoning with MaxEnt compared with classical (monotonic) reasoning.

Properties: examples concerning context sensitivity and missing truthfunctionality in probability theory.

Cost Matrix: an example for translating probabilitues into decisions (as used and described in the PIT-application LEXMED).




Probabilistic Reasoning

Probabilities deliver a well-researched method of reasoning with uncertain knowledge. They form a unified language to express knowledge, which may be inductively generated from data or as well result from experts or literature. Using probabilities means expressing the given knowledge in the form of probabilistic statements or 'rules'. Of course, the user is more interested in some conclusions, based on this knowledge. A first attempt might be to use the wellknown concepts of monotonic reasoning for conclusions, where the term monotonic describes, that all the conclusion will remain valid given additional knowledge (rules).

Probabilistic Reasoning I (monotonic)

In this type of logic, the conclusions should be true, whenever the premises (i.e. the given knowledge) are true. We give an example of this type of probabilistic reasoning:
We may have specified the following database DB of rules :
   About 70% of balls in an urn U have red spots.
   About 80% of balls in an urn U have green spots. Additionally we might have the following query, based on the knowledge in DB:

How many balls in urn U could we expect to have red and green spots ?

Now it is easy to calculate, that there is a minimum of 50% (as the overlap is 70% + 80% - 100 % = 50 %) and a maximum of 70 % (if all balls with red spots do show green spots) of balls with spots in two colours. So our answer is

We could expect between 50% and 70% of balls with red and green spots.

This answer is correct and it will stay correct even under additional knowledge. But this answer is not useful in the general case. Imagine a few more premises and slightly more complex queries and believe, that we will end up with results like

We could expect between 0% and 100% of balls with property x, y and z.

which of course is a fact by definition and not worth mentioning.

Probabilistic Reasoning: II (non-monotonic)

Therefore we are not satisfied by monotonic reasoning and prefer a different type of logic. In this logic, our conclusions will not be true in all cases, where the premises are true. They will be true only in MOST cases. (For a more precise meaning see the publications).
Therefore we will receive a non-monotonic behavior: Given additional knowledge, we we might change our opinion. (The most classical example is that knowing of an object to be a bird, you will conclude: it flies; receiving the additional knowledge that the object looks like a pinguin, you will conclude: It does not fly.) To scetch the idea for the example of above, lets first look at the following table:

in Symbols P-Model 1 P-Model 2 P-Model 3
P([balls = red sp. and green sp.] ) 0.5 0.56 0.7
P([balls = red sp. and no green sp.] ) 0.2 0.14 0.0
P([balls = no red sp. and green sp.] ) 0.3 0.24 0.1
P([balls = no red sp. and no green sp.]) 0.0 0.06 0.2

The first column lists the set of our 4 elementary events (in brackets). A probability measure relates positive numbers to this events, which sum up to one. Therefore, the columns named P-Model decribe (together with the column symbols) some probability measures on our set of events. The columns are names P-Model (for probability model), as these probability measures additionally fulfill our premises: in every column we have 70 % of balls with red spots and 80 % of balls with green spots. In fact, there are infinitely many P-Models, which fulfill our premises (resp. rules, resp. constraints). The method of Maximum Entropy now says:

  • Measure the (information theoretic resp. shannon-) entropy of the probability model.
  • Choose the P-model with maximal entropy
    (In our form of constraints it is theoretically guaranteed, that there is exactly one model with that property, given the constraints are consistent.)

Its main argument is, that assuming a discrete number of balls in the urn, the P-Models are not distributed uniformly. Moreover they are 'concentrated around' the P-Model with highest entropy (for the precise meaning of 'concentrated around' this see any corresponding textbook). Therefore the P-Model with hightest entropy is the best choice we could make, if we are forced to make a decision.
In our example it is not difficult to measure the entropy: For every column we have to perform Sum(i) ( - x(i) * ln x(i) ) (where the base of the logarithm is not important for the process). We receive
for the column P-Model 1   0.173 + 0.322 + 0.361 + 0 = 0.856 .
for the column P-Model 2   0.325 + 0.275 + 0.343 + 0.169 = 1.112 .
for the column P_Model 3   0.250 + 0.0 + 0.230 + 0.322 = 0.802 .
If we have to decide between these three probability models, the method of maximum entropy would choose the P-Model 2. Furthermore, the P-Model 2 is also the maximum entropy model of all (infinite many) P-Models of our two rules and therefore reveals one of the property of the method: As this P-model is (given the contraints of above) simply calculated by the assumption of independence, (e.g. 0.56 = 0.7 * 0.8) the choice of maximum entropy is somehow correlated to the use of independence assumptions. For further details we refer to the publications.

Problems of the implementations

Now it should be clear, that beside some deductive elements in our approach (the infinite set of probability models fulfilling the constraints), we mainly use an inductive logic (using the principle of maximum entropy to choose the model with minimal information from the infite set of probability models). The problem of an implementation of this approach is not to find the maximum entropy model in an infinite set (as it could be calculated in a direct manner) but to keep the size of representation small. Adding a two valued variable e.g. doubles the number of elementary events which have to be distinguished (i.e we have exponential complexity in relation to the number of variables).
To solve this problem for our applications, we use properties of Maximum Entropy to simplify the representation of the problem (without changing the result) already at the beginning of our calculations (see the use of the Principle of Independence and the use of the Principle of Indifference in our publications).




Examples of Context Sensitivity of Probability Theory


Example 1

The aim of example 1 is to demonstrate, that we may have different pieces of information about the same proposition (piece of knowledge) in different contexts. We demonstrate this by the following example where we prove, that a given probability information (C1), about a proposition c in context a, does not imply anything about the possible values q1 and q2 in the statements (Q1) and (Q2) about c in the extended contexts 'a and b' resp. 'a or b'.


(C1) If I know a (and only a), then the probability of c is p1 in (0,1).
(Q1) If I know a and b, what is the probability q1 of c ?
(Q2) If I know a or b, what is the probability q2 of c ?


  Using '*' for 'and' and '+' for 'or'
  and '-|>' for the conditional - i.e. P(a -|> b) is another notation for P(b | a) (:= P(a * b) / P(a)) - we get:


(C1) P(a -|> c) = p1;
(Q1) P(a * b -|> c) = q1;
(Q2) P(a + b -|> c) = q2;



In the following table the columns P-Model 1 and P-Model 2 describe two probability measures which both fulfill the constraint (C1). (In our notation, we use '-' for 'not' and 'e' for a small positive number close to 0 (epsilon) which, in particular, is chosen smaller than p1). The rows of the table correspond to the 8 elementary events a*b*c,...,-a*-b*-c, whose probabilities are given in the table either referred to as P(elementary event) in column 1 or by the probability vector (v1,...,v8) on column 2.

El-Event El-Variable Values in P-Model 1 Values in P-Model 2
P( a * b * c) v1 0 e*p1/(1-p1)
P( a * b * -c) v2 e*(1-p1)/p1 0
P( a * -b * c) v3 e*p1/p1 0
P( a * -b * -c) v4 0 e*(1-p1)/(1-p1)
P(-a * b * c) v5 0 (1-p1-e)/(1-p1)
P(-a * b * -c) v6 (p1-e)/p1 0
P(-a * -b * c) v7 0 0
P(-a * -b * -c) v8 0 0


P-Model 1 and P-Model 2 are P-models:
We have to show that all v1,...,v8 are nonnegative and (v1 + v2 + ..+ v8) = 1.
This holds because
p1 is in the interval (0,1) and
(e*(1-p1) + e*p1 + (p1-e))/p1 = 1   resp.   (e*p1 + e*(1-p1) + (1-p1-e))/(1-p1) = 1.

Satisfying (C1):
P(a)=v1+v2+v3+v4, P(a*c)=v1+v3, P(a*b)=v1+v2, P(a*b*c)=v1, P(a+b)=v1+v2+v3+v4+v5+v6, P((a+b)*c)=v1+v3+v5
we compute easily from (C1), (Q1) and (Q2) that
p1 = (v1+v3) / (v1+v2+v3+v4)
q1 = (v1) / (v1+v2)
q2 = (v1+v3+v5)/(v1+v2+v3+v4+v5+v6)
We now get that (C1) is fulfilled in

  • P-Model 1, because
        (v1+v3) / (v1+v2+v3+v4) = e*p1 / ( e*(1-p1) + e*p1 ) = p1 / ( (1-p1) + p1 ) = p1
  • P-Model 2, because
        (v1+v3) / (v1+v2+v3+v4) = e*p1 / ( e*p1 + e*(1-p1)) = p1 / ( p1 + (1-p1) ) = p1

Computing q1:
In P-Model 1 holds: q1 = (v1) / (v1+v2) = 0/v2 = 0
In P-Model 2 holds: q1 = (v1) / (v1+v2) = v1 / (v1+0) = 1

More generally we can say, since the constraint (C1) is convex, q1 can take every value in [0,1] independently of any value p1.

Computing q2:
In P-Model 1 holds: q2 = v3 / (v2+v3+v6) = e*p1/ p1 = e, i.e. arbitrarily close to 0.
In P-Model 2 holds: q2 = (v1+v3+v5) / (v1+v2+v3+v4+v5+v6) = (v1+v5) / (v1+v4+v5)
  = (e*p1 + 1-p1- e) /(1-p1) = (e(p1-1) + (1-p1)) /(1-p1) = (1-e), i.e. arbitrarily close to one.

More generally we can say, since the constraint (C1) is convex, q2 can take every value in (0,1) independently of any value p1.

Example 2

The aim of example 2 is to demonstrate, that we may have different information about the same variable in different contexts. We demonstrate this by proving, that even the combination of the constraints (C2) and (C3) does not restrict the possible values of the query (Q3) resp. it does not restrict it very much in case of query (Q4).


(C2) If I know a (and only a), then the probability of c is p2:
(C3) If I know b (and only b), then the probability of c is p3:
(Q3) If I know a and b, what is the probability q3 of c ?
(Q4) If I know a or b, what is the probability q4 of c ?


Query (Q3) asks for the probability q3 of c given 'a and b' together with the probabilities p2 and p3, while query (Q4) asks for the probability q2 of c given 'a or b' together with the probabilities p2 and p3.
Formally :
  (We use '-' for 'not', '*' for 'and' and '+' for 'or' and the arrow in 's -|> t' as conditional, equivalent to the notation 't | s' where 'P(s -|> t) := P(s and t) / P(s)' )


(C2) P(a -|> c) = p2;
(C3) P(b -|> c) = p3;
(Q3) P(a * b -|> c) = q3;
(Q4) P(a + b -|> c) = q4;



The following table describes two probability models.


El-Event El-Variable Values in P-Model 3 Values in P-Model 4
P( a * b * c) v1 0 p2*p3 / (p2+p3-p2*p3)
P( a * b * -c) v2 (1-p2)*(1-p3) / (1-p2*p3) 0
P( a * -b * c) v3 p2*(1-p3) / (1-p2*p3) 0
P( a * -b * -c) v4 0 (1-p2)*p3 / (p2+p3-p2*p3)
P(-a * b * c) v5 p3*(1-p2) / (1-p2*p3) 0
P(-a * b * -c) v6 0 (1-p3)*p2 / (p2+p3-p2*p3)
P(-a * -b * c) v7 0 0
P(-a * -b * -c) v8 0 0


Recall that

  • (p2) is equivalent to (v1+v3) / (v1+v2+v3+v4)
    In P-Model 3 (C2) is fulfilled as
    (v1+v3) / (v1+v2+v3+v4) = p2(1-p3) / (1-p2*p3) / ( (1-p2)*(1-p3)+p2(1-p3) )/ (1-p2*p3) ) = p2(1-p3) / ( (1-p2)(1-p3)+p2(1-p3) )) = p2 / (1-p2+p2 ) = p2
    In P-Model 4 (C2) is fulfilled as
    (v1+v3) / (v1+v2+v3+v4) = p2*p3 / ( p2*p3+ (1-p2)*p3 ) = p2
  • (p3) is equivalent to (v1+v5) / (v1+v2+v5+v6)
    In P-Model 3 (C3) is fulfilled as
    (v1+v5) / (v1+v2+v5+v6) = p3*(1-p2) / ( (1-p2)*(1-p3)+p3*(1-p2) ) = p3(1-p2) / ( 1-p3 -p2 +p2*p3 + p3 -p2*p3) = p3(1-p2) / ( 1-p2 ) = p3
    In P-Model 4 (C3) is fulfilled as
    (v1+v5) / (v1+v2+v5+v6) = p2*p3 / ( p2*p3+(1-p3)*p2 ) = p3
  • (q3) is equivalent to (v1) / (v1+v2)
  • (q4) is equivalent to (v1+v3+v5)/(v1+v2+v3+v4+v5+v6).


Concerning (Q3):

Query (Q3) demonstrates, that we have complete freedom if we specialize the context, as q1 can take every value from [0,1] independently of any (fixed) value in (0,1) of p2 and p3.


We describe two probability models (i.e. these probability measures fulfil the constraints (C2) and (C3) .
The column of P-Model 3 shows that q3 can take the value 0.
The column of P-Model 4 shows that q3 can take the value 1.
As this constraints (C2) and (C3) are convex and therefore any value inbetween two solutions can be taken, we have proved that q3 can take any value in [0,1]



Common Sense Example


This common sense example (from D. Dubois) models high values of q3 , given low values of p2 and p3.

If you take a shower, you do not risk your life [0.0, 0.01]
If you use a hairdryer, you do not risk your life [0.0, 0.01]
If you take a shower and use a hairdryer (at the same time), you do risk your life. [0.9, 1.00]

Concerning (Q4):

With query (Q4) we want to demonstrate, how extending the context means loosing information, as the probability of (Q4) ranges over the following interval: [(p2*p3 / (p2+p3-p2*p3), (p2+p3-2p2*p3)/(1-p2*p3) ], which we obtain if we use P-Model 4 to calculate the lower bound and P_Model 3 to calculate the upper bound. Again we recall that P-Model 3 and P-Model 4 fulfill the constraints (C2) and (C3). As the constraints (C2) and (C3) are convex and therefore any value between two solutions can occur, we proved that q4 can take any value in the described interval. Example: For p2 = p3 = 0,5 we receive for q4 the interval [1/3, 2/3].


Examples Concerning Truth Functionality

Example 3: The aim of example 3 is to demonstrate, that we can specify certain information for single statements, while still having additional freedom to specify different information for their combination. The classical definition of missing truth functionality is included in this case (see remark). In general terms, truth functionality is absent if is there is no function, which allows to calculate the degree of belief of a compound sentence from the single sentences.


(C5) If I know a (and only a), then the probability of b is p5:
(C6) If I know a (and only a), then the probability of c is p6:
(Q5) If I know a (and only a), what is the probability q3 of (b and c) ?


Query (Q5) asks for the probability q3 of (b and c) given a and the probabilities p5 and p6.


  (We use '-' for 'not', '*' for 'and' and '+' for 'or'
  and the arrow in 's -|> t' as conditional, equivalent to 't | s' where 'P(s -|> t) := P(s and t) / P(s)' ): for


(C5) P(a -|> b) = p5;
(C6) P(a -|> c) = p6;
--- ---------- ---
(Q5) P(a -|> b*c) = q5;




The classical concept of absent truth functionality is implied by the example above, as the statement
  'P(a*b) is no function of P(a) and P(b)'  
can be reformulated as
  'P(True -|> a*b) is no function of P(True -|> a) and P(True -|> b)'.
This means that we consider the conditional probability as basic construct and the unconditional probability as special case of conditional probabilities.


The following table describes two probability models.


El-Event El-Variable Values in P-Model 5 Values in P-Model 6 Values in P-Model 7
    (p5+p6 <= 1) (p5+p6> 1) -
P( a * b * c) v1 0 p5+p6-1 min(p5,p6)
P( a * b * -c) v2 p5 1-p6 p5 - min(p5,p6)
P( a * -b * c) v3 p6 1-p5 p6 - min(p5,p6)
P( a * -b * -c) v4 1-p5-p6 0 1- max(p5,p6)
P(-a * b * c) v5 0 0 0
P(-a * b * -c) v6 0 0 0
P(-a * -b * c) v7 0 0 0
P(-a * -b * -c) v8 0 0 0


Recall that

  • (c5) is equivalent to (v1+v2) / (v1+v2+v3+v4)
  • (c6) is equivalent to (v1+v3) / (v1+v2+v3+v4)
  • (q5) is equivalent to (v1) / (v1+v2+v3+v4)


Concerning (Q5):

The query (Q5) demonstrates, that the probability (q5) is not determined by the constraints (C5) and (C6).


We first show, that the lower bound of q5 is max(0,p5+p6-1).


Case p5+p6 <= 1.
  In P-Model 5, q5 = 0 / (0 +p5+p6+1-p5-p6) = 0.
Case p5+p6 > 1.
  In P-Model 6, (p5+p6-1) / (p5+p6-1+(1=p5)+(1-p5) ) = p5+p6-1.
We therefore receive a lower bound of max(0,p5+p6-1) for q5. This value is minimal, as the overlap of the two sets is minimal, if the sets a and b only takes values in c if there is no other possibility.


We also show, that the upper bound for q5 is min(p5,p6).
In P-Model 7 the upper bound for q5 is min(p5,p6). This value is the extreme case, as the overlap of b and c (in a) is maximal if (a and b) is in (a and c) , (in which case we receive q5 = p5) or (a and c) is in (a and b) (with q5 = p6).


As the constraints are convex, q5 can take every value in the interval [max(0,p5+p6-1), min(p5,p6)] for fixed values p5 and p6 in [0,1].


Numerical Example for Query (Q5)


For p5=0.45 and p6=0.55 we receive values in [0.0, 0.45], which is not a very useful information.



Common Sense Examples


We categorize classes (e.g. germans, birds, etc) by typical properties.
Nevertheless in many cases there will be no individual in the class, who owns all the typical properties. That means formally, that

'P(a -|> b*c*d...)'

may be 0 while the single probabilities

P(a -|>b), P(a -|>c),...

are very close to 1.


An other example is the so called lottery paradox:
We do not expect a single ticket i to win, i.e. for some small e

P(ticket i does not win) = (1-e)

(written as conditional: P(TRUE -|> [ticket i does win]) = (1-e) )
but if we conclude that the conjunction of all these statements (for different tickets) does not win, we make an error, as

P(TRUE -|> (ticket i does not win) and (ticket j does not win) and ...) = 0;
Structure of the Cost Matrix

Explanation of the Cost Matrix

Our choice for translating probabilities into decisions can be described by searching for the decision, for which the expected average costs are minimal.
For this goal we let the individual positions of our matrix describe the costs (theoretical,global, independent of institutions) , given by an actual fault situation and an actual decision.
 (Certified) Fault:
Fault 1
Fault 2
Hypotheses and Decisions Decision / Treatment 1 Costs of Treatment 1, given Fault 1 Costs of Treatment 1, given Fault 2
Decision / Treatment 2 Costs of Treatment 2, given Fault 1 Costs of Treatment 2, given Fault 2
Decision / Treatment 3 Costs of Treatment 3, given Fault 1 Costs of Treatment 3, given Fault 2

(NB.: Under 'Certified Fault' it is also possible here to find the term 'no fault'. , correspondingly under the term treatment in also included the decision 'no treatment' An (invented) Example with numbers :

  Fault / Certified Diagnosis 1 Fault / Certified Diagnosis 2
Decision / Treatment 1 2500 3800
Decision / Treatment 2 4000 3000
Decision / Treatment 3 2000 7000


From this matrix can be calculated 'the additional costs of non-optimal decisions' when the smallest respective element of the column (that describes the treatment of choice when the fault associated to the column is present) is subtracted from all the elements.

  Fault / Certified Diagnosis / 1 Fault / Certified Diagnosis / 2
Decision / Treatment 1 500 800
Decision / Treatment 2 2000 0
Decision / Treatment 3 0 4000


When Fault 1 is evident, then it is clear that the correct choice is treatment 3. Compared with Fault 2, where the obvious choice is treatment 2. What meaning can the treatment 1 have ? This example already shows that the unCertainty alone about the fault in question can lead to a choice of new treatments: When we add to the matrix the probability, that a specialist in a specific situation clssifies the existing faults and we accept the specialist holds fault one only to be more likely than Fault 2 (which we would formalize as a probability of 60 % to 40 %); when we further extend the matrix to the calculation of the average costs by a corresponding treatment, then we get th following extended matrix:

  Fault / Certified Diagnosis 1 Fault / Certified Diagnosis 2 average costs per treatment
Probability of the Certified Diagnosis 60 % 40 %  
Decision / Treatment 1 500 800 0.6 *500 + 0.4 *  800 =  620
Decision / Treatment 2 2000 0 0.6 * 2000 + 0.4 *       0 = 1200
Decision / Treatment 3 0 4000 0.6 *     0 + 0.4 * 4000 = 1600


Here is the treatment 1 the right alternative, as it leads to a cost of 620 units, while treatment 2 in this situation costs 1200 units and treatment 3 1600 units. (We want to point out again, that here the term 'cost' include later effects and other risks.)





Publications and further Literature,..



Publications (and technical reports) concerning PIT

  • B. Fronhöfer, M. Schramm;
    A Probability Theoretic Analysis of Score Systems.
    Workshop Uncertainty in AI, Workshop at KI 2001, Vienna, Austria. FS01
  • M. Schramm, B. Fronhöfer;
    On Efficient Translations of Score Systems into Probabilistic Systems.
    Workshop Uncertainty in AI, Workshop at KI 2001, Vienna, Austria. SF01
  • M. Schramm, B. Fronhöfer;
    PIT: A System for Reasoning with Probabilities.
    Workshop Uncertainty in AI, Workshop at KI 2001, Vienna, Austria. SF01b
  • M. Schramm and B. Fronhöfer.
    Completing Incomplete Bayesian Networks.
    Workshop Conditionals, Information, and Inference Hagen, Germany, May 13-15, 2002 SF02
    Springer 2004, LNCS, 3301, S.200-218;
  • M.Schramm, W.Ertel, W.Rampf;
    Bestimmung der Wahrscheinlichkeit einer Appendizitis mit LEXMED, Biomedical Journal, Heft 57, April 2001 ( ;
  • M.Schramm, W.Ertel, W.Rampf;
    Automatische Diagnose von Appendizitis mit LEXMED.
    FH Ravensburg-Weingarten, Technischer Bericht, Januar 2001, 10 S SER01
  • M.Schramm, W.Ertel, W.Rampf;
    Automatische Diagnose von Appendizitis mit LEXMED.
    FH Ravensburg-Weingarten, Technischer Bericht (Langfassung), April 2000; SER00
  • P. Trunk;
    Applying Graphical Models to Reasoning with Maximum Entropy
    Diploma Thesis at TU Munich, Chair of Computer Science, (Guidance: M. Schramm, V. Fischer) at Prof. E. Jessen; January 2000; Tru00
  • W.Ertel, M.Schramm; ES00
    Combining Expert Knowledge and Data via Probabilistic Rules with an Application to Medical Diagnosis;
    UAI 2000 Workshop on Fusion of Domain Knowledge with Data for Decision Support, Juni 2000 . Stanford CA.
  • M. Schramm;
    Über das Prinzip der Minimierung der relativen Entropie.
    FH Ravensburg-Weingarten,Technical Report Sch00b January 2000
  • M.Schramm, W.Ertel; SE99
    Reasoning with Probabilities and Maximum Entropy: The System PIT and its Application in LEXMED; in: K.~Inderfurth et al (ed), Operations Research Proceeedings 1999, S. 274 - 280, Springer
  • W.Ertel, M.Schramm; ES99
    Combining Data and Knowledge by MaxEnt-Optimisation of Probability Distributions,
    PKDD-99 (3rdEuropean Conference on Principles and Practice of Knowledge Discovery in Databases, Prag September 1999), Springer
  • M.Schramm, V. Fischer:
    Probabilistic Reasoning with Maximum Entropy: The System PIT.
    SF97 in Proc. of Workshop für Logikprogrammierung WLP97
  • M. Schramm,   Indifferenz, Unabhängigkeit und maximale Entropie: Eine wahrscheinlichkeitstheoretische Semantik für Nicht-Monotones Schliessen, in: Dissertationen zur Informatik, Nr. 4, CS-Press, München, 1996,  (Sch96)
  • M.Schramm, St. Schulz.
    Combining Propositional Logic with Maximum Entropy Reasoning on Probability Models; ECAI '96 Workshop on   Integrating Nonmonotonicity into Automated Reasoning Systems SS96
  • V. Fischer, M. Schramm;
    tabl - A Tool for efficient Compilation of probabilistic Constraints, TechReport der TU München, FS96
  • M. Schramm, M. Greiner,   Foundations: Indifference, Independence and Maxent: in.: Proc of the 14th Int. Maximum Entropy Workshop 1994, Maximum Entropy and Bayesian Methods.   Kluwer Academic Publ. 1994, Ed.: J. Skilling SG95


General Literature


  • Aci96
  • Silvia Acid and Luis M. de Campos.
    An algorithm for finding minimum d-separating sets in belief networks.
    Technical report, Departamento de Ciencias de la Computación e I.A. Universidad de Granada, Span, 1996


  • Ada66
  • E.W. Adams,
    `Probability and the Logic of Conditionals', in Aspects of Inductive Logic, eds., J. Hintikka and P.Suppes, North Holland, Amsterdam, (1966).


  • Ant97
  • G. Antoniou.
    Nonmonotonic reasoning.
    MIT Press, Cambridge, Massachusetts, 1997.


  • Bly72
  • C.R. Blyth,
    `On Simpson's Paradox and the Sure-Thing Principle', Journal of the American Statistical Association, 67(338), 363-381, (June 1972).


  • Bry92
  • R.Y. Bryant,
    `Symbolic Boolean Manipulation with Ordered Binary Decision Diagrams', Technical Report CMU-CS-92-160, School of Computer Science, Carnegie Mellon University, (1992).
    Available at


  • Cen90
  • Yair Censor, Alvaro R. Pierro, Tommy Elfing, Gabor T. Herman, and Alfredo N.Iusem.
    On iterative Methods for linearly Constrained Entropy Maximization, volume 24 of Numerical Analysis and mathematical Mod.
    PWN Polish Scientific Publishers, 1990.


  • CGH97
  • Enrique Castillo, Jose Manual Gutierrez, and Ali S. Hadi.
    Expert Systems and Probabilistic Network Models.
    Springer Verlag, 1997.


  • Che83
  • P. Cheeseman,
    `A Method of Generalize Bayesian Probability Values for Expert Systems', in Proceedings of the 8th International Joint Conference on Artificial Intelligence (IJCAI-83), Karlsruhe, volume 1, pp. 198-202. Morgan Kaufman, (1983).


  • Cow99
  • Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter.
    Probabilistic Networks and Expert Systems.
    Springer, 1999.


  • Dom91
  • F.T. Dombal.
    Diagnosis of Acute Abdominal Pain.
    Churchill Livingstone, 1991.
    erste Auflage: 1980.


  • DP96
  • D. Dubois and H. Prade.
    Non-standard theories of uncertainty in plausible reasoning.
    In G. Brewka, editor, Principles of Knowledge Representation. CSLI Publications, 1996.


  • Edw95
  • David Edwards.
    Introduction to graphical Modelling.
    Springer, 1995.


  • EGS95
  • W.Ertel, C. Goller, M.Schramm, S. Schulz;
    Rule Based Reasoning / Neural Nets
    - Analysis Report, ESPRIT 9119 - MIX Deliverable D4.2, 1995


  • ES99 (ps) (pdf)
  • W. Ertel and M. Schramm.
    Combining Data and Knowledge by MaxEnt-Optimization of Probability Distributions.
    In PKDD'99 (3rd European Conference on Principles and Practice of Knowledge Discovery in Databases), LNCS, Prague, 1999. Springer Verlag.


  • ES00 (ps) (pdf)
  • W. Ertel and M. Schramm.
    Combining Expert Knowledge and Data via Probabilistic Rules with an Application to Medical Diagnosis. In UAI-2000 Workshop on Fusion of Domain Knowledge with Data for Decision Support, Stanford CA.


  • FS96
  • (ps) (pdf) Volker Fischer and Manfred Schramm.
    tabl - a Tool for Efficient Compilation of Probabilistic Constraints.
    Technical Report TUM-I9636, TU München, November 1996, 38 S.


  • FS01
  • B. Fronhöfer and M. Schramm.
    A Probability Theoretic Analysis of Score Systems.
    Workshop Uncertainty in AI at KI 2001, Vienna, Austria.
    Technical Report of the Fernuniversität Hagen
    Informatik Berichte 287 - 8/2001 (ps) (pdf)


  • Gam96
  • Johann Gamper and Friedrich Steinmann.
    Medizinische Expertensysteme - eine kritische Betrachtung.
    APIS Zeitschrift für Politik, Ethik, Wissenschaftm und Kultur im Gesundheitswesen, 1996.


  • Gig95
  • G. Gigerenzer and Ullrich Hoffrage
    How to Improve Bayesian Reasoning Without Instruction: Frequency Formats
    Psychological Rewiev ,102,Nr. 4,pp 684-704, 1995.


  • Gig96
  • G. Gigerenzer.
    The psychology of good judment: Frequency formats and simple algorithms.
    Medical Decision Making, 1996.


  • GHK92
  • A.J. Grove, J.Y. Halpern, and D. Koller,
    `Random Worlds and Maximum Entropy', in Proceedings of the 7th Annual IEEE Symposium on Logic in Computer Science (LICS). IEEE, (1992).


  • GMP90
  • M. Goldszmidt, P. Morris, and J. Pearl
    `A Maximum Entropy Approach to Non-Monotonic Reasoning', in Proc. of the Eighth National Conference on Artificial Intelligence, Boston, pp. 646-652. AAAI, AAAI Press/MIT Press, (1990).


  • Gru98
  • Peter Gruenwald.
    The minimum description length principle and reasoning under uncertainty.


  • Gru00
  • Peter Gruenwald.
    Maximum entropy and the glasses you are looking through.
    UAI, 2000.


  • GS96
  • (ps), (pdf), M.Greiner, M. Schramm,
    A Probabilistic Stopping Criterion for the Evaluation of Benchmarks,
    Technical Report, TU München, TUM-I9638, November 1996   (rev. Sep. 98)


  • GS97
  • Charles M.Grinstead and J.Laurie Snell.
    Introduction to Probability, 2nd Ed. (revised).
    American Mathematical Society (AMS), 1997.


  • Hal99
  • Joseph Y. Halpern.
    A counterexample to theorems of cox and fine.
    J. of Artificial Intelligence Research, 10:67-85, 1999.


  • HC73
  • Arthur Hobson and Bin-Kang Cheng.
    A comparison of the Shannon and Kullback Information Measures.
    J. of Statistical Physics, 7(4):301-310, 1973.


  • HH88
  • D.E.Heckerman and E.J.Horvitz.
    The myth of modularity in rule-based systems for reasoning with uncertainty.
    Uncertainty in Artificial Intelligence 2, 1988.


  • HUG
  • Homepage of HUGIN.


  • Jae96
  • Manfred Jaeger.
    Representation independence of nonmonotonic inference relations.
    In KR Morgan Kaufmann, San Francisco, 1996.


  • Jae97
  • Manfred Jaeger.
    Relational bayesian networks.
    In UAI, 1997.


  • Jay57
  • E.T. Jaynes.
    Information Theory and Statistical Mechanics I (1957).
    In R.D. Rosenkrantz, editor, E.T. Jaynes: Papers on Probability, Statistics and Statistical Physics, pages 4-16. Kluwer Academic Publishers, 1989.


  • Jay78
  • E.T. Jaynes.
    Where do we stand on Maximum Entropy?
    In R.D. Rosenkrantz, editor, Papers on Probability, Statistics and Statistical Physics, pages 210-314. Kluwer Academic Publishers, 1978.


  • Jay79
  • E.T. Jaynes.
    Concentration of distributions at entropy maxima.
    In Rosenkrantz, editor, Papers on Probability, Statistics and statistical Physics. D. Reidel Publishing Company, 1982.


  • Jay82
  • E.T. Jaynes,
    `On the Rationale of Maximum Entropy Methods', Proc. of the IEEE, 70(9), 939-952, (1982).


  • Jay90
  • E.T. Jaynes.
    Probability Theory as Logic.
    In Paul F. Fougere, editor, Maximum Entropy and Bayesian Methods. Kluwer Academic, MIT Cambridge USA, 1990.


  • Jay95
  • E.T. Jaynes,
    `Probability Theory: The Logic of Science'.
    Continually updated, fragmentary edition of approximately 700 pages available from, 1995.


  • Kah72
  • Daniel Kahneman and Amos Tversky.
    Subjective Probability: A Judgement of Representativeness .
    Cognitive Psychology; 3, pp. 430-454, 1972.


  • Kah82
  • Daniel Kahneman and Amos Tversky.
    On the study of statistical intuitions.
    Cognition; 14?, pp. 123-141, 1982.


  • Kane89
  • Thomas B. Kane.
    Maximum Entropy in Nilsson's Probabilistic Logic.
    In IJCAI, volume 1, 1989.


  • Ker97a
  • G. Kern-Isberner.
    A conditional-logical approach to minimum cross-entropy,
    in: Proceedings of the Annual Symposium on Theoretical Aspects of Computer Science (STACS-97), Springer 1997.
    Abstract, PS, BibTeX-Entry


  • Ker97b
  • G. Kern-Isberner.
    A logically sound method for uncertain reasoning with quantified conditionals;
    in: Proceedings of the International Joint Conference on Qualitative & Quantitative Practical Reasoning (ECSQARU/FAPR '97), 1997.
    Abstract, PS, BibTeX-Entry


  • Ker98
  • G. Kern-Isberner.
    Characterizing the Principle of Minimum Cross-entropy within a Conditional-logical Framework.
    AI, 98:169-208, 1998.


  • Ker99
  • Gabriele Kern-Isberner.
    Revising and updating probabilistic beliefs.
    Frontiers in belief revision, Kluwer, pages 329-344, 1999.


  • Ker00
  • Gabriele Kern-Isberner.
    Conditional Indifference and Conditional Preservation.
    In Proceedings 8th International Workshop on Nonmonotonic Reasoning (NMR'2000) , 2000.


  • KK92
  • J. N. Kapur and H. K. Kesavan.
    Entropy Optimization Prinziples with Applications.
    Academic Press, Inc. , Harcourt Brace Janovich, Publishers, 1992.


  • KL99
  • G. Kern-Isberner and T. Lukasiewicz.
    Probabilistic logic programming under maximum entropy.
    Technical Report 9903, Justus-Liebig Universität Gießen, 1999.


  • Lau88
  • S. Lauritzen, D. S. (1988).
    Local computations with probabilities on graphical structures and their application to expert systems.
    In J. Royal Statistical Society , B, volume 50,2 of B, pages 157-224.


  • Lau96
  • S.L. Lauritzen.
    Graphical Models.
    Oxford Science Publications, 1996.


  • LB82
  • John F. Lemmer and S. W. Barth.
    Efficient Minimum Information Updating for Bayesian Inferencing in Expert Systems.
    AAAI,1982, p. 424-427,


  • Lem83
  • John F. Lemmer.
    Generalized Bayesian updating of incompletety specified distributions ,
    Large Scale Systems 5,1983, p. 51-68, Elsevier Science, North Holland


  • Lif89
  • V. Lifschitz, `Benchmark Problems for formal Non-Monotonic Reasoning', in Non-Monotonic Reasoning: 2nd International Workshop, ed., Reinfrank et al, volume 346 of LNAI, pp. 202-219. Springer, (1989).


  • Luk00
  • Th. Lukaszewicz.
    Credal networks under maximum entropy.
    Uncertainty in AI, pages 363-370, 2000.


  • Mal88
  • F.M. Malvestuto.
    Existence of extensions and product extensions for discrete probability distributions.
    Computational Statistics and Data Analysis, ?:61-77, 1988.


  • Mal89
  • F.M. Malvestuto.
    Computing the maximum-entropy extension of given discrete probability distributions.
    Computational Statistics and Data Analysis, 8:299-311, 1989.


  • Mak94
  • D. Makinson.
    General patterns in nonmonotonic reasoning.
    In D.M. Gabbay, C.H. Hogger, and J.A. Robinson, editors, Handbook of Logic in Artificial Intelligence and Logic Programming, volume 3, pages 35-110. Oxford University Press, 1994.


  • McD87
  • Drew McDermott,
    A critique of pure reason
    Computational Intelligence 3, p. 151-160, 1987.


  • NH90
  • E. Neufeld and J.D. Horton,
    `Conditioning on Disjunctive Knowledge: Simpson's Paradox in Default Logic', in Uncertainty in Artificial Intelligence, eds., M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer, volume 5, 117-125, Elsevier Science, (1990).


  • Nil86
  • Nils J. Nilsson,
    Probabilistic Logic,
    Artificial Intelligence , 1986,28,p. 71-87


  • Par94
  • J.B. Paris.
    The Uncertain Reasoner's Companion A Mathematical Perspective
    Cambridge University Press, 1994


  • Pea88
  • J. Pearl.
    Probabilistic Reasoning in Intelligent Systems.
    Morgan Kaufmann, San Mateo, CA, 1988.


  • PIT
  • Homepage of PIT.


  • Poo89
  • D. Poole .
    What the Lottery Paradox Tells Us about Default Reasoning
    KR'89: Proc.\ of the First International Conference on Principles of Knowledge Representation and Reasoning, 1989, Kaufmann, ed.: R. J. Brachman and H. J. Levesque and R. Reiter, p. 333-340,


  • PV90
  • J.B. Paris and A. Vencovska.
    A Note on the Inevitability of Maximum Entropy.
    International Journal of Approximate Reasoning, 3:183-223, 1990.


  • Qui98
  • J.R. Quinlan.


  • Ren82
  • A. Rényi.
    Tagebuch über die Informationstheorie.
    WK (Wissenschaft und Kultur) Band 34. Birkhäuser Verlag, 1982.


  • RM96
  • W. Rödder and C.-H. Meyer.
    Coherent knowledge processing at maximum entropy by spirit.
    KI 96 (Dresden), 1996.


  • Sch96
  • M. Schramm.
    Indifferenz, Unabhängigkeit und maximale Entropie: Eine Wahrscheinlichkeitstheoretische Semantik für nichtmonotones Schließen.
    Number 4 in Dissertationen zur Informatik. CS Press, München, 1996.


  • Sch00
  • M. Schramm
    Simulation von Scoresystemen durch probabilistische Systeme.
    Technical Report, University of Applied Science Ravensburg-Weingarten, May 2000.


  • Sch00b
  • M. Schramm
    Über das Prinzip der Minimierung der relativen Entropie.
    Technical Report, PIT-Systems, Januar 2000, submitted


  • Sch01
  • M. Schramm
    On Minimisation of Relative Entropy .
    Technical Report, PIT-Systems, Jan 2001.


  • SF01
  • M. Schramm and B. Fronhöfer.
    On Efficient Decision Preserving Translations of Score Systems into Probabilistic Systems.
    Workshop Uncertainty in AI at KI 2001, Vienna, Austria.
    Technical Report of the Fernuniversität Hagen
    Informatik Berichte 287 - 8/2001 (ps) (pdf)


  • SF01b (ps) (pdf)
  • M. Schramm and B. Fronhöfer.
    PIT: A System for Reasoning with Probabilities.
    Workshop Uncertainty in AI at KI 2001, Vienna.


  • SF02 (ps) (pdf)
  • M. Schramm and B. Fronhöfer.
    Completing Incomplete Bayesian Networks.
    Workshop Conditionals, Information, and Inference Hagen, Germany, May 13-15, 2002


  • SE99 (ps) (pdf)
  • M. Schramm and W. Ertel
    Reasoning with Probabilities and Maximum Entropy: The System PIT and its Application in LEXMED.
    In SOR, Operations Research Proceedings 1999, ed.: Inderfurth K., Springer, 1999.


  • SER00
  • M. Schramm and W. Ertel and W. Rampf
    Automatische Diagnose von Appendizitis mit LEXMED (Langfassung)
    FH Ravensburg-Weingarten, Technischer Bericht, April 2000;


  • SER01
  • M. Schramm and W. Ertel and W. Rampf
    Automatische Diagnose von Appendizitis mit LEXMED
    Januar 2001, submitted.


  • SF97
  • M. Schramm, V. Fischer
    Probabilistic Reasoning with Maximum Entropy - The System PIT
    Workshop für Logikprogrammierung 1997 , (ps) (pdf)
    in Proc. WLP 97
    erweiterte Zusammenfassung (deutsch) (ps) (pdf)


  • SG95
  • M. Schramm and M. Greiner.
    Foundations: Indifference, Independence & Maxent.
    In J. Skilling, editor, Maximum Entropy and Bayesian Methods in Science and Engeneering (Proc. of the MaxEnt'94). Kluwer Academic Publishers, 1995.
    corresponding Technical Report of the TU Munich
    (ps) (pdf)


  • Sha96
  • Glenn Shafer.
    Probabilistic Expert Systems.
    CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, 1996.


  • Sha48
  • Claude E. Shannon.
    The mathematical theory of communication.
    The Bell Systems Technology Journal, 27:349-423, 1948.


  • SS96

  • Combining Propositional Logic with Maximum Entropy Reasoning on Probability Models; ECAI '96 Workshop on   Integrating Nonmonotonicity into Automated Reasoning Systems (ps) (pdf)


  • SJ80
  • J.E. Shore and R.W. Johnson.
    Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross Entropy.
    IEEE Transactions on Information Theory, IT-26(1):26-37, 1980.


  • Ski88
  • J. Skilling.
    The Axioms of Maximum Entropy.
    In G.J. Erickson and C.R. Smith, editors, Maximum Entropy and Bayesian Methods in Science and Engineereing, volume 1 - Foundations. Kluwer Academic Publishers, 1988.


  • SL88
  • D.J. Spiegelhalter S. Lauritzen.
    Local computations with probabilities on graphical structures and their application to expert systems.
    In J. Royal Statistical Society , B, volume 50,2 of B, pages 157-224, 1988.


  • SPI
  • Homepage of SPIRIT.


  • Stu89
  • M. Studeny.
    Multiinformation and the problem of characterization of conditional independence relations.
    Problems of Control and Information Theory, 18(1):3-16, 1988.


  • Stu92
  • Milan Studeny.
    Conditional independence relations have no finite complete characterization.
    Information Theory, Statistical Decision Functions and Random Processes. Transactions of the 11th Prague Conference vol. B, pages 377-396, 1992.


  • Stu93
  • Milan Studeny.
    Formal properties of conditional independence in different calculi of AI.
    Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pages 341-348, 1993.


  • Stu97
  • Milan Studeny.
    Semigraphoids and structures of probabilistic conditional independence.
    Annals of Mathematics and Artificial Intelligence 21, pages 71-98, 1997.


  • SV98
  • M. Studeny, J. Vejnarova (1998).
    The multiinformation function as a tool for measuring stochastic dependence.
    In Michael I. Jordan, editor, Learning in Graphical Models, pages 261-297. Kluwer Academic Publishers.


  • SW76
  • Claude E. Shannon and Warren Weaver.
    Mathematische Grundlagen der Informationstheorie.
    Reihe Scienta Nova. R. Oldenburg, 1976.
    Erstausgabe 1948.


  • Tar85
  • Robert E. Tarjan.
    Decomposition by clique separators.
    Discrete Mathematics 55, 55:221-232, 1985.


  • Tru00
  • P. Trunk.
    Applying Graphical Models to Reasoning with Maximum Entropy.
    Diplomarbeit, Lehrstuhl für Informatik VIII, Prof. E. Jessen, TU München, January 2000.
    (Guidance: M. Schramm, V. Fischer)


  • Tve74
  • Amos Tversky and Daniel Kahneman.
    Judgement under Uncertainty: Heuristics and Biases.
    Science 185,pp. 1124-1131, 1974.


  • Uff95
  • Jos Uffink:
    Can the maximum entropy principle be explained as a consistency requirement?
    Studies in History and Philosophy of Modern Physics 26B (1995) 223-261.


  • Uff96
  • Jos Uffink: The Constraint Rule of the Maximum Entropy Principle.
    Studies in History and Philosophy of modern Physics 27, (1996) pp. 47-79.


  • Wen90
  • W. X. Wen.
    Minimum cross entropy reasoning in recursive causal networks.
    In R.D. Shachter, T.S. Levitt, L.N. Kanal, and J.F. Lemmer, editors, Uncertainty in Artificial Intelligence 4, pages 105-119. Elsevier Science, 1990.


  • Wil98
  • Jon Williamson.
    The Philosophy of Diagnosis in relation to Artificial Intelligence.
    PhD thesis, Kings College London, 1998.


  • Wil00
  • Jon Williamson.
    Approximating discrete probability distributions with bayesian networks.
    In Proceedings of the International Conference on Artificial Intelligence in Science and Technology, 2000.


  • Whi90
  • J. Whittaker.
    Graphical Models in applied multivariate Statistics.
    John Wiley, 1990.



Literature and links concerning Probabilistic Reasoning (Theory)

Literature (and contacts) concerning reasoning with probabilities and maximum Entropy :

Further Links concerning Entropy

Probabilities and Common Sense Reasoning

  • Gerd Gigerenzer, Laura Martignon, Research Area "Adaptiv behavior and cognition " , Max-Plack Institut für Bildungsforschung, Berlin  (Publikationen)

Probabilities and Graphs: Bayesian Networks, Learning of Bayesian Networks, etc.

Further links of probabilistic reasoning


Uncertainty in AI: UAI




Literature concerning probabilistic Reasoning (Applications, Implementations)

Expert systems in Medicine:

Related Systems

  • An other system, working with probabilities and the principles of maximum Entropy has beed developed at the Fernuni Hagen, supervised by Prof. Rödder and Dr. Meyer (Chair for Operations Research). System SPIRIT
  • A wellknown shell, implementing Bayesian Networks is Hugin

Further Literature (expert systems) :


Druckversion Druckversion | Sitemap
© Dr. Manfred Schramm, published under opensource GPL 3 license