Entropy as Measure of Uncertainty, as von Neumann Interpreted Boltzmann's Meaning

"Information Theory, Relative Entropy and Statistics Fran¸cois Bavaud

Introduction: the relative entropy as an epistemological functional

"Shannon’s Information Theory (IT) (1948) definitely established the purely mathematical nature of entropy and relative entropy, in contrast to the previous identification by Boltzmann (1872) of his “H-functional” as the physical entropy of earlier thermodynamicians (Carnot, Clausius, Kelvin). The following declaration is attributed to Shannon (Tribus and McIrvine 1971):

*"My greatest concern was what to call it. I thought of calling it “information”, but the word was overly used, so I decided to call it “uncertainty”. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.'*”

"I thought of calling it 'information,' but ... I decided to call it 'uncertainty' ... Von Neumann told me ... your uncertainty function has been used in statistical mechanics under that name, so it already has a name"

"You should call it entropy"

*"In a nutshell, the relative entropy K(f||g) has two arguments f and g, which both are probability distributions belonging to the same simplex. Despite formally similar, the arguments are epistemologically contrasted: f represents the observations, the data, what we see, while g represents the expectations, the models, what we believe. K(f||g) is an asymmetrical measure of dissimilarity between empirical and theoretical distributions, able to capture the various aspects of the confrontation between models and data, that is the art of classical statistical inference, including Popper’s refutationism as a particulary case. Here lies the dialectic charm of K(f||g), which emerges in that respect as an epistemological functional."*

*an asymmetrical measure of dissimilarity between the observations, the data ("what we see") and "the expectations, the models, what we believe."*

this is thus* "able to capture the various aspects of the confrontation between models and data, that **is the art of classical statistical inference"*

*"Here lies the dialectic charm of K(f||g)," "also known as the Kullback-Leibler divergence," "which emerges ... as an epistemological functional"*

*"K(f||g), also known as the Kullback-Leibler divergence ... constitutes a non-symmetric measure of the dissimilarity between the distributions f and g"*

*"Furthermore, the asymmetry of the relative entropy does not constitute a defect, but perfectly matches the asymmetry between data and models"*

*"Asymmetry of the relative entropy and hard falsificationism"*

*"the theory 'All crows are black' is refuted by the single observation of a white crow, while the theory 'Some crows are black' is not refuted by the observation of a thousand white crows. In this spirit, Popper’s falsificationist mechanisms (Popper 1963) are captured by the properties of the relative entropy, and can be further extended to probabilistic or 'soft falsificationist' situations, beyond the purely logical true/false context"*

*"Competition between simple hypotheses: Bayesian selection"*

*"the heart of the so-called model selection procedures, with the introduction of penalties ... increasing with the number of free parameters. In the alternative minimum description length (MDL) and algorithmic complexity theory approaches ... richer models necessitate a longer description and should be penalised accordingly. All those procedures, together with Vapnik’s Structural Risk Minimization (SRM) principle (1995), aim at controlling the problem of overparametrization in statistical modelling."*

*"Suppose data to be ***incompletely observed"**

*"For decades (ca. 1950-1990), the 'maximum entropy' principle, also called 'minimum discrimination information (MDI) principle' by Kullback (1959), has largely been used in science and engineering as a first-principle, 'maximally non-informative' method of generating models, ***maximising our ignorance (as represented by the entropy) under our available knowledge** ... (see in particular Jaynes (1957), (1978)). However, (18) shows the maximum entropy construction ... points towards the empirical (rather than theoretical) nature of the latter. In the present setting, ˜f D appears as the most likely data reconstruction under the prior model and** the incomplete observations** (see also section 5.3)."

*Examples:*

*Unobserved Category*

Coarse Grained Observations

Symmetrical Observations

Average Value

Statistical mechanics

Decompositions

*Conditional Independence*