Proof Mean and Variance of Exponential Families Canonical Form

one. Introduction

In accord to it'south historical roots classical statistical theory has been formulated to describe repeatable experiments in terms of random variables. A drawback that accompanies this language is the difficulty to integrate and describe abstract structural knowledge. Rise theories similar deep learning and complex networks dynamics, nonetheless, impressively demonstrate, that statistical modeling and inference can greatly benefit from the integration of abtract structural assumptions and in detail in the domains of complex natural information.

Information technology is therefore not surprising that this development led to a growing interest in alternative approaches to formulate statistical theory. In item Shun'ichi Amari [Amari1987] pursued a fundamentally different approach by focusing on the functional space of the probability distributions. This view motivated the reformulation of statistical theory by means of structural statistics [Michl2019, Michl2020]. An important awarding and showcase are exponential families, which can be completely characterized their geometric construction in terms of dually flat statistical manifolds.

2. Chief affine structure of Exponential Families

Definition (Exponential family).

Allow be a statistical model over a measurable space . So is termed an exponential family unit if and simply if there exists an invertible role

a sufficient statistic

and a scalar function

such that for any and information technology holds that:

Since is a sufficient statistic a Markov morphism, given by is globally invertible and its restriction to yields a statistical isomorphism to a statistical model over a measurable space . Then is an identifiable parametrisation of and past the definition

it follows that:

Without loss of generality any exponential family unit, equally defined in 2.1, may therefore exist assumed to be given by probability distributions with a representation 2.ii. This representation is termed the approved form of an exponential family and the parametrisation the approved, or natural parametrisation.

Definition (Natural parametrisation).

Permit be an exponential family unit in approved grade, then the corresponding canonical parametrisation is termed a natural parametrisation of

and the parameter vectors

are termed natural parameters. Remark: Exponential families, given past the notation implicate a canonical class and a natural parametrisation .

The function , given by an exponential family in canonical form is known as the cumulant generating function and may be regarded as a normalisation factor, that implements the normalisation condition of the probability distribution:

Since is independent of it may be pulled out of the integral and a rearrangement of equation 2.iii yields:

Due to this dependency the cumulant generating part relates different statistical backdrop.

Lemma one .

Let be the cumulant generating function of an exponential family , so is convex with respect to and it's first and second guild derivatives are given by:

where and

respectively denote the vectorial expectation and variance of

with respect to .

Proof.

Let and let exist the density part of over . And so the normalization condition is:

The partial derivation of equation two.7 to the natural parameter yields:

Therefore:

This proves equation 2.5. A further partial derivation of ii.8 with respect to the natural parameter yields:

Therefore:

This proves equation ii.half dozen. Furthermore since:

The Hessian matrix of is positive definite and therefore is convex. The convexity of may therefore be used to induce a Riemannian metric past the Bregman divergence.

Lemma ii .

Let be the cumulant generating function of an exponential family . Then the Bregman difference given past

, is the dual Kullback-Leibler deviation, such that:

Proof.

Allow and their respectively density functions over . Past calculating:

the Bregman divergence is derived past:

Lemma 3 .

Let exist the cumulant generating function of an exponential family . Then the Bregman departure given past , induces a Riemannian metric, which is given by the Fisher information:

Proof.

Let and

the probability density office of

over , and so:

The Riemannian metric, which is induced past the Bregman departure is therefore given by:

Lemma four .

Let be the cumulant generating function of an exponential family and the Riemannian statistical manifold, given by the Bregman deviation . And then the -affine geodesics in are given by exponential families.

Proof.

The -affine geodesics in are given by affine linear curves in the -parametrisation. For and , with let be the -affine geodesic connecting with and the probability density office of over . Then then representation of in the -parametrisation is given past:

Let exist the respective probability densities of and over . A -transformation and a subsequent substitution of and past equation ii.11 yields:

A further -transformation and subsequent integration over a measurable ready gives:

For varying this yields a generic representation of -affine geodesics in , by are exponential families with respect to the curve parameter . ∎

Lemma 5 .

Permit be the cumulant generating function of an exponential family and the Riemannian statistical manifold, given by the Bregman divergence . Then the -affine geodesics in are flat with respect to the Fisher information.

iii. Dual affine structure of Exponential Families

In the purpose, to study the structure, obtained by a Legendre transformation a further parametric family is introduced, that is shown to embrace this dual structure. This parametric family unit is the mixture family.

Definition (Mixture family unit).

Let

exist a statistical model over a measurable space . Then is a mixture family unit if and only if at that place is an invertible function with and and pairwise independent probability distributions over , such that for all :

The normalization condition can exist used to restrict the number of parameters, by the definition:

Therefore:

This transformation is a globally invertible Markov morphism, and its restriction to yields a statistical isomorphism to a statistical model . So is an identifiable parametrisation of and since defines a probability distribution over it may be regarded equally the expected value with respect to . Past the definition of it follows that:

Without loss of generality any mixture family may therefore be assumed to be given by equations 3.3. This representation is termed the approved form of a mixture family and the parametrisation the expectation parametrisation.

Definition (Expectation parametrisation).

Permit

Then a parametrisation with , which is given by , where denotes the vectorial expectation of with respect to , is termed an expectation parametrisation with respect to and parameter vectors are termed expectation parameters.

Lemma half-dozen .

Let be the cumulant generating function of an exponential family unit and the negative entropy . Then the dual parametrisation is given past the expectation parametrisation and the Legendre dual function past , such that:

Proof.

Let and the density function of over . Then from equation 2.v is follows, that:

the dual parametrisation directly follows from . The Legendre dual part is derived by:

Lemma vii .

Permit be the cumulant generating part of an exponential family . Then the The Bregman divergence, given by the Legendre dual function is the Kullback-Leibler divergence:

Proof.

From lemma LABEL:lem:3.ix, equation Label:eq:divergence_dual_divergence information technology follows, that:

Lemma 8 .

Permit exist the cumulant generating function of an exponential family . Then the Bregman departure of the Legendre dual function , induces a Riemannian metric, which is given by the changed Fisher information:

Proof.

By applying theorem LABEL:prop:3.2 the Bregman divergence induces the dual Riemannian metric , which by definition is inverse to the Riemannian metric, induced by , such that . From lemma 3 information technology follows, that

Lemma 9 .

Let be the cumulant generating part of an exponential family and the Riemannian statistical manifold, given by the Bregman divergence of the Legendre dual office . Then the -affine geodesics in are given past mixture families.

Proof.

-affine geodesics in are given past affine linear curves in the -parametrisation. For and , with let be the -affine geodesic connecting with and the probability density of over , then:

An integration over yields:

This is a mixture of probability distributions with respect to a mixing parameter . ∎

Lemma 10 .

Let be the cumulant generating function of an exponential family . Then the -affine geodesics are flat with regard to the Riemannian metric, induced by the Bregman divergence of the Legendre dual function .

4. Dually apartment structure

Theorem eleven (Structure of Exponential Families).

Let be an exponential family. So at that place exists a Bregman divergence such that is a dually flat statistical manifold with respect to the Riemannian metric, induced by .

Proof.

Let be the natural parametrisation and the cumulant generating role of . Permit further exist the Bregman divergence over with respect to , so is a Riemannian statistical manifold with a Bregman divergence . Past lemma 3 it follows that is an affine parametrisation and by lemma 5 that the -affine geodesics are flat with respect to the Riemannian metric, induced past . Let be the expectation parametrisation of with respect to , then by lemma 1 it follows, that , by lemma 8, that is an affine parametrisation of and past lemma 10 that the -affine geodesics are flat with respect to the Riemannian metric, induced by , where . Therefore the conditions of lemma LABEL:lem:3.10 are satisfied and is a dually flat statistical manifold. ∎

Together the principal and dual affine construction of an exponential family induce a dually flat structure, which is characterized by the -affine and -affine geodesics, with respect to the Fisher data metric and its dual metric. Thereby the -affine geodesics are represented by exponential families over the curve parameter an the -affine geodesics by mixture families. This relationship allows a characterisation of the intrinsic dually apartment structure, which is independent of its parametrisation. This structure is given past an -affine structure, that preserves the exponential family representation within its main affine construction and an -affine structure that preserves the dual representation within the dual affine structure. Therefore geodesics and geodesic projections in the -affine structure are respectively termed -geodesics, denoted by and -affine projections, denoted by . Furthermore the geodesics and geodesic projections in the -affine construction are respectively termed -geodesics, denoted by and -affine projections, denoted by . With respect to submanifolds, a smooth submanifold is termed -apartment if it has a linear embedding inside the -affine construction and -flat if it has a linear embedding inside the -affine construction.

Proof.

Let be an exponential family. Then theorem 11 states, that the structure of is given by a dually flat statistical manifold , where the Riemannian metric is induced by a Bregman divergence with respect to the natural parametrisation . Let be an -apartment submanifold, then is flat with regard to the Riemannian metric, induced by the expectation parametrisation and therefore apartment with respect to . Conversely let exist an -flat submanifold, and then is flat with regard to the Riemannian metric, induced by the natural parametrisation and therefore flat with respect to . Therefore the conditions of corollary Characterization:cor:three.2 are satisfied, which proves the corollary. ∎

5. Maximum Entropy Estimation in Exponential Families

The sample infinite of a statistical model is generated by a statistic , that induces a probability distribution from the probability infinite of an underlying statistical population . To this end besides the sample space may exist regarded every bit a probability infinite, whereat the occurrence of the probability distribution is hypothetical.

Lemma 13 .
Proof.

Let be a -finite reference measure over , and so the probability density of the compatible distribution is given past:

Permit exist the approved parametrisation of , and then the densities of any may exist written as:

Where the cumulative generating function is given by:

Allow , so:

This is the density of the uniform distribution over . ∎

Lemma 14 .

Permit be a measurable space and the uniform distribution over . Then let be an capricious probability distribution over , then:

Proof.

Let be the probability density of , then the entropy of over is:

Permit be an arbitrary probability distribution over with density , so:

Therefore:

Lemma 15 .

Let be an exponential family over a sample infinite

. Then a maximum entropy estimation of

is given by the compatible probability distribution over .

Proof.

From lemma 13 it follows, that . Furthermore lemma 14 proves that maximizes the entropy among all . ∎

Theorem 16 .

Allow exist an exponential family over a sample space and a smooth submanifold, and then a maximum entropy estimation of is given by a geodesic projection of the uniform probability distribution over to .


Figure 5.one. ME estimation in Exponential Families

By assuming a given event inside a sample space

, the principle of maximum entropy emphasizes a probability distribution, that minimizes additional assumptions. And then lemma

fifteen states, that the maximum entropy estimation of a single observation within the set of all probability distributions is given by the compatible probability distribution over this observation, such that:

The dispensation of any boosted noesis except the observation itself determines the uniform probability distribution as the empirical probability of a single ascertainment. This shall exist extended to repeated observations. Let be a repeated observation in , then the density office of a finite measure over is given by the arithmetic hateful of the compatible distributions of the individual single observations:

Where is a -finite reference measure out and the Dirac measure and defined by:

And so is a probability density function over , since and:

Furthermore the assumptions given by are equal the knowledge which is given by the repeated ascertainment . This determines as the density of an empirical probability distribution.

Definition (Empirical probability distribution).

Permit be a measurable infinite, a repeated ascertainment in and a -finite reference mensurate over . Then the empirical probability distribution of over is given past:

Proposition 17 .

Permit be a sample space, be observable distribution of and the fractional sequences of a repeated observation in . Then the sequence of the empirical probability distributions converges to as .

Proof.

Permit be the probability density function of

. Then due to the strong police force of large numbers it follows, that:

Let , then the limit of the empirical probability distributions is given by:

vi. Maximum Likelihood Estimation in Exponential Families

Lemma 18 .

Permit exist a statistical model over a sample space and a finite repeated ascertainment in . Then the MLE has the following representation:

Proof.

Since the transformation is strictly monotonous over information technology direct follows, that:

Furthermore due to pairwise independence of the individual observations are information technology follows, that:

And therefore:

Lemma 19 .

Let be a statistical model over a sample space and a finite repeated ascertainment in . Allow farther be the empirical probability density with respect to . And then the MLE has the following representation:

Proof.

By substitution of the empirical probability density it follows that:

The maximization of equation 6 with respect to therefore yields the following identity:

By lemma eighteen, equation 6.1 it follows, that:

Theorem 20 .

Let be an exponential family over a sample space . Let further be a shine submanifold of and a repeated ascertainment in . Then a maximum likelihood estimation of respective to is given by the geodesic project of the empirical probability to in .


Figure 6.1. ML estimation in Exponential Families
Proof.

Let be an exponential family over a sample space and a shine submanifold of . Then at that place exists a Bregman deviation , such that is a dually flat statistical manifold and is the -affine parametrisation of . And so is a shine Riemannian submanifold of with respect to the induced Riemannian metric . Allow be a repeated ascertainment over and its empirical probability distribution.

So is given in expectation parameters in and therefore in an -affine parametrisation within Riemannian Manifold . By applying the projection theorem the geodesic projection of to , equals the dual affine projection, and therefore past a betoken , that minimizes the Bregman divergence . The geodesic altitude is therefore given past:

The minimization of with respect to the natural parametrisation therefore yields the post-obit identity:

By equation half dozen.two and lemma, equation six.4 information technology follows, that:

vii. Latent variable models

Exponential families, equally introduced in the previous sections, statistically relate random variables over a common statistical population by their common probability distribution in the sample infinite. In many cases even so the intrinsic construction of this relationship has a natural decomposition by the introduction of latent random variables, that are not directly observable from the statistical population but assumed to bear upon the observations. This is of particular importance for the modelling of statistical populations with complex network structures. In this case the properties of the network may be incorporated past the provisional transition probabilities between observable and latent random variables.

In completely appreciable statistical models, the probability distributions may exist estimated by the empirical probability distributions of repeated observations. In latent variable models however the conditional transition probabilities and between the observables and the latent variables in general prevent this inference. The just exception is given if and are uniform distributed, such that for any given any has the aforementioned probability with respect to and vice versa. In this case estimations decompose into independent estimations of the observable variables and the latent variables. If the conditional transition probabilities, however are non uniform distributed, they take to be taken into account for estimations. This as well applies to empirical distributions. Let be the partial sequences of a repeated observation in , so by proffer 17 it follows, that the empirical probabilities converge in distribution to the truthful probability distribution of . Since however is the marginal distribution of the observables in the common empirical probabilities over are constituted past an empirical probability of a repeated observation and a conditional transition probability. The probability density of an empirical probability over is therefore given by:

In the presence of continuous variables it is useful to restrict the empirical probability distributions to "not pathological" cases. This brake defines statistical model with respect to empirical observations.

Definition (Empirical model).

Permit exist a partially appreciable measurable space with . And then a statistical model is termed an empirical model over if comprises all empirical probability distributions over , which are constituted by a finite repeated ascertainment and a conditional transition probability of a given set up . If is the set of all absolutely continuous conditional transition probabilities, and then is termed an admittedly continuous empirical model.

viii. Exponential families with latent variables

Let be an exponential family over a partially appreciable measurable infinite . Then due to theorem 11 the structure of is that of a dually flat manifold such that there exists a parametrisation and a convex function , that let to regard as a dually flat statistical manifold, given by . In the purpose of observation based estimations in the presence of latent variables, the obstacle that has to exist taken is the extension of to a dually flat embedding space , that covers equally well as the empirical model . For arbitrary empirical models all the same this embedding does generally non exist, for which is causeless to exist admittedly continuous. Then is generally an infinite dimensional exponential family unit.

Lemma 21 .

Let be an absolutely continuous empirical model over a partially appreciable measurable space . So is an space dimensional exponential family unit and its probability densities are given past:

where and respectively announce the appreciable and latent variables, scalar coefficients and the cumulant generating part, given past:

Proof.

Due to the definition of an empirical model the probability density of any may be written as:

Furthermore any conditional transition probability is given by:

Therefore by equation eight.two and 8.three it follows, that:

Since are absolutely continues equation eight.4 may exist rewritten to:

By the substitution

information technology follows, that the the empirical probabilities in may be written as a mixture with mixing coefficients :

This shows, that is an infinite dimensional mixture family. Past a further transformation:

a dual representation of equation 8.6 is obtained, with:

This shows, that is also an infinite dimensional exponential family with natural variables and the cumulative generating part is given by

Lemma 22 .

Let be an exponential family unit over a partially appreciable measurable space and an admittedly continuous empirical model over . Then at that place exists a dually flat statistical manifold , that covers and equally dually flat submanifolds.

Proof.

With respect to the partially appreciable measurable space any may exist written as and any as . Let be given by:

Since is an exponential family information technology has a natural parametrisation and due to Lemma 21 the continuous empirical model is an infinite dimensional exponential family with natural coefficients . So a parametrisation of is given by:

Let further be the cumulant generating function of and the cumulant generating function of , and then and are convex functions and therefore a convex function over is given by:

This allows the definition of a Bregman difference , such that induces Riemannian metric over . By substitution of equation 8.7 it follows that the application of to yields a parametric representation, which is given by:

This shows that is an exponential family unit and furthermore, that is a dually flat statistical manifold, that covers and as smooth submanifolds. Since the projections and are linear in the -parametrisation and are -apartment with respect to the induced metric . Let be the expectation parameters of and expectation coefficients , then the dual parametrisation is given by:

Since the projections and are linear in the -parametrisation and are -flat with respect to the induced metric . Therefore and are a dually flat submanifolds of . ∎

nine. Maximum Likelihood Estimation in Exponential Families with Latent Variables

Due to the existence of a simply continued embedding space that covers as well as , also arbitrary polish submanifolds of may be continued to submanifolds of , given by a repeated ascertainment . Since is furthermore a Riemannian statistical manifold and and are smooth submanifolds of geodesics between and are given by the induced Riemannian metric .

Theorem 23 .

Permit be an exponential family over a partially observable measurable space and an absolutely continuous empirical model over . Let farther be a smooth submanifold of , and a repeated observation in . And then a maximum likelihood interpretation of respective to is given by a minimal geodesic projection of to .


Effigy ix.1. ML estimation in latent variable Exponential Families
Proof.

Since is an exponential family unit and a continuous empirical model, at that place exists a common dually flat embedding infinite , such that and are dually flat submanifolds, with respect to the Bregman divergence . Since the maximum likelihood estimation of with respect to the repeated observation is independent of its chosen parametrisation, it may be obtained by the natural parametrisation of the embedding space , such that:

Then theorem has an equivalent formulation, given past:

Since is a smoothen submanifold of and of it follows, that is also a polish submanifold of and since is a dually flat statistical manifold, the projection theorem is satisfied. Therefore the geodesic projection of a stock-still point to is given by the minimal dual affine projection, and thus by a point , that minimizes , such that:

So the minimal geodesic projection of to is given past points and , that minimize :

Since is the natural parametrisation of the exponential family , the Bregman difference is given by the dual Kullback-Leibler divergence in natural parameters. Therefore it follows, that:

Without loss of generality let be the vectorial appreciable random variable and the vectorial latent random variable. So and denote the Lebesgue measures in and and:

The common empirical distributions are defined by the marginal empirical densities of the appreciable variables and provisional transition probabilities of the latent variables by , such that:

Furthermore the function allows to write the product into a sum and to substitute the addends by functionals, such that:

With:

Then it follows, that:

And furthermore::

Then and are completely determined by , such that their minima exercise not depend on the choice of and . Therefore:

Since are the empirical probabilities over the weather of lemma , equation 6.2, are satisfied, such that

10. Alternate minimization

Let exist a dually apartment statistical manifold and and polish submanifolds of . Then the deviation between to is defined by the minimal divergence of its corresponding points, such that:

where and minimize . Past applying Amari´s projection theorem it follows, that the pair likewise minimizes the geodesic distance and therefore has to exist regarded as a pair of closest points between and . Furthermore the dually flat structure allows an iterative alternating geodesic projection between and to asymptotically judge and . This is termed alternating minimization.

Definition 24 (Alternating minimization).

Let be an exponential family unit with smooth submanifolds and . Then the alternating minimization from to iteratively defines a sequence of elements in , which is given past:

Begin:

Let exist arbitrary

Iteration:

( -pace) is given by a geodesic project of to :

( -stride) is given by a dual geodesic projection of to :

Theorem 25 (Alternating minimization).

Let exist an exponential family model with smooth submanifolds and . Then the alternating minimization algorithm converges in a pair of probability distributions, that locally minimize the geodesic altitude betwixt and .

Figure x.1. Alternate minimization in Exponential Families
Proof.

Let exist an exponential family, then the Bregman departure of its cumulative generating function induces a Riemannian metric and is a simply connected dually apartment statistical manifold. So and are Riemannian submanifolds of , with respect to the induced Riemannian metric. In the -step, the geodesic projection is given by the dual affine projection in the -parametrisation. Then the dual affine project minimizes the Bregman difference and the Pythagorean theorem yields:

In the -pace dual geodesic projection is given by the affine project in the -parametrisation. Then the affine projection minimizes the dual Bregman deviation and the Pythagorean theorem yields:

This proves, that monotonously decreases, since:

Furthermore is bounded bellow by:

This proves, that converges against a local minimum. ∎

Corollary 26 .

Let be an exponential family unit with an -flat submanifold and an -apartment submanifold . And then the alternate minimization algorithm from to converges against points, that globally minimize the geodesic distance betwixt and .

Proof.

Permit be a sequence, given past the alternate minimization algorithm from to , and then due to theorem 25 converges against confronting points, that locally minimize the geodesic distance betwixt and . Since is -apartment and is -flat corollary Characterization:cor:3.2 is satisfied, such that this geodesic distance in unique and therefore converges confronting points, that globally minimize the geodesic altitude between and . ∎

Corollary 27 .

Let be an exponential family over a partially observable measurable infinite and an admittedly continuous empirical model over . Let further be an -flat submanifold of , and a repeated observation over . And so a maximum likelihood interpretation of corresponding to is given by the limit of alternating minimization algorithm from to .

Proof.

Since is an exponential family unit and an absolutely continuous empirical model, there exists a mutual dually apartment embedding space , such that and are dually flat submanifolds, with respect to the Bregman divergence . Then is -apartment in and by definition is -apartment in and therefore also in . This allows the awarding of corollary 26. ∎

References

cooperthatimetat.blogspot.com

Source: https://deepai.org/publication/applications-of-structural-statistics-geometrical-inference-in-exponential-families

0 Response to "Proof Mean and Variance of Exponential Families Canonical Form"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel