VOOZH about

URL: https://en.wikipedia.org/wiki/Conditional_entropy

⇱ Conditional entropy - Wikipedia


Jump to content
From Wikipedia, the free encyclopedia
Measure of relative information in probability theory
πŸ‘ Image
Venn diagram showing additive and subtractive relationships various information measures associated with correlated variables πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
. The area contained by both circles is the joint entropy πŸ‘ {\displaystyle \mathrm {H} (X,Y)}
. The circle on the left (red and violet) is the individual entropy πŸ‘ {\displaystyle \mathrm {H} (X)}
, with the red being the conditional entropy πŸ‘ {\displaystyle \mathrm {H} (X|Y)}
. The circle on the right (blue and violet) is πŸ‘ {\displaystyle \mathrm {H} (Y)}
, with the blue being πŸ‘ {\displaystyle \mathrm {H} (Y|X)}
. The violet is the mutual information πŸ‘ {\displaystyle \operatorname {I} (X;Y)}
.
Information theory
πŸ‘ Image

In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable πŸ‘ {\displaystyle Y}
given that the value of another random variable πŸ‘ {\displaystyle X}
is known. Here, information is measured in shannons, nats, or hartleys. The "entropy of πŸ‘ {\displaystyle Y}
conditioned on πŸ‘ {\displaystyle X}
" is denoted as πŸ‘ {\displaystyle \mathrm {H} (Y|X)}
.

Definition

[edit]

The conditional entropy of πŸ‘ {\displaystyle Y}
given πŸ‘ {\displaystyle X}
is defined as

πŸ‘ {\displaystyle \mathrm {H} (Y|X)\ =-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log {\frac {p(x,y)}{p(x)}}}

where πŸ‘ {\displaystyle {\mathcal {X}}}
and πŸ‘ {\displaystyle {\mathcal {Y}}}
denote the support sets of πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
.

Note: Here, the convention is that the expression πŸ‘ {\displaystyle 0\log 0}
should be treated as being equal to zero. This is because πŸ‘ {\displaystyle \lim _{\theta \to 0^{+}}\theta \,\log \theta =0}
.[1]

Intuitively, notice that by definition of expected value and of conditional probability, πŸ‘ {\displaystyle \displaystyle H(Y|X)}
can be written as πŸ‘ {\displaystyle H(Y|X)=\mathbb {E} [f(X,Y)]}
, where πŸ‘ {\displaystyle f}
is defined as πŸ‘ {\displaystyle \displaystyle f(x,y):=-\log \left({\frac {p(x,y)}{p(x)}}\right)=-\log(p(y|x))}
. One can think of πŸ‘ {\displaystyle \displaystyle f}
as associating each pair πŸ‘ {\displaystyle \displaystyle (x,y)}
with a quantity measuring the information content of πŸ‘ {\displaystyle \displaystyle (Y=y)}
given πŸ‘ {\displaystyle \displaystyle (X=x)}
. This quantity is directly related to the amount of information needed to describe the event πŸ‘ {\displaystyle \displaystyle (Y=y)}
given πŸ‘ {\displaystyle (X=x)}
. Hence by computing the expected value of πŸ‘ {\displaystyle \displaystyle f}
over all pairs of values πŸ‘ {\displaystyle (x,y)\in {\mathcal {X}}\times {\mathcal {Y}}}
, the conditional entropy πŸ‘ {\displaystyle \displaystyle H(Y|X)}
measures how much information, on average, the variable πŸ‘ {\displaystyle X}
encodes about πŸ‘ {\displaystyle Y}
.

Motivation

[edit]

Let πŸ‘ {\displaystyle \mathrm {H} (Y|X=x)}
be the entropy of the discrete random variable πŸ‘ {\displaystyle Y}
conditioned on the discrete random variable πŸ‘ {\displaystyle X}
taking a certain value πŸ‘ {\displaystyle x}
. Denote the support sets of πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
by πŸ‘ {\displaystyle {\mathcal {X}}}
and πŸ‘ {\displaystyle {\mathcal {Y}}}
. Let πŸ‘ {\displaystyle Y}
have probability mass function πŸ‘ {\displaystyle p_{Y}{(y)}}
. The unconditional entropy of πŸ‘ {\displaystyle Y}
is calculated as πŸ‘ {\displaystyle \mathrm {H} (Y):=\mathbb {E} [\operatorname {I} (Y)]}
, i.e.

πŸ‘ {\displaystyle \mathrm {H} (Y)=\sum _{y\in {\mathcal {Y}}}{\mathrm {Pr} (Y=y)\,\mathrm {I} (y)}=-\sum _{y\in {\mathcal {Y}}}{p_{Y}(y)\log _{2}{p_{Y}(y)}},}

where πŸ‘ {\displaystyle \operatorname {I} (y_{i})}
is the information content of the outcome of πŸ‘ {\displaystyle Y}
taking the value πŸ‘ {\displaystyle y_{i}}
. The entropy of πŸ‘ {\displaystyle Y}
conditioned on πŸ‘ {\displaystyle X}
taking the value πŸ‘ {\displaystyle x}
is defined by:

πŸ‘ {\displaystyle \mathrm {H} (Y|X=x)=-\sum _{y\in {\mathcal {Y}}}{\Pr(Y=y|X=x)\log _{2}{\Pr(Y=y|X=x)}}.}

Note that πŸ‘ {\displaystyle \mathrm {H} (Y|X)}
is the result of averaging πŸ‘ {\displaystyle \mathrm {H} (Y|X=x)}
over all possible values πŸ‘ {\displaystyle x}
that πŸ‘ {\displaystyle X}
may take. Also, if the above sum is taken over a sample πŸ‘ {\displaystyle y_{1},\dots ,y_{n}}
, the expected value πŸ‘ {\displaystyle E_{X}[\mathrm {H} (y_{1},\dots ,y_{n}\mid X=x)]}
is known in some domains as equivocation.[2]

Given discrete random variables πŸ‘ {\displaystyle X}
with image πŸ‘ {\displaystyle {\mathcal {X}}}
and πŸ‘ {\displaystyle Y}
with image πŸ‘ {\displaystyle {\mathcal {Y}}}
, the conditional entropy of πŸ‘ {\displaystyle Y}
given πŸ‘ {\displaystyle X}
is defined as the weighted sum of πŸ‘ {\displaystyle \mathrm {H} (Y|X=x)}
for each possible value of πŸ‘ {\displaystyle x}
, using πŸ‘ {\displaystyle p(x)}
as the weights:[3]:β€Š15β€Š

πŸ‘ {\displaystyle {\begin{aligned}\mathrm {H} (Y|X)\ &\equiv \sum _{x\in {\mathcal {X}}}\,p(x)\,\mathrm {H} (Y|X=x)\\&=-\sum _{x\in {\mathcal {X}}}p(x)\sum _{y\in {\mathcal {Y}}}\,p(y|x)\,\log _{2}\,p(y|x)\\&=-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}\,p(x)p(y|x)\,\log _{2}\,p(y|x)\\&=-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}\,p(x)p(y|x)\,\log _{2}\,\left(p(y|x){\frac {p(x)}{p(x)}}\right)\\&=-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log _{2}{\frac {p(x,y)}{p(x)}}.\end{aligned}}}

Properties

[edit]

Conditional entropy equals zero

[edit]
πŸ‘ {\displaystyle \mathrm {H} (Y|X)=0}
if and only if the value of πŸ‘ {\displaystyle Y}
is completely determined by the value of πŸ‘ {\displaystyle X}
.

Conditional entropy of independent random variables

[edit]

Conversely, πŸ‘ {\displaystyle \mathrm {H} (Y|X)=\mathrm {H} (Y)}
if and only if πŸ‘ {\displaystyle Y}
and πŸ‘ {\displaystyle X}
are independent random variables.

Chain rule

[edit]

Assume that the combined system determined by two random variables πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
has joint entropy πŸ‘ {\displaystyle \mathrm {H} (X,Y)}
, that is, we need πŸ‘ {\displaystyle \mathrm {H} (X,Y)}
bits of information on average to describe its exact state. Now if we first learn the value of πŸ‘ {\displaystyle X}
, we have gained πŸ‘ {\displaystyle \mathrm {H} (X)}
bits of information. Once πŸ‘ {\displaystyle X}
is known, we only need πŸ‘ {\displaystyle \mathrm {H} (X,Y)-\mathrm {H} (X)}
bits to describe the state of the whole system. This quantity is exactly πŸ‘ {\displaystyle \mathrm {H} (Y|X)}
, which gives the chain rule of conditional entropy:

πŸ‘ {\displaystyle \mathrm {H} (Y|X)\,=\,\mathrm {H} (X,Y)-\mathrm {H} (X).}
[3]:β€Š17β€Š

The chain rule follows from the above definition of conditional entropy:

πŸ‘ {\displaystyle {\begin{aligned}\mathrm {H} (Y|X)&=\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log \left({\frac {p(x)}{p(x,y)}}\right)\\[4pt]&=\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)(\log(p(x))-\log(p(x,y)))\\[4pt]&=-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p(x,y)\log(p(x,y))+\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}{p(x,y)\log(p(x))}\\[4pt]&=\mathrm {H} (X,Y)+\sum _{x\in {\mathcal {X}}}p(x)\log(p(x))\\[4pt]&=\mathrm {H} (X,Y)-\mathrm {H} (X).\end{aligned}}}

In general, a chain rule for multiple random variables holds:

πŸ‘ {\displaystyle \mathrm {H} (X_{1},X_{2},\ldots ,X_{n})=\sum _{i=1}^{n}\mathrm {H} (X_{i}|X_{1},\ldots ,X_{i-1})}
[3]:β€Š22β€Š

It has a similar form to chain rule in probability theory, except that addition instead of multiplication is used.

Bayes' rule

[edit]

Bayes' rule for conditional entropy states

πŸ‘ {\displaystyle \mathrm {H} (Y|X)\,=\,\mathrm {H} (X|Y)-\mathrm {H} (X)+\mathrm {H} (Y).}

Proof. πŸ‘ {\displaystyle \mathrm {H} (Y|X)=\mathrm {H} (X,Y)-\mathrm {H} (X)}
and πŸ‘ {\displaystyle \mathrm {H} (X|Y)=\mathrm {H} (Y,X)-\mathrm {H} (Y)}
. Symmetry entails πŸ‘ {\displaystyle \mathrm {H} (X,Y)=\mathrm {H} (Y,X)}
. Subtracting the two equations implies Bayes' rule.

If πŸ‘ {\displaystyle Y}
is conditionally independent of πŸ‘ {\displaystyle Z}
given πŸ‘ {\displaystyle X}
we have:

πŸ‘ {\displaystyle \mathrm {H} (Y|X,Z)\,=\,\mathrm {H} (Y|X).}

Other properties

[edit]

For any πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
:

πŸ‘ {\displaystyle {\begin{aligned}\mathrm {H} (Y|X)&\leq \mathrm {H} (Y)\,\\\mathrm {H} (X,Y)&=\mathrm {H} (X|Y)+\mathrm {H} (Y|X)+\operatorname {I} (X;Y),\qquad \\\mathrm {H} (X,Y)&=\mathrm {H} (X)+\mathrm {H} (Y)-\operatorname {I} (X;Y),\,\\\operatorname {I} (X;Y)&\leq \mathrm {H} (X),\,\end{aligned}}}

where πŸ‘ {\displaystyle \operatorname {I} (X;Y)}
is the mutual information between πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
.

For independent πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
:

πŸ‘ {\displaystyle \mathrm {H} (Y|X)=\mathrm {H} (Y)}
and πŸ‘ {\displaystyle \mathrm {H} (X|Y)=\mathrm {H} (X)\,}

Although the specific-conditional entropy πŸ‘ {\displaystyle \mathrm {H} (X|Y=y)}
can be either less or greater than πŸ‘ {\displaystyle \mathrm {H} (X)}
for a given random variate πŸ‘ {\displaystyle y}
of πŸ‘ {\displaystyle Y}
, πŸ‘ {\displaystyle \mathrm {H} (X|Y)}
can never exceed πŸ‘ {\displaystyle \mathrm {H} (X)}
.

Conditional differential entropy

[edit]

Definition

[edit]

The above definition is for discrete random variables. The continuous version of discrete conditional entropy is called conditional differential (or continuous) entropy. Let πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
be a continuous random variables with a joint probability density function πŸ‘ {\displaystyle f(x,y)}
. The differential conditional entropy πŸ‘ {\displaystyle h(X|Y)}
is defined as[3]:β€Š249β€Š

πŸ‘ {\displaystyle h(X|Y)=-\int _{{\mathcal {X}},{\mathcal {Y}}}f(x,y)\log f(x|y)\,dxdy}
.

Properties

[edit]

In contrast to the conditional entropy for discrete random variables, the conditional differential entropy may be negative.

As in the discrete case there is a chain rule for differential entropy:

πŸ‘ {\displaystyle h(Y|X)\,=\,h(X,Y)-h(X)}
[3]:β€Š253β€Š

Notice however that this rule may not be true if the involved differential entropies do not exist or are infinite.

Joint differential entropy is also used in the definition of the mutual information between continuous random variables:

πŸ‘ {\displaystyle \operatorname {I} (X,Y)=h(X)-h(X|Y)=h(Y)-h(Y|X)}
πŸ‘ {\displaystyle h(X|Y)\leq h(X)}
with equality if and only if πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
are independent.[3]:β€Š253β€Š

Relation to estimator error

[edit]

The conditional differential entropy yields a lower bound on the expected squared error of an estimator. For any Gaussian random variable πŸ‘ {\displaystyle X}
, observation πŸ‘ {\displaystyle Y}
and estimator πŸ‘ {\displaystyle {\widehat {X}}}
the following holds:[3]:β€Š255β€Š

πŸ‘ {\displaystyle \mathbb {E} \left[{\bigl (}X-{\widehat {X}}{(Y)}{\bigr )}^{2}\right]\geq {\frac {1}{2\pi e}}e^{2h(X|Y)}}

This is related to the uncertainty principle from quantum mechanics.

Generalization to quantum theory

[edit]

In quantum information theory, the conditional entropy is generalized to the conditional quantum entropy. The latter can take negative values, unlike its classical counterpart.

See also

[edit]

References

[edit]
  1. ^ "David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book". www.inference.org.uk. Retrieved 2019-10-25.
  2. ^ Hellman, M.; Raviv, J. (1970). "Probability of error, equivocation, and the Chernoff bound". IEEE Transactions on Information Theory. 16 (4): 368–372. CiteSeerX 10.1.1.131.2865. doi:10.1109/TIT.1970.1054466.
  3. ^ a b c d e f g T. Cover; J. Thomas (1991). Elements of Information Theory. Wiley. ISBN 0-471-06259-6.