VOOZH about

URL: https://en.wikipedia.org/wiki/Variation_of_information

⇱ Variation of information - Wikipedia


Jump to content
From Wikipedia, the free encyclopedia
Measure of distance between two clusterings related to mutual information

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.[1][2][3]

πŸ‘ Image
Information diagram illustrating the relation between information entropies, mutual information and variation of information.

Definition

[edit]

Suppose we have two partitions πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
of a set πŸ‘ {\displaystyle A}
, namely πŸ‘ {\displaystyle X=\{X_{1},X_{2},\ldots ,X_{k}\}}
and πŸ‘ {\displaystyle Y=\{Y_{1},Y_{2},\ldots ,Y_{l}\}}
.

Let:

πŸ‘ {\displaystyle n=\sum _{i}|X_{i}|=\sum _{j}|Y_{j}|=|A|}
πŸ‘ {\displaystyle p_{i}=|X_{i}|/n}
and πŸ‘ {\displaystyle q_{j}=|Y_{j}|/n}
πŸ‘ {\displaystyle r_{ij}=|X_{i}\cap Y_{j}|/n}

Then the variation of information between the two partitions is:

πŸ‘ {\displaystyle \mathrm {VI} (X;Y)=-\sum _{i,j}r_{ij}\left[\log(r_{ij}/p_{i})+\log(r_{ij}/q_{j})\right]}
.

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on πŸ‘ {\displaystyle A}
defined by πŸ‘ {\displaystyle \mu (B):=|B|/n}
for πŸ‘ {\displaystyle B\subseteq A}
.

Explicit information content

[edit]

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact lattice where the partial order induces two operations, the meet πŸ‘ {\displaystyle \wedge }
and the join πŸ‘ {\displaystyle \vee }
, where the maximum πŸ‘ {\displaystyle {\overline {\mathrm {1} }}}
is the partition with only one block, i.e., all elements grouped together, and the minimum is πŸ‘ {\displaystyle {\overline {\mathrm {0} }}}
, the partition consisting of all elements as singletons. The meet of two partitions πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
is easy to understand as that partition formed by all pair intersections of one block of, πŸ‘ {\displaystyle X_{i}}
, of πŸ‘ {\displaystyle X}
and one, πŸ‘ {\displaystyle Y_{i}}
, of πŸ‘ {\displaystyle Y}
. It then follows that πŸ‘ {\displaystyle X\wedge Y\subseteq X}
and πŸ‘ {\displaystyle X\wedge Y\subseteq Y}
.

Let's define the entropy of a partition πŸ‘ {\displaystyle X}
as

where πŸ‘ {\displaystyle p_{i}=|X_{i}|/n}
. Clearly, πŸ‘ {\displaystyle H({\overline {\mathrm {1} }})=0}
and πŸ‘ {\displaystyle H({\overline {\mathrm {0} }})=\log \,n}
. The entropy of a partition is a monotonous function on the lattice of partitions in the sense that πŸ‘ {\displaystyle X\subseteq Y\Rightarrow H(X)\geq H(Y)}
.

Then the VI distance between πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
is given by

The difference πŸ‘ {\displaystyle d(X,Y)\equiv |H\left(X\right)-H\left(Y\right)|}
is a pseudo-metric as πŸ‘ {\displaystyle d(X,Y)=0}
doesn't necessarily imply that πŸ‘ {\displaystyle X=Y}
. From the definition of πŸ‘ {\displaystyle {\overline {\mathrm {1} }}}
, it is πŸ‘ {\displaystyle \mathrm {VI} (X,\mathrm {1} )\,=\,H\left(X\right)}
.

If in the Hasse diagram we draw an edge from every partition to the maximum πŸ‘ {\displaystyle {\overline {\mathrm {1} }}}
and assign it a weight equal to the VI distance between the given partition and πŸ‘ {\displaystyle {\overline {\mathrm {1} }}}
, we can interpret the VI distance as basically an average of differences of edge weights to the maximum

For πŸ‘ {\displaystyle H(X)}
as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

and we also have that πŸ‘ {\displaystyle d(X,X\wedge Y)\,=\,H(X\wedge Y|X)}
coincides with the conditional entropy of the meet (intersection) πŸ‘ {\displaystyle X\wedge Y}
relative to πŸ‘ {\displaystyle X}
.

Identities

[edit]

The variation of information satisfies

πŸ‘ {\displaystyle \mathrm {VI} (X;Y)=H(X)+H(Y)-2I(X,Y)}
,

where πŸ‘ {\displaystyle H(X)}
is the entropy of πŸ‘ {\displaystyle X}
, and πŸ‘ {\displaystyle I(X,Y)}
is mutual information between πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
with respect to the uniform probability measure on πŸ‘ {\displaystyle A}
. This can be rewritten as

πŸ‘ {\displaystyle \mathrm {VI} (X;Y)=H(X,Y)-I(X,Y)}
,

where πŸ‘ {\displaystyle H(X,Y)}
is the joint entropy of πŸ‘ {\displaystyle X}
and πŸ‘ {\displaystyle Y}
, or

πŸ‘ {\displaystyle \mathrm {VI} (X;Y)=H(X|Y)+H(Y|X)}
,

where πŸ‘ {\displaystyle H(X|Y)}
and πŸ‘ {\displaystyle H(Y|X)}
are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

πŸ‘ {\displaystyle \mathrm {VI} (X;Y)\leq \log(n)}
,

Or with respect to a maximum number of clusters, πŸ‘ {\displaystyle K^{*}}
:

πŸ‘ {\displaystyle \mathrm {VI} (X;Y)\leq 2\log(K^{*})}

Triangle inequality

[edit]

To verify the triangle inequality πŸ‘ {\displaystyle \mathrm {VI} (X;Z)\leq \mathrm {VI} (X;Y)+\mathrm {VI} (Y;Z)}
, expand using the identity πŸ‘ {\displaystyle \mathrm {VI} (X;Y)=H(X|Y)+H(Y|X)}
. It suffices to prove πŸ‘ {\displaystyle H(X|Z)\leq H(X|Y)+H(Y|Z)}
The right side has a lower bound πŸ‘ {\displaystyle H(X|Y)+H(Y|Z)\geq H(X|Y,Z)+H(Y|Z)=H(X,Y|Z)}
which is no less than the left side.

References

[edit]
  1. ^ Arabie, P.; Boorman, S. A. (1973). "Multidimensional scaling of measures of distance between partitions". Journal of Mathematical Psychology. 10 (2): 148–203. doi:10.1016/0022-2496(73)90012-6.
  2. ^ W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p. 4731 (1989)
  3. ^ MeilΔƒ, Marina (2003). "Comparing Clusterings by the Variation of Information". In Bernhard SchΓΆlkopf; Manfred K. Warmuth (eds.). Learning Theory and Kernel Machines. 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop. Lecture Notes in Computer Science, vol. 2777. Springer. pp. 173–187. doi:10.1007/978-3-540-45167-9_14. ISBN 978-3-540-40720-1. S2CID 4341039.

Further reading

[edit]

External links

[edit]