Quantum Mechanics: Probability Theory Basics
Quantum Mechanics: Probability Theory Basics
and we will use this more compact expression henceforth. It will sometimes be
useful to consider the probability simplex N which is a subset of RN , where N
consists of all nonnegative vectors with entries summing to one. Then we can write
p~ 2 N .
Next we consider a rudimentary version of dynamics. That is, what kinds of
transformations on p~ will map it into another valid probability distribution? The
17
18 2. ESSENTIALS OF QUANTUM MECHANICS
we have k probability distributions p~1 , ..., p~k , then we can form a new probability
distribution by forming a convex combination
k
X
p~ 0 = rj p~j (3)
j=1
Pk
where rj 0 and rj = 1. To see this, notice that p~ 0 has nonnegative entries
Pj=1
k Pk
and that ~1T · p~ 0 = j=1 rj (~1T · p~j ) = j=1 rj = 1. We can interpret r1 , ..., rk as a
probability distribution over k items in its own right, and say of (3) that we have
a probabilistic mixture of k probability distributions wherein we sample from p~j
with probability rj . That is, r1 , ..., rk is a probability distribution over probability
distributions. (You can use this ‘meta’ statement to impress your friends, if you
like.) To make this concrete, consider the following example:
Another example would be a Bayesian update. There are clearly a vast infinitude of
other possibilities as well. Among this infinitude of transformations there is a natu-
ral class that interfaces well with convex combinations of probability distributions.
In particular, suppose we mandate that T satisfies
0 1
Xk k
X
T@ rj p~j A = rj T (~
pj ) (4)
j=1 j=1
for any p~1 , ..., p~k and any valid r1 , ..., rk . In words, we are requiring that a transfor-
mation of a probabilistic mixture is a probabilistic mixture of transformations (and
specifically, the same transformation). Such T ’s satisfy a nice structure theorem:
1. PROBABILITY THEORY ON VECTOR SPACES 21
Let M be the matrix whose jth column is T (~ej ). Then T (~p) = M · p~. Each column
T (~ej ) is a probability vector, so Mij 0 and ~1T · M = ~1T . Thus M is a Markov
matrix, as claimed. ⇤
Mixture-preserving transformations are natural from a physical point of view.
Imagine a preparation device that, with probabilities r1 , ..., rk , produces one of the
distributions p~1 , ..., p~k by consulting some randomly tossed coins you do not get to
see. If dynamics could distinguish whether this randomization happened “before”
or “after” the transformation, then the timing of the unseen coin flips would be
observable from the output statistics alone. Requiring that they not be observable
is exactly the statement of (4).
Two simple consequences are worth keeping in mind. First, the admissible
dynamics are closed under randomized control: if with probability rj you implement
a Markov matrix Mj , then the overall map is
k
X
M0 = r j Mj ,
j=1
Pk
which is again a Markov matrix since ~1 T · M 0 = j=1 rj (~1T · Mj ) = ~1T and all
entries are nonnegative. Second, if one further insists that deterministic states are
carried to deterministic states, so that ~ej never acquires additional randomness,
then each column T (~ej ) must itself be a basis vector. Equivalently, M has exactly
one 1 (and zeros elsewhere) in each column. Such matrices are sometimes called
deterministic or functional Markov matrices. If in addition the mapping j 7! i(j)
is injective (no two distinct columns point to the same basis vector), then M is a
permutation matrix.
By contrast, nonlinear updates arise when you condition on a revealed out-
come and then renormalize; the rule in that case depends on which outcome was
announced, so it is not a single fixed map on N and does not represent closed-
system dynamics. This classical discussion sets the stage for the quantum case,
which we will treat soon. (There, the state space becomes the convex set of density
operators, mixture-preserving maps become convex-linear “channels,” and the role
of Markov matrices is played by completely positive, trace-preserving maps.)
algebra. The key operation will be the tensor product, which is an operation for
joining two or more vector spaces.
We will proceed by motivating the tensor product informally through simple
examples, and then give the abstract definition. It is worth paying close attention
as the tensor product will serve as an essential piece of mathematical architecture
for almost everything in quantum learning theory.
~ in RN . We denote their tensor product by ~v ⌦ w.
Consider two vectors ~v , w ~ To
develop what this means, consider the example below.
1 3
Example 3. Let ~v = and w~ = . Then their tensor product ~v ⌦ w~ is
2 4
represented by
2 3
3 2 3
61 · 4 7 3
1 3 6 7 647
~v ⌦ w~= ⌦ =6 7 6 7
6 7 = 465 .
2 4 4 3 5
2· 8
4
In words, w
~ gets ‘sucked in’ to ~v . Now let us take the tensor product in the other
~ ⌦ ~v :
order, namely w
2 3
1 2 3
63 · 2 7 3
3 1 6 7 667
~ ⌦ ~v =
w ⌦ =66
7=6 7.
7 445
4 2 4 5
1
4· 8
2
~ 2 R2 ⌦ R3 ' R6 .
and we write ~v ⌦ w
From the previous two examples we see the general rule that if ~v 2 RN and w
~ 2 RM ,
~ 2R ⌦R 'R
then ~v ⌦ w N M NM
. So upon taking the tensor product of two vector
spaces, the dimensions multiply. We can generalize this further by contemplating
another example:
1. PROBABILITY THEORY ON VECTOR SPACES 23
1 3 5
Example 5. Let ~v = ,w ~= , and ~u = . Then we have
2 4 6
2 3
15
6187
2 3 6 7
3 6207
6 7
647 5 6247
~v ⌦ w
~ ⌦ ~u = (~v ⌦ w) 6
~ ⌦ ~u = 4 5 ⌦7 =6 7
6 6 6307
6 7
8 6367
6 7
4405
48
~ ⌦ ~u 2 R2 ⌦ R2 ⌦ R2 ' R8 .
and ~v ⌦ w
the N M simple tensors {~ei ⌦f~j }i,j form a basis of R ⌦R , and so dim(R ⌦RM ) =
N M N
P P
~ = j wj f~j , then
N M . If ~v = i vi ~ei and w
X
~v ⌦ w~= vi wj (~ei ⌦ f~j ) ,
i,j
which recovers the stacking rules seen in the earlier examples and realizes the iden-
tification RN ⌦ RM ' RN M .
Identifying R with the one–dimensional space spanned by 1, there are canonical
isomorphisms V ⌦ R ' V ' R ⌦ V given by ~v ⌦ a 7! a ~v and a ⌦ ~v 7! a ~v . Hence
RN ⌦ R1 ' RN ' R 1 ⌦ R N .
0
Linear maps interact nicely with tensor products. If A : RN ! RN and
0 0 0
B : RM ! RM are linear, there is a linear map A ⌦ B : RN ⌦ RM ! RN ⌦ RM
defined by
(A ⌦ B)(~v ⌦ w)
~ = (A~v ) ⌦ (B w)
~
which in matrix form is the familiar Kronecker product.
24 2. ESSENTIALS OF QUANTUM MECHANICS
Remark 9 (Associativity of tensor products). For our purposes, it does not matter
whether we first form (V ⌦ W ) and then tensor with U from the right, or first form
(W ⌦ U ) and then tensor with V from the left. There is a canonical identification
between
(V ⌦ W ) ⌦ U and V ⌦ (W ⌦ U ),
and so we will simply write
V ⌦W ⌦U
without worrying about parentheses. This scales to many tensor factors. For a
vector space V we write
V ⌦k := V ⌦ · · · ⌦ V ,
| {z }
k copies
k
which has dimension (dim V ) and a basis {~ei1 ⌦ · · · ⌦ ~eik }. We will use this to
model multi–part systems: for example, a register of k N -ary variables naturally
k
lives in (RN )⌦k ' RN .
As a word of caution, order still matters. As we explained before, in general
we have ~v ⌦ w ~ 6= w
~ ⌦ ~v . When we want to swap the order of a tensor product we
will use the linear map SWAP : V ⌦ W ! W ⌦ V , acting by
SWAP · (~v ⌦ w) ~ ⌦ ~v .
~ =w
In summary, associativity lets us ignore parentheses; SWAP lets us reorder factors
when needed.
Going from the abstract back to the concrete, we have the example below:
With some basic tensor product definitions at hand, we can now leverage them
to discuss joint probability distributions in a slick vector space formalism.
Respecting historical tradition,1 suppose we have two urns, where the first urn
has N objects and the second urn has M objects. Suppose that the probability
that we select one of the N items in the first urn is described by the probability
1See Ars Conjectandi by Jacob Bernoulli, published posthumously in 1713.
1. PROBABILITY THEORY ON VECTOR SPACES 25
vector p~ 2 RN , and the probability that we select one of the M items in the second
urn is described by the probability vector ~q 2 RM . Then if we select an item from
the first urn followed by the second urn, what is the probability that we sampled
item i from the first urn and item j from the second urn? The answer is encoded
in the tensor product p~ ⌦ ~q , and in particular its (i 1)M + jth entry:
p ⌦ ~q ](i
[~ 1)M +j = pi q j .
We can extract this entry by dotting p~ ⌦ ~q against ~eiT ⌦ ~ejT , namely
(~eiT ⌦ ~ejT ) · (~
p ⌦ ~q ) = pi qj .
The vector p~ ⌦ ~q is itself a probability vector living in N M ⇢ RN M ; thus it is a
probability distribution on N M outcomes, as we wanted.
So far we have examined p~ ⌦ ~q which is a product distribution, assuming in our
example that our sampling from each of the two urns is uncorrelated. Below we
show in an example that convex combinations of tensor products can represent a
correlated, joint distribution.
Example 7. Suppose the first urn has two items (N = 2), say a ring and a
watch, and the second urn has three items (M = 3), say a tissue, a match, and a
rubber band. The urns were prepared by the ghost of Jacob Bernoulli. We are told
that with probability 1/3 he put a ring in the first urn and a rubber band in the
second urn, and with probability 2/3 he put a watch in the first urn and a match
in the second urn. Then the joint distribution over the urns is described by
2 3
0
2 3 2 3 6 0 7
0 0 6 7
1 1 2 0 61/37
4
⌦ 0 +5 4 5
⌦ 1 =6 6 7.
3 0 3 1 6 0 7
7
1 0 42/35
0
This distribution does not factorize into a tensor product of two individual vectors.
is
P a joint distribution so long as ri1 i2 ···ik 0 for all i1 , i2 , ..., ik and additionally
i1 ,i2 ,...,ik ri1 i2 ···ik = 1. Here we have used a multi-index notation, P
in which we are
putting subscripts on subscripts; this is to avoid notation like a,b,c,... rabc··· which
do not specify the total number of subscripts, which in our case is k. (Moreover,
there are only 26 letters of the Latin alphabet.) Multi-index notation may initially
26 2. ESSENTIALS OF QUANTUM MECHANICS
seem like gross notation, but you will soon grow accustomed to it, like generations
have before you.
Joint distributions interface nicely with the ~1T row vector in a number of ways.
For clarity, let us write ~1TN to denote the all-ones row vector with N entries. Then
we have the nice identity
~1TN ⌦ ~1TN ⌦ · · · ⌦ ~1TN = ~1TN N ···N .
1 2 k 1 2 k
We can also use the all-one row vector to formulate a nice way of computing mar-
ginal distributions. To illustrate, we proceed with the example below.
and so MS : RN1 ···Nk ! RProdj2S Nj . Then MS · p~ is the marginal over the subsys-
tems indexed by S.
To summarize, we have recast ordinary probability theory (on discrete prob-
ability spaces) in a linear-algebraic language, which has motivated us to develop
the fundamentals of multi-linear algebra and tensor products. This mathemati-
cal technology certainly illuminates aspects of multi-linearity lurking in ordinary
probability theory. But our true motivation was to set up probability theory in
such a way as to make (finite-dimensional) quantum mechanics appear as a natu-
ral generalization, using many of the same ingredients. In this next section when
2. QUANTUM THEORY IN FINITE DIMENSIONS 27