Advanced Analysis, Notes 5: Hilbert spaces (application: Von Neumann’s mean ergodic theorem)

by Orr Shalit

In this lecture we give an application of elementary operators-on-Hilbert-space theory, by proving von Neumann’s mean ergodic theorem. See also this treatment by Terry Tao on his blog.

For today’s lecture we will require the following simple fact which I forgot to mention in the previous one.

Exercise A: Let A, B \in B(H). Then \|AB\| \leq \|A\| \|B\|.

1. The basic problem of ergodic theory

In the study of discrete dynamical systems, one considers the action of some map T on a space X. Ergodic theory is the part of dynamical systems theory in which one is interested in the action of a measure preserving transformation T on a measure space X.

Perhaps surprisingly, the origins of ergodic theory are in mathematical physics – statistical mechanics, to be precise. If you are interested in learning more about how what we discuss here is related to physics, I suggest you take a look at the section “Ergodic theory: an introduction” in Reed and Simon’s Volume I of the series Methods of Modern Mathematical Physics.

Since our goal here is merely to illustrate how operator theory can be applied to ergodic theory, we will work in the simplest possible setup: our space X will be the unit interval [0,1], and our transformation T : [0,1] \rightarrow [0,1] will be piecewise continuous (we will not, however, need to assume that T is invertible). Anybody who took a course in measure theory can replace X and T by whatever they desire. The operator theoretic details will remain the same.

We assume further that T is measure preserving. By this we mean that for all f \in PC[0,1],

\int_0^1 f\circ T (x) dx = \int_0^1 f(x) dx .

It is not entirely clear at first whether there are interesting examples of measure preserving maps.


  1. For \alpha \in (0,1), let T(x) = x + \alpha (mod 1). Then \int_0^1 f\circ T (x) dx = \int_0^{1-\alpha} f(x+\alpha) dx + \int_{1-\alpha}^1 f(x+\alpha -1) dx and this is equal to \int_\alpha^1 f(x) dx + \int_0^\alpha f(x) dx = \int_0^1 f(x) dx .
  2. Let T(x) = 2x for x \in [0,1/2] and T(x) = 2x-1 for x \in (1/2,1]. Then \int_0^{1/2} f(2x) dx + \int_{1/2}^1 f(2x-1)dx = \frac{1}{2}\int_0^1 f(t) dt + \frac{1}{2}\int_{0}^1 f(t) dt = \int_0^1 f(t) dt, so T is measure preserving. (Remark: Note that T([0,1/2]) = [0,1], so it does not do what one might naively think that a “measure preserving” map should do. However, T does satisfy that the measure of T^{-1}(A) is equal to the measure of A for every A \subset [0,1], and this turns out to be the important property).
  3. T(x) = 2x for x \in [0,1/2] and T(x) = 2-2x for x \in (1/2,1].
  4. etc.

We pick a point x \in [0,1], and we start moving it around the space [0,1] by applying T again and again. We get a sequence x, T(x), T^2(x), \ldots in [0,1]. The basic problem in ergodic theory is to determine the statistical behavior of this sequence. To quantify the phrase “statistical behavior”, one may study the large N behavior of the so-called time averages

\frac{1}{N+1} \sum_{n=0}^N f (T^n (x)) ,

for functions f \in PC[0,1], say.

Why would this be interesting?

Suppose, for example, that f is the indicator function of some interval: f(x) = 1 if and only if x \in (a,b), otherwise f(x) = 0. In this case the sum \sum_{n=0}^N f(T^n(x)) counts the number of times that T^n(x) visited the interval (a,b) in the first N+1 steps that x takes along the sequence x, T(x), T^2(x), \ldots. The time averages therefore measure the fraction of the “time” that the sequence T^n(x) spends inside (a,b). When one takes the limit N \rightarrow \infty, if that limit exists, one gets a measure of how much time the sequence spends inside (a,b) in the long run. If the sequence T^n(x) behaves in a completely “random” manner, what would be your best guess for the limit of \frac{1}{N+1} \sum_{n=0}^N f (T^n (x)) ? If we think of the probability of T^n(x) being at a certain point in [0,1] as being uniformly distributed on [0,1], then the best guess, intuitively, would be that

\lim_{N \rightarrow \infty} \frac{1}{N+1} \sum_{n=0}^N f(T^n(x)) = b-a .

Note that  b-a = \int_0^1 f(t) dt, so our guess is

(*)   \lim_{N \rightarrow \infty} \sum_{n=0}^N f(T^n(x)) = \int_0^1 f(t) dt

for indicator functions. Now maybe we would like to use some more complicated function f to measure the distribution of the sequence T^n(X). But if (*) holds for indicator function then it holds for step functions, and then also presumably for other functions by some limiting process.

The equality (*) is called also, sometimes, the “Ergodic Hypothesis”. It describes a situation where taking time average (that is, starting at a point x and taking the average of repeated measurements f(T^n(x))) is equal to the space average \int_0^1 f(t) dt (which is the expected value of f on the probability space [0,1]). I certainly do not claim that we have justified (*); in fact (*) does not always hold. All that we said is that we might expect (*) to hold if the sequence T^n(x) is spread out on the interval in a random or uniform way. The various ergodic theorems proved in ergodic theory make this very loose discussion precise.

2. The mean ergodic theorem

The mean ergodic theorem discusses the validity of (*) in the setting of L^2. (The reason why it is called mean is that convergence in the L^2 norm used to be called convergence in the mean). The first thing we have to do is to make sense of the composition f \circ T in L^2. The problem is that if f \in L^2[0,1], then f is not defined by the values it attains at points x \in [0,1], so it is not clear what we mean by f \circ T. This problem is solved very smoothly in the setting of Hilbert space.

Define a transformation U : C[0,1] \rightarrow PC[0,1] by Uf = f \circ T. By our assumptions on T, U is well defined, linear and isometric on C[0,1], which is a dense subspace of L^2[0,1]. By Exercise B from the previous lecture, U extends in a unique way to an isometry U: L^2[0,1] \rightarrow L^2[0,1]. We continue to write U f = f \circ T even for f \in L^2[0,1].

Definition 1: A transformation T : [0,1] \rightarrow [0,1] is said to be ergodic if f = f \circ T implies that f = const.

Theorem 2 (Mean ergodic theorem): Let T : [0,1] \rightarrow [0,1] be a measure preserving (piecewise continuous) ergodic transformation. Then for all f \in L^2[0,1]

\lim_{N \rightarrow \infty} \sum_{n=0}^N f \circ T^n = \int_0^1 f(t) dt

in the L^2 norm. 

Remark: Recalling (*), one immediately sees the weakness of this theorem. It does not tell us what happens for particular x in the time averages, but only gives us norm convergence of the sequence of functions \frac{1}{N+1} \sum_{n=0}^N f \circ T^n. Typically, pointwise (or almost everywhere pointwise) convergence theorems are harder to prove.

The operator U that we defined above is an isometry, and in particular it satisfies \|U\| \leq 1. An operator A satisfying \|A\|\leq 1 is said to be a contraction. Theorem 2 is an immediate consequence of the following theorem.

Theorem 3: Let H be a Hilbert space and let A \in B(H) be a contraction. Denote M = \{x : Ax = x\}. Then for all x \in H

\lim_{N \rightarrow \infty} \frac{1}{N+1}\sum_{n=0}^N A^n x = P_M x

in the L^2 norm. 

To see how Theorem 2 follows from Theorem 3, note simply that if T is ergodic, then M = \{f : Uf = f\} is the one dimensional space of constant functions, so (by Theorem 15 in Notes 2) P_M f = (f,1)1 = \int_0^1 f(t) dt.

Proof: The first step of the proof requires a bit of inspiration. Recall that \|A^*\| = \|A\|\leq 1. Let x \in M be a unit vector, and consider (A^*x,x) = (x, Ax) = (x,x) = 1. From the Cauchy-Schwarz inequality, 1=|(A^*x,x)| \leq \|A^*x\| \|x\| \leq 1, so |(A^*x,x)| = \|A^*x\| \|x\|. From the equality part of Cauchy-Schwarz this can only happen when A^*x = c x. But reconsidering (A^*x,x) it is evident that c = 1. Thus A^*x = x. Because of the symmetry of the * operations, we get a nice little result: for a contraction A,

Ax = x \Leftrightarrow A^*x = x .

Next, we try to understand what the decomposition H = M \oplus M^\perp looks like. Note that M = N(I-A). By our nice little result, N(I-A) = N(I-A^*). Therefore (by Proposition 9 in Notes 4), M^\perp = N(I-A^*)^\perp = \overline{R(I-A)}. So M induces the decomposition

H = N(I-A) \oplus \overline{R(I-A)}.

If x \in N(I-A), then \frac{1}{N+1}\sum_{n=0}^N A^n x = x = P_M x, and the conclusion of the theorem holds for this x.

Let w \in R(I-A). Then w has the form w = z - Az for some z. We compute

\frac{1}{N+1}\sum_{n=0}^N A^n (z - Az) = \frac{1}{N+1}(z-A^{n+1}z) \rightarrow 0.

We have used the fact that \|A^{n+1}\| \leq \|A\|^{n+1}, so that z-A^{n+1}z is bounded.

Now consider y \in \overline{R(I-A)}, and fix \epsilon >0. There is some w \in R(I-A) such that \|w-y\| < \epsilon. For some N_0, \|\frac{1}{N+1} A^n w\|< \epsilon for all N \geq N_0. Then for N \geq N_0 we break up \sum_{n=0}^N A^n y  as \sum_{n=0}^N A^n (y-w) + \sum_{n=0}^N A^n w to find that

\| \frac{1}{N+1} \sum_{n=0}^N A^n y \| < 2 \epsilon

demonstrating that \frac{1}{N+1} \sum_{n=0}^N A^n y \rightarrow 0 = P_M y . Thus the theorem holds for all y \in M^\perp.

Finally, since H = M \oplus M^\perp, the theorem holds on all of H.