Proof

Not really. But maybe.

"Happiness can be discrete
( ... even when melancholy seems continuous)."

### Setup: States and Actions Our 'way of doing things' is contained in the object $\theta$. This decides *how we react* when given some state / situation in life $s_t$ (at time $t$). We respond to a situation by taking an action according to the probability distribution $a_t \sim \pi_{\theta}(a_t|s_t)$. This distribution (which depends on $\theta$) represents our "behaviour" and spits out our responding "action" given the situation $s_t$. The object $\theta$ is the "configuration" (*or* strategy) for our behaviour. We can tune our behaviour - by changing $\theta$ which in turn changes our behaviour (distribution for actions) $\pi_\theta(a_t|s_t)$. Let's assume our first state $s_1$ is sampled from some distribution of initial states $p(s_1)$. This state would be the cards we're dealt in life: $$s_1 \sim p(s_1)$$ Now our *response* as explained above would be sampled from our behaviour distribution: $$a_1 \sim \pi_{\theta}(a_t = a_1 | s_t = s_1)$$ After taking action $a_1$ we land up in some state $s_2$. Which state is this? Let's assume that this state is sampled from some "state transition" probability distribution that spits out a state *given some state* and *action taken*. This represents the randomness of the world. So we have: $$ s_{t+1} \sim p(s_{t+1} | a_t, s_t)$$ And so the cycle continues. This collection of specific actions taken and states landed on is a trajectory $\tau$: $$\tau = (s_1, a_1, s_2, a_2, \dots, s_T) $$ Another way of saying the above could be that we are sampling our trajectories from some "trajectory" distribution as $\tau \sim p_{\theta}(\tau)$ (instead of breaking down our sampling as sampling from policy {or *action distribution*} and transition distribuion {or *states distribution*}). Since both approaches are saying the same thing, we can say that: $$ p(\tau) = p(s_1, a_1, s_2, a_2, \dots, s_T)$$ $$ p_\theta(s_1, a_1, \dots) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t | s_t) p(s_{t+1} | s_t, a_t) $$ ### Goal: Happiness and fulfillment Our "fulfillment" (*read* happiness) is captured by the function $J(\pi_{\theta})$. This "fulfillment" function $J$ depends on the actions we take in life and hence depends on our behaviour $\pi_{\theta}$. More precisely $J(\pi_\theta)$ is: $$ J(\pi_\theta) = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\sum_{t} r_t]$$ where $r_t$ is the reward / happiness / fullfillment we get at time $t$. $J$ then is basically quantifying the sum of mini-"happiness's" we're getting over all time. The expectation is because we're navigating a probabilistic world. As sincere and curious learners we want to figure out the best *way of doing things* that would make us happiest. More mathematically, we want to figure out behaviour optimum $\pi^*_{\theta}$ that maximizes $J$, as: $$ \pi^*_{\theta} = \arg\max_{\pi_\theta} J(\pi_{\theta}) $$ $$ = \arg\max_{\pi_{\theta}} \mathbb{E}_{p_{\theta}(\tau)}[\sum_{t} r_t] $$ $$ = \arg\max_{\pi_{\theta}} \mathbb{E}_{p_{\theta}(\tau)}[r_{\tau}] $$ Here I've just written the cumulative happiness over time $\sum_t r_t$ as $r_{\tau}$, the cumulative happiness over trajectory (same thing). ### Trying and learning So reiterating we're navigating this stochastic world using our current behaviour $\pi_\theta$. If we somehow have the gradient $\nabla_{\theta}J(\pi_\theta)$ we would get a sense of how to tune our $\theta$ so that our fulfillment $J(\pi_\theta)$ increases. By moving $\theta$ in the direction of $\nabla_{\theta} J(\pi_{\theta})$ we would move in the direction of increasing $J$. The gradient $\nabla_{\theta}J(\pi_\theta)$ is: $$\begin{aligned} \nabla_\theta J(\pi_\theta) &= \nabla_\theta \mathbb{E}_{p_{\theta}}[r_{\tau}] \\ &= \nabla_\theta \int_{\tau}^{}{p_\theta }{r(\tau)}\,d\tau\\ &= \int_{\tau}^{}{\nabla_\theta p_\theta } {r(\tau)}\,d\tau\\ &= \int_{\tau}^{}{\pi_\theta \nabla_\theta \log \pi_\theta } {r(\tau)} \,d\tau\\ &= \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta r(\tau)\right]\end{aligned}$$ A subtle but beautiful point here is that we're actually tuning our *continuous* $\theta$ knob for *discrete* rewards [Mario can collect chunky coins, Atari can have discrete points, and we can be happy in discrete bursts] that we collect along our trajectory (applies for continuous rewards too). Our integral above is a sum in $\tau$ space - which is a sum over all possible trajectories. This is not nice at all^* because we'd have to sum over all possible trajectories to get our gradient at current $\theta$ . Let's approximate the above expectation through sampling a few '$N$' trajectories: $$\begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \log \pi_\theta(\tau_i) {r(\tau_i)}\\ &\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t=1}^{T}{} r_t\end{aligned}$$ This estimator in its current form is extremely low bias and extremely high variance. Why is it high variance? Because the quantity inside the sum $\sum^{N}_{i=1}$ can vary a lot depending on which trajectory we landed up in. Life trajectories can be pretty diverse and so the corresponding happiness over each trajectory can be pretty diverse too. This approximation approaches true gradient only when we've sampled infinite (all) trajectories. Due to this being high variance, we can very easily veer off finding potential optimal behaviours $\pi^*_{\theta}$ when doing the optimization using this approximation. But what is the expression actually saying? It tells me how how to change my 'ways' $\theta$. It's giving me the direction to go in $\theta$ space, such that my expected fulfillment $J$ is maximized. We go through each $s_t$ observed in our trajectory and take note of the direction in $\theta$ space we should move and *reweigh* (increase/decrease) the probability the action we took, increase it if the trajectory gave us "happiness" and decrease it if the trajectory gave us "sadness". One could argue that the true value of a trajectory can only be known if one actually follows it to the end - a thought both scary and reassuring at the same time. No one knows the right answer. But we're going to resort to extrapolating our *optimized* behaviour from only a few trajectories - that we could then use in both previously seen and unseen states. ### Can we ignore the past? Going back to our expression: $$\begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \log \pi_\theta(\tau_i) {r(\tau_i)}\\ &\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t=1}^{T}{} r_t\end{aligned}$$ Examining the terms inside the sum, we can say that at each timestep $t$ we're reweighing (increasing / decreasing) the probability of that "action" taken by the total happiness obtained over the entire trajectory. More precisely, we're moving in $\theta$-space according to $\nabla_{\theta} \log \pi_{\theta}(a_t|s_t)$ multiplied by cumulative reward over entire trajectory which is $\sum^{T}_{t=1} r_t$. We need to reduce the quantity inside our estimator to reduce variance. A suggestion is to only count future happiness when evaluating the importance of the current action - thereby ignoring rewards / happinesses earned in the past. This is an acknowledgement of the fact that the action we take now should only be weighed by the happiness we get from hereon in the future and it has *nothing to do with the rewards earned in the past*. But can we show this mathematically that we can ignore past rewards when evaluating current actions? More precisely, can we say: $$\begin{aligned} \nabla_\theta J(\theta) &= \mathbb{E}_{\ {a_t | s_t} \sim\pi_\theta}\left[ \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t=1}^{T}{} r_t \right] \\ &= \mathbb{E}_{\ {a_t | s_t} \sim\pi_\theta}\left[ \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \biggl(\sum_{t'=0}^{t-1}{} r_{t'} + \sum_{t'=t}^{T}{} r_{t'} \biggr)\right] \\ &\stackrel{?}{=} \mathbb{E}_{\ {a_t | s_t} \sim\pi_\theta}\left[ \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=t}^{T}{} r_{t'} \right] \\\end{aligned}$$ In other words we have to show: $$\begin{aligned} \mathbb{E}_{\ {a_t | s_t} \sim\pi_\theta}\left[ \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=0}^{t-1}{} r_{t'} \right] &= 0\end{aligned}$$ Let's go ahead and show that. ### Epiphanies emergent **Lemma:** Given a probability distribution $P_{\ {\theta}}$ and random variable $X$, such that $X \sim P_{\ {\theta}}$, we have: $$\begin{aligned} \mathbb{E}_{X \sim P_\theta}\left[\nabla_\theta \log P_\theta(X)\right] = 0 \end{aligned}$$ Proof: $$\begin{aligned} \int_{x}^{}{P_\theta}\,dx &= 1\\ \end{aligned}$$ Taking gradient of both sides: $$\begin{aligned} \nabla_\theta \int_{x}^{}{P_\theta}\,dx &= \nabla_\theta 1\\\end{aligned}$$ Given gradient of $P_\theta$ exists almost everywhere: $$\begin{aligned} \int_{x}^{}{\nabla_\theta P_\theta}\,dx &= 0 \\ \int_{x}^{}{P_\theta\nabla_\theta \log P_\theta}\,dx &= 0 \\ \mathbb{E}_{P_\theta}\left[\nabla_\theta \log P_\theta\right] &= 0 \\\end{aligned}$$ ---------------------------------\- Now, let's look at our original expectation. The happiness / reward $r_{t'}$ for time $t-1$ is the one obtained at time step $t-1$ after taking action $a_{t-1}$ in state $s_{t-1}$: $$\begin{aligned} & \mathbb{E}_{\ {a_t | s_t}\sim\pi_\theta}\left[ \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=0}^{t-1}{} r_{t'} \right] \\ \end{aligned}$$ Let's condition the entire expectation on the state exactly at $t$, $s_t$ using iterated expectations and linearity of expectations. So the quantity above can be written as: $$\begin{aligned} & \mathbb{E}_{s_t}\left[\mathbb{E}_{\ {a_t | s_t}\sim\pi_\theta}\left[ \sum_{t=1}^{T}{} \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=0}^{t-1}{} r_{t'} \right] | s_t\right] \\ &= \mathbb{E}_{s_t}\left[\sum_{t=1}^{T}{}\mathbb{E}_{a_t | s_t\sim\pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=0}^{t-1}{} r_{t'} \right] | s_t\right] \\ \end{aligned}$$ Now given $s_t$ is fixed, the random variable $\nabla_\theta \log \pi_\theta(a_t|s_t)$ is conditionally independent of the random variable involving $\sum_{t'=0}^{t-1}{} r_{t'}$ for any $t'\leq t$ given $s_t$. In plain english, the value of an action taken now for total happiness is independent of past happiness if we already know our current state. By definition, the 'state' is Markovian in nature and contains all information needed to predict future states. So we can distribute the inner expectation as: $$\begin{aligned} & \mathbb{E}_{s_t}\left[\sum_{t=1}^{T}{}\mathbb{E}_{a_t | s_t\sim\pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \right] \mathbb{E}_{a_{t'} | s_{t'}\sim\pi_\theta}\left[ \sum_{t'=0}^{t-1}{} r_{t'} | s_t\right] \right] \\ \end{aligned}$$ Using Lemma, the first expectation in the inner expectation is zero: $$\begin{aligned} & \mathbb{E}_{a_t|s_t \sim\pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \right]=0 \\ \end{aligned}$$ so the entire expectation is zero. $\square$ The past is gone. The future is conditionally independent of the past given the present. What could be gained (or lost) in the past has no bearing on how you change your behaviour now (to be happier). Now matters. And the future matters.

^*unless quantum computers figure out monte carlo integration?

The (mathematical) why for not giving up.