Matthew’s Blog - Solving Constrained Optimization Problems with Pytorch

I’ve got two bits of homework to do today. The first one is easy, and the second one is hard.

Easy Problem

The first should be easy enough, I have to find the closest point between two equations:

\[ \begin{aligned} y &= x + 4 \\ x^2 + 2y^2 &= 1 \end{aligned} \]

If we plot these it should be possible to see what the point would be. To plot the second points we need to solve it for y:

\[ \begin{aligned} x^2 + 2y^2 &= 1 \\ 2y^2 &= 1 - x^2 \\ y^2 &= \frac{1 - x^2}{2} \\ y &= \pm \sqrt{\frac{1 - x^2}{2}} \end{aligned} \]

Code

import pandas as pd
import numpy as np

df = pd.DataFrame(
    np.arange(-2, 2.01, 0.01),
    columns=["x"]
).set_index("x")

# first equation, y = x + 4
df["y1"] = df.index + 4

# second equation x^2 + 2y^2 = 1
# the second equation is done over two series to handle the + and the -
df["y2.1"] = (1-df.index**2)**0.5
df["y2.2"] = -(1-df.index**2)**0.5

df.plot(figsize=(4,7)) ; None

The orange and green lines both form the \(x^2 + 2y^2 = 1\) line. The gap in the circle is caused by plotting the curves as a collection of discrete points.

In my poor plot of these equations we can see that there is no perfect point that satisfies both equations. Instead we should find the point that minimizes the distance between the points. I can restrict the second equation to only the positive values and then it should be possible to write an equation for the distance between the lines.

The distance between the lines is the difference between these two equations:

\[ \begin{aligned} y &= x + 4 \\ y &= \sqrt{\frac{1 - x^2}{2}} \end{aligned} \]

We can express this as:

\[ distance = x + 4 - \sqrt{\frac{1 - x^2}{2}} \]

And plotting this:

Code

import pandas as pd
import numpy as np

df = pd.DataFrame(
    np.arange(-1, 1.01, 0.01),
    columns=["x"]
).set_index("x")

df["distance"] = df.index + 4 - ((1 - df.index**2) / 2)**0.5
df.plot() ; None

We can see that the minimum distance is between -1 and -0.75. It should be possible to find the point by calculating the derivative of this equation. Wolfram Alpha tells me that it is

\[ \frac{d}{dx} = \frac{x}{\sqrt{2 - 2x^2}} + 1 \]

To calculate the closest point, I need to find where

\[ \frac{x}{\sqrt{2 - 2x^2}} + 1 = 0 \]

I’m lazy so I’m using wolfram alpha again, which tells me the answer is \(x = - \sqrt{\frac{2}{3}}\). We can plug this into the equation to verify it:

\[ \begin{aligned} \frac{x}{\sqrt{2 - 2x^2}} + 1 &= 0 \\ \frac{- \sqrt{\frac{2}{3}}}{\sqrt{2 - 2\left(- \sqrt{\frac{2}{3}}\right)^2}} + 1 &= 0 \\ \frac{- \sqrt{\frac{2}{3}}}{\sqrt{2 - 2\frac{2}{3}}} + 1 &= 0 \\ \frac{- \sqrt{\frac{2}{3}}}{\sqrt{\frac{6}{3} - \frac{4}{3}}} + 1 &= 0 \\ 1 - \frac{\sqrt{\frac{2}{3}}}{\sqrt{\frac{2}{3}}} &= 0 \\ 1 - 1 &= 0 \\ \end{aligned} \]

That works out. I should refresh my calculus.

Hard Problem

The above problem was then extended to finding the probability distribution that best fits a weighted die that has an average roll of 4.5. The best probabilistic model is the one that maximizes the entropy. This is the best because it applies the least constraint over the system - a system with lower entropy is more predictable.

We can define the entropy of the model b with:

\[ H(\text{p}) = -\sum_i{p_i \log p_i} \]

Our model is that a 6 sided die has an average roll of 4.5. That can be expressed as two equations:

\[ f(p_1,\ldots,p_6) = \sum_i{p_i} = 1 \\ g(p_1,\ldots,p_6) = \sum_i{i p_i} = 4.5 \]

The first establishes that the total probability across the different outcomes sums to one, which means we arn’t dealing with a die with a different shape. The second expresses the average value that is rolled is a 4.5.

We can then calculate the differential of all of these based on two simple facts:

\[ \frac{d}{dx} x = 1 \]

and

\[ \frac{d}{dx} x \log x = \log x + 1 \]

That means we can state the differentials of the equations as:

\[ \begin{aligned} \triangledown H &= - (\log p_1 + 1, \log p_2 + 1, \log p_3 + 1, \log p_4 + 1, \log p_5 + 1, \log p_6 + 1) \\ \triangledown f &= (1, 1, 1, 1, 1, 1) \\ \triangledown g &= (1, 2, 3, 4, 5, 6) \\ \end{aligned} \]

All of this was covered in the workshop, and it even covered how to then specify the specific equations:

\[ \begin{aligned} - \log p_1 - 1 &= \lambda + \mu \\ - \log p_2 - 1 &= \lambda + 2 \mu \\ - \log p_3 - 1 &= \lambda + 3 \mu \\ - \log p_4 - 1 &= \lambda + 4 \mu \\ - \log p_5 - 1 &= \lambda + 5 \mu \\ - \log p_6 - 1 &= \lambda + 6 \mu \\ p_1 + p_2 + p_3 + p_4 + p_5 + p_6 &= 1 \\ p_1 + 2 p_2 + 3 p_3 + 4 p_4 + 5 p_5 + 6 p_6 &= 4.5 \end{aligned} \]

Then wolfram alpha gives the solution of:

\[ p = (0.054, 0.079, 0.114, 0.165, 0.24, 0.347) \]

The problem here isn’t the problem specification, it’s the method of solving it…

The hard problem is to solve this using pure pytorch. To do this we have to think about how pytorch works.

Optimization Approach

The first way that I would solve this problem is to define the two variables, \(\lambda\) and \(\mu\) as pytorch parameters and then define a loss function which measures how closely they adhere to the constraints. This would involve defining the value of \(p_1, \ldots, p_6\) in terms of those two variables and then finding the distance from the correct value for functions f and g.

\[ \begin{aligned} - \log p_1 - 1 &= \lambda + \mu \\ - \log p_1 &= \lambda + \mu + 1 \\ \log p_1 &= -( \lambda + \mu + 1 ) \\ p_1 &= e^{-( \lambda + \mu + 1 )} \\ \end{aligned} \]

We can repeat this for all the different outcomes. Then the loss is the absolute difference between this and the desired value:

\[ \begin{aligned} loss_f &= \lvert p_1 + p_2 + p_3 + p_4 + p_5 + p_6 - 1 \rvert \\ loss_f &= \lvert e^{-( \lambda + \mu + 1 )} + e^{-( \lambda + 2 \mu + 1 )} + e^{-( \lambda + 3 \mu + 1 )} + e^{-( \lambda + 4 \mu + 1 )} + e^{-( \lambda + 5 \mu + 1 )} + e^{-( \lambda + 6 \mu + 1 )} - 1 \rvert \\ loss_g &= \lvert p_1 + 2 p_2 + 3 p_3 + 4 p_4 + 5 p_5 + 6 p_6 - 4.5 \rvert \\ loss_g &= \lvert e^{-( \lambda + \mu + 1 )} + 2 e^{-( \lambda + 2 \mu + 1 )} + 3 e^{-( \lambda + 3 \mu + 1 )} + 4 e^{-( \lambda + 4 \mu + 1 )} + 5 e^{-( \lambda + 5 \mu + 1 )} + 6 e^{-( \lambda + 6 \mu + 1 )} - 4.5 \rvert \\ \end{aligned} \]

Using absolute in a loss function is generally discouraged because it is discontinuous at zero. Instead I will square the value to ensure it remains positive and continuous. That leads to a final pair of loss functions:

\[ \begin{aligned} loss_f &= \left ( e^{-( \lambda + \mu + 1 )} + e^{-( \lambda + 2 \mu + 1 )} + e^{-( \lambda + 3 \mu + 1 )} + e^{-( \lambda + 4 \mu + 1 )} + e^{-( \lambda + 5 \mu + 1 )} + e^{-( \lambda + 6 \mu + 1 )} - 1 \right )^2 \\ loss_g &= \left ( e^{-( \lambda + \mu + 1 )} + 2 e^{-( \lambda + 2 \mu + 1 )} + 3 e^{-( \lambda + 3 \mu + 1 )} + 4 e^{-( \lambda + 4 \mu + 1 )} + 5 e^{-( \lambda + 5 \mu + 1 )} + 6 e^{-( \lambda + 6 \mu + 1 )} - 4.5 \right )^2 \\ \end{aligned} \]

Code

#collapse
from typing import Tuple

import torch
from torch import nn

# The six probabilities corresponding to p1,...,p6
Answer = Tuple[float, float, float, float, float, float]

class LambdaMuOptimizer(nn.Module):
    def __init__(
        self,
        average_roll: float,
        lambda_: float,
        mu: float
    ) -> None:
        super().__init__()
        self.average_roll = average_roll
        self.lambda_ = torch.nn.Parameter(torch.tensor(lambda_))
        self.mu = torch.nn.Parameter(torch.tensor(mu))

    def optimize(self, iterations: int = 10_000) -> Answer:
        df = self.train(iterations)

        # Show how the loss has changed over the training
        df[["loss"]].plot(logy=True)

        # Find the best performing row and show it
        best_row = df[df.loss == df.loss.min()]
        print(best_row)

        # Then calculate and return the different probabilities
        return self.ps().tolist()

    def train(self, iterations: int, lr: float=0.01) -> None:
        optimizer = torch.optim.AdamW(self.parameters(), lr=lr)
        steps = []
        for _ in range(iterations):
            optimizer.zero_grad()
            loss = self.loss()
            steps.append((
                loss.item(),
                self.lambda_.item(),
                self.mu.item()
            ))
            loss.backward()
            optimizer.step()
        return pd.DataFrame(steps, columns=["loss", "𝜆", "𝜇"])

    def loss(self) -> torch.Tensor:
        ps = self.ps()
        return self.loss_f(ps) + self.loss_g(ps, self.average_roll)

    @staticmethod
    def loss_f(ps: torch.Tensor) -> torch.Tensor:
        return (ps.sum() - 1)**2

    @staticmethod
    def loss_g(ps: torch.Tensor, target_average_roll: float) -> torch.Tensor:
        average_roll = (ps * torch.tensor([1, 2, 3, 4, 5, 6])).sum()
        return (average_roll - target_average_roll)**2

    def ps(self) -> torch.Tensor:
        return torch.cat([
            self.p1()[None],
            self.p2()[None],
            self.p3()[None],
            self.p4()[None],
            self.p5()[None],
            self.p6()[None],
        ], dim=0)

    def pn(self, n: int) -> torch.Tensor:
        # These calculate the value of p1,...,p6 given 𝜆 and 𝜇
        exponent = -(self.lambda_ + (n * self.mu) + 1)
        return torch.exp(exponent)

    def p1(self) -> torch.Tensor:
        return self.pn(1)

    def p2(self) -> torch.Tensor:
        return self.pn(2)

    def p3(self) -> torch.Tensor:
        return self.pn(3)

    def p4(self) -> torch.Tensor:
        return self.pn(4)

    def p5(self) -> torch.Tensor:
        return self.pn(5)

    def p6(self) -> torch.Tensor:
        return self.pn(6)

Code

model = LambdaMuOptimizer(
    average_roll=4.5,
    lambda_=0.15,
    mu=0.15
)
probabilities = model.optimize(iterations=10_000)

total_probability = sum(probabilities)
average_roll = sum(
    value * probability
    for value, probability in enumerate(probabilities, start=1)
)

print(f"Probabilities sum to: {total_probability:0.5f}")
print(f"Average roll is: {average_roll:0.5f}")
probabilities

          loss         𝜆        𝜇
9676  0.000071  2.200761 -0.35447
Probabilities sum to: 1.00832
Average roll is: 4.49824

[0.05809279531240463,
 0.08279377222061157,
 0.11799757927656174,
 0.16816997528076172,
 0.2396756261587143,
 0.34158533811569214]

The loss graph is log scale.

With 10,000 iterations and a sophisticated optimizer we can arrive at values which are approximately correct. The correct values are:

\[ p=(0.054,0.079,0.114,0.165,0.24,0.347) \]

and this has produced

\[ p=(0.058,0.083,0.118,0.168,0.24,0.342) \]

which is close but is not completely correct. The loss curve also shows that the progress towards the correct solution became very noisy towards the end.

Manifold Approach

Optimising over 𝜆 and 𝜇 is difficult because the constraints of \(\sum_i p_i = 1\) and \(\sum_i i p_i = 4.5\) get broken as the values change. Defining those constraints as loss functions means that the solution can move close to them but they are never likely to be satisfied. This is because the loss gets smaller and smaller as it moves towards satisfying them.

Instead we could try to parameterize the “model” in terms of the p values and then calculate the loss according to how well the points lie on the line defined by 𝜆 and 𝜇. The two equations are linear with respect to the inputs so it should be possible to map any set of values to the nearest point which satisfies both equations.

To practice this lets imagine that we are dealing with a 4 sided die instead (which will now average 2.5). This way there will be two degrees of freedom as the two constraints should result in a balance. If we randomly set \(p_1\) and \(p_4\) then it should be easy to see that \(p_2 = p_4\) and \(p_3 = p_1\) in order to maintain \(\sum_i i p_i = 2.5\). Then normalizing the values would satisfy the \(\sum_i p_i = 1\).

That is a well defined way to satisfy the two constraints but is it the closest point? We can define closeness such that we minimize the difference between the original values \(o_1,...,o_4\) and the plane values \(p_1,...,p_4\) which is defined as \(distance = \sum_i \left| o_i - p_i \right|\). If we square this then we can have a differentiable equation which should allow us to derive the minimum distance point.

\[ \begin{aligned} x &= \sum_i ( o_i - p_i )^2 \\ x &= \sum_i o_i^2 + p_i^2 - 2 o_i p_i \end{aligned} \]

Would this end up being a partial derivative over each individual value? It seems that way. Furthermore it is difficult to work with this as I have no proposed point to measure against.

If we were only dealing with one constraint then it would be easy to calculate the closest point as we could project the current point onto that surface. The probability mass constraint merely requires normalizing the vector.

Thinking about this further we can say something about the plane with respect to the original point. The best point on the plane will be connected by a normal vector from the plane to the original point. I know that this plane is flat because it has two linear equations that dictate it, and if there is a non orthogonal vector between a plane value and the original value then there is a closer plane value (think triangles).

Given this constraint does the differential equation approach make more sense? Or is there a way to define this normal vector? I think I need to be able to more easily define the plane itself.

Thinking about this further what I have is a 4-d point that I wish to map to a flat 2-d plane. A matrix would be capable of doing this. If I can define three non parallel points on the plane it should be possible to construct the matrix of the plane.

It’s easy to define points on the plane. The first can be the central point: \(p_i = \frac{1}{4}\). The second can be a die that can only roll 1 or 4: \(p_{1,4} = \frac{1}{2} ; p_{2,3} = 0\). The third can be a die that can only roll 2 or 3: \(p_{2,3} = \frac{1}{2} ; p_{1,4} = 0\).

Any 4x2 matrix will map these points to a plane, would it be the correct plane though?

Calculating the correct plane can be done instead by treating this as a simultaneous equation problem. We have the two equations:

\[ \begin{aligned} p_1 + p_2 + p_3 + p_4 &= 1 \\ p_1 + 2 p_2 + 3 p_3 + 4 p_4 &= 2.5 \\ \end{aligned} \]

If we had values for \(p_1,...,p_3\) we could calculate the value for \(p_4\) as it is constrained by the first equation:

\[ \begin{aligned} p_1 + p_2 + p_3 + p_4 &= 1 \\ p_4 &= 1 - p_1 - p_2 - p_3 \\ \end{aligned} \]

Equally if we have values for \(p_1\) and \(p_2\) then we can calculate the value for \(p_3\) as it is constrained by the first and second equation:

\[ \begin{aligned} p_1 + 2 p_2 + 3 p_3 + 4 p_4 &= 2.5 \\ p_1 + 2 p_2 + 3 p_3 + 4 ( 1 - p_1 - p_2 - p_3 ) &= 2.5 \\ p_1 + 2 p_2 + 3 p_3 + 4 - 4 p_1 - 4 p_2 - 4 p_3 &= 2.5 \\ - 3 p_1 - 2 p_2 - p_3 + 4 &= 2.5 \\ - p_3 &= 2.5 - 4 + 3 p_1 + 2 p_2 \\ p_3 &= - ( -1.5 + 3 p_1 + 2 p_2 ) \\ p_3 &= 1.5 - 3 p_1 - 2 p_2 \\ \end{aligned} \]

Now I should be able to use this to simplify \(p_4\) as well:

\[ \begin{aligned} p_4 &= 1 - p_1 - p_2 - p_3 \\ p_4 &= 1 - p_1 - p_2 - ( 1.5 - 3 p_1 - 2 p_2 ) \\ p_4 &= 2 p_1 + p_2 - 0.5 \\ \end{aligned} \]

This is then a way to take my 2-d plane into the 4-d dice roll!

It’s easy enough to extend this to the 6-d die:

\[ \begin{aligned} p_1 + p_2 + p_3 + p_4 + p_5 + p_6 &= 1 \\ p_1 + 2 p_2 + 3 p_3 + 4 p_4 + 5 p_5 + 6 p_6 &= 4.5 \\ \\ p_6 &= 1 - p_1 - p_2 - p_3 - p_4 - p_5 \\ \\ p_1 + 2 p_2 + 3 p_3 + 4 p_4 + 5 p_5 + 6 p_6 &= 4.5 \\ p_1 + 2 p_2 + 3 p_3 + 4 p_4 + 5 p_5 + 6 ( 1 - p_1 - p_2 - p_3 - p_4 - p_5 ) &= 4.5 \\ p_1 + 2 p_2 + 3 p_3 + 4 p_4 + 5 p_5 + 6 - 6 p_1 - 6 p_2 - 6 p_3 - 6 p_4 - 6 p_5 &= 4.5 \\ - 5 p_1 - 4 p_2 - 3 p_3 - 2 p_4 - p_5 + 6 &= 4.5 \\ - p_5 &= + 5 p_1 + 4 p_2 + 3 p_3 + 2 p_4 - 1.5 \\ p_5 &= - 5 p_1 - 4 p_2 - 3 p_3 - 2 p_4 + 1.5 \\ \\ p_6 &= 1 - p_1 - p_2 - p_3 - p_4 - p_5 \\ p_6 &= 1 - p_1 - p_2 - p_3 - p_4 - ( 1.5 - 5 p_1 - 4 p_2 - 3 p_3 - 2 p_4 ) \\ p_6 &= 4 p_1 + 3 p_2 + 2 p_3 + p_4 - 0.5 \end{aligned} \]

This doesn’t completely describe the space of valid points though. The equations above permit negative probabilities as they are unbounded. This does permit creating a new model. What would the loss function be though?

In a classification model we use cross entropy. It happens that we have an entropy equation here too. The entropy has previously been defined as:

\[ H(\text{p}) = -\sum_i{p_i \log p_i} \]

Given that this uses logarithms I must ensure that the values never become negative, or exceed 1. The model will have to cap the values every iteration.

Code

#collapse
from typing import List, Tuple

import torch
from torch import nn

# The six probabilities corresponding to p1,...,p6
Answer = Tuple[float, float, float, float, float, float]

class FourDimensionalOptimizer(nn.Module):
    def __init__(
        self,
        average_roll: float,
        # MUST start in a valid position!
        # this assumes average roll of 4.5
        initial_values: List[int] = [0.01, 0.01, 1/4 - 0.01, 1/4 - 0.01],
    ) -> None:
        super().__init__()
        self.average_roll = average_roll
        self.ps_1234 = torch.nn.Parameter(torch.tensor(initial_values))

    def optimize(self, iterations: int = 10_000, lr: float=0.01) -> Answer:
        df = self.train(iterations, lr=lr)

        # Show how the loss has changed over the training
        df[["loss"]].plot(logy=True)

        # Find the best performing row and show it
        best_row = df[df.loss == df.loss.min()]
        print(best_row)

        # Then calculate and return the different probabilities
        return self.ps().tolist()

    def train(self, iterations: int, lr: float=0.01) -> None:
        optimizer = torch.optim.AdamW(self.parameters(), lr=lr)
        steps = []
        for _ in range(iterations):
            optimizer.zero_grad()
            loss = self.loss()
            steps.append((
                loss.item(),
                *self.ps().tolist(),
            ))
            loss.backward()
            optimizer.step()
        return pd.DataFrame(steps, columns=["loss", "p1", "p2", "p3", "p4", "p5", "p6"])

    def loss(self) -> torch.Tensor:
        ps = self.ps()
        return torch.exp((ps * torch.log(ps)).sum())

    def ps(self) -> torch.Tensor:
        # −5𝑝1 − 4𝑝2 − 3𝑝3 − 2𝑝4 + (6 - average roll)
        ps_5 = (
            self.ps_1234 * torch.tensor([-5, -4, -3, -2])
        ).sum() + (6 - self.average_roll)
        # 4𝑝1 + 3𝑝2 + 2𝑝3 + 𝑝4 + (average roll - 5)
        ps_6 = (
            self.ps_1234 * torch.tensor([4, 3, 2, 1])
        ).sum() + (self.average_roll - 5)
        return torch.cat([
            self.ps_1234,
            ps_5[None],
            ps_6[None],
        ], dim=0)

Code

model = FourDimensionalOptimizer(average_roll=4.5)
probabilities = model.optimize(iterations=100)

total_probability = sum(probabilities)
average_roll = sum(
    value * probability
    for value, probability in enumerate(probabilities, start=1)
)

print(f"Probabilities sum to: {total_probability:0.5f}")
print(f"Average roll is: {average_roll:0.5f}")
probabilities

        loss        p1       p2        p3        p4        p5        p6
91  0.199173  0.054391  0.07858  0.114577  0.165202  0.239592  0.347659
Probabilities sum to: 1.00000
Average roll is: 4.50000

[0.05432011932134628,
 0.07905277609825134,
 0.11435562372207642,
 0.16458164155483246,
 0.23995816707611084,
 0.34773170948028564]

Code

model = FourDimensionalOptimizer(
    average_roll=3.5,
    initial_values=[1/6, 1/6, 1/6 + 0.1, 1/6 - 0.1]
)
probabilities = model.optimize(iterations=100)

total_probability = sum(probabilities)
average_roll = sum(
    value * probability
    for value, probability in enumerate(probabilities, start=1)
)

print(f"Probabilities sum to: {total_probability:0.5f}")
print(f"Average roll is: {average_roll:0.5f}")
probabilities

        loss        p1        p2        p3       p4        p5        p6
99  0.166667  0.166787  0.166698  0.166547  0.16665  0.166334  0.166985
Probabilities sum to: 1.00000
Average roll is: 3.50000

[0.16672348976135254,
 0.16663537919521332,
 0.16657476127147675,
 0.16667631268501282,
 0.1667640209197998,
 0.16662585735321045]

This approach works much faster and we can see that it works for different roll averages too. It also produces values that are far more accurate.

	1	2	3	4	5	6
model	0.054391	0.07858	0.114577	0.165202	0.239592	0.347659
correct	0.054	0.079	0.114	0.165	0.24	0.347

Looking at these the model is only out on \(p_3\) and \(p_6\) by 0.001 and that is due to rounding. This looks really solid.

TED Talk

This puzzle was set by a colleague and naturally he has his own solution. The solution he used allowed the model to optimize freely through the space and then maps the points back to the correct surface. I mentioned how I had found it hard to reason through the construction of the smaller space and he wrote this to me:

This is one of those rare cases where having a lot of mathematical training actually accomplishes something.Your approach of encoding the constraints into the objective function doesn’t have any major drawbacks that I can think of, except that it might be challenging to generalize it to nonlinear constraints - e.g. if the constraint is \(x^2 + y^2 = 1\) then you have to handle both the cases \(y = \sqrt{1 - x^2}\) and \(y = -\sqrt{1 - x^2}\).

Here’s how I thought through the projection approach (and again, there’s an element of training here - constructing projection maps is like what you do for breakfast every morning in functional analysis). First, we can express the constraint as \(Ax = b\). It is always easier to start equations of the form \(Ax = 0\) and then translate everything back, so we’ll start with that. Let’s pretend we’re working in 3 dimensions for a moment, and your equation looks like \(ax + by + cz = 0\). A good first step to constructing any matrix is to start with a nice coordinate system: we want coordinates \((u, v, w)\) with the properties that:

\(u\) and \(v\) lie in the plane
\(|u| = |v| = |w|\) (unit vectors)
dot products are all zero: \(\langle u,v \rangle = 0\), \(\langle u,w \rangle = 0\), \(\langle v,w \rangle = 0\)

This is called an “orthonormal basis” - the trick is starting with an orthonormal basis for the plane - in this case \(u\) and \(v\) - and extending it to the whole space. If you have such a thing, then the projection map is easy to construct: you just take dot products against \(u\) and \(v\): \(Px := \langle x,u \rangle u + \langle x,v \rangle v\). This lies in the plane because it is a linear combination of \(u\) and \(v\), and \(x - Px = \langle x,w \rangle w\) which is orthogonal to the plane. So the question is: “how do we come up with an orthonormal basis for a plane and extend it to the whole space?”. The answer is called the “Gram-Schmidt algorithm”, and it works like this:

Start with ANY basis \(b_1\), \(b_2\), \(b_3\) such that \(b_1\) and \(b_2\) are in the plane (just by solving the equation \(Ax = 0\)). Then iteratively build it up into an orthonormal basis by projecting:

\[ \begin{aligned} u &= \frac{b1}{|b1|} \\ v &= b2 - proj(b2 -> u) = b2 - \langle b2, u \rangle u \\ v &= \frac{v}{|v|} \\ w &= b3 - proj(b3 -> u, v) = \dots \\ w &= \frac{w}{|w|} \\ \end{aligned} \]

Now obviously I didn’t compute all that stuff in my solution, and the reason is that the other part of my training taught me that it is more numerically efficient to encode all of this algebra into matrix factorizations. So i went back and googled “which matrix factorizations use the gram-schmidt algorithm” and then pieced together the rest. Thank you for attending my ted talk.

I should try to implement this at some point.