Closest Points 2D

1 Learning Goals

Describe Closest Points Problem
Practice analyzing ethical implications of algorithm implementations
Learn how to set a runtime goal
Practice reasoning (and writing proofs for) the combine step of the D&C Closest Points algorithm
Analyze the runtime of the D&C Closest Points and use preprocessing to speed-up the algorithm

2 Introduction to the Closest Points Problem

2.1 Description

Input:

A list of 2-D points, P, and those points plotted on a 2-D axes.

Assumption: No two points have the same $x$ coordinate or the same $y$ coordinate. For example, $(1,2)$ and $(1,4)$ would not be allowed.

Output: The distance between the two closest points in $P$, where distance is calculated using the Euclidean distance: \[ d(p_1,p_2)=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}\]

2.2 Ethical Implications

The algorithm itself is essentially a mathematical object. But once it gets implemented for a particular task, it has ethical implications.

Applications of closest points (usually variants of this algorithm):

Robotics (find the closest points between robot and environment)
Designing roofs
Air traffic control (identify planes that are closest together and so most likely to have a collision)

We will be using a tool called the “Ethical Matrix” developed for algorithmic applications by O’Neil and Gunn.

To set up the matrix, you should create a table with the following columns: Stakeholders, Well-Being, Autonomy, and Justice:

Stakeholders Each row should list a stakeholder: someone or something that is affected by the implementation of the algorithm in the specified scenario. It is important to try to think of as many stakeholders as possible - otherwise you will miss possible impacts of the algorithm.
Well-Being In the Well-Being column, you should note what harms or benefits the stakeholder will likely derive from the algorithm.
Autonomy In the Autonomy column, you should note whether the stakeholder has a choice in whether to use the algorithm (or have the algorithm be used on them). Also, you should note if the stakeholder is informed enough about the workings of the algorithm to responsibly use. (This last is more important for machine learning algorithms, which we will not be looking at much in this class, but can sometimes be important for simpler algorithms.)
Justice In this column, you should think about whether there is unfair treatment of different groups, either within a stakeholder group, or across stakeholder groups.

For example, if we consider using the closest points algorithm to improve air traffic control, a row of the ethical matrix might look like:

Example of an Ethical Matrix
Stakeholders	Well-Being	Autonomy	Justice
Airplane passengers	Safety, Cheaper, Environment	Alternative transit, virtual options	Unequal access (rich/poor, geographic location)

Group Exercise

Ice breaker: Where do you want to travel, and why?

Create at least two more rows of the ethical matrix for additional stakeholders. Think broadly when you think about stakeholders.

After filling in the ethical matrix, you can highlight blocks of the matrix as green (positive), white (not strongly positive or negative), yellow (some problem), and red (very problematic). This coloring can help highlight the potential advantages and disadvantages of using the algorithm.

Creating an ethical matrix does not tell you what to do. As with most ethical decisions, there is not always a clear answer. However, this tool helps you think about the consequences of implementing an algorithm, before actually implementing it. In particular you might consider:

How can you mitigate negative consequences? Can you modify the algorithm to do so? Can you provide additional aid to those who might be negatively affected?
If the effects are so negative, might you try to convince your bosses not to use this algorithm? Would you leave your company rather than work to implement an algorithm?

3 Designing an Algorithm for the Closest Points Problem

3.1 Targeting performance

Before designing a sophisticated algorithm, it is good to try to figure out what kind of performance (time complexity) we might like our algorithm to achieve. We can do this by analyzing simpler brute force approaches (worst-case) and analyzing algorithms for easier problems (best-case). We would like our sophisticated algorithm to be better than brute force, but we don’t expect it to do better than the best algorithm for an easier problem.

A Brute Force algorithm is one that does the simplest thing possible. In this case, that means checking every pair. Here is pseudocode that does this:

$\texttt{BruteForceCloPts}(P)$

$\min\gets \infty$
For $i\gets 1$ to $n-1$:
$\quad$ For $i\gets i+1$ to $n$:
$\qquad$ If $dist(p_i,p_j)<\min$, then $\min\gets dist(p_i,p_j)$
return $\min$

An easier problem than 2D Closest Points is 1D Closest Points. We would not expect a 2D algorithm to be faster than a 1D Algorithm.

Group Exercise

Ice breaker: Fun weekend experience

Use pseudocode to describe a $O(nlog n)$ algorithm for 1D Closest Points, and explain why it has that runtime and why it is correct.

Note - you may use $\texttt{Sort}$ as a line of pseudocode in this class, and you should assume it takes $O(n\log n)$ time to do this sorting. (See CS 201.)

3.2 Sketching a Divide and Conquer Algorithm:

Here is a first pass at an algorithm:

$\texttt{CloPts}(P)$

Base case: we’ll do this later!
Divide: Let’s sort the points by $x$-coordinates, and then divide into a left and right half. For example, if we have 6 points sorted by $x$-coordinate: \[P=((1,5), (2,2),(4,4), (10,20), (12,3), (19,6)) \] then we’ll divide into a left half and right half \[L=((1,5), (2,2),(4,4))\] \[R=((10,20), (12,3), (19,6)).\] We’ll call $7$ the midline, because $7$ is midway between $4$ and $10$, the $x$-coordinates of the two points on either side of the divide. Note that $7$ is not literally in the middle of the space of points. The $x$-coordinates of the points range from $1$ to $19$, so if we were to just divide that space, we might get a midline of $10$, which would not correctly divide the points.
Conquer:
- $\delta_1=\texttt{CloPts}(L)$
- $\delta_2=\texttt{CloPts}(R)$
- $\delta=min\{\delta_1,\delta_2\}$

From this point on, it is important that you don’t think about the details of what happens in those recursive calls. Like in an inductive proof, we are just going to assume from here on that they will return the correct output. We will not think about how they do this. Then, as long as we get the combine and base cases to be correct with this assumption, these recursive calls will indeed output the correct thing. This is the power of inductive thinking.

Combine: This is too complex for this sketch…we need a new section

3.3 The Combine Step

Assuming the recursive calls work correctly, we will have found the distance between the closest pair of points both on the left of the midline, and the distance between the closest pair of points both on the right of the midline, so the only thing we need to worry about is if the closest pair of points in the whole set is a pair where one point is in $L$ and one point is in $R$, crossing the midline.

However, it seems like points that are very far away from the midline can be ignored. As shown in the picture, if we have already found a pair in $R$ with distance $1$, then a point in $R$ that is $10$ away from the midline can’t possibly be part of the closest pair with a point in $L$.

Illustration of closest points in left and right halfs, pair that crossed midline, and point that can be ignored far from midline.

ABCD Question

Recalling that $d(p_1,p_2)=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}$, which points do we need to worry about in the combine step? Those within

$\delta/2$ of midline
$\sqrt{\delta}$ of midline
$\delta$ of midline
$2\delta$ of midline

To understand which is correct, we can use the following Lemma:

Lemma 1 If the $x$-coordinate of a point $p_2$ in $R$ is more than $r$ greater than the $x$-coordinate of the midline, $p_2$’s distance to any point $p_1$ in $L$ is more than $r$.

Proof. (We will give a proof sketch. A real proof can not just be equations. You need to have complete English sentences explaining the equations, and any equation should be part of a sentence, with English first, and then the equation.)

\[\begin{align} (x_1-x_2)^2&=(\textrm{ difference in x-coordinate from }x_1 \textrm{ to midline }&+ & \textrm{difference in x-coordinate from }x_2 \textrm{ to midline })^2\\ &> (0 & + & r)^2\\ &> r^2 && \\ ~\\ (y_1-y_2)^2&\geq 0&& \end{align}\]

\[\begin{align} (x_1-x_2)^2+(y_1-y_2)^2&> r^2\\ \sqrt{(x_1-x_2)^2+(y_1-y_2)^2}&> r \end{align}\]

$\square$

Since we only care about points that have distance less than $\delta$, when we set $r$ to be $\delta$, Lemma 1 tells us that if a point’s $x$-coordinate is more than $\delta$ away from the midline, then its distance to any point on the other side of the midline will be more than $\delta$, so we can ignore those points in our combine step.

If you squint, the region close to the midline almost looks like points on a line, as in the picture

Illustration of long narrow region around midline, highlighted in yellow.

Why don’t we try using an approach similar to a line:

Combine Step
$Y_\delta\leftarrow y$-sorted list of points in $P$ within $\delta$ of midline
$p\in Y_\delta$:
$\quad$ Check distance from $p$ to next ???? points
$\quad$ Save if smallest distance found

While in our example of points on a line, we only needed to look at the next sorted point, in this case, looking at just the next point is not sufficient. To see this, consider the following picture:

Illustration showing that if you sort the points near the midline by y-coordinate, the closest point is not the next sorted point.

In this picture, the closest point to the top point is actually not the next point in $Y_\delta$, but the one after that.

Lemma 2 Given a point in $Y_\delta$, one only needs to look at the next ??? points in $Y_\delta$ in order to find the closest point

Group Exercise

What should ??? be replaced by in the above lemma?

Proof. (See post-class pdf at top of this page for picture and missing info in below proof Lemma 2. Note that this Lemma is likely not optimal. By that, I mean that you could probably get away with checking fewer points. However, it is easy to prove!)

Imagine dividing the region within $\delta$ of midline into $\delta/2\times\delta/2$ squares starting at the current point, which we call $q$. Each of these squares can contain at most one point. To see this, for contradiction, suppose there are 2 points in a square. Then the difference in their $x$-coordinates is at most $\delta/2$, and the difference in their $y$-coordinates is at most $\delta/2$, so their distance is at most \[\sqrt{\left(\frac{\delta}{2}\right)^2+\left(\frac{\delta}{2}\right)^2}=\frac{\delta}{\sqrt{2}}.\] This is a contradiction, since every square contains only points in $L$ or only points in $R$, and we know that any two points that are both in $L$ or both in $R$ must have a distance of at least $\delta$, since that is the smallest distance found in our recursive calls which we get to assume are correct.

Now if we consider rows of boxes below $q$, since each row has height $\delta/2$, any points that are in the $3^\textrm{rd}$ row of boxes will have a difference in $y$-coordinate from $q$ that is at least $\delta$. This means it will have a distance from $q$ that is at least $\delta$, and so we need not check any points in rows 3 or more below $q$.

Thus there are only xxxx relevant boxes, and $q$ is in one box, so there are at most ??? boxes that might contain the next closest point to $q$, and each of these boxes contains at most one point, and all of these points would appear immediately after $q$ in $Y_\delta$ since $Y_\delta$ is sorted by $y$-coordinate.

$\square$

3.4 Base Case

What size set of pts should trigger the base case?

$0$
$\leq 1$
$\leq 2$
$\leq 3$

3.5 Algorithm

Finally, we can write down the full pseudocode for the algorithm. (See post-class pdf at top of page for missing values.)

CloPts(P)
// Base Case
If $|P|\leq ...$, then do brute force

//Divide
Sort $P$ by $x$-coordinate, and then divide into arrays $L$ and $R$
Pick midline to be midway between $L$ and $R$

// Conquer
$\delta=min\{\texttt{CloPts}(L),\texttt{CloPts}(R)\}$

// Combine
$Y_$ pts within *** of midline, sorted by $y$-coordinate
$\texttt{for}$ $p_i\in Y_\delta$:
$\quad$ $\texttt{for}$ $j\leftarrow i+1$ to $i+???$:
$\qquad$ if $d(p_i,p_j)\leq \delta$, then $\delta\leftarrow d(p_i,p_j)$
return $\delta$

Try to explain in your own words why this is correct? What questions do you have about why this algorithm works?

4 Runtime

In groups, create a recurrence relation for the runtime of the Closest Points algorithm, using the pseudocode above.

It turns out the approach described above is not optimal. We are repeatedly sorting by $x$ and $y$ at each recursive step, and this adds extra factors to the runtime. Instead, we’ll presort the points before running the recursive algorithm:

** PreSort(P)
$X\leftarrow P$ sorted by $x$
$Y\leftarrow P$ sorted by $y$
return $X,Y$

CloPts(X, Y)
// Note that $X$ and $Y$ should contain the same set of points, just one array is sorted by $x$ and one by $y$.

// Base Case
1. If $|P|\leq ...$, then do brute force

//Divide
2. Divide $X, Y$ into $X_L,X_R,Y_L, Y_R$, and pick midline

// Conquer
3. $\delta=min\{\texttt{CloPts}(X_L,Y_L),\texttt{CloPts}(X_R,Y_R)\}$

// Combine
4. Create $Y_\delta$ from $Y$ and midline
5. $\texttt{for}$ $p_i\in Y_\delta$:
$\quad$ $\texttt{for}$ $j\leftarrow i+1$ to $i+???$:
$\qquad$ if $d(p_i,p_j)\leq \delta$, then $\delta\leftarrow d(p_i,p_j)$
6. return $\delta$

We will go through and analyze the runtime of each step.

$O(1)$
This step can be done in $O(n)$ time. $X$ can be split by doing a loop through the elements and copying the first half in-order into $X_L$ and copying the second half in-order into $X_R$. $Y$ can be split by doing a loop through the elements of $Y$ and checking the $x$-coordinate of each point, and if the $x$-coordinate of the point is less than the midline value, copy it to the next empty spot in $Y_L$, and if the $x$-coordinate of the point is more than the midline value, copy it to the next empty spot in $Y_R$. Each of these loops only goes through $X$ or $Y$ one time and does constant work within each loop, so the runtime is $O(n)$.
$2T(n/2)+O(1)$, for the two recursive calls, each on an input half the size of the original, plus constant time to take the minimum
We again loop through $Y$, and check the $x$-coordinate of each point, and if the $x$-coordinate of the point within $\delta$ of the midline, we copy it $Y_\delta$, and otherwise we do nothing with the point and just continue to the next for loop iteration.
Since $Y_\delta$ can have at most $n$ points in it, the outer loop will run at most $n$ times. The inner for loop only iterates a constant number of times, and then the if statement only takes a constant amount of time, so the entire process takes $O(n)$ time.
O(1)

Combining the contributions of each step and using big-O rules for adding runtimes, we get that the runtime is \[T(n)= \begin{cases} O(1), \textrm{ if }n\leq ...\\ 2T(n/2)+O(n), \textrm{ else} \end{cases}\] We can use the tree formula for this recurrence relation, and we find that the runtime is $O(n\log n)$!! This means we have achieved the same runtime as for 1-D, the best we could have possibly hoped for. Wow!

Where did we need the assumption that $x$, $y$ points are unique?