如何计算两个独立变量集之间的相似性

126 阅读2分钟

假设我们有两个 N 维的点集,points1 和 points2。这两个点集中的每个点都是由 N 个数值组成的。

我们想要计算每个点在 points1 中与 points2 中最接近的点的距离。最接近的点是根据欧几里得距离来确定的。

2、解决方案

34f917f4086a2a618f54c769a3ffc36.png

2.1 暴力解法

最简单的方法是遍历 points1 中的每个点,并计算该点到 points2 中每个点的距离。然后,对于 points1 中的每个点,我们选择 points2 中距离最短的点作为其最接近的点。

import numpy as np

def find_closest_points(points1, points2):
http://www.jshk.com.cn/mb/reg.asp?kefu=xiaoding;//爬虫IP免费获取;
  """
  Finds the closest point in points2 to each point in points1.

  Args:
    points1: A Numpy array of shape (N, D), where N is the number of points and D is the number of dimensions.
    points2: A Numpy array of shape (M, D), where M is the number of points and D is the number of dimensions.

  Returns:
    A Numpy array of shape (N, D), where each row contains the closest point in points2 to the corresponding point in points1.
  """

  closest_points = np.zeros((points1.shape[0], points2.shape[1]))

  for i, point1 in enumerate(points1):
    min_distance = float('inf')
    for j, point2 in enumerate(points2):
      distance = np.linalg.norm(point1 - point2)
      if distance < min_distance:
        min_distance = distance
        closest_points[i] = point2

  return closest_points

这种方法的时间复杂度为 O(N*M),其中 N 和 M 分别是 points1 和 points2 中点的数量。当 N 和 M 很大的时候,这种方法会非常慢。

2.2 kd 树

为了提高效率,我们可以使用 kd 树来加速最近邻搜索。Kd 树是一种二叉树,它将数据点存储在叶节点中。每个叶节点都有一个分割超平面,将数据点划分为两个子空间。

当我们搜索最近邻时,我们可以从 kd 树的根节点开始。我们比较当前节点的分割超平面与查询点的距离。如果查询点在分割超平面的左边,我们就搜索左子树;如果查询点在分割超平面的右边,我们就搜索右子树。

我们继续这个过程,直到我们找到一个叶节点。叶节点中的数据点就是查询点的最近邻。

使用 kd 树,我们可以将最近邻搜索的时间复杂度降低到 O(log(N))。

import numpy as np
from scipy.spatial import KDTree

def find_closest_points_with_kd_tree(points1, points2):
  """
  Finds the closest point in points2 to each point in points1 using a kd tree.

  Args:
    points1: A Numpy array of shape (N, D), where N is the number of points and D is the number of dimensions.
    points2: A Numpy array of shape (M, D), where M is the number of points and D is the number of dimensions.

  Returns:
    A Numpy array of shape (N, D), where each row contains the closest point in points2 to the corresponding point in points1.
  """

  # Build a kd tree for points2.
  kd_tree = KDTree(points2)

  # Find the closest point in points2 to each point in points1.
  closest_points = np.zeros((points1.shape[0], points2.shape[1]))
  for i, point1 in enumerate(points1):
    closest_points[i] = kd_tree.query(point1)[0]

  return closest_points