Faiss：Ubuntu 20.04安装与学习笔记

最新推荐文章于 2025-05-30 09:13:25 发布

Lip0041

最新推荐文章于 2025-05-30 09:13:25 发布

阅读量1.5k

点赞数 2

CC 4.0 BY-SA版权

文章标签： python

本文链接：https://blog.csdn.net/Lip0041/article/details/108010899

本文介绍了如何在Ubuntu 20.04上安装Faiss，包括Anaconda的安装和环境配置，以及如何解决安装过程中可能出现的问题。接着详细探讨了Faiss的基本概念和用途，讲解了不同类型的索引，如IndexFlatL2、IndexIVFFlat和IndexIVFPQ，并分析了它们在搜索速度和内存占用之间的权衡。还涵盖了Faiss的MetricType、聚类、PCA和量化等基础知识，以及如何选择合适的索引策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Faiss-github链接：https://github.com/facebookresearch/faiss/wiki

Ubuntu20.04 安装(虚拟机)

安装anaconda
1. 从清华镜像下载：https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/
2. 进入下载目录，安装
```
bash Anaconda3-5.3.1-Linux-x86_64.sh
```
3. 一路回车+yes即可
4. 添加到环境变量
```
vim /etc/profile
末尾加上语句 export PATH={你的Anaconda安装路径}/bin:$PATH
source /etc/profile 生效
```

conda换为国内源——参考链接
在vim ~/.condarc中加入

channels:
  - defaults
show_channel_urls: true
channel_alias: https://mirrors.tuna.tsinghua.edu.cn/anaconda
default_channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

利用conda 安装openblas、faiss-cpu

conda install openblas
conda install faiss-cpu -c pytorch

在python中import faiss无报错即安装成功
安装过程如果出现已杀死的情况，不用慌，使用下列语句更conda，确定是最新版本
```
conda update -n base conda
```
多试几次，如果还是不行重启试试。
可能是因为虚拟机性能限制，会出现已杀死的状况。

Faiss 学习

Tutorial

认识faiss

faiss是一个利用相似性搜索的库，是一个对于多维稠密向量的检索工具

相似性搜索是：给定维数为d的向量x_i的集合，通过faiss创建一个数据结构index。然后在给定维度d的新向量x时，计算出欧式距离argmin_i||x - x_i||，即进行对索引的搜索操作。
使用faiss进行检索的步骤：训练、构建索引库、查询。

入门——IndexFlatL2-精确但耗时

faiss使用32位浮点矩阵，处理d维向量的集合。则需要两个矩阵

xb: 表示数据库，包含必须索引的所有向量，在其中进行搜索，大小为nb * d
xq: 表示查询向量，找到最近的邻居，大小为nq * d。nq 为所要查询的向量个数

对下列IndexFlatL2.py进行分析

# Copyright (c) 2015-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD+Patents license found in the
# LICENSE file in the root directory of this source tree.

import numpy as np # 导入处理多维数组的库
 
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
# np.random.random(x, y) 生成x*y的范围(0,1)的随机浮点数矩阵
xb = np.random.random((nb, d)).astype('float32') # 创建行为nb列为d的数据类型为float32的随机浮点数，浮点数范围为(0，1)
# np.arange(n) 生成[0, n)的数组
xb[:, 0] += np.arange(nb) / 1000. # 第一维的所有数据加上 行标/1000.0
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
# 暴力L2距离搜索——不需要通过训练来分析数据库向量集的分布
import faiss                   # make faiss available
index = faiss.IndexFlatL2(d)   # 创建数据库向量维度为64的FlatL2索引
print(index.is_trained)		   # 判断是否需要训练
index.add(xb)                  # 将数据库向量添加到索引上
print(index.ntotal)			   # 输出索引向量的数量
# k临近搜索
k = 4                          # we want to see 4 nearest neighbors
# D, I = index.search(xb[:5], k) # sanity check
# print(I)	# 向量索引位置
# print(D)	# 相似度矩阵
D, I = index.search(xq, k)     # 对查询向量进行搜索
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries

结果：

IndexFlatL2

相似度搜索是在给定的一组图片中，找出指定目标最像的k张图片，简称k邻近问题。参考：相似性搜索。也就是说，对上面demo进行的搜索得出的结果是与查询向量xq欧式距离最近的4个向量。
k邻近搜索是在数据库向量xb中找与查询向量xq最相近的前k个向量
分析search方法的返回值：
1. search方法的返回值I与D分别代表查询出来的向量索引位置与相似度矩阵
2. 向量索引位置即与xq最相近的xb中的向量的行标(因为维度d是一定的)。I矩阵大小为nq*k，其中若要取得有效输出k小于等于nb
3. 相似度矩阵，经一次检验的值D存储的值是欧几里德距离L2的平方(或许这一次检验出现的是一个碰巧的情况)，总归相似度矩阵中数值越小，越接近于0，就说明这个数值对应的I中的索引向量是最接近xq中相应的向量。

更快的搜索——IndexIVFFlat

手段：将数据集分段。在d维空间中定义Voronoi单元，并且每个数据库向量都属于其中一个单元。搜索时，仅包含查询x所在单元格中包含的数据库向量y，并将一些相邻的数据库向量y与查询向量进行比较。

Voronoi为泰森多边形，由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。

此手段由IndexIVFFlat索引完成，需要通过训练阶段（仅依靠数据库本身即可），除了训练之外，IndexIVFFlat还需要量化器(quantizer)，此索引将矢量分配给Voronoi单元。每个单元由一个质心定义，找到一个矢量所在的Voronoi单元包括在质心集中找到该矢量的最近邻居。这由IndexFlatL2实现。

IndexIVFFlat含有两个参数：nlist - 划分单元的数量；nprobe - 执行搜索访问的单元格数

对下列IndexIVFFlat.py进行分析

# Copyright (c) 2015-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD+Patents license found in the
# LICENSE file in the root directory of this source tree.

import numpy as np

d = 64                              # 向量维度
nb = 100000                         # 向量集大小
nq = 10000                          # 查询次数
np.random.seed(1234)                # 随机种子,使结果可复现
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

import faiss

nlist