2025.06.17【WGS】|多样品VCF文件中每个样品SNP数统计方法详解（含代码与注释）

文章目录

在群体遗传学、RAD-seq等高通量测序分析中，统计每个样品的SNP数目是常见的数据质控和下游分析需求。本文将详细介绍如何从合并后的VCF文件中，准确统计每个样品的SNP数目，并给出完整的Python脚本和注释说明。

一、统计原理

VCF（Variant Call Format）文件是变异检测的标准输出格式。对于多样品VCF文件，每一行代表一个变异位点，前9列为变异信息，后面每一列为一个样品的基因型（GT）及其它信息。

SNP计数的核心思路：

对于每个样品，遍历所有变异位点，只要该样品的GT字段不是0/0、0|0（纯合参考）或./.、.|.（缺失），就认为该样品在该位点有SNP。
统计每个样品的SNP位点数。

二、示例VCF结构

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1 sample2 sample3
chr1    100     .       A       G       99      PASS    ...     GT:AD   0/1:5,5 0/0:10,0 1/1:0,10
chr1    200     .       T       C       99      PASS    ...     GT:AD   0/0:8,0  0/1:4,4  ./.:.
chr1    300     .       G       A       99      PASS    ...     GT:AD   1/1:0,9  0/0:9,0  0/1:5,5

第一行是表头，后面每一行为一个SNP位点。
GT字段为基因型，0/0表示纯合参考，0/1或1/0为杂合，1/1为纯合变异，./.为缺失。

三、统计流程

1. 获取样本名称

可以用bcftools query -l命令获取VCF文件中的所有样本名。

2. 遍历每一行SNP，统计每个样品的SNP数

跳过注释行（以#开头）
解析每一行的GT字段
对每个样品，若GT不是0/0、0|0、./.、.|.，则SNP计数+1

四、完整Python脚本及注释

#!/usr/bin/env python3
"""
统计多样品VCF文件中每个样品的SNP数目
作者：生信小助手
"""

import sys
import os
import subprocess

def get_sample_names(vcf_file):
    """
    获取VCF文件中的样本名称
    """
    cmd = f"bcftools query -l {vcf_file}"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.returncode == 0:
        return [line.strip() for line in result.stdout.split('\n') if line.strip()]
    else:
        print(f"Error reading sample names: {result.stderr}")
        sys.exit(1)

def count_snps(vcf_file, sample_names):
    """
    统计每个样品的SNP数目
    只要GT字段不是0/0、0|0、./.、.|.，就认为是SNP
    """
    snp_counts = {sample: 0 for sample in sample_names}
    # 用bcftools view -H提取所有SNP行（去掉注释）
    cmd = f"bcftools view -H {vcf_file}"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print(f"Error extracting SNP data: {result.stderr}")
        sys.exit(1)
    for line in result.stdout.split('\n'):
        if not line.strip():
            continue
        fields = line.strip().split('\t')
        # VCF前9列为固定列，后面为样品
        for idx, gt in enumerate(fields[9:]):
            sample = sample_names[idx]
            gt_val = gt.split(':')[0]  # 只取GT字段
            # 只要不是纯合参考或缺失，就计为SNP
            if gt_val not in ('0/0', '0|0', './.', '.|.'):
                snp_counts[sample] += 1
    return snp_counts

def main():
    if len(sys.argv) < 2:
        print("Usage: python count_snp_per_sample.py <vcf_file> [output_file]")
        sys.exit(1)
    vcf_file = sys.argv[1]
    output_file = sys.argv[2] if len(sys.argv) > 2 else "snp_count_per_sample.txt"
    if not os.path.exists(vcf_file):
        print(f"Error: VCF file {vcf_file} does not exist")
        sys.exit(1)
    sample_names = get_sample_names(vcf_file)
    snp_counts = count_snps(vcf_file, sample_names)
    with open(output_file, 'w') as f:
        f.write("Sample\tSNP_Count\n")
        for sample in sample_names:
            f.write(f"{sample}\t{snp_counts[sample]}\n")
    print(f"SNP count per sample written to {output_file}")

if __name__ == "__main__":
    main()

五、运行方法

安装bcftools（如果未安装）：
```
conda install -c bioconda bcftools
```

运行脚本：

python count_snp_per_sample.py merged_filtered_snps.vcf.gz

输出文件snp_count_per_sample.txt格式如下：

Sample	SNP_Count
sample1	12345
sample2	12001
sample3	11098

六、进阶：区分杂合/纯合SNP数

如果需要分别统计杂合（如0/1、1/0）和纯合变异（如1/1）数目，可以扩展如下：

# 在count_snps函数中添加
if gt_val in ('0/1', '1/0', '0|1', '1|0'):
    het_counts[sample] += 1
elif gt_val in ('1/1', '1|1'):
    hom_counts[sample] += 1

七、总结

统计每个样品SNP数的核心是解析GT字段，排除纯合参考和缺失。
推荐用bcftools配合Python脚本高效处理大规模VCF文件。
可根据需求扩展统计杂合、纯合、缺失等信息。

如有更多生信数据处理问题，欢迎留言交流！

标签：#生信 #VCF #SNP统计 #群体遗传 #Python #bcftools #脚本

如需脚本源码或自动化流程，欢迎私信或评论获取！