教程：使用PDF转markdown工具marker

原创已于 2025-03-25 07:56:02 修改 · 755 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#python #pdf #开源 #github #开源软件

于 2025-03-24 23:20:13 首次发布

简介

使用到的项目地址：https://github.com/VikParuchuri/marker

最近又在捣鼓我的obsidian all in one事业，发现自己急需一个工具把pdf转化为md
在github上的这一领域，marker是star最多的工具

然而作者主要使用的是Linux，文档对Windows用户来说似乎不太友好
网上也有人提出过解决方案，例如：https://www.bilibili.com/opus/885187425386627079

不过现在作者已经更新了代码库，再加上我个人认为上文作者讲的也有些繁琐(非贬义）.

因此今天我希望能用最简短的文字带大家部署好这个工具

极简教程

确保你下载了git
找一个你喜欢的文件夹，运行：

git clone https://github.com/VikParuchuri/marker

使用conda创建虚拟环境，确保你有miniconda/anaconda工具,运行:

conda create -n py310 python=3.10

打开marker文件夹，用打开anaconda终端，输入：

pip install poetry
poetry install

这一步是在下载相关依赖

修改marker\settings.py文件，在class Settings(BaseSettings):上方加入；

print(torch.cuda.is_available())
# 定义 cuda 变量
cuda = 'cuda' if torch.cuda.is_available() else 'cpu'

最后，还需要你下载cuda和对应版本的pytorch，这一部分很多人写过了
例如：https://zhuanlan.zhihu.com/p/672526561

run

现在，创建一个python文件，使用下列代码，将书的目录改成你自己的书的目录，就可以使用了。
如果发生爆显存(cuda out of memory)的情况，尝试将num_worker数据改小

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered, save_output
import multiprocessing
import torch
import os

if __name__ == '__main__':
    # 创建输出目录
    output_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "book")
    os.makedirs(output_dir, exist_ok=True)
    
    # 检查是否有可用的 CUDA
    if torch.cuda.is_available():
        print(f"使用 CUDA 设备: {torch.cuda.get_device_name(0)}")
        # 对于 CUDA，我们使用较少的 worker 以避免显存不足
        num_workers = 2  # 每个 worker 大约使用 5GB 显存
    else:
        print("未检测到 CUDA，使用 CPU")
        num_workers = max(1, multiprocessing.cpu_count() // 2)
    
    converter = PdfConverter(
        artifact_dict=create_model_dict(),
        config={
            "workers": num_workers,  # 设置worker数量
            "disable_multiprocessing": False,  # 启用多进程
            "output_dir": output_dir  # 设置输出目录
        }
    )
    
    try:
        input_file = "H:/Obsidian/Myrepo/00 课内学习/数据库/数据库系统概论（第5版） (王珊 萨师煊) (Z-Library).pdf"
        base_name = os.path.splitext(os.path.basename(input_file))[0]
        
        # 转换PDF
        rendered = converter(input_file)
        
        # 保存输出
        save_output(rendered, output_dir, base_name)
        
        # 获取文本内容
        text, _, images = text_from_rendered(rendered)
        print(text)
        print(f"输出文件保存在: {output_dir}")
        
    except Exception as e:
        print(f"转换过程中发生错误: {str(e)}")
        import traceback
        print(traceback.format_exc())  # 打印详细错误信息