python html文本转为text文本_python html text-CSDN博客

本文介绍了一种使用正则表达式从HTML文档中提取纯文本的方法，避免了使用nltk模块中未实现的clean_html()函数的问题，同时提供了一个自定义的_deHTMLParser类作为替代方案。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

翻了一些博客，看到有博主是自己写了将html转为text的函数，但是由于项目时间比较紧，所以自己懒得动脑筋去写了，
这里推荐大家用一下nltk模块中clean_html()函数，用法如下：

import nltk 
html="""
<!DOCTYPE html>
<html>
    <head>
        <title>这是个标题</title>
    </head>
    <body>
        <h1>这是一个一个简单的HTML</h1>
        <p>Hello World！</p>
    </body>
</html>
"""
print(nltk.clean_html(html))

你以为这样就结束了吗？不，你会看到红红的报错信息：

"To remove HTML markup, use BeautifulSoup's get_text() function"
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function

这报错信息的意思是让你别用这个函数了，要用就用BeautifulSoup的get_text()函数，但我还是不甘，还是想用nltk来解决，并不想重新学习BeautifulSoup，毕竟nltk之前还是有接触一点，我个人稍微熟悉一点。
（如果想用BeautifulSoup，附上BeautifulSoup的官方文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/）

究其原因，查看它报错所提到的util.py文件，里面是这样的：

def clean_html(html):
    raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

可以看到它是没有实现的，但是经过不断查找资料，我得知Github上有clean_html()的实现代码，
Github中的具体网址：https://github.com/nltk/nltk/commit/39a303e5ddc4cdb1a0b00a3be426239b1c24c8bb
如下：

def clean_html(html):  #利用nltk的clean_html()函数将html文件解析为text文件
    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()

综上所述，不再需要导入nltk模块，而是直接把clean_html()的实现放入自己的项目中就可以直接使用了，也就是这样：

import re
def clean_html(html):  #利用nltk的clean_html()函数将html文件解析为text文件
    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()

html="""
<!DOCTYPE html>
<html>
    <head>
        <title>这是个标题</title>
    </head>
    <body>
        <h1>这是一个一个简单的HTML</h1>
        <p>Hello World！</p>
    </body>
</html>
"""

print(clean_html(html))

说这么多也就是想复现一下自己的心路历程，仅记录自己的学习经验，也希望能帮助到其他学习者。

同时，附上我翻到的博客里自定义的转html为text的函数（有做部分改动，因为源代码会报错），
参考来源：https://www.jb51.net/article/59904.htm

from html.parser import HTMLParser
from re import sub 
from sys import stderr 
from traceback import print_exc 
 
class _DeHTMLParser(HTMLParser): 
    def __init__(self): 
        HTMLParser.__init__(self) 
        self.__text = [] 
 
    def handle_data(self, data): 
        text = data.strip() 
        if len(text) > 0: 
            text = sub('[ \t\r\n]+', ' ', text) 
            self.__text.append(text + ' ') 
 
    def handle_starttag(self, tag, attrs): 
        if tag == 'p': 
            self.__text.append('\n\n') 
        elif tag == 'br': 
            self.__text.append('\n') 
 
    def handle_startendtag(self, tag, attrs): 
        if tag == 'br': 
            self.__text.append('\n\n') 
 
    def text(self): 
        return ''.join(self.__text).strip() 
 
 
def dehtml(text): 
    try: 
        parser = _DeHTMLParser() 
        parser.feed(text) 
        parser.close() 
        return parser.text() 
    except: 
        print_exc(file=stderr) 
        return text 

print(dehtml(html))  #直接使用dehtml()函数就可以