翻了一些博客,看到有博主是自己写了将html转为text的函数,但是由于项目时间比较紧,所以自己懒得动脑筋去写了,
这里推荐大家用一下nltk模块中clean_html()函数,用法如下:
import nltk
html="""
<!DOCTYPE html>
<html>
<head>
<title>这是个标题</title>
</head>
<body>
<h1>这是一个一个简单的HTML</h1>
<p>Hello World!</p>
</body>
</html>
"""
print(nltk.clean_html(html))
你以为这样就结束了吗?不,你会看到红红的报错信息:
"To remove HTML markup, use BeautifulSoup's get_text() function"
NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function
这报错信息的意思是让你别用这个函数了,要用就用BeautifulSoup的get_text()函数,但我还是不甘,还是想用nltk来解决,并不想重新学习BeautifulSoup,毕竟nltk之前还是有接触一点,我个人稍微熟悉一点。
(如果想用BeautifulSoup,附上BeautifulSoup的官方文档:http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
究其原因,查看它报错所提到的util.py文件,里面是这样的:
def clean_html(html):
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")
可以看到它是没有实现的,但是经过不断查找资料,我得知Github上有clean_html()的实现代码,
Github中的具体网址:https://github.com/nltk/nltk/commit/39a303e5ddc4cdb1a0b00a3be426239b1c24c8bb
如下:
def clean_html(html): #利用nltk的clean_html()函数将html文件解析为text文件
# First we remove inline JavaScript/CSS:
cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
# Then we remove html comments. This has to be done before removing regular
# tags since comments can contain '>' characters.
cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
# Next we can remove the remaining tags:
cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
# Finally, we deal with whitespace
cleaned = re.sub(r" ", " ", cleaned)
cleaned = re.sub(r" ", " ", cleaned)
cleaned = re.sub(r" ", " ", cleaned)
return cleaned.strip()
综上所述,不再需要导入nltk模块,而是直接把clean_html()的实现放入自己的项目中就可以直接使用了,也就是这样:
import re
def clean_html(html): #利用nltk的clean_html()函数将html文件解析为text文件
# First we remove inline JavaScript/CSS:
cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
# Then we remove html comments. This has to be done before removing regular
# tags since comments can contain '>' characters.
cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
# Next we can remove the remaining tags:
cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
# Finally, we deal with whitespace
cleaned = re.sub(r" ", " ", cleaned)
cleaned = re.sub(r" ", " ", cleaned)
cleaned = re.sub(r" ", " ", cleaned)
return cleaned.strip()
html="""
<!DOCTYPE html>
<html>
<head>
<title>这是个标题</title>
</head>
<body>
<h1>这是一个一个简单的HTML</h1>
<p>Hello World!</p>
</body>
</html>
"""
print(clean_html(html))
说这么多也就是想复现一下自己的心路历程,仅记录自己的学习经验,也希望能帮助到其他学习者。
同时,附上我翻到的博客里自定义的转html为text的函数(有做部分改动,因为源代码会报错),
参考来源:https://www.jb51.net/article/59904.htm
from html.parser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc
class _DeHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.__text = []
def handle_data(self, data):
text = data.strip()
if len(text) > 0:
text = sub('[ \t\r\n]+', ' ', text)
self.__text.append(text + ' ')
def handle_starttag(self, tag, attrs):
if tag == 'p':
self.__text.append('\n\n')
elif tag == 'br':
self.__text.append('\n')
def handle_startendtag(self, tag, attrs):
if tag == 'br':
self.__text.append('\n\n')
def text(self):
return ''.join(self.__text).strip()
def dehtml(text):
try:
parser = _DeHTMLParser()
parser.feed(text)
parser.close()
return parser.text()
except:
print_exc(file=stderr)
return text
print(dehtml(html)) #直接使用dehtml()函数就可以
但是不推荐使用这个自定义的函数,因为它不能把html中一些样式代码部分去掉,所以有一点小瑕疵,如果不太在意这个问题的话可以使用这个函数,影响不大。
参考来源:http://www.voidcn.com/article/p-uwvcknuy-buu.html