包含名称'foo.html'的文件
Blah blah blah
**Catalina 320**
Blah
**Catalina 320**Blah Blah
**These boats** are fully booked for the day
Blah blah blah
Catalina 320
Catalina 320
码:
from time import clock
n = 1000
########################################################################
import lxml.etree as ET
from lxml.etree import XMLParser
parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('foo.html', parser)
te = clock()
for i in xrange(n):
resultsArray = []
for thing in etree.findall("//"):
if "These boats" in thing.text:
break
elif "Catalina 320"in thing.text:
resultsArray.append(ET.tostring(thing).strip())
tf = clock()
print 'Solution with lxml'
print tf-te,'\n',resultsArray
########################################################################
with open('foo.html') as f:
text = f.read()
import re
print '\n\n----------------------------------'
rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)
te = clock()
for i in xrange(n):
yi = rigx.findall(text)
tf = clock()
print 'Solution 1 with a regex'
print tf-te,'\n',yi
print '\n----------------------------------'
ragx = re.compile('(Catalina 320)|(These boats)')
te = clock()
for i in xrange(n):
li = []
for mat in ragx.finditer(text):
if mat.group(2):
break
else:
li.append(mat.group(1))
tf = clock()
print 'Solution 2 with a regex, similar to solution with lxml'
print tf-te,'\n',li
print '\n----------------------------------'
regx = re.compile('(Catalina 320)')
te = clock()
for i in xrange(n):
ye = regx.findall(text, 0, text.find('These boats') if 'These boats' in text else len(text))
tf = clock()
print 'Solution 3 with a regex'
print tf-te,'\n',ye结果
Solution with lxml
0.30324105438
['**Catalina 320**', '
**Catalina 320**']----------------------------------
Solution 1 with regex
0.0245033935877
['Catalina 320', 'Catalina 320']
----------------------------------
Solution 2 with a regex, similar to solution with lxml
0.0233258696287
['Catalina 320', 'Catalina 320']
----------------------------------
Solution 3 with regex
0.00784708671074
['Catalina 320', 'Catalina 320']我的正则表达式解决方案有什么问题?
Times:
lxml - 100 %
solution 1 - 8.1 %
solution 2 - 7.7 %
solution 3 - 2.6 %
使用正则表达式不要求文本是XML或HTML文本。
。
那么,假装正则表达式不如lxml来处理这个问题的剩余参数是什么? h2>
编辑1 h2>
使用rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?',re.DOTALL)的解决方案并不好:
这个正则表达式将捕获“Catalina 320”的出现位于'这些船'后如果没有出现'Catalina 320'之前'这些船'
模式必须是:
rigx = re.compile('(Catalina 320)(?:(?:.(?!Catalina 320))*These boats.*\Z)?|These boats.*\Z',re.DOTALL)但与其他解决方案相比,这是一个相当复杂的模式