1 . parse_company.py
适合规整的key-value结构infobox提取,部分具体infobox_type如下:
- ‘Infobox company’
- ‘Infobox book’
- ‘Infobox peroson’
- ‘Infobox university’
- ‘Infobox military person’
- ‘Infobox artist’
- ‘Infobox scientist’
- ‘Infobox football club’
- ‘Infobox Organization’
- ‘Infobox award’
- ‘Infobox video game’
- ‘Infobox ice hockey player’
- ‘Infobox protected area’
- ‘Infobox lake’
- ‘Infobox radio station’
- ‘Infobox software’
- ‘Infobox writer’ (特殊情况 见注释)
- ‘Infobox football match’ (特殊情况 见注释)
- ‘Infobox prepared food’ (特殊情况 见注释)
- ‘Infobox ship begin’(特殊情况 见注释)
使用其他模板报错词条可以用此模板修正,查找错误词条语句比如:
for item in db.wiki_en2.find({‘info_box_type’:box_type},
‘infobox21_flag’:-1,no_cursor_timeout=True).batch_size(50):
2 . parse_Officeholder.py
政府官员常用模板,有key-value和group,特点:有in office。
部分具体infobox_type如下:
- ‘Infobox Officeholder’
- ‘Infobox officeholder
- ‘Infobox Congressman’
- ‘Infobox congressman’
- ‘Infobox State Representative’
- ‘Infobox Politician’
- ‘Infobox politician’
- ‘Infobox MP’
3.untility.py
parse_*.py文件引用的公共模块,
存放正则表达、信息提取、th/tr处理、上标转换等。
4.parse_xx.py
第二次的版本,基本可直接运行,对应Infobo类型为’Infobox xx’.
parse_Election.py 可以提取 ‘Infobox Election’类型的infobox
parse_planet.py 可以提取 ‘Infobox planet’类型的infobox
parse_Album.py 可以提取 ‘Infobox Album’、’Infobox single’类型的Infobox
parse_NCAA_team_season.py 可以提取 ‘Infobox NCAA team season’
…
5.parse_infobox_xx_en.py
第一次的版本,存在一些问题,可以参考或者继续改进。
parse_infobox_station_en.py 可以提取 ‘Infobox station’类型的infobox
…
注意:第一次的版本没有取value的内链。
6.infobox/main.py
对于复杂的具有key-value 、 table 、group的infobox,可以运行infobox项目。
输入box_type,选择合适的模板(album/sportsperson/settlement),最后运行main.py。
box_type = 'Infobox Album'
moduleName = "album" # 选择模板
6.infobox/pattern/album.py
适用于明显分组的infobox,部分具体infobox_type如下:
- ‘Infobox baseball biography’
- ‘Infobox building’
- ‘Infobox cricketer’
- ‘Infobox school’
- ‘Infobox road’
- ‘Infobox journal’
- ‘Infobox mountain’
- ‘Infobox NFL player’
- ‘Infobox rugby league biography’
- ‘Infobox boxer’
- ‘Infobox golfer’
- ‘Infobox college coach’
- ‘Infobox political party’
- ‘Infobox AFL biography’
- ‘Infobox park’
- ‘Infobox royalty’
- ‘Infobox church’
- ‘Infobox Australian place’
- ‘Infobox Christian leader’
- ‘Infobox song’
- ‘Infobox rockunit’
7.infobox/pattern/sportsperson.py
与album.py模板类似,适用于明显分组且底部有record的infobox,
一般为运动员,部分具体infobox_type如下:
- ‘Infobox sportsperson’
- ‘Infobox swimmer’
- ‘Infobox athlete’
- ‘Infobox tennis biography’
8.infobox/pattern/settlement.py
适用于以横线分组的infobox,部分具体infobox_type如下:
- ‘Infobox settlement’
- ‘Infobox Settlement’
- ‘Infobox Italian comune’
- ‘Infobox UK place’
- ‘Infobox German location’
9.infobox/pattern/common.py
*.py文件引用的公共模块,
存放正则表达、tr模式匹配方式、th/tr处理、上标转换等。