github开源代码:https://github.com/lhyxcxy/nlp
给出原始文档(chinese.txt)结构,数据一行一行存储,下面给出其中一部分语料
训练及生成文件程序,
生成字典各种语料库及训练后的模型,本文列举了lda和lsi,
#coding:utf-8
from gensim import corpora,similarities,models
import os
from collections import defaultdict
import codecs
import json
import jieba
documents=[]
"""句子相似性"""
f=