阿里天池学习赛 新闻文本分类

本文介绍参与阿里天池学习赛的新闻文本分类项目,包括数据预处理、词向量训练、使用神经网络和深度学习模型进行文本分类。提供完整和简化版代码链接。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

当时我的成绩  阿里天池 

原始数据  训练的词向量  模型 代码都在下面这两个链接里 一个完整版 一个不完整版  

 链接:https://pan.baidu.com/s/1I8l-5f0-IlrSPa3aP6nY2A 
提取码:1111 
复制这段内容后打开百度网盘手机App,操作更方便哦

链接:https://pan.baidu.com/s/1XNaM7fc96aSBi-sML-_vEw 
提取码:1111 
复制这段内容后打开百度网盘手机App,操作更方便哦

from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import pandas as pd
import numpy as np 
import torch
from torch import nn
import torch.utils.data as data
import torch.nn.functional as F
from torch import tensor
from sklearn.metrics import f1_score
from datetime import datetime 
import time 

#csv数据量的数目测试  一共有20000条
with open("train_set839.csv", 'r') as f: #计算长度
    hang_count
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang ([email protected]) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

甜辣uu

谢谢关注再接再厉

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值