今天忽然想着爬一下之前写的所有博客的内容,也是巩固练习一下scrapy,目标定位,爬取标题,url与内容:
采用 scrapy genspider -t crawl 命令创建爬虫,之后在爬虫文件中进行修改,主代码很简单:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BlogSpider(CrawlSpider):
name = 'blog'
allowed_domains = ['csdn.net']
start_urls = ['https://blog.csdn.net/weixin_44521703/article/list/{}?'.format(i) for i in range