基于Scrapy爬取二手房
时间: 2025-06-01 13:00:56 浏览: 13
### 使用 Scrapy 框架爬取二手房数据
为了实现使用 Scrapy 框架来爬取二手房信息,以下是详细的说明和代码示例。
#### 创建项目结构
首先创建一个新的 Scrapy 项目并进入该项目目录:
```bash
scrapy startproject secondhand_housing
cd secondhand_housing
```
#### 定义 Item 类
在 `items.py` 文件中定义用于存储抓取到的数据项。基于提供的字段需求,可以这样设置:
```python
import scrapy
class SecondHandHousingItem(scrapy.Item):
price = scrapy.Field()
mode = scrapy.Field()
area = scrapy.Field()
floor = scrapy.Field()
age = scrapy.Field()
location = scrapy.Field()
district = scrapy.Field()
```
此部分描述了待提取的具体属性[^1]。
#### 编写 Spider 脚本
接下来,在 spiders 文件夹下新建 Python 文件作为 spider 实现逻辑。假设目标网站为安居客,则可命名为 `anjuke_spider.py`:
```python
import scrapy
from ..items import SecondHandHousingItem
class AnjukeSpider(scrapy.Spider):
name = "anjuke"
allowed_domains = ["anjuke.com"]
start_urls = [
'http://www.anjuke.com/sy-city.html', # 替换成实际的URL地址
]
def parse(self, response):
for href in response.css('a::attr(href)').getall():
yield response.follow(href, self.parse_list)
def parse_list(self, response):
houses = response.xpath('//div[@class="house-title"]')
for house in houses:
item = SecondHandHousingItem()
item['price'] = house.xpath('.//span[@class="unit-price"]/text()').extract_first().strip()
item['mode'] = house.xpath('.//dd[contains(@class,"huxing")]/text()').re(r'\d室\d厅')[0].strip()
item['area'] = house.xpath('.//dd[contains(@class,"mianji")]/text()').re(r'(\d+\.?\d*)平米')[0].strip()
item['floor'] = house.xpath('.//dd[contains(@class,"louceng")]/text()').extract_first().strip()
item['age'] = house.xpath('.//dd[contains(@class,"jianzhu-nianfen")]/text()').extract_first().strip()
item['location'] = house.xpath('.//address/text()').extract_first().strip()
item['district'] = house.xpath('.//p[@class="content__title--wrap"]/a[last()]/text()').extract_first().strip()
yield item
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse_list)
```
这段脚本实现了对页面链接解析以及具体房源详情页的内容抽取过程。
#### 设置中间件与下载器配置
编辑 settings.py 来调整一些必要的参数以优化性能或规避反爬机制:
```python
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
ITEM_PIPELINES = {
'secondhand_housing.pipelines.SecondhandHousingPipeline': 300,
}
FEED_EXPORT_ENCODING = 'utf-8'
LOG_LEVEL='INFO'
CONCURRENT_REQUESTS=16
COOKIES_ENABLED=False
RETRY_HTTP_CODES=[500, 502, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES={
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware':400,
}
```
这些设定有助于提高效率的同时减少被封禁的风险[^2]。
#### 数据处理管道 Pipeline
最后一步是在 pipelines.py 中编写自定义 pipeline 将获取的数据存入 MySQL 或其他形式持久化储存介质内:
```python
import mysql.connector
class SecondhandHousingPipeline(object):
def __init__(self):
self.conn = mysql.connector.connect(
host='localhost',
user='root',
passwd='',
db='housing_data'
)
self.cur = self.conn.cursor()
def process_item(self, item, spider):
try:
sql_query = """INSERT INTO housing_info (
price,
mode,
area,
floor,
age,
location,
district) VALUES (%s,%s,%s,%s,%s,%s,%s);"""
values=(item["price"],
item["mode"],
item["area"],
item["floor"],
item["age"],
item["location"],
item["district"])
self.cur.execute(sql_query,values)
self.conn.commit()
except Exception as e:
print(f"Error inserting into database {e}")
return item
def close_spider(self,spider):
self.cur.close()
self.conn.close()
```
通过上述操作即完成了整个流程的设计与实施。
阅读全文
相关推荐

















