python协程：如何判断是否要把函数定义成为协程函数，要进行IO操作就可以定义为协程函数-CSDN博客

本文链接：https://blog.csdn.net/m0_55284524/article/details/149098195

思考：如何判断一个函数该不该定义为协程函数？

如果函数是要进行IO操作（网络读写、读写文件、读写数据库等），就把它定义为协程函数，否则就是普通函数。

协程就是我们异步操作的片段，通常，写程序都会把全部功能分为不同功能的函数，目的是为了结构清晰；进一步，把哪些涉及耗费时间的IO操作（读写文件、数据库、网络）的函数通过async def异步化，就是异步编程。

哪些异步函数（协程函数）都是通过消息机制被时间循环管理调度着，整个程序的执行是单线程的，但是某个协程A进行IO时，事件循环就去执行其他协程非IO的代码，当事件循环收到协程A结束IO的消息时，就又会回来执行协程A，这样事件循环不断在协程之间转换，充分利用了IO的闲置时间，这就是异步IO。

写异步IO程序时候记住一个准则：需要IO的地方异步，其他地方即使使用了协程函数也是没有用的。

网络爬虫就是异步IO的用武之地。

例如，爬取：Qutotes to Scrape网站的原始代码，不使用协程的代码时：
同步阻塞的代码

import requests
import csv  # 数据存储到一个表格中
from bs4 import BeautifulSoup  # 数据解析
import time


def extract_details(page):
    response = requests.get(f'{base_url}/{page}')
    soup = BeautifulSoup(response.text, 'html.parser')

    text_list = []
    for quote in soup.select(".quote"):
        text_list.append({
            'text': quote.find(class_='text').text.strip,
            'author': quote.find(class_='author').text.strip,
            'tag': quote.find(class_='tag').text.strip if quote.find(class_='tag') else '',
            'page_url': quote.find(class_='tag').get("href") if quote.find(class_="tag") else '',
        })
    return text_list

def store_results(list_of_lists):
    text_list = sum(list_of_lists, [])

    with open('text.csv', 'w', newline='') as pokemon_file:
        fieldnames = text_list[0].keys()
        file_writer = csv.DictWriter(pokemon_file, fieldnames=fieldnames)
        file_writer.writeheader()
        file_writer.writerows(text_list)

start_time = time.time()

base_url = "http://quotes.toscrape.com/page"

pages = range(1, 11)

list_of_lists = [
    extract_details(page)
    for page in pages
]

store_results(list_of_lists)

end_time = time.time()

print(f'Execution time: {end_time - start_time}')

执行结果耗时：8.88秒 在这里插入图片描述

这里进行协程方式优化调用

import asyncio

import aiohttp  # 异步第三方请求库
import csv  # 数据存储到一个表格中
from bs4 import BeautifulSoup  # 数据解析
import time


async def extract_details(page, session):
    async with session.get(f'{base_url}/{page}') as response:
        soup = BeautifulSoup(await response.text(), 'html.parser')

        text_list = []
        for quote in soup.select(".quote"):
            text_list.append({
                'text': quote.find(class_='text').text.strip,
                'author': quote.find(class_='author').text.strip,
                'tag': quote.find(class_='tag').text.strip if quote.find(class_='tag') else '',
                'page_url': quote.find(class_='tag').get("href") if quote.find(class_="tag") else '',
            })
        return text_list


def store_results(list_of_lists):
    text_list = sum(list_of_lists, [])

    with open('text.csv', 'w', newline='') as pokemon_file:
        fieldnames = text_list[0].keys()
        file_writer = csv.DictWriter(pokemon_file, fieldnames=fieldnames)
        file_writer.writeheader()
        file_writer.writerows(text_list)


async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [
            extract_details(page, session)
            for page in pages
        ]
        list_of_lists = await asyncio.gather(*tasks)
        store_results(list_of_lists)


start_time = time.time()
base_url = "http://quotes.toscrape.com/page"
pages = range(1, 11)
asyncio.run(main())
end_time = time.time()

print(f'Execution time: {end_time - start_time}')