C++并发编程 -6. 无锁并发数据结构设计

本文链接：https://blog.csdn.net/qq_45604814/article/details/145685712

在多线程编程中，同步机制如互斥锁（std::mutex）虽然可以保证线程安全，但在高并发场景下可能带来性能瓶颈。无锁编程利用硬件提供的原子操作，如Compare-And-Swap (CAS)，实现线程安全的数据结构，避免互斥锁的开销。

本文将介绍如何使用C++实现一个无锁栈（Lock-Free Stack），并结合实际代码和原理分析。

一. 基础版无锁栈

栈（Stack）是一种 后进先出（LIFO） 的数据结构。其基本操作包括：

push：将元素入栈。
pop：将元素出栈。

将n个元素1，2，3，4依次入栈，那么出栈的顺序是4，3，2，1.

1.1 栈节点

每个栈节点包含一个存储数据的字段和一个指向下一个节点的指针。

template<typename T>
struct node
{
    T data;
    node* next;
    node(T const& data_) : 
            data(data_)
    {}
};

无竞争情况下，我们期望栈push操作顺序：

step1 创建新节点

step2 将元素入栈，将新节点的next指针指向现在的head节点。

step3 将head节点更新为新节点的值。

如果竞争情况下，情况可能变成：

线程A准备step2,线程B也刚好执行到step2并且执行成功，更新了head节点。然后线程A执行step2,这样就会导致A看到的head是old head，导致数据覆盖或者异常。

为此，我们可以通过原子变量的compare_exchange(比较交换操作)来控制更新head节点，以此来达到线程安全的目的。

1.2 push操作

step1 新建节点存储数据

step2 将新建节点指向head

step3 将新建节点更新为head节点。

template<typename T>
void push(const T& value){
    auto new_node = new Node(value)
    do{
        new_node->next = head.load();
    }while(!head.compare_exchange_weak(new_node->next, new_node));
}

建议大家用do-while的方式实现，这样我们可以在do-while中增加很多自己的定制逻辑,另外推荐大家用compare_exchange_weak，尽管存在失败的情况，但是他的开销小，所以compare_exchange_weak返回false我们再次重试即可。

是否符合多线程应用场景？

假设线程A 和 B先执行到new_node->next = head.load()且均加载完毕head. 线程B先通过读改写跟新nodeB为head，线程A执行进行读改写。发现head 和 new_node->next不匹配，因为此时new_node->next实际是nodeB, 线程A持有的是old head，此时CAS失败返回false，执行do while流程更新head节点，再次执行CAS 成功。

1.3 pop操作

step1 取出头节点元素

step2 更新head为下一个节点。

step3 返回取出头节点元素的数据域。

template<typename T>
void pop(T& value){
    do{
        node* old_head = head.load(); //1
    }while(!head.compare_exchange_weak(old_head, old_head->next)); //2
    value = old_head->data; //3
}

通过判断head和old_head的值是否相等，如果相等则将head的值设置为old_head的下一个节点，否则返回false，并且将old_head更新为当前head的值(比较交换函数帮我们做的)。

是否符合多线程应用场景？

按照push给出的思路，显然已经符合。

但是除此之外，有几点经验之谈 :

1.未释放弹出的节点的内存,导致内存堆积

2.未判断边界情况，例如pop栈为空，push栈过大等。

3.异常发生在栈拷贝（第五章节已经讨论）解决办法是引入智能指针。

1.4 智能指针版本

class lock_free_stack
{
private:
    struct node
    {
        std::shared_ptr<T> data;
        node* next;
        node(T const& data_) : //⇽-- - 1
            data(std::make_shared<T>(data_))
        {}
    };
    lock_free_stack(const lock_free_stack&) = delete;
    lock_free_stack& operator = (const lock_free_stack&) = delete;
    std::atomic<node*> head;
public:
    lock_free_stack() {}
    void push(T const& data)
    {
        node* const new_node = new node(data);    //⇽-- - 2
            new_node->next = head.load();    //⇽-- - 3
            while (!head.compare_exchange_weak(new_node->next, new_node));    //⇽-- - 4
    }
    std::shared_ptr<T> pop()
    {
      node* old_head = nullptr; //1        
      do {
          old_head = head.load();
          if (old_head == nullptr) {
              return nullptr; 
          }
      } while (!head.compare_exchange_weak(old_head, old_head->next)); //2        
      std::shared_ptr<T> res;   //3
      res.swap(old_head->data); //4
      delete old_head;  //5 
      return res;  //6    
    }
};

在3处我们定义了智能指针去接受oldhead->data的数据，如此当数据量过大的时候就不会导致拷贝引起的异常，同时delete old_head ，避免了内存堆积。

思考，如果在5处删除了oldhead, 此时另外一个线程刚好运行到cas并且发生了head与old_head->next数据交换，此时就会引发崩溃。

这里我详细说明一下，有的人可能认为由于线程1已经将head修改为old_head->next，此时线程2的compare_exchange_weak会发现head已经不是原来的old_head，因此返回false，循环重新开始。这时候，线程2的old_head会被重新加载为新的head，而不会访问已经被释放的old_head->next。所以用户认为这里不会有问题，但实际上compare_exchange_weak即使比较失败准备下一次CAS，但也触发了对old->next的访问，会造成未定义行为。

compare_exchange_weak\strong在执行比较和交换尝试之前，会先评估其参数。这意味着即使 `head` 和 `old_head` 不相等，但`old_head->next` 的值仍然会被访问（作为尝试更新的目标值）

另外补充几点，所有原子类型的 CAS 都直接比较变量存储的二进制值

1.当目标变量是指针类型时比较的是指针变量的值(即它存储的内存地址)并非指针本身的地址

2.当目标变量是非指针类型比较的是数据的值

3.A时刻和B时刻某个变量的值可能是相同的，但表示状态不相同，CAS检查值相同误认为没有发生变化，这或许会造成ABA问题！

因此我们需要引入一个机制，暂时不删除节点放入待删除列表。等到所有的线程都不涉及到该节点的使用的时候才将其删除，我们称之为延迟删除。

延迟删除节点满足下述几个条件：

同一时刻内只有一个线程pop，可将oldhead和待删列表删除
同一时刻内存在多个线程pop并且线程A完成CAS准备删除oldhead,线程B刚进入pop操作还未通过CAS更新oldhead->next操作,可将oldhead删除，但待删列表不可删除。
同一时刻内存在多个线程pop并且线程A完成CAS准备删除oldhead,线程B处于CAS的更新oldhead->next操作，此时需要将oldhead放入待删除列表。

综上代码设计思路：

如果head已经被更新，且旧head不会被其他线程引用，那旧head就可以被删除。否则放入待删列表。
如果仅有一个线程执行pop操作，那么待删列表可以被删除
如果有多个线程执行pop操作，那么待删列表不可被删除

我们引入一个原子变量记录某一时刻几个线程正在执行pop操作，记threads_in_pop。需要一个原子变量记录待删列表的首节点，记to_be_deleted。

1.4.1 实现pop改造

std::shared_ptr<T> pop() {
     //1 计数器首先自增，然后才执行其他操作
    ++threads_in_pop;  
    node* old_head = nullptr;     
    do {
        //2 加载head节点给旧head存储
        old_head = head.load();  
        if (old_head == nullptr) {
            --threads_in_pop;
            return nullptr; 
        }
    } while (!head.compare_exchange_weak(old_head, old_head->next)); // 3    
    //3处 比较更新head为旧head的下一个节点    
    std::shared_ptr<T> res;
    if (old_head)
    {
        // 4 只要有可能，就回收已删除的节点数据
        res.swap(old_head->data);    
    }
    // 5 从节点提取数据，而非复制指针
    try_reclaim(old_head);   
    return res;
}

在pop函数入口出记录threads_in_pop的数量间接体现当前线程使用的数量。

（这种方式存在很大的弊端，因为当我们根据此原子变量进行一些操作的时候由于线程调度，还是会发生old->next更新的情况，且作者在引入pop操作的时候一直是为了解决问题去解决问题，当然这种思路是为了让读者学习更深入，但远远没有可以支撑起项目的代码，我在后面章节会推出真正具有实用意义的无锁结构）

1.4.2 实现try_reclaim函数

try_reclaim函数设计思路如下：

检查threads_in_pop是否为1。

如果为1标识当前只有一个线程pop，此时可以oldhead删除。使用双重肯定，再次检查threads_in_pop，如果此时任然为1，则将待删列表删除。否则更新待删列表头节点。

如果不为1，则将节点添加到待删列表，对threads_in_pop减1.

void try_reclaim(node* old_head)
{
    //1 原子变量判断仅有一个线程进入
    if(threads_in_pop == 1)
    {
        //2 当前线程把待删列表取出
        node* nodes_to_delete = to_be_deleted.exchange(nullptr);
        //3 更新原子变量获取准确状态，判断pop是否仅仅正被当前线程唯一调用
        if(!--threads_in_pop)
        {
            //4 如果唯一调用则将待删列表删除
            delete_nodes(nodes_to_delete);
        }else if(nodes_to_delete)
        {
            //5 如果pop还有其他线程调用且待删列表不为空，
            //则将待删列表首节点更新给to_be_deleted
            chain_pending_nodes(nodes_to_delete);
        }
        delete old_head;
    }
    else {
        //多个线程pop竞争head节点，此时不能删除old_head
        //将其放入待删列表
        chain_pending_node(old_head);
        --threads_in_pop;
    }
}

诸如上面说的，在第二次判断threads_in_pop后需要delete_nodes期间，也有可能另外线程进入pop操作，如果此时线程调度挂起导致另外线程pop发生了读改写操作，总的来说如下：

线程A 完成 pop 的 CAS 操作，head 更新为 old_head->next。
线程A 进入 try_reclaim，发现 threads_in_pop == 1，于是直接删除 old_head。
线程B 在 线程A 删除 old_head 前，已通过 head.load() 加载了旧的 old_head（此时该节点尚未被删除）。
线程B 执行 compare_exchange_weak(old_head, old_head->next)，此时会访问 old_head->next，但 old_head 已被 线程A 删除，导致 Use-After-Free。

在后面的示例中，我们将使用风险指针或者引用计数解决此问题。

1.4.3 实现delete_nodes函数

static void delete_nodes(node* nodes)
{
    while (nodes)
    {
        node* next = nodes->next;
        delete nodes;
        nodes = next;
    }
}

1.4.4 实现chain_pending_node函数

void chain_pending_node(node* n)
{
		chain_pending_nodes(n, n);   
}

void chain_pending_nodes(node* nodes)
{
    node* last = nodes;
    //1 沿着next指针前进到链表末端
    while (node* const next = last->next)    
    {
        last = next;
    }
    //2 将链表放入待删链表中
    chain_pending_nodes(nodes, last);
}

void chain_pending_nodes(node* first, node* last)
{
    //1 先将last的next节点更新为待删列表的首节点
    last->next = to_be_deleted;    
    //2  借循环保证 last->next指向正确
    // 将待删列表的首节点更新为first节点
    while (!to_be_deleted.compare_exchange_weak(
        last->next, first));     
}

如果发生 try_reclaim的时候有多个线程在pop，则会导致oldhead投入待删列表。

在void chain_pending_nodes(node* first, node* last)函数中，我们首先将last->next值更新为to_be_delete,这样做的目的是尽可能的预先保持最先的待删头节点，不过即使在此阶段任然不是最新，可通过cas更新成为最新，

这就要求node*last处于最后一个节点，所以在void chain_pending_nodes(node* nodes)中需要寻找到链表尾部。通过while (node* const next = last->next)去实现。

1.4.5 完整代码


#include <iostream>
#include <thread>
#include <set>
#include <mutex>
#include <cassert>
#include <chrono>
#include <memory>

using namespace std;
using namespace std::chrono;


template<typename T>
class lock_free_stack {
private:
	struct node {
		std::shared_ptr<T> data;
		node* next;
		node(T const& data_) :data(std::make_shared<T>(data_)) {}
	};
	lock_free_stack(const lock_free_stack&) = delete;
	lock_free_stack& operator = (const lock_free_stack&) = delete;
	std::atomic<node*> head;
	std::atomic<node*> to_be_deleted;
	std::atomic<int> threads_in_pop;
public:
	lock_free_stack(): head(nullptr),to_be_deleted(nullptr),threads_in_pop(0){}

	void push(T const& data) {
		node* const new_node = new node(data);    //⇽-- - 2			
		new_node->next = head.load();    //⇽-- - 3			
		while (!head.compare_exchange_weak(new_node->next, new_node));    //⇽-- - 4	
	}

	std::shared_ptr<T> pop() {
		++threads_in_pop;   //1 计数器首先自增，然后才执行其他操作
		node* old_head = nullptr; 	
		do {
			old_head = head.load();  //2 加载head节点给旧head存储
			if (old_head == nullptr) {
				--threads_in_pop;
				return nullptr; 
			}
		} while (!head.compare_exchange_weak(old_head, old_head->next)); // 3	比较更新head为旧head的下一个节点	
	
		std::shared_ptr<T> res;
		if (old_head)
		{
            // 4 只要有可能，就回收已删除的节点数据
			res.swap(old_head->data);    
		}
        // 5 从节点提取数据，而非复制指针
		try_reclaim(old_head);   
		return res;
	}

    void try_reclaim(node* old_head)
	{
        //1 原子变量判断仅有一个线程进入
		if(threads_in_pop == 1)
		{
			//2 当前线程把待删列表取出
            node* nodes_to_delete = to_be_deleted.exchange(nullptr);
            //3 更新原子变量获取准确状态，判断pop是否仅仅正被当前线程唯一调用
            if(!--threads_in_pop)
            {
	            //4 如果唯一调用则将待删列表删除
                delete_nodes(nodes_to_delete);
            }else if(nodes_to_delete)
            {
	            //5 如果pop还有其他线程调用且待删列表不为空，
	            //则将待删列表首节点更新给to_be_deleted
                chain_pending_nodes(nodes_to_delete);
            }
            delete old_head;
        }
        else {
            //多个线程pop竞争head节点，此时不能删除old_head
            //将其放入待删列表
            chain_pending_node(old_head);
            --threads_in_pop;
        }
	}

	static void delete_nodes(node* nodes)
	{
		while (nodes)
		{
			node* next = nodes->next;
			delete nodes;
			nodes = next;
		}
	}

	void chain_pending_node(node* n)
	{
		chain_pending_nodes(n, n);   
	}

	void chain_pending_nodes(node* first, node* last)
	{
		//1 先将last的next节点更新为待删列表的首节点
		last->next = to_be_deleted;    
		//2  借循环保证 last->next指向正确
		// 将待删列表的首节点更新为first节点
		while (!to_be_deleted.compare_exchange_weak(
			last->next, first));     
	}

	void chain_pending_nodes(node* nodes)
	{
		node* last = nodes;
		//1 沿着next指针前进到链表末端
		while (node* const next = last->next)    
		{
			last = next;
		}
		//2 将链表放入待删链表中
		chain_pending_nodes(nodes, last);
	}
};



hazard_pointer hazard_pointers[max_hazard_pointers];

void TestLockFreeStack() {

    lock_free_stack<int> lk_free_stack;
    std::set<int>  rmv_set;
    std::mutex set_mtx;
	auto start = high_resolution_clock::now();

    std::thread t1([&]() {
        for (int i = 0; i < 200000; i++) {
            lk_free_stack.push(i);
            // std::cout << "push data " << i << " success!" << std::endl;
            // std::this_thread::sleep_for(std::chrono::milliseconds(5));
            }
        });

    std::thread t2([&]() {
		for (int i = 0; i < 100000;) {
            auto head = lk_free_stack.pop();
            if (!head) {
				std::this_thread::sleep_for(std::chrono::milliseconds(10));
				continue;
            }
			std::lock_guard<std::mutex> lock(set_mtx);
			//rmv_set.insert(*head);
            // std::cout << "pop data " << *head << " success!" << std::endl;
            i++;
		}
      });

	std::thread t3([&]() {
		for (int i = 0; i < 100000;) {
			auto head = lk_free_stack.pop();
            if (!head) {
                std::this_thread::sleep_for(std::chrono::milliseconds(10));
                continue;
            }
            std::lock_guard<std::mutex> lock(set_mtx);
            //rmv_set.insert(*head);
            // std::cout << "pop data " << *head << " success!" << std::endl;
            i++;
		}
		});

    t1.join();
    t2.join();
    t3.join();
	auto end = high_resolution_clock::now();
	auto duration = duration_cast<milliseconds>(end - start).count();
	cout << "Time elapsed: " << duration << " ms" << endl;
    assert(rmv_set.size() == 200000);
}

int main()
{
    TestLockFreeStack();
    std::cout << "Hello World!\n";
}

相同运行环境下，单push 单pop 200000条数据大概需要70ms左右

单push 双pop 200000条数据大概需要90ms左右

如果使用mutex+cond结构

单push 单pop 200000条数据大概需要130ms左右

单push 双pop 200000条数据大概需要350ms左右

测试手段可能不规则，但是趋势是正确的.

二. 风险指针优化实现无锁并发栈

在基础版无锁栈内存在两个问题：

问题一：引入延迟删除策略后，只有当同一时刻存在一个pop线程才可进行待删列表删除，实际使用过程中可能一直存在多线程pop情况，导致一直无法删除待删列表。

问题二：try_reclaim函数的threads_in_pop无法确保threads_in_pop 的原子性仅能保证计数操作的原子性，但无法阻止其他线程在 时间窗口内 访问已被释放的节点。无法解决Use-After-Free问题。

本标题中通过风险指针解决上述两个问题。

2.1 什么是风险指针

风险指针”是指Maged Michael发明的一种技法，后来被IBM申请为专利。简言之可以将要删除的节点进行特殊处理，如果有线程正在使用我们将这个节点的指针标记为风险指针，其他线程不可删除。

2.2 设计思路

threads_in_pop 由于无法阻止其他线程在 时间窗口内 访问已被释放的节点导致异常，归根结底原因还是由于在delete old_head的时间点无法确保其他线程不访问原old_head->next。我们可以这样做：在操作old_head期间将该节点标记为风险指针，同一时刻内可能有多个线程标记old_head为风险节点。当某个线程完成了CAS操作后，我们将该线程对应的风险指针清空.之后我们再次遍历风险数组检测old_head是否为风险指针，如果是则将old_head挪到延迟待删列表，如果不是则可以直接删除。因为对于其他线程而言，如果CAS没有操作完成风险指针一定不会清空的。按照这个思路我们展开讲讲：

2.3 数据结构定义

我们实现hazard_pointer类，管理风险指针和线程id。

struct hazard_pointer {
    std::atomic<std::thread::id> id;
    std::atomic<void*>  pointer;
};

id为正在使用该风险指针的id，pointer为指针类型，存储的节点数据地址。
当一个线程从风险数组中查找某个闲置节点作为风险节点，则需要将pointer指向节点的数据，并且将id设置为当前的线程id。

定义一个全局的风险节点数组，用来存储风险节点。

hazard_pointer hazard_pointers[max_hazard_pointers];

2.4 通过hp_owner类管理这个风险指针

class hp_owner {
public:
    hp_owner(hp_owner const&) = delete;
    hp_owner operator=(hp_owner const&) = delete;
    hp_owner():hp(nullptr){
        for (unsigned i = 0; i < max_hazard_pointers; ++i) {
            std::thread::id  old_id;
            if (hazard_pointers[i].id.compare_exchange_strong(old_id, std::this_thread::get_id())) {
                hp = &hazard_pointers[i];
                break;
            }
        }
        if (!hp) {
            throw std::runtime_error("No hazard pointers available");
        }
    }
    ~hp_owner() {
        hp->pointer.store(nullptr);
        hp->id.store(std::thread::id());
    }
private:
    hazard_pointer* hp;
};

2.5 通过全局的风险节点数组返回可用节点（该节点此时还未被使用）

std::atomic<void*>& get_hazard_pointer_for_current_thread() {
    //每个线程都具有自己的风险指针 线程本地变量
    thread_local static hp_owner hazzard;  
    return hazzard.get_pointer();
}

在首次调用get_hazard_pointer_for_current_thread函数的时候通过hp_owner进行构造，将该线程id与风险数组指针绑定。由于hazzard是线程本地的静态变量，其作用域只有本线程，同时只初始化一次。

2.6 获取风险指针

std::atomic<void*>& get_pointer() {
    return hp->pointer;
}

hp_owner 的get_pointer函数返回其成员pointer指向的地址

2.7 判断该节点是否被风险指针所指涉

bool outstanding_hazard_pointers_for(void* p)
    {
        for (unsigned i = 0; i < max_hazard_pointers; ++i)
        {
            if (hazard_pointers[i].pointer.load() == p)
            {
                return true;
            }
        }
        return false;
    }

2.8 当前节点被风险指针所指涉则将该节点放入待删队列延迟删除

void reclaim_later(node* old_head) {
    add_to_reclaim_list(new data_to_reclaim(old_head));
}

将节点放入待删列表，我们封装了一个data_to_reclaim类型的节点放入待删列表。

2.9 将节点放入待删列表

定义待删节点的结构体：

struct data_to_reclaim {
    node* data;
    std::function<void(node*)> deleter;
    data_to_reclaim* next;
    data_to_reclaim(node * p):data(p), next(nullptr){}
    ~data_to_reclaim() {
        delete data;
    }
};

在无锁栈中定义一个节点表示待删列表的首节点，因为栈是被多个线程操作的，待删列表也会被多个线程访问，那么我们需要用原子变量表示这个首节点

std::atomic<data_to_reclaim*>  nodes_to_reclaim;

实现将节点放入待删列表的逻辑

void add_to_reclaim_list(data_to_reclaim* reclaim_node) {
    reclaim_node->next = nodes_to_reclaim.load();
    while (!nodes_to_reclaim.compare_exchange_weak(reclaim_node->next, reclaim_node));
}

实现从待删列表中删除无风险的节点

void delete_nodes_with_no_hazards() {
    data_to_reclaim* current = nodes_to_reclaim.exchange(nullptr);
    while (current) {
        data_to_reclaim* const next = current->next;
        if (!outstanding_hazard_pointers_for(current->data)) {
                delete current;
        }
        else {
            add_to_reclaim_list(current);
        }
        current = next;
    }
}

2.10 实现带有风险指针的pop操作

std::shared_ptr<T> pop()
{
    //1 从风险列表中获取一个节点给当前线程
    std::atomic<void*>& hp=get_hazard_pointer_for_current_thread(); 
    node* old_head=head.load();
    do
    {
        node* temp;
        do    
        {
            temp=old_head;
            hp.store(old_head);
            old_head=head.load();
        }//2 如果old_head和temp不等说明head被其他线程更新了，需重试
        while(old_head!=temp); 
    }//3 将当前head更新为old_head->next，如不满足则重试
    while(old_head&&
          !head.compare_exchange_strong(old_head,old_head->next)); 
    // 4一旦更新了head指针，便将风险指针清零
    hp.store(nullptr);    
    std::shared_ptr<T> res;
    if(old_head)
    {
        res.swap(old_head->data);
        //5 删除旧有的头节点之前，先核查它是否正被风险指针所指涉
        if(outstanding_hazard_pointers_for(old_head))    
        {
            //6 延迟删除
            reclaim_later(old_head);    
        }
        else
        {
            //7 删除头部节点
            delete old_head;    
        }
        //8 删除没有风险的节点
        delete_nodes_with_no_hazards();    
    }
    return res;
}

首先从空闲列表获取一个空闲节点，在pop之前，我们取出head记为old_head，并通过风险指针hp执行old_head，这样执行pop的这个线程就表明正在使用该风险指针，其余线程亦是如此

当某个线程执行完毕pop的CAS操作表明该线程已经将彻底将head更新完毕，如果更新完毕我们就将风险指针置为空（因为对于该线程而言head已经是线程安全的）

之后我们进行outstanding_hazard_pointers_for判断，如果old_head被其他线程指涉，则将oldhead加到待删列表中。如果没有指涉则直接删除。

最后遍历删除没有风险的节点。

2.11 完成代码

#include <iostream>
#include <thread>
#include <set>
#include <mutex>
#include <cassert>
#include <chrono>
#include <memory>
#include <atomic> 

using namespace std;

//最大风险指针数量
unsigned const max_hazard_pointers = 100;
//风险指针
struct hazard_pointer {
	std::atomic<std::thread::id> id;
	std::atomic<void*>  pointer;
};
//风险指针数组

hazard_pointer hazard_pointers[max_hazard_pointers];
//风险指针持有类
class hp_owner {
public:
	hp_owner(hp_owner const&) = delete;
	hp_owner operator=(hp_owner const&) = delete;
	hp_owner():hp(nullptr){
		bind_hazard_pointer();
	}

	std::atomic<void*>& get_pointer() {
		return hp->pointer;
	}

	~hp_owner() {
		hp->pointer.store(nullptr);
		hp->id.store(std::thread::id());
	}
private:
	void bind_hazard_pointer() {
		for (unsigned i = 0; i < max_hazard_pointers; ++i) {
			std::thread::id  old_id;
			if (hazard_pointers[i].id.compare_exchange_strong(old_id, std::this_thread::get_id())) {
				hp = &hazard_pointers[i];
				break;
			}
		}

		if (!hp) {
			throw std::runtime_error("No hazard pointers available");
		}
	}
	hazard_pointer* hp;
};

std::atomic<void*>& get_hazard_pointer_for_current_thread() {
	//每个线程都具有自己的风险指针 线程本地变量
	thread_local static hp_owner hazzard;  
	return hazzard.get_pointer();
}

template<typename T>
class hazard_pointer_stack {
private:
	
	//栈节点
	struct node {
		std::shared_ptr<T> data;
		node* next;
		node(T const& data_) :data(std::make_shared<T>(data_)) {}
	};

	//待删节点
	struct data_to_reclaim {
		node* data;
		data_to_reclaim* next;
		data_to_reclaim(node * p):data(p), next(nullptr){}
		~data_to_reclaim() {
			delete data;
		}
	};

	hazard_pointer_stack(const hazard_pointer_stack&) = delete;
	hazard_pointer_stack& operator = (const hazard_pointer_stack&) = delete;
	std::atomic<node*> head;
	std::atomic<data_to_reclaim*>  nodes_to_reclaim;
public:
	hazard_pointer_stack():head(nullptr){}

	void push(T const& data) {
		node* const new_node = new node(data);    //⇽-- - 2			
		new_node->next = head.load();    //⇽-- - 3			
		while (!head.compare_exchange_weak(new_node->next, new_node));    //⇽-- - 4	
	}

	bool outstanding_hazard_pointers_for(void* p)
	{
		for (unsigned i = 0; i < max_hazard_pointers; ++i)
		{
			if (hazard_pointers[i].pointer.load() == p)
			{
				return true;
			}
		}
		return false;
	}

	void add_to_reclaim_list(data_to_reclaim* reclaim_node) {
		reclaim_node->next = nodes_to_reclaim.load();
		while (!nodes_to_reclaim.compare_exchange_weak(reclaim_node->next, reclaim_node));
	}

	void reclaim_later(node* old_head) {
		add_to_reclaim_list(new data_to_reclaim(old_head));
	}

	void delete_nodes_with_no_hazards() {
		data_to_reclaim* current = nodes_to_reclaim.exchange(nullptr);
			while (current) {
				data_to_reclaim* const next = current->next;
				if (!outstanding_hazard_pointers_for(current->data)) {
					delete current;
				}
				else {
					add_to_reclaim_list(current);
				}

				current = next;
			}
	}

	std::shared_ptr<T> pop() {
		//1 从风险列表中获取一个节点给当前线程
		std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
		node* old_head = head.load();
		do
		{
			node* temp;
			do
			{
				temp = old_head;
				hp.store(old_head);
				old_head = head.load();
			}//2 如果old_head和temp不等说明head被其他线程更新了，需重试
			while (old_head != temp);
		}//3 将当前head更新为old_head->next，如不满足则重试
		while (old_head &&
			!head.compare_exchange_strong(old_head, old_head->next));
		// 4 一旦更新了head指针，就可以将这个风险指针清零了，因为其他线程pop操作的head已经不是我们hp存储的old_head了。所以此种情况下是线程安全的。
		hp.store(nullptr);
		std::shared_ptr<T> res;
		if (old_head)
		{
			res.swap(old_head->data);
			//5 删除旧有的头节点之前，先核查它是否正被风险指针所指涉
			if (outstanding_hazard_pointers_for(old_head))
			{
				//6 延迟删除
				reclaim_later(old_head);
			}
			else
			{
				//7 删除头部节点
				delete old_head;
			}
			//8 删除没有风险的节点
			delete_nodes_with_no_hazards();
		}
		return res;
	}

};


int main() {
    hazard_pointer_stack<int> hazard_stack;
    std::set<int>  rmv_set;
    std::mutex set_mtx;

    std::thread t1([&]() {
        for (int i = 0; i < 20000; i++) {
            hazard_stack.push(i);
            std::cout << "push data " << i << " success!" << std::endl;
            std::this_thread::sleep_for(std::chrono::milliseconds(5));
        }
        });

    std::thread t2([&]() {
        for (int i = 0; i < 10000;) {
            auto head = hazard_stack.pop();
            if (!head) {
                std::this_thread::sleep_for(std::chrono::milliseconds(10));
                continue;
            }
            std::lock_guard<std::mutex> lock(set_mtx);
            rmv_set.insert(*head);
            std::cout << "pop data " << *head << " success!" << std::endl;
            i++;
        }
        });

    std::thread t3([&]() {
        for (int i = 0; i < 10000;) {
            auto head = hazard_stack.pop();
            if (!head) {
                std::this_thread::sleep_for(std::chrono::milliseconds(10));
                continue;
            }
            std::lock_guard<std::mutex> lock(set_mtx);
            rmv_set.insert(*head);
            std::cout << "pop data " << *head << " success!" << std::endl;
            i++;
        }
        });

    t1.join();
    t2.join();
    t3.join();

    assert(rmv_set.size() == 20000);

    return 0;
}

风险指针设计相当于一个保护，将hp与CAS进行强制关联，hp!=null 表明某个或某几个线程仍然在CAS，就不可删除他。风险指针的机制能保证要删除的节点在合理的时机回收，但是也引发了一些性能问题，比如为了删除某个节点要遍历风险数组判断该节点是否被风险指针所指涉。其次我们对于要删除的节点需要从风险数组中选择一个合适的节点记录其地址，所以也需要遍历。

对风险指针(较好)的回收策略是以空间换时间，即预先开启内存，内存满足一定容量后进行遍历回收，减少了判断回收的次数。但同样增加内存的开销。

所以，是否有非专利的内存回收技术，且能被大多数人所使用呢？很幸运，的确有。引用计数就是这样一种机制。

三.引用计数实现无锁并发栈

C++并发编程一书中提出了两个计数，一个外部计数，一个内部计数，二者加起来就是有效的引用计数，为何需要两个计数来协同处理呢，带着大家一步一步的探索。

3.1 基础版引用计数设计以及问题

3.1.1 定义结构

引用引用计数的最初目的是为了解决delete 造成的use-after-free以及延迟删除节点时间过长的问题.受threads_in_pop引用线程计数启发,先定义一个栈结构，将引用计数置于每个节点内部。

template<typename T>
class single_ref_stack {
public:
    single_ref_stack():head(nullptr) {
    }
    ~single_ref_stack() {
        //循环出栈
        while (pop());
    }
private:
    struct ref_node {
        //1 数据域智能指针
        std::shared_ptr<T>  _data;
        //2 引用计数
        std::atomic<int> _ref_count;
        //3  下一个节点
        ref_node* _next;
        ref_node(T const& data_) : _data(std::make_shared<T>(data_)),
            _ref_count(1), _next(nullptr) {}
    };
    //头部节点
    std::atomic<ref_node*> head;
};

3.1.2 实现push函数

void push(T const& data) {
    auto new_node = new ref_node(data);
    new_node->next = head.load();
    while (!head.compare_exchange_weak(new_node->next, new_node));
}

push 操作很简单，创建一个ref_node类型的指针对象new_node，将new_node的next指针指向现在的头节点，然后不断地重试(防止其他线程修改head后导致head变化不一致),直到将head更新为new_node.

3.1.3 实现pop函数

大致思路如下：

将涉及pop的线程_ref_count++，接着我们通过CAS进行读改写，当某个线程通过CAS读写返回true(表示它看到的是最新的节点状态)我们将该节点的引用计数-2，-2包括本线程-1以及初始值-1，然后通过CAS更新引用计数确保拿到最新的引用计数值。如果ref_count=0则可以删除oldhead。

对于一开始就不满足CAS读改写的线程，直接将引用计数-1，如果减到1表示没有线程使用改节点，可以顺利删除old_head.

std::shared_ptr<T> pop() {
    ref_node* old_head = head.load();
    for (;;) {
        if (!old_head) {
            return std::shared_ptr<T>();
        }
        //1 只要执行pop就对引用计数+1
        ++(old_head->_ref_count);
        //2 比较head和old_head想等则交换否则说明head已经被其他线程更新
        if (head.compare_exchange_strong(old_head, old_head->_next)) {
            auto cur_count = old_head->_ref_count.load();
            auto new_count;
            //3  循环重试保证引用计数安全更新
            do {
                //4 减去本线程增加的1次和初始的1次
                new_count = cur_count - 2;
            } while (!old_head->_ref_count.compare_exchange_weak(cur_count,  new_count));
            //返回头部数据
            std::shared_ptr<T> res;
            //5  交换数据
            res.swap(old_head->_data);
            //6
            if (old_head->_ref_count == 0) {
                delete old_head;
            }
            return res;
        }
        else {
            //7 
            if (old_head->_ref_count.fetch_sub(1) == 1) {
                delete old_head;
            }
        }
    }
}

当一个资源被创建时，它总是至少有一个引用者——即创建它的代码或对象。因此，引用计数从1开始，表示这个资源已经被“占用”或“引用”了一次。如果从0开始，那么在创建资源的那一刻，它就会被认为是可以被销毁的，这显然是不合理的。

上述代码存在明显的use-after-free问题。如果线程1和线程2都准备执行1处代码，但是线程2抢先执行，并且更新引用计数_ref_count变为0，则执行删除old_head的操作，此时线程1刚好执行1处代码引发崩溃。这就要求在设计无锁数据结构时，需确保指针地址的变化与内部状态的修改是同步的

引发崩溃的本质原因由于引用计数的清0导致delete oldhead，而此时有一些线程还在访问该部分数据，因此延迟删除还是必要的。

3.2 将数据节点与引用计数解耦

其实目前最好的思路还是将内存重用避免use-after-free，且通过带版本的head解决ABA问题。实际使用需要结合场景架构等综合考虑。我将在后续文章中接受一些更好的实现策略。

继续按照《并发编程》作者思路，为了实现按照类似延迟删除的思路，我们可以将引用计数与绑定的指针进行解耦，这样在最初的CAS操作中，我们只比较引用计数，通过引用计数阻止过早的对数据节点的CAS，避免use-after-free。

3.2.1 定义结构

将原来的节点结构拆成两个，一个是包括引用计数和节点的指针，另外一个是包括节点的数据域以及下一个节点的地址。

struct node {
    //1 数据域智能指针
    std::shared_ptr<T>  _data;
    //2  下一个节点
    ref_node _next;
    node(T const& data_) : _data(std::make_shared<T>(data_)) {}
};
struct ref_node {
    // 引用计数
    std::atomic<int> _ref_count;
    node* _node_ptr;
    ref_node( T const & data_):_node_ptr(new node(data_)), _ref_count(1){}
    ref_node():_node_ptr(nullptr),_ref_count(0){}
};

假设head存储的是指针类型，记为

std::atomic<ref_node*> head;

注：有关C++ 在无锁结构设计中head是指针类型好还是副本类型好的问题我将在下一小节讨论。

3.2.2 实现push函数

void push(T const& data) {
	auto new_node = new ref_node(data);
	new_node->next = head.load();
	while (!head.compare_exchange_weak(new_node->next, new_node));
}

3.2.3 实现pop函数

思路已经在之前表达清楚，在此不过多赘述了。

std::shared_ptr<T> pop() {
    //0 处
    ref_node* old_head = head.load();
    for (;;) {
        //1 只要执行pop就对引用计数+1并更新到head中
        ref_node* new_head;
        do {
            new_head = old_head;
            //7 处
            new_head->_ref_count += 1;
        } while (!head.compare_exchange_weak(old_head, new_head));
        //4 
        old_head = new_head;
        auto* node_ptr = old_head->_node_ptr;
        if (node_ptr == nullptr) {
            return  std::shared_ptr<T>();
        }
        //2 比较head和old_head想等则交换否则说明head已经被其他线程更新
        if (head.compare_exchange_strong(old_head, node_ptr->_next)) {
            //要返回的值
            std::shared_ptr<T> res;
            //交换智能指针
            //5 处
            res.swap(node_ptr->_data);
            //6 增加的数量
            int increase_count = old_head->_ref_count.fetch_sub(2);
            //3 处判断仅有当前线程持有指针则删除
            if (increase_count == 2) {
                delete node_ptr;
            }
            return res;
        }else {
            if (old_head->_ref_count.fetch_sub(1) == 1) {
                delete node_ptr;
            }
        }
    }
}

从表面看，我们引入引用计数的确能避免3.1.3 版本引发崩溃的问题，但是由于head属于指针类型，在多个线程内共享的时候产生了一些新的问题。

问题一：假设线程AB同时执行到2处，线程A执行CAS操作更新head节点。线程B执行CAS失败进入else逻辑，但是由于CAS失败会更新oldhead为新head，由于新的head引用计数为1，这会导致head节点对应的数据节点提前删除，后续线程访问时会导致崩溃。

问题二：假设线程A即将运行6处，线程B运行到7处对引用计数+1，经过CAS发现head!=oldhead,oldhead会被更新为head，对于线程A来讲引用计数变成3，此时线程A继续向下运行会导致逻辑判断异常不进行释放。反观线程B由于线程A减少引用计数，在2处CAS后引用计数又变成3，进行else操作由于无法降为1释放。

上述两个问题都是由于多个线程之间共享head指针导致，为了避免上述问题发生，我们尝试改为副本操作, head的类型修改为ref_node类型，外部引用计数用于跟踪有多少线程正在访问这个节点，比如在弹出操作时，当节点被从栈顶移除，可能有其他线程仍在访问该节点，此时需要保证节点不会被提前释放。由于每个线程持有的是head副本，我们需要增加一个内部引用计数实时同步各个线程实际使用head的次数，当数据不再被任何线程引用时，才释放节点。

3.3 外部引用+内部引用完善并发栈

我将双引用计数在处理pop期间状态变化列举如下：

场景	external_count	internal_count
线程尝试操作节点	递增，声明操作权	无变化
节点弹出成功	归零（脱离栈结构）	增加`external_count - 2`
节点弹出失败	无变化	递减，若归零则释放
内存释放条件	不直接参与	归零时释放节点

3.3.1 定义结构

新增_dec_count表示减少的引用计数，放在node结构里。

  struct ref_node;
    struct node {
        //1 数据域智能指针
        std::shared_ptr<T>  _data;
        //2  下一个节点
        ref_node _next;
        node(T const& data_) : _data(std::make_shared<T>(data_)) {}
        //减少的数量
        std::atomic<int>  _dec_count;
    };
    struct ref_node {
        // 引用计数
        int _ref_count;
        node* _node_ptr;
        ref_node( T const & data_):_node_ptr(new node(data_)), _ref_count(1){}
        ref_node():_node_ptr(nullptr),_ref_count(0){}
    };

3.3.2 实现push函数

void push(T const& data) {
    auto new_node =  ref_node(data);
    new_node._node_ptr->_next = head.load();
     while (!head.compare_exchange_weak(new_node._node_ptr->_next, new_node));
 }

3.3.3 实现pop函数

std::shared_ptr<T> pop() {
    ref_node old_head = head.load();
    for (;;) {
        //1 只要执行pop就对引用计数+1并更新到head中
        ref_node new_head;
        //2
        do {
            new_head = old_head;
            new_head._ref_count += 1;
        } while (!head.compare_exchange_weak(old_head, new_head));
        old_head = new_head;
        //3
        auto* node_ptr = old_head._node_ptr;
        if (node_ptr == nullptr) {
            return  std::shared_ptr<T>();
        }
        //4 比较head和old_head相等则交换否则说明head已经被其他线程更新
        if (head.compare_exchange_strong(old_head, node_ptr->_next)) {
            //要返回的值
            std::shared_ptr<T> res;
            //交换智能指针
            res.swap(node_ptr->_data);
            //5  增加的数量
            int increase_count = old_head._ref_count - 2;
            //6  
            if (node_ptr->_dec_count.fetch_add(increase_count) == -increase_count) {
                delete node_ptr;
            }
            return res;
        }else {
            //7
            if (node_ptr->_dec_count.fetch_sub(1) == 1) {
                delete node_ptr;
            }
        }
    }
}

我们在套用3.2.3 问题一和问题二

问题一： 假设线程AB同时执行到4处，线程A执行CAS操作更新head节点。线程B执行CAS失败进入else逻辑，但是由于CAS失败会更新oldhead为新head进入else逻辑，由于操作的是内部引用计数，初始值为0，fetch_sub返回0，dec_count变成-1. 线程A increase_count=refcout(3)-2=1.满足dec_cout == -increase_cout ，删除节点。

问题二：假设线程A即将运行6处，线程B运行到2处对引用计数+1，经过CAS发现head!=oldhead,oldhead会被更新为head，对于线程A来讲引用计数变成3，此时线程A继续向下运行会导致逻辑判断异常不进行释放。increase_count=3-2=1，dec_cout=0 != -1,dec_cout.fetchadd后为1. 线程B由于不满足CAS进入7处逻辑，但是由于操作的是内部引用计数而非外部引用计数，满足1==1 delete node_ptr.

3.3.4 完整代码

#include <iostream>
#include <thread>
#include <set>
#include <mutex>
#include <atomic>
#include <memory>
#include <cassert>
#include <chrono>

using namespace std;


template<typename T>
class single_ref_stack {
public:
	single_ref_stack(){
	
	}

	~single_ref_stack() {
		while (pop());
	}

	void push(T const& data) {
		auto new_node =  ref_node(data);
		new_node._node_ptr->_next = head.load();
		while (!head.compare_exchange_weak(new_node._node_ptr->_next, new_node));
	}

	std::shared_ptr<T> pop() {
		ref_node old_head = head.load();
		for (;;) {
			ref_node new_head;
			do {
				new_head = old_head;
				new_head._ref_count += 1;
			} while (!head.compare_exchange_weak(old_head, new_head));

			old_head = new_head;

			auto* node_ptr = old_head._node_ptr;
			if (node_ptr == nullptr) {
				return  std::shared_ptr<T>();
			}

			if (head.compare_exchange_strong(old_head, node_ptr->_next)) {
				
				std::shared_ptr<T> res;
				res.swap(node_ptr->_data);

				int increase_count = old_head._ref_count - 2;
				
				if (node_ptr->_dec_count.fetch_add(increase_count) == -increase_count) {
					delete node_ptr;
				}

				return res;
			}else {
				if (node_ptr->_dec_count.fetch_sub(1) == 1) {
					delete node_ptr;
				}
			}
		}
	}

private:
	struct ref_node;
	struct node {
		std::shared_ptr<T>  _data;
		ref_node _next;
		node(T const& data_) : _data(std::make_shared<T>(data_)) {}

		std::atomic<int>  _dec_count;
	};

	struct ref_node {
		int _ref_count;
	
		node* _node_ptr;
		ref_node( T const & data_):_node_ptr(new node(data_)), _ref_count(1){}

		ref_node():_node_ptr(nullptr),_ref_count(0){}
	};

	std::atomic<ref_node> head;
};


int main() {
	single_ref_stack<int>  single_ref_stack;
	std::set<int>  rmv_set;
	std::mutex set_mtx;

	std::thread t1([&]() {
		for (int i = 0; i < 20000; i++) {
			single_ref_stack.push(i);
			std::cout << "push data " << i << " success!" << std::endl;
			std::this_thread::sleep_for(std::chrono::milliseconds(5));
		}
		});

	std::thread t2([&]() {
		for (int i = 0; i < 10000;) {
			auto head = single_ref_stack.pop();
			if (!head) {
				std::this_thread::sleep_for(std::chrono::milliseconds(10));
				continue;
			}
			std::lock_guard<std::mutex> lock(set_mtx);
			rmv_set.insert(*head);
			std::cout << "pop data " << *head << " success!" << std::endl;
			i++;
		}
		});

	std::thread t3([&]() {
		for (int i = 0; i < 10000;) {
			auto head = single_ref_stack.pop();
			if (!head) {
				std::this_thread::sleep_for(std::chrono::milliseconds(10));
				continue;
			}
			std::lock_guard<std::mutex> lock(set_mtx);
			rmv_set.insert(*head);
			std::cout << "pop data " << *head << " success!" << std::endl;
			i++;
		}
		});

	t1.join();
	t2.join();
	t3.join();

	assert(rmv_set.size() == 20000);
}

3.4 总结

至此，我们通过不断的改进完成对无锁栈区数据结构上的优化。我们在回顾一下：

首先我们通过CAS实现最基础的pop函数，但是根据我们以往开发经验存在一下不足,

a.未释放弹出的节点的内存,导致内存堆积

b.未判断边界情况，例如pop栈为空，push栈过大等。

c.异常发生在栈拷贝（第五章节已经讨论）解决办法是引入智能指针。

然后我们引入智能指针解决上述几点问题，随着智能指针的引入就需要引入对节点的内存管理，由于过早的delete old_head 造成use-after-free问题。因此我们引入了延迟删除的机制。

延迟删除节点满足下述几个条件：

a.同一时刻内只有一个线程pop，可将oldhead和待删列表删除

b.同一时刻内存在多个线程pop并且线程A完成CAS准备删除oldhead,线程B刚进入pop操作还未通过CAS更新oldhead->next操作,可将oldhead删除，但待删列表不可删除。

c.同一时刻内存在多个线程pop并且线程A完成CAS准备删除oldhead,线程B处于CAS的更新oldhead->next操作，此时需要将oldhead放入待删除列表。

通过threads_in_pop完成try_reclaim函数，但是发现仍然存在一些不完善的地方。

所以我们引入风险指针，解决threads_in_pop 由于无法阻止其他线程在 时间窗口内 访问已被释放的节点导致异常。但是每次使用风险指针需要从空闲列表获取空闲节点，以及后续CAS期间需要遍历风险列表，这势必造成无锁结构性能的降低，并且受IBM专利影响可能无法正常使用。

针对风险指针的不完善，我们提出了引用计数的概念，通过引用计数与线程绑定，一定程度上避免问题的发生。但是单引用问题无法解决3.2.3 问题一和问题二。

最终我们提出了传输head副本的外部引用+内部引用的方式。通过外部引用避免节点过早的释放，但是由于副本的使用我们需要一个内部引用计数进行同步各线程之间的状态。

但是即使是引用计数，还是会存在一系列问题，例如引用计数引起的频繁CAS，针对延迟回收极端情况下可能内存泄漏（如果某个线程长时间持有引用（例如被挂起或优先级低），内存无法及时回收）

3.5 思考，无锁结构设计中head是指针类型好还是副本类型好呢？

head通常代表数据结构的头部，比如栈顶或队列头。

指针类型是指直接使用指针来指向节点，比如std::atomic<Node*>这样的类型。直接操作指针的原子性读写/CAS开销底适合简单场景。并且合适使用任何数据结构。但是使用指针指针地址被释放后需要避免重用的问题。引起ABA（线程A读取指针T，线程B删除指针T重新分配指针T’，线程A误认为T没有删除，仍然使用该指针做一些操作导致异常）

副本类型可以避免多线程共享指针引发的一系列问题，但是由于使用副本操作导致原子操作数据变宽，某些平台不满足is_lock_free()便无法实现。但是适合通用无锁设计。

现代 CPU 对原子操作的支持有明确的位宽限制：

x86-64 平台：支持 8/16/32/64 位的原子操作。

ARM 平台：部分架构需依赖 LL/SC（Load-Linked/Store-Conditional）指令实现宽原子操作，可能引入重试开销。

128 位原子操作：需要 CMPXCHG16B 指令（x86-64），且内存地址必须 16 字节对齐。

若平台原生支持（如 x86-64），性能接近 64 位操作；若不支持，编译器可能用锁模拟（如 mutex），导致操作非无锁且性能骤降。运行通过is_lock_free() 检查。

相关环境下指针类型（64 位原子操作)通常 1-2 个时钟周期完成

副本类型（128 位原子操作x86-64）开销约 10-20 周期，ARM64：需 LL/SC 循环，可能因竞争多次重试。

注意：在x86-64平台上，使用 CMPXCHG16B 指令实现128位原子操作，需要显式指定16字节对齐（某些缓存行优化也需要16字节对齐子块）

强制对齐方法：C++11及以上：alignas(16)

动态内存：aligned_alloc(16, size)

副本类型更安全，适合通用无锁设计，尤其在 ABA 风险高的场景。

指针类型更高效，但需谨慎处理 ABA 和内存生命周期。实际选择应结合性能测试和平台特性。

四.内存模型优化无锁栈

内存模型详细介绍在C++并发编程 -4.原子操作和内存模型详细介绍，再次不过多赘述。

基本的内存优化包括两点建议：

将pop-push构建一种release-acquire内存屏障。push(release)将elements通知给其他线程，pop(acquire)即使读取elements进行消费。
如遇到内存回收或者delete操作，可能涉及到对数据域和指针域分离操作。通常做法是首先swap(node->data),然后delete node->ptr。如果是多线程操作，我们需要确保删除的一致性，避免过多的删除或内存泄漏。对于删除成功的操作我们需要即使release“发布”通知其他线程及时同步状态。多余删除失败的线程我们也需要及时读取其余线程最新修改的变量。

我们将3.3.4代码进行优化

对于第一点：

如果CAS成功，我们需要进行release确保操作即使更新，确保被其他线程可见。如果CAS失败则使用宽松内存序列，因为此时失败了对别的线程讲没有必要即使更新。

void push(T const& data) {
        ....
		while (!head.compare_exchange_weak(new_node.ptr->next, new_node, 
			std::memory_order_release, std::memory_order_relaxed));
	}

在pop的increase_head_count函数中，我们需要及时的acquire获取到其余现场的修改。

这样将pop和push形成一对内存屏障。

void increase_head_count(counted_node_ptr& old_counter) {
		do {
			new_counter = old_counter;
			++new_counter.external_count;
		while (!head.compare_exchange_strong(old_counter,  new_counter, 
			std::memory_order_acquire, std::memory_order_relaxed));
		old_counter.external_count = new_counter.external_count;
	}

对于第二点：

1处由于会循环尝试，std::memory_order_relaxed仅保证原子性，不保证同步。但所有原子操作的修改最终会传播到其他线程（只是时间不确定）。

2处由于执行到这个逻辑意味着做res.swap(ptr->data)操作，因此需要将这个成功的改动通知给其余线程，因此使用release确保其余线程能够读取到。避免ues-after-free

3处执行else if逻辑的线程意味着失败，需要获取到目前最新的更改。

std::shared_ptr<T> pop() {
for (;;) {
	    ......
        //1
		if (head.compare_exchange_strong(old_head, ptr->next,  std::memory_order_relaxed)) {
        ......
        //2
			if (ptr->internal_count.fetch_add(count_increase, std::memory_order_release) == -count_increase) {
			delete  ptr;
			}
		.....
        //3
	} else if (ptr->internal_count.fetch_add(-1, std::memory_order_acquire) == 1) { 
			delete ptr;
		}
	}
}

五.无锁队列

5.1 基础版

实现一个无锁队列，与无锁栈不同点在于栈是LIFO(last in first out)后进先出，而队列是FIFO先进先出的队列.

#include<atomic>
#include<memory>
template<typename T>
class SinglePopPush
{
private:
    struct node
    {
        std::shared_ptr<T> data;
        node* next;
        node() :
            next(nullptr)
        {}
    };
    std::atomic<node*> head;
    std::atomic<node*> tail;
    node* pop_head()
    {
        node* const old_head = head.load();
        // ⇽-- - 1
        if (old_head == tail.load())   
        {
            return nullptr;
        }
        head.store(old_head->next);
        return old_head;
    }
public:
    SinglePopPush() :
        head(new node), tail(head.load())
    {}
    SinglePopPush(const SinglePopPush& other) = delete;
    SinglePopPush& operator=(const SinglePopPush& other) = delete;
    ~SinglePopPush()
    {
        while (node* const old_head = head.load())
        {
            head.store(old_head->next);
            delete old_head;
        }
    }
    std::shared_ptr<T> pop()
    {
        node* old_head = pop_head();
        if (!old_head)
        {
            return std::shared_ptr<T>();
        }
        // ⇽-- -2
        std::shared_ptr<T> const res(old_head->data);  
            delete old_head;
        return res;
    }
    void push(T new_value)
    {
        std::shared_ptr<T> new_data(std::make_shared<T>(new_value));
        // ⇽-- - 3
        node* p = new node;    
        //⇽-- - 4
        node* const old_tail = tail.load(); 
        //⇽-- - 5
        old_tail->data.swap(new_data);   
        //⇽-- - 6
        old_tail->next = p; 
        //⇽-- - 7
        tail.store(p);    
    }
};

上述队列只是要SPSC，如果是多线程场景下，将存在很多问题。例如：线程A和B同时执行push操作，同时运行到old_tail->data.swap,这样就会竞争导致数据覆盖。当然同时pop也会存在同样的问题。

5.2 外+内引用计数

书中提到：有一种方法能解决上面的问题，且该方法颇具吸引力：在尾节点中添加一外部计数器，与处理头节点的方法相同。不过队列中的每个节点已配备一个外部计数器，分别存储在对应前驱节点内的next指针中。

若要让同一个节点具有两个外部计数器，便需要改动引用计数的实现方式，以免过早删除节点。我们为了满足上述要求，可在节点的结构体中记录外部计数器的数目，外部计数器一旦发生销毁，该数目则自减，并且将该外部计数器的值加到内部计数器的值之上。对于任意特定节点，如果内部计数器的值变为零，且再也没有外部计数器存在，我们就知道该节点能被安全地删除.

5.3 定义结构

struct counted_node_ptr
{
    int external_count;
    node* ptr;
};
std::atomic<counted_node_ptr> head;
//⇽--- 1
std::atomic<counted_node_ptr> tail;    
struct node_counter
{
    unsigned internal_count:30;
    //⇽--- 2
    unsigned external_counters:2;   
};
struct node
{
    std::atomic<T*> data;
    //⇽---  3
    std::atomic<node_counter> count;    
    counted_node_ptr next;
    node()
    {
        node_counter new_count;
        new_count.internal_count=0;
        //⇽---  4
        new_count.external_counters=2;    
        count.store(new_count);
        next.ptr=nullptr;
        next.external_count=0;
    }
};

counted_node_ptr作用是将节点指针与外部引用计数绑定，确保指针被安全访问

external_count跟踪有多少外部实体（如 head、tail 或其他线程）持有该指针。

node_counter使用了位域优化，通过30+2=32位，确保原子操作的高效性

这里提一下，在设计node_counter节点的时候使用了一个32*2=64字节大小，刚好是一个机器字。而我在前面也提到过，在设计无锁结构过程中，如果超过一个机器字进行无锁操作可能需要看架构是否支持.只要把结构体的大小限制在单个机器字内，那么在许多硬件平台上，其原子操作就更加有机会以无锁方式实现。

internal_count：用于节点内部状态管理（如数据释放条件）

external_counters：记录有多少 counted_node_ptr 指向此节点

5.4 实现push

void push(T new_value)
{
    std::unique_ptr<T> new_data(new T(new_value));
    counted_node_ptr new_next;
    new_next.ptr=new node;
    new_next.external_count=1;
    counted_node_ptr old_tail=tail.load();
    for(;;)
    {
        // 5
        increase_external_count(tail,old_tail);    
        T* old_data=nullptr;
        // 6
        if(old_tail.ptr->data.compare_exchange_strong(   
           old_data,new_data.get()))
        {
            old_tail.ptr->next=new_next;
            old_tail=tail.exchange(new_next);
            //  7
            free_external_counter(old_tail);    
            new_data.release();
            break;
        }
        old_tail.ptr->release_ref();
    }
}

首先，节点经过初始化，其internal_count成员被置零，而external_counters成员则设置成2(4处)，因为我们向队列加入的每个新节点，它最初既被tail指针指涉，也被前一个节点的next指针指涉。

在push操作中，使用std::unique_ptr<>可以配合std::atomic<T*>进行原子操作。参照 Michael-Scott 队列设计，申请一个新节点作为下一次插入的占位节点。然后执行CAS操作尝试将新数据存入当前尾节点的 data 字段。

如果成功：

将新节点链接到链表末尾。
原子交换 tail 为新节点。
释放旧 tail 的引用计数。

如果失败：

释放当前线程对节点的引用，重试。

5.5 实现pop

对于pop，最主要的工作就是合理的操作外部+内部引用计数。

template<typename T>
class lock_free_queue
{
private:
    struct node
    {
        void release_ref();
        //node的余下代码与代码清单7.16相同
    };
public:
    std::unique_ptr<T> pop()
    {
        // 1
        counted_node_ptr old_head=head.load(std::memory_order_relaxed);    
        for(;;)
        {
            //2
            increase_external_count(head,old_head);    
            node* const ptr=old_head.ptr;
            if(ptr==tail.load().ptr)
            {
                //3
                ptr->release_ref();    
                return std::unique_ptr<T>();
            }
            // 4
            if(head.compare_exchange_strong(old_head,ptr->next))    
            {
                T* const res=ptr->data.exchange(nullptr);
                // 5
                free_external_counter(old_head);   
                return std::unique_ptr<T>(res);
            }
            // 6
            ptr->release_ref();    
        }
    }
};

节点的弹出操作从加载old_head指针开始（1处），接着进入一个无限循环，并且令已加载好的指针上的外部计数器的值自增（2处）。若头节点正巧就是尾节点，即表明队列内没有数据，我们便释放引用（3处），并返回空指针。

否则表明队列中存在数据，因此当前线程试图调用compare_exchange_strong()将其收归己有（4处）。以上调用会对比结构体head和old_head，其成员都包括外部计数器和指针，但均被视作一个整体。无论哪个成员发生了变化而导致不匹配，代码即释放引用（6处）并重新循环。

如果比较-交换操作成功，当前线程就顺利地将节点所属的数据收归己有，故我们随即释放弹出节点的外部计数器（5处），再将数据返回给pop()的调用者。若两个外部计数器都被释放，且内部计器数值变为0，则节点本身可被删除。有几个函数负责处理引用计数。

5.6 实现减少引用计数

template<typename T>
class lock_free_queue
{
private:
    struct node
    {
        void release_ref()
        {
            node_counter old_counter=
                count.load(std::memory_order_relaxed);
            node_counter new_counter;
            do
            {
                new_counter=old_counter;
                //1
                --new_counter.internal_count;    
            }
            //2
            while(!count.compare_exchange_strong(    
                  old_counter,new_counter,
                  std::memory_order_acquire,std::memory_order_relaxed));
            if(!new_counter.internal_count &&
               !new_counter.external_counters)
            {
                //3
                delete this;    
            }
        }
    };
};

尽管我们在这里只改动位域成员internal_count(1处)，也必须按原子化方式更新整个计数器结构体。所以更新操作要用比较-交换函数配合循环实现（2处）。

当计数器internal_count完成自减后，如果内外两个计数器的值均为0，就表明调用release_ref()的是最后一个指涉目标节点的指针（代码清单pop （5 6两处）的ptr），我们应当删除节点（3处）。

5.7 实现增加引用计数

template<typename T>
class lock_free_queue
{
private:
    static void increase_external_count(
        std::atomic<counted_node_ptr>& counter,
        counted_node_ptr& old_counter)
    {
        counted_node_ptr new_counter;
        do
        {
            new_counter=old_counter;
            ++new_counter.external_count;
        }
        while(!counter.compare_exchange_strong(
              old_counter,new_counter,
              std::memory_order_acquire,std::memory_order_relaxed));
        old_counter.external_count=new_counter.external_count;
    }
};

increase_external_count()已改成了静态成员函数，需要更新的目标不再是自身固有的成员计数器，而是一个外部计数器，它通过第一个参数传入函数以进行更新。

5.8 释放其外部计数器

template<typename T>
class lock_free_queue
{
private:
    static void free_external_counter(counted_node_ptr &old_node_ptr)
    {
        node* const ptr=old_node_ptr.ptr;
        int const count_increase=old_node_ptr.external_count-2;
        node_counter old_counter=
            ptr->count.load(std::memory_order_relaxed);
        node_counter new_counter;
        do
        {
            new_counter=old_counter;
            //⇽---  1
            --new_counter.external_counters;  
            //⇽---  2  
            new_counter.internal_count+=count_increase;    
        }
        //⇽---  3
        while(!ptr->count.compare_exchange_strong(    
              old_counter,new_counter,
              std::memory_order_acquire,std::memory_order_relaxed));
        if(!new_counter.internal_count &&
           !new_counter.external_counters)
        {
            //⇽---  4
            delete ptr;    
        }
    }
};

与free_external_counter()对应的是increase_external_count()函数，该函数对整个计数器结构体仅执行一次compare_exchange_strong()，便合并更新了其中的两个计数器(3处)，这与release_ref()中更新internal_count的自减操作类似。

计数器external_counters则同时自减(1处)。如果这两个值均变为0，就表明目标节点再也没有被指涉，遂可以安全删除（4处）。

为了避免条件竞争，上述更新行为需要整合成单一操作完成，因此需要用比较-交换函数配合循环运行。若两项更新分别独立进行，万一有两个线程同时调用该函数，则它们可能都会认为自己是最后的执行者，所以都删除节点，结果产生未定义行为。

5.9 优化

虽然上述代码尚可工作，也无条件竞争，但依然存在性能问题。一旦某线程开始执行 push()操作，针对 old_tail.ptr->data成功完成了compare_exchange_strong()调用(push代码6处)，就没有其他线程可以同时运行push()。若有其他任何线程试图同时压入数据，便始终看不到nullptr，而仅能看到上述线程执行push()传入的新值，导致compare_exchange_strong()调用失败，最后只能重新循环。这实际上是忙等，消耗CPU周期却一事无成，结果形成了实质的锁。第一个push()调用令其他线程发生阻塞，直到执行完毕才解除，所以这段代码不是无锁实现。问题不止这一个。若别的线程被阻塞，则操作系统会提高对互斥持锁的线程的优先级，好让它尽快完成，但本例却无法依此处理，被阻塞的线程将一直消耗CPU周期，等到最初调用push()的线程执行完毕才停止。这个问题带出了下一条妙计：让等待的线程协助正在执行push()的线程，以实现无锁队列。

我们很清楚应该在这种方法中具体做什么：先设定尾节点上的next指针，使之指向一个新的空节点，且必须随即更新tail指针。由于空节点全都等价，因此这里所用空节点的起源并不重要，其创建者既可以是成功压入数据的线程，也可以是等待压入数据的线程。如果将节点内的next指针原子化，代码就能借compare_exchange_strong()设置其值。只要设置好了next指针，便可使用compare_exchange_weak()配合循环设定tail指针，借此令它依然指向原来的尾节点。若tail指针有变，则说明它已同时被别的线程更新过，因此我们停止循环，不再重试。

pop()需要稍微改动才可以载入原子化的next指针

template<typename T>
class lock_free_queue
{
private:
    struct node
    {
        std::atomic<T*> data;
        std::atomic<node_counter> count;
        //⇽---  1
        std::atomic<counted_node_ptr> next;    
    };
public:
    std::unique_ptr<T> pop()
    {
        counted_node_ptr old_head=head.load(std::memory_order_relaxed)；
        for(;;)
        {
            increase_external_count(head,old_head);
            node* const ptr=old_head.ptr;
            if(ptr==tail.load().ptr)
            {
                return std::unique_ptr<T>();
            }
            //  ⇽---  2
            counted_node_ptr next=ptr->next.load();   
            if(head.compare_exchange_strong(old_head,next))
            {
                T* const res=ptr->data.exchange(nullptr);
                free_external_counter(old_head);
                return std::unique_ptr<T>(res);
            }
            ptr->release_ref();
        }
    }
};

上面的代码进行了简单改动：next指针现在采用了原子变量（1处），并且（2处）的载入操作也成了原子操作。本例使用了默认的memory_order_seq_cst次序，而ptr->next指针原本属于std::atomic<counted_node_ptr>型别，在（2 处）隐式转化成counted_node_ptr型别，这将触发原子化的载入操作，故无须显式调用load()。不过我们还是进行了显式调用，目的是提醒自己，在以后优化时此处应该显式设定内存次序。

新版本的push()相对更复杂，如下

template<typename T>
class lock_free_queue
{
private:
    // ⇽---  1
    void set_new_tail(counted_node_ptr &old_tail,   
                      counted_node_ptr const &new_tail)
    {
        node* const current_tail_ptr=old_tail.ptr;
        // ⇽---  2
        while(!tail.compare_exchange_weak(old_tail,new_tail) &&   
              old_tail.ptr==current_tail_ptr);
        // ⇽---  3
        if(old_tail.ptr==current_tail_ptr)
            //⇽---  4   
            free_external_counter(old_tail);    
        else
            //⇽---  5
            current_tail_ptr->release_ref();    
    }
public:
    void push(T new_value)
    {
        std::unique_ptr<T> new_data(new T(new_value));
        counted_node_ptr new_next;
        new_next.ptr=new node;
        new_next.external_count=1;
        counted_node_ptr old_tail=tail.load();
        for(;;)
        {
            increase_external_count(tail,old_tail);
            T* old_data=nullptr;
            //⇽---  6
            if(old_tail.ptr->data.compare_exchange_strong(    
                   old_data,new_data.get()))
            {
                counted_node_ptr old_next={0};
                //⇽---  7
                if(!old_tail.ptr->next.compare_exchange_strong(    
                       old_next,new_next))
                {
                    //⇽---  8
                    delete new_next.ptr;    
                    new_next=old_next;   // ⇽---  9
                }
                set_new_tail(old_tail, new_next);
                new_data.release();
                break;
            }
            else    // ⇽---  10
            {
                counted_node_ptr old_next={0};
                // ⇽--- 11
                if(old_tail.ptr->next.compare_exchange_strong(    
                       old_next,new_next))
                {
                    // ⇽--- 12
                    old_next=new_next;    
                    // ⇽---  13
                    new_next.ptr=new node;    
                }
                //  ⇽---  14
                set_new_tail(old_tail, old_next);   
            }
        }
    }
};

由于我们确实想在(6处)设置data指针，而且还需接受另一线程的协助，因此引入了else分支以处理该情形(10处)。上述push()的新版本先在(6处)处设置好节点内的data指针，然后通过compare_exchange_strong()更新next指针(7处)，从而避免了循环。

若交换操作失败，我们便知道另一线程同时抢先设定了next指针，遂无须保留函数中最初分配的新节点，可以将它删除（8处）。

虽然next指针是由别的线程设定的，但代码依然持有其值，留待后面更新tail指针（9处）。更新tail指针的代码被提取出来，写成set_new_tail()函数（1处）。它通过compare_exchange_weak()配合循环来更新tail指针（2处）。

如果其他线程试图通过push()压入新节点，计数器external_count就会发生变化，而上述新函数正是为了防止错失这一变化。但我们也要注意，若另一线程成功更新了tail指针，其值便不得再次改变。若当前线程重复更新tail指针，便会导致控制流程在队列内部不断循环，这种做法完全错误。

相应地，如果比较-交换操作失败，所载入的ptr指针也需要保持不变。在脱离循环时，假如ptr指针的原值和新值保持一致（3处）就说明tail指针的值肯定已经设置好，原有的外部计数器则需要释放（4处）。若ptr指针前后有所变化，则另一线程将释放计数器，而当前线程要释放它持有的唯一一个tail指针（5处）。

这里，若多个线程同时调用push()，那么只有一个线程能成功地在循环中设置data指针，失败的线程则转去协助成功的线程完成更新。当前线程一进入push()就分配了一个新节点，我们先更新next指针，使之指向该节点（11处）。假定操作成功，该节点就充当新的尾节点⑫，而我们还需另行分配一个新节点，为下一个压入队列的数据预先做好准备⑬。接着，代码尝试调用set_new_tail()以设置尾节点（14处），再重新循环。


#include<atomic>
#include<memory>
#include <cassert>
#include <iostream>
#include <thread>
#include <cassert>


template<typename T>
class lock_free_queue
{
private:

    struct node_counter
    {
        unsigned internal_count : 30;
        //⇽--- 2
        unsigned external_counters : 2;
    };

    struct node;

    struct counted_node_ptr
    {
        //存在破坏trivial class 的风险
        /*bool operator == (const counted_node_ptr& cnp) {
            return (external_count == cnp.external_count && ptr == cnp.ptr);
        }*/

        //构造初始化各成员
        counted_node_ptr():external_count(0), ptr(nullptr) {}
        int external_count;
        node* ptr;
    };

    struct node
    {
        std::atomic<T*> data;
        std::atomic<node_counter> count;
        //⇽---  1
        std::atomic<counted_node_ptr> next;

        node(int external_count = 2)
        {
            node_counter new_count;
            new_count.internal_count = 0;
            //⇽---  4
            new_count.external_counters = external_count;
            count.store(new_count);

            counted_node_ptr node_ptr;
			node_ptr.ptr = nullptr;
			node_ptr.external_count = 0;

            next.store(node_ptr);
        }


        void release_ref()
        {
            std::cout << "call release ref " << std::endl;
            node_counter old_counter =
                count.load(std::memory_order_relaxed);
            node_counter new_counter;
            do
            {
                new_counter = old_counter;
                //1
                --new_counter.internal_count;
            }
            //2
            while (!count.compare_exchange_strong(
                old_counter, new_counter,
                std::memory_order_acquire, std::memory_order_relaxed));
            if (!new_counter.internal_count &&
                !new_counter.external_counters)
            {
                //3
                delete this;
                std::cout << "release_ref delete success" << std::endl;
                destruct_count.fetch_add(1);
            }
        }
    };

    std::atomic<counted_node_ptr> head;
    //⇽--- 1
    std::atomic<counted_node_ptr> tail;

    // ⇽---  1
    void set_new_tail(counted_node_ptr& old_tail,
        counted_node_ptr const& new_tail)
    {
        node* const current_tail_ptr = old_tail.ptr;
        // ⇽---  2  此处仅有一个线程能设置tail为new_tail，失败的会更新old_tail为tail的新值
        //  为防止失败的线程重试导致tail被再次更新所以添加了后面的&&判断
		//如果tail和old_tail不等说明引用计数不同或者tail已经被移动，如果tail已经被移动那么old_tail的ptr和current_tail_ptr不同，则可以直接退出。
		//所以一旦tail被设置为new_tail，那么另一个线程在重试时判断tail和old_tail不等，会修改old_tail, 此时old_tail已经和current_tail不一致了，所以没必要再重试。
       //如不加后续判断， 会造成重复设置newtail，引发多插入节点的问题。
        while (!tail.compare_exchange_weak(old_tail, new_tail) &&
            old_tail.ptr == current_tail_ptr);
        // ⇽---  3
        if (old_tail.ptr == current_tail_ptr)
            //⇽---  4   
            free_external_counter(old_tail);
        else
            //⇽---  5
            current_tail_ptr->release_ref();
    }

    static void free_external_counter(counted_node_ptr& old_node_ptr)
    {
        std::cout << "call  free_external_counter " << std::endl;
        node* const ptr = old_node_ptr.ptr;
        int const count_increase = old_node_ptr.external_count - 2;
        node_counter old_counter =
            ptr->count.load(std::memory_order_relaxed);
        node_counter new_counter;
        do
        {
            new_counter = old_counter;
            //⇽---  1
            --new_counter.external_counters;
            //⇽---  2  
            new_counter.internal_count += count_increase;
        }
        //⇽---  3
        while (!ptr->count.compare_exchange_strong(
            old_counter, new_counter,
            std::memory_order_acquire, std::memory_order_relaxed));
        if (!new_counter.internal_count &&
            !new_counter.external_counters)
        {
            //⇽---  4
            destruct_count.fetch_add(1);
            std::cout << "free_external_counter delete success" << std::endl;
            delete ptr;
        }

    }


    static void increase_external_count(
        std::atomic<counted_node_ptr>& counter,
        counted_node_ptr& old_counter)
    {
        counted_node_ptr new_counter;
        do
        {
            new_counter = old_counter;
            ++new_counter.external_count;
        } while (!counter.compare_exchange_strong(
            old_counter, new_counter,
            std::memory_order_acquire, std::memory_order_relaxed));
        old_counter.external_count = new_counter.external_count;
    }

public:
    lock_free_queue() {
       
		counted_node_ptr new_next;
		new_next.ptr = new node();
		new_next.external_count = 1;
		tail.store(new_next);
		head.store(new_next);
        std::cout << "new_next.ptr is " << new_next.ptr << std::endl;
    }

    ~lock_free_queue() {
        while (pop());
        auto head_counted_node = head.load();
        delete head_counted_node.ptr;
    }

    void push(T new_value)
    {
        std::unique_ptr<T> new_data(new T(new_value));
        counted_node_ptr new_next;
        new_next.ptr = new node;
        new_next.external_count = 1;
        counted_node_ptr old_tail = tail.load();
        for (;;)
        {
            increase_external_count(tail, old_tail);
            T* old_data = nullptr;
            //⇽---  6
            if (old_tail.ptr->data.compare_exchange_strong(
                old_data, new_data.get()))
            {
                counted_node_ptr old_next;
                counted_node_ptr now_next = old_tail.ptr->next.load();
                //⇽---  7 链接新的节点
                if (!old_tail.ptr->next.compare_exchange_strong(
                    old_next, new_next))
                {
                    //⇽---  8
                    delete new_next.ptr;
                    new_next = old_next;   // ⇽---  9
                }
                set_new_tail(old_tail, new_next);
                new_data.release();
                break;
            }
            else    // ⇽---  10
            {
                counted_node_ptr old_next ;
                // ⇽--- 11
                if (old_tail.ptr->next.compare_exchange_strong(
                    old_next, new_next))
                {
                    // ⇽--- 12
                    old_next = new_next;
                    // ⇽---  13
                    new_next.ptr = new node;
                }
                //  ⇽---  14
                set_new_tail(old_tail, old_next);
            }
        }

        construct_count++;
    }


    std::unique_ptr<T> pop()
    {
        counted_node_ptr old_head = head.load(std::memory_order_relaxed);
            for (;;)
            {
                increase_external_count(head, old_head);
                node* const ptr = old_head.ptr;
                if (ptr == tail.load().ptr)
                {
                    //头尾相等说明队列为空，要减少内部引用计数
                    ptr->release_ref();
                    return std::unique_ptr<T>();
                }
                //  ⇽---  2
                counted_node_ptr next = ptr->next.load();
                if (head.compare_exchange_strong(old_head, next))
                {
                    T* const res = ptr->data.exchange(nullptr);
                    free_external_counter(old_head);
                    return std::unique_ptr<T>(res);
                }
                ptr->release_ref();
            }
    }

    static std::atomic<int> destruct_count;
    static std::atomic<int> construct_count;
};

template<typename T>
std::atomic<int> lock_free_queue<T>::destruct_count = 0;

template<typename T>
std::atomic<int> lock_free_queue<T>::construct_count = 0;



#define TESTCOUNT 10

int main() {
    lock_free_queue<int>  que;
    std::thread t1([&]() {
        for (int i = 0; i < TESTCOUNT; i++) {
            que.push(i);
            std::cout << "push data is " << i << std::endl;
            std::this_thread::sleep_for(std::chrono::milliseconds(10));
        }
        });

   

	std::thread t2([&]() {
		for (int i = 0; i < TESTCOUNT;) {
			auto p = que.pop();
			if (p == nullptr) {
				std::this_thread::sleep_for(std::chrono::milliseconds(10));
				continue;
			}
			i++;
			std::cout << "pop data is " << *p << std::endl;
		}
		});

    t1.join();
	t2.join();

    assert(que.destruct_count == TESTCOUNT);
}

六.无锁设计原则

6.1 内存模型优化

默认的，如果不指定内存序，使用最严格全局一致内存序memory_order_seq_cst。使用该内存序对其进行分析和推理要比其他内存次序容易得多。本章的所有例子，都是从std::memory_order_seq_cst开始，只有当基本操作正常工作的时候，才放宽内存序的选择。在这种情况下，使用其他内存序就是进行优化(早起可以不用这样做)。通常，当你看整套代码对数据结构的操作后，才能决定是否要放宽该操作的内存序选择。所以，尝试放宽选择，可能会让你轻松一些。在测试后的时候，工作的代码可能会很复杂(不过，不能完全保证内存序正确)。除非你有一个算法检查器，可以系统的测试，线程能看到的所有可能性组合，这样就能保证指定内存序的正确性(这样的测试的确存在)，仅是执行实现代码是远远不够的。

6.2 内存回收

本章中介绍了三种技术来保证内存可以被安全的回收：

等待无线程对数据结构进行访问时，删除所有等待删除的对象。
使用风险指针来标识正在被线程访问的对象。
对对象进行引用计数，当没有线程对对象进行引用时，将其删除。

即使介绍了这么多实现方案，还是会存在各种各样的问题，比如风险指针遍历时间复杂度问题，比如引用计数引入频繁CAS以及ABA，因此我本人的建议是：如果内存容量说的过去，请尽可能的开辟空间并重复使用该块空间，避免频繁的alloc--free,后续介绍一些库的实现。

6.3 避免ABA问题

在“基于比较/交换”的算法中要格外小心“ABA问题”。其流程是:

线程1读取原子变量x，并且发现其值是A。
线程1对这个值进行一些操作，比如，解引用(当其是一个指针的时候)，或做查询，或其他操作。
操作系统将线程1挂起。
其他线程对x执行一些操作，并且将其值改为B。
另一个线程对A相关的数据进行修改(线程1持有)，让其不再合法。可能会在释放指针指向的内存时，代码产生剧烈的反应(大问题)；或者只是修改了相关值而已(小问题)。
再来一个线程将x的值改回为A。如果A是一个指针，那么其可能指向一个新的对象，只是与旧对象共享同一个地址而已。
线程1继续运行，并且对x执行“比较/交换”操作，将A进行对比。这里，“比较/交换”成功(因为其值还是A)，不过这是一个错误的A(the wrong A value)。从第2步中读取的数据不再合法，但是线程1无法言明这个问题，并且之后的操作将会损坏数据结构。

例如：线程1执行pop操作加载A，准备CAS将其更新为B（还未做准备做），然后操作系统调度线程2执行pop操作，加载A并完成CAS改成B，然后线程1仍然阻塞线程2执行pop操作，加载B并完成CAS改成C。操作系统调度线程3执行push操作，又将A push到头节点中，线程1夺回使用权进行CAS，发现以前加载的A和目前CAS最新的A是相等的，认为此时没有线程操作头节点，于是CAS更新为B，但是实际上B以及释放了。

我们的示例中之所以没有ABA问题，是因为每次push构造的智能指针地址是不同的，所以Aold->next和Anew->next一定是不同的，所以CAS验证不会通过，不会有问题。

6.4 避免Use-After-Free问题

常见的Use-After-Free问题来自于CAS中对最后一个参数的访问，例如1.4.2 实现try_reclaim函数章节，CAS即使不发生更新操作也会对第二个参数进行访问。

6.5 避免无谓忙等

在最终队列的例子中，已经见识到线程在执行push操作时，必须等待另一个push操作流程的完成。等待线程就会被孤立，将会陷入到忙等待循环中，当线程尝试失败的时候，会继续循环，这样就会浪费CPU的计算周期。当忙等待循环结束时，就像一个阻塞操作解除，和使用互斥锁的行为一样。通过对算法的修改，当之前的线程还没有完成操作前，让等待线程执行未完成的步骤，就能让忙等待的线程不再被阻塞。在队列示例中，需要将一个数据成员转换为一个原子变量，而不是使用非原子变量和使用“比较/交换”操作来做这件事；要是在更加复杂的数据结构中，这将需要更加多的变化来满足需求。

补充说明：无论是无锁栈还是无锁队列，上述代码只是沿用并讲解作者的思路，举例真正意义上的项目使用还存在遥远的距离，在后面章节我将列举项目中经典的无锁算法。