Glossary of System Design Basics_the load balancer can be-CSDN博客

本文链接：https://blog.csdn.net/anqi3776/article/details/114566010

本文介绍了系统设计的基础概念，包括负载均衡、缓存、数据分区和代理服务器。负载均衡通过分散流量，提高服务响应速度和可用性。缓存如应用服务器缓存和CDN，能提升性能并减少延迟。数据分区有助于管理和扩展大型数据集，而代理服务器则分为不同类型，如开放代理和反向代理。SQL与NoSQL数据库各有优缺点，SQL适合结构化数据，NoSQL适合非结构化和大规模数据场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Load Balancing

We’ll cover the following

Benefits of Load Balancing Load Balancing Algorithms Redundant Load
Balancers

Load Balancer (LB) is another critical component of any distributed system. It helps to spread the traffic across a cluster of servers to improve responsiveness and availability of applications, websites or databases. LB also keeps track of the status of all the resources while distributing requests. If a server is not available to take new requests or is not responding or has elevated error rate, LB will stop sending traffic to such a server.

Typically a load balancer sits between the client and the server accepting incoming network and application traffic and distributing the traffic across multiple backend servers using various algorithms. By balancing application requests across multiple servers, a load balancer reduces individual server load and prevents any one application server from becoming a single point of failure, thus improving overall application availability and responsiveness.

负载均衡器（LB）是任何分布式系统的另一个关键组件。它有助于将流量分布在服务器集群上，以提高应用程序、网站或数据库的响应能力和可用性。LB还在分发请求时跟踪所有资源的状态。如果服务器无法接收新请求或没有响应或错误率升高，LB将停止向此类服务器发送流量。

通常，负载均衡器位于客户机和服务器之间，接受传入的网络和应用程序流量，并使用各种算法将流量分布在多个后端服务器上。通过跨多个服务器平衡应用程序请求，负载平衡器可以减少单个服务器的负载，并防止任何一个应用程序服务器成为单一故障点，从而提高应用程序的总体可用性和响应能力。

在这里插入图片描述

To utilize full scalability and redundancy, we can try to balance the load at each layer of the system. We can add LBs at three places:

Between the user and the web server
Between web servers and an internal platform layer, like application servers or cache servers
Between internal platform layer and database.

Benefits of Load Balancing

Users experience faster, uninterrupted service. Users won’t have to wait for a single struggling server to finish its previous tasks. Instead, their requests are immediately passed on to a more readily available resource.

Service providers experience less downtime and higher throughput. Even a full server failure won’t affect the end user experience as the load balancer will simply route around it to a healthy server.

Load balancing makes it easier for system administrators to handle incoming requests while decreasing wait time for users.

Smart load balancers provide benefits like predictive analytics that determine traffic bottlenecks before they happen. As a result, the smart load balancer gives an organization actionable insights. These are key to automation and can help drive business decisions.

System administrators experience fewer failed or stressed components. Instead of a single device performing a lot of work, load balancing has several devices perform a little bit of work.

用户体验更快、不间断的服务。用户不必等待一台苦苦挣扎的服务器完成之前的任务。相反，他们的请求会立即传递给更容易获得的资源。
服务提供商经历更少的停机时间和更高的吞吐量。即使是一个完整的服务器故障也不会影响最终用户的体验，因为负载平衡器只是将其路由到一个健康的服务器。
负载平衡使系统管理员更容易处理传入请求，同时减少用户的等待时间。
智能负载平衡器提供了一些好处，比如在流量瓶颈发生之前就确定它们的预测分析。因此，智能负载平衡器为组织提供了可操作的见解。这些是自动化的关键，有助于推动业务决策。
系统管理员遇到的故障或压力组件更少。负载平衡不是让单个设备执行大量工作，而是让多个设备执行少量工作。

Load Balancing Algorithms

How does the load balancer choose the backend server?
Load balancers consider two factors before forwarding a request to a backend server. They will first ensure that the server they choose is actually responding appropriately to requests and then use a pre-configured algorithm to select one from the set of healthy servers. We will discuss these algorithms shortly.

Health Checks - Load balancers should only forward traffic to “healthy” backend servers. To monitor the health of a backend server, “health checks” regularly attempt to connect to backend servers to ensure that servers are listening. If a server fails a health check, it is automatically removed from the pool, and traffic will not be forwarded to it until it responds to the health checks again.

There is a variety of load balancing methods, which use different algorithms for different needs.

Least Connection Method — This method directs traffic to the server with the fewest active connections. This approach is quite useful when there are a large number of persistent client connections which are unevenly distributed between the servers.
Least Response Time Method — This algorithm directs traffic to the server with the fewest active connections and the lowest average response time.
Least Bandwidth Method - This method selects the server that is currently serving the least amount of traffic measured in megabits per second (Mbps).
Round Robin Method — This method cycles through a list of servers and sends each new request to the next server. When it reaches the end of the list, it starts over at the beginning. It is most useful when the servers are of equal specification and there are not many persistent connections.
Weighted Round Robin Method — The weighted round-robin scheduling is designed to better handle servers with different processing capacities. Each server is assigned a weight (an integer value that indicates the processing capacity). Servers with higher weights receive new connections before those with less weights and servers with higher weights get more connections than those with less weights.
IP Hash — Under this method, a hash of the IP address of the client is calculated to redirect the request to a server.

负载平衡器如何选择后端服务器？
负载平衡器在将请求转发到后端服务器之前会考虑两个因素。他们将首先确保他们选择的服务器实际上对请求做出了适当的响应，然后使用预先配置的算法从一组正常的服务器中选择一个。我们将很快讨论这些算法。
运行状况检查-负载平衡器应该只将流量转发到“正常”的后端服务器。为了监视后端服务器的运行状况，“运行状况检查”定期尝试连接到后端服务器，以确保服务器正在侦听。如果服务器未通过运行状况检查，它将自动从池中删除，并且在它再次响应运行状况检查之前，不会将流量转发给它。
有各种各样的负载平衡方法，它们根据不同的需要使用不同的算法。
最少连接方法-此方法将流量定向到活动连接最少的服务器。当服务器之间存在大量不均匀分布的持久客户端连接时，这种方法非常有用。
最小响应时间方法-此算法将流量定向到活动连接最少且平均响应时间最低的服务器。
最小带宽方法-此方法选择当前服务于最小流量（以兆位每秒（Mbps）为单位）的服务器。
Round Robin Method—此方法在服务器列表中循环，并将每个新请求发送到下一个服务器。当它到达列表的末尾时，它会从头开始。当服务器的规格相同并且没有太多持久连接时，它最有用。
加权循环法-加权循环调度旨在更好地处理具有不同处理能力的服务器。为每个服务器分配一个权重（表示处理能力的整数值）。权重较高的服务器在权重较低的服务器之前接收新连接，权重较高的服务器比权重较低的服务器获得更多连接。
IP哈希-在此方法下，计算客户端IP地址的哈希，将请求重定向到服务器。

Redundant Load Balancers

The load balancer can be a single point of failure; to overcome this, a second load balancer can be connected to the first to form a cluster. Each LB monitors the health of the other and, since both of them are equally capable of serving traffic and failure detection, in the event the main load balancer fails, the second load balancer takes over.
在这里插入图片描述

负载平衡器可以是单点故障；为了克服这一点，可以将第二个负载平衡器连接到第一个负载平衡器以形成集群。每个负载均衡都监视另一个负载均衡的运行状况，由于这两个负载均衡都具有相同的服务流量和故障检测能力，因此在主负载均衡出现故障时，第二个负载均衡将接管。

Following links have some good discussion about load balancers:
[1] What is load balancing
[2] Introduction to architecting systems
[3] Load balancing

Caching

We’ll cover the following

Application server cache Content Delivery (or Distribution) Network
(CDN) Cache Invalidation Cache eviction policies

Load balancing helps you scale horizontally across an ever-increasing number of servers, but caching will enable you to make vastly better use of the resources you already have as well as making otherwise unattainable product requirements feasible. Caches take advantage of the locality of reference principle: recently requested data is likely to be requested again. They are used in almost every computing layer: hardware, operating systems, web browsers, web applications, and more. A cache is like short-term memory: it has a limited amount of space, but is typically faster than the original data source and contains the most recently accessed items. Caches can exist at all levels in architecture, but are often found at the level nearest to the front end, where they are implemented to return data quickly without taxing downstream levels.

负载平衡有助于在数量不断增加的服务器上横向扩展，但缓存将使您能够更好地利用现有资源，并使其他无法实现的产品需求变得可行。缓存利用了引用的局部性原则：最近请求的数据可能会再次被请求。它们几乎应用于每一个计算层：硬件、操作系统、web浏览器、web应用程序等等。缓存类似于短期内存：它的空间有限，但通常比原始数据源快，并且包含最近访问的项。缓存可以存在于体系结构中的所有级别，但通常位于最靠近前端的级别，在那里实现缓存可以快速返回数据，而不会对下游级别造成负担。

Application server cache

Placing a cache directly on a request layer node enables the local storage of response data. Each time a request is made to the service, the node will quickly return locally cached data if it exists. If it is not in the cache, the requesting node will fetch the data from the disk. The cache on one request layer node could also be located both in memory (which is very fast) and on the node’s local disk (faster than going to network storage).

What happens when you expand this to many nodes? If the request layer is expanded to multiple nodes, it’s still quite possible to have each node host its own cache. However, if your load balancer randomly distributes requests across the nodes, the same request will go to different nodes, thus increasing cache misses. Two choices for overcoming this hurdle are global caches and distributed caches.

将缓存直接放置在请求层节点上可以实现响应数据的本地存储。每次向服务发出请求时，节点都会快速返回本地缓存的数据（如果存在）。如果它不在缓存中，请求节点将从磁盘获取数据。一个请求层节点上的缓存也可以位于内存（非常快）和节点的本地磁盘上（比进入网络存储更快）。
如果将其扩展到多个节点，会发生什么情况？如果请求层扩展到多个节点，那么每个节点仍然有可能拥有自己的缓存。但是，如果负载平衡器在节点间随机分配请求，则相同的请求将转到不同的节点，从而增加缓存未命中。克服这一障碍的两个选择是全局缓存和分布式缓存。

Content Delivery (or Distribution) Network (CDN)

CDNs are a kind of cache that comes into play for sites serving large amounts of static media. In a typical CDN setup, a request will first ask the CDN for a piece of static media; the CDN will serve that content if it has it locally available. If it isn’t available, the CDN will query the back-end servers for the file, cache it locally, and serve it to the requesting user.

If the system we are building is not large enough to have its own CDN, we can ease a future transition by serving the static media off a separate subdomain (e.g., static.yourservice.com) using a lightweight HTTP server like Nginx, and cut-over the DNS from your servers to a CDN later.

CDN是一种缓存，用于为大量静态媒体提供服务的站点。在典型的CDN设置中，请求将首先向CDN请求一段静态媒体；如果CDN在本地提供该内容，则CDN将提供该内容。如果不可用，CDN将在后端服务器上查询该文件，在本地缓存该文件，并将其提供给请求用户。
如果我们正在构建的系统不够大，无法拥有自己的CDN，我们可以通过在单独的子域（例如。，static.yourservice.com)使用像Nginx这样的轻量级HTTP服务器，然后将DNS从服务器切换到CDN。

Cache Invalidation

While caching is fantastic, it requires some maintenance to keep the cache coherent with the source of truth (e.g., database). If the data is modified in the database, it should be invalidated in the cache; if not, this can cause inconsistent application behavior.

Solving this problem is known as cache invalidation; there are three main schemes that are used:

Write-through cache: Under this scheme, data is written into the cache and the corresponding database simultaneously. The cached data allows for fast retrieval and, since the same data gets written in the permanent storage, we will have complete data consistency between the cache and the storage. Also, this scheme ensures that nothing will get lost in case of a crash, power failure, or other system disruptions.

Although, write-through minimizes the risk of data loss, since every write operation must be done twice before returning success to the client, this scheme has the disadvantage of higher latency for write operations.

Write-around cache: This technique is similar to write-through cache, but data is written directly to permanent storage, bypassing the cache. This can reduce the cache being flooded with write operations that will not subsequently be re-read, but has the disadvantage that a read request for recently written data will create a “cache miss” and must be read from slower back-end storage and experience higher latency.

Write-back cache: Under this scheme, data is written to cache alone, and completion is immediately confirmed to the client. The write to the permanent storage is done after specified intervals or under certain conditions. This results in low-latency and high-throughput for write-intensive applications; however, this speed comes with the risk of data loss in case of a crash or other adverse event because the only copy of the written data is in the cache.

虽然缓存非常棒，但它需要一些维护来保持缓存与真实来源（如数据库）的一致性。如果在数据库中修改了数据，则应在缓存中使其无效；否则，可能会导致应用程序行为不一致。
解决此问题称为缓存失效；主要使用三种方案：
直写缓存：在这种方案下，数据同时写入缓存和相应的数据库。缓存的数据允许快速检索，而且，由于相同的数据被写入永久存储器，我们将在缓存和存储器之间拥有完全的数据一致性。此外，此方案还确保在发生崩溃、电源故障或其他系统中断时不会丢失任何东西。
尽管直写可以最大限度地降低数据丢失的风险，但是由于每次写操作都必须执行两次才能将成功返回给客户端，因此这种方案的缺点是写操作的延迟更高。

绕写缓存：这种技术类似于直写缓存，但数据直接写入永久存储器，绕过缓存。这可以减少缓存被随后不会被重新读取的写入操作淹没，但其缺点是，对最近写入的数据的读取请求将创建“缓存未命中”，并且必须从较慢的后端存储中读取，并经历更高的延迟。

回写缓存：在这种方案下，数据单独写入缓存，完成后立即向客户端确认。对永久存储器的写操作是在指定的时间间隔或特定条件下完成的。这使得写密集型应用程序的延迟和吞吐量都很低；但是，这种速度在发生崩溃或其他不利事件时会带来数据丢失的风险，因为写数据的唯一副本在缓存中。

Cache eviction policies

Following are some of the most common cache eviction policies:

First In First Out (FIFO): The cache evicts the first block accessed first without any regard to how often or how many times it was accessed before.
Last In First Out (LIFO): The cache evicts the block accessed most recently first without any regard to how often or how many times it was accessed before.
Least Recently Used (LRU): Discards the least recently used items first.
Most Recently Used (MRU): Discards, in contrast to LRU, the most recently used items first.
Least Frequently Used (LFU): Counts how often an item is needed. Those that are used least often are discarded first.
Random Replacement (RR): Randomly selects a candidate item and discards it to make space when necessary.

以下是一些最常见的缓存逐出策略：
先进先出（FIFO）：缓存逐出首先访问的第一个块，而不考虑以前访问它的频率或次数。
后进先出（Last-In-First-Out，后进先出）：缓存逐出最近首先访问的块，而不考虑以前访问它的频率或次数。
最近最少使用（LRU）：首先丢弃最近最少使用的项。最近使用（MRU）：与LRU不同，丢弃的是最新使用的物品。
最少使用次数（LFU）：统计项目需要的频率。最不常用的会先被丢弃。随机替换（RR）：随机选择一个候选项，并在必要时丢弃它以腾出空间。

Following links have some good discussion about caching:
[1] Cache
[2] Introduction to architecting systems

Data Partitioning

We’ll cover the following

Partitioning Methods
Partitioning Criteria
Common Problems of Data Partitioning

Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is the process of splitting up a DB/table across multiple machines to improve the manageability, performance, availability, and load balancing of an application. The justification for data partitioning is that, after a certain scale point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow it vertically by adding beefier servers.

数据分区是一种将大型数据库（DB）分解为许多较小部分的技术。它是跨多台计算机拆分一个DB/表的过程，以提高应用程序的可管理性、性能、可用性和负载平衡。数据分区的理由是，经过一定的扩展点后，通过增加更多的机器来水平扩展比通过增加更强大的服务器来垂直扩展更便宜、更可行。

1. Partitioning Methods

There are many different schemes one could use to decide how to break up an application database into multiple smaller DBs. Below are three of the most popular schemes used by various large scale applications.

a. Horizontal partitioning: In this scheme, we put different rows into different tables. For example, if we are storing different places in a table, we can decide that locations with ZIP codes less than 10000 are stored in one table and places with ZIP codes greater than 10000 are stored in a separate table. This is also called a range based partitioning as we are storing different ranges of data in separate tables. Horizontal partitioning is also called as Data Sharding.

a、
水平分区：在这个方案中，我们将不同的行放入不同的表中。例如，如果我们在一个表中存储不同的位置，我们可以确定邮政编码小于10000的位置存储在一个表中，而邮政编码大于10000的位置存储在一个单独的表中。这也称为基于范围的分区，因为我们将不同范围的数据存储在不同的表中。水平分区也称为数据分片。

The key problem with this approach is that if the value whose range is used for partitioning isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers. In the previous example, splitting location based on their zip codes assumes that places will be evenly distributed across the different zip codes. This assumption is not valid as there will be a lot of places in a thickly populated area like Manhattan as compared to its suburb cities.

这种方法的关键问题是，如果没有仔细选择用于分区的值，那么分区方案将导致服务器不平衡。在前面的示例中，基于邮政编码的拆分位置假定位置将均匀分布在不同的邮政编码中。这种假设是无效的，因为与郊区城市相比，曼哈顿这样人口稠密的地区将会有很多地方。

b. Vertical Partitioning: In this scheme, we divide our data to store tables related to a specific feature in their own server. For example, if we are building Instagram like application - where we need to store data related to users, photos they upload, and people they follow - we can decide to place user profile information on one DB server, friend lists on another, and photos on a third server.

Vertical partitioning is straightforward to implement and has a low impact on the application. The main problem with this approach is that if our application experiences additional growth, then it may be necessary to further partition a feature specific DB across various servers (e.g. it would not be possible for a single server to handle all the metadata queries for 10 billion photos by 140 million users).

b、
垂直分区：在这个方案中，我们将数据划分为与特定功能相关的表存储在它们自己的服务器中。例如，如果我们正在构建类似Instagram的应用程序—我们需要存储与用户、他们上传的照片以及他们关注的人相关的数据—我们可以决定将用户配置文件信息放在一台DB服务器上，朋友列表放在另一台服务器上，照片放在第三台服务器上。
垂直分区易于实现，对应用程序的影响较小。这种方法的主要问题是，如果我们的应用程序经历了额外的增长，那么可能需要在不同的服务器上进一步划分特定于功能的数据库（例如，单个服务器不可能处理1.4亿用户对100亿张照片的所有元数据查询）。

c. Directory Based Partitioning: A loosely coupled approach to work around issues mentioned in the above schemes is to create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code. So, to find out where a particular data entity resides, we query the directory server that holds the mapping between each tuple key to its DB server. This loosely coupled approach means we can perform tasks like adding servers to the DB pool or changing our partitioning scheme without having an impact on the application.

基于目录的分区：解决上述方案中提到的问题的松耦合方法是创建一个查找服务，该服务了解当前的分区方案，并将其从DB访问代码中抽象出来。因此，为了找出一个特定的数据实体所在的位置，我们查询保存每个元组键到其DB服务器之间的映射的目录服务器。这种松散耦合的方法意味着我们可以在不影响应用程序的情况下执行诸如向DB池添加服务器或更改分区方案之类的任务。

2. Partitioning Criteria

a. Key or Hash-based partitioning: Under this scheme, we apply a hash function to some key attributes of the entity we are storing; that yields the partition number. For example, if we have 100 DB servers and our ID is a numeric value that gets incremented by one each time a new record is inserted. In this example, the hash function could be ‘ID % 100’, which will give us the server number where we can store/read that record. This approach should ensure a uniform allocation of data among servers. The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use Consistent Hashing.

b. List partitioning: In this scheme, each partition is assigned a list of values, so whenever we want to insert a new record, we will see which partition contains our key and then store it there. For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries.

c. Round-robin partitioning: This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).

d. Composite partitioning: Under this scheme, we combine any of the above partitioning schemes to devise a new scheme. For example, first applying a list partitioning scheme and then a hash based partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.

a、
基于密钥或散列的分区：在这个方案中，我们将散列函数应用于我们存储的实体的一些密钥属性；这将产生分区号。例如，如果我们有100个DB服务器，并且我们的ID是一个数值，每次插入一条新记录时，它都会递增一。在本例中，哈希函数可以是’ID%100’，这将为我们提供可以存储/读取该记录的服务器号。这种方法应该确保在服务器之间统一分配数据。这种方法的根本问题是，它有效地修复了DB服务器的总数，因为添加新服务器意味着更改哈希函数，这将需要重新分配数据和服务停机。解决这个问题的一个方法是使用一致的哈希。
b、
列表分区：在这个方案中，每个分区都被分配一个值列表，所以每当我们要插入一个新记录时，我们都会看到哪个分区包含我们的键，然后将它存储在那里。例如，我们可以决定居住在冰岛、挪威、瑞典、芬兰或丹麦的所有用户将存储在北欧国家的分区中。
c、循环分区：这是一个非常简单的策略，可以确保数据分布的一致性。对于’n’分区，'i’元组被分配给分区（i mod n）。 d、
组合分区：在这个方案下，我们将上述任何一种分区方案结合起来设计一个新的方案。例如，首先应用列表分区方案，然后应用基于哈希的分区。一致散列可以被认为是散列和列表分区的组合，其中散列将密钥空间减少到可以列出的大小。

3. Common Problems of Data Partitioning

On a partitioned database, there are certain extra constraints on the different operations that can be performed. Most of these constraints are due to the fact that operations across multiple tables or multiple rows in the same table will no longer run on the same server. Below are some of the constraints and additional complexities introduced by partitioning:

a. Joins and Denormalization: Performing joins on a database which is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database partitions. Such joins will not be performance efficient since data has to be compiled from multiple servers. A common workaround for this problem is to denormalize the database so that queries that previously required joins can be performed from a single table. Of course, the service now has to deal with all the perils of denormalization such as data inconsistency.

b. Referential integrity: As we saw that performing a cross-partition query on a partitioned database is not feasible, similarly, trying to enforce data integrity constraints such as foreign keys in a partitioned database can be extremely difficult.

Most of RDBMS do not support foreign keys constraints across databases on different database servers. Which means that applications that require referential integrity on partitioned databases often have to enforce it in application code. Often in such cases, applications have to run regular SQL jobs to clean up dangling references.

c. Rebalancing: There could be many reasons we have to change our partitioning scheme:

The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code that cannot fit into one database partition.
There is a lot of load on a partition, e.g., there are too many requests being handled by the DB partition dedicated to user photos.
In such cases, either we have to create more DB partitions or have to rebalance existing partitions, which means the partitioning scheme changed and all existing data moved to new locations. Doing this without incurring downtime is extremely difficult. Using a scheme like directory based partitioning does make rebalancing a more palatable experience at the cost of increasing the complexity of the system and creating a new single point of failure (i.e. the lookup service/database).

在分区数据库上，可以执行的不同操作有某些额外的约束。这些限制大多是由于跨多个表或同一表中多行的操作将不再在同一服务器上运行。下面是分区带来的一些限制和额外的复杂性：
a、
联接和非规范化：在一台服务器上运行的数据库上执行联接是很简单的，但是一旦一个数据库被分区并分布在多台计算机上，执行跨数据库分区的联接通常是不可行的。由于必须从多个服务器编译数据，这样的连接将不会提高性能。解决这个问题的一个常见方法是对数据库进行非规范化，以便可以从单个表执行以前需要的联接的查询。当然，服务现在必须处理所有非规范化的危险，比如数据不一致。
b、
引用完整性：正如我们所看到的，在分区数据库上执行跨分区查询是不可行的，类似地，在分区数据库中强制执行数据完整性约束（如外键）可能非常困难。
大多数RDBMS不支持不同数据库服务器上的数据库之间的外键约束。这意味着在分区数据库上需要引用完整性的应用程序通常必须在应用程序代码中强制实现它。通常在这种情况下，应用程序必须运行常规的SQL作业来清除悬空引用。
c、重新平衡：我们必须改变分区方案的原因可能有很多：数据分布不均匀，例如，一个特定的邮政编码有很多地方不能放入一个数据库分区。
分区上有很多负载，例如，有太多的请求由专用于用户照片的DB分区处理。
在这种情况下，要么我们必须创建更多的DB分区，要么必须重新平衡现有分区，这意味着分区方案发生了变化，所有现有数据都移动到了新的位置。在不引起停机的情况下这样做是非常困难的。使用类似于基于目录的分区的方案确实会使重新平衡体验更加愉快，但代价是增加系统的复杂性并创建新的单点故障（即查找服务/数据库）。

Proxies

Proxy Server Types
Open Proxy
Reverse Proxy
A proxy server is an intermediate server between the client and the back-end server. Clients connect to proxy servers to make a request for a service like a web page, file, connection, etc. In short, a proxy server is a piece of software or hardware that acts as an intermediary for requests from clients seeking resources from other servers.

Typically, proxies are used to filter requests, log requests, or sometimes transform requests (by adding/removing headers, encrypting/decrypting, or compressing a resource). Another advantage of a proxy server is that its cache can serve a lot of requests. If multiple clients access a particular resource, the proxy server can cache it and serve it to all the clients without going to the remote server.

代理服务器是客户端和后端服务器之间的中间服务器。客户机连接到代理服务器以请求网页、文件、连接等服务。简言之，代理服务器是一种软件或硬件，充当客户机从其他服务器寻求资源请求的中介。
通常，代理用于过滤请求、记录请求，或者有时转换请求（通过添加/删除头、加密/解密或压缩资源）。代理服务器的另一个优点是，它的缓存可以服务很多请求。如果多个客户机访问一个特定的资源，代理服务器可以缓存该资源并将其提供给所有客户机，而无需访问远程服务器。

Proxy Server Types

Proxies can reside on the client’s local server or anywhere between the client and the remote servers. Here are a few famous types of proxy servers:

Open Proxy

An open proxy is a proxy server that is accessible by any Internet user. Generally, a proxy server only allows users within a network group (i.e. a closed proxy) to store and forward Internet services such as DNS or web pages to reduce and control the bandwidth used by the group. With an open proxy, however, any user on the Internet is able to use this forwarding service. There two famous open proxy types:

Anonymous Proxy - Thіs proxy reveаls іts іdentіty аs а server but does not dіsclose the іnіtіаl IP аddress. Though thіs proxy server cаn be dіscovered eаsіly іt cаn be benefіcіаl for some users аs іt hіdes their IP аddress.
Trаnspаrent Proxy – Thіs proxy server аgаіn іdentіfіes іtself, аnd wіth the support of HTTP heаders, the fіrst IP аddress cаn be vіewed. The mаіn benefіt of usіng thіs sort of server іs іts аbіlіty to cаche the websіtes.

开放代理是任何Internet用户都可以访问的代理服务器。通常，代理服务器仅允许网络组（即封闭代理）内的用户存储和转发诸如DNS或web页面之类的因特网服务，以减少和控制组使用的带宽。但是，使用开放的代理，Internet上的任何用户都可以使用此转发服务。有两种著名的开放代理类型：
匿名代理-该代理服务器的IP地址无效。虽然代理服务器cіn会被发现，但对于某些用户来说，cіn会为他们的IP地址带来好处。
Trаnspаrent Proxy–Thіs Proxy serverаgаnіdentіfіesіtself，аand wіTh the
support of HTTP heаders，首个IP地址可查看。我们使用这种服务器的好处是可以访问网站。

Reverse Proxy

A reverse proxy retrieves resources on behalf of a client from one or more servers. These resources are then returned to the client, appearing as if they originated from the proxy server itself

反向代理代表客户机从一个或多个服务器检索资源。然后，这些资源被返回到客户机，看起来好像它们来自代理服务器本身

SQL vs. NoSQL

In the world of databases, there are two main types of solutions: SQL and NoSQL (or relational databases and non-relational databases). Both of them differ in the way they were built, the kind of information they store, and the storage method they use.

Relational databases are structured and have predefined schemas like phone books that store phone numbers and addresses. Non-relational databases are unstructured, distributed, and have a dynamic schema like file folders that hold everything from a person’s address and phone number to their Facebook ‘likes’ and online shopping preferences.

关系数据库是结构化的，并且具有预定义的模式，比如存储电话号码和地址的电话簿。非关系型数据库是非结构化的、分布式的，并且有一个类似于动态模式的文件夹，其中包含从一个人的地址和电话号码到他们的Facebook“喜好”和网上购物偏好的所有内容。

SQL

Relational databases store data in rows and columns. Each row contains all the information about one entity and each column contains all the separate data points. Some of the most popular relational databases are MySQL, Oracle, MS SQL Server, SQLite, Postgres, and MariaDB.

关系数据库按行和列存储数据。每行包含关于一个实体的所有信息，每列包含所有单独的数据点。一些最流行的关系数据库是MySQL、Oracle、mssqlserver、SQLite、Postgres和MariaDB。

NoSQL

Following are the most common types of NoSQL:

Key-Value Stores: Data is stored in an array of key-value pairs. The ‘key’ is an attribute name which is linked to a ‘value’. Well-known key-value stores include Redis, Voldemort, and Dynamo.

Document Databases: In these databases, data is stored in documents (instead of rows and columns in a table) and these documents are grouped together in collections. Each document can have an entirely different structure. Document databases include the CouchDB and MongoDB.

Wide-Column Databases: Instead of ‘tables,’ in columnar databases we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Columnar databases are best suited for analyzing large datasets - big names include Cassandra and HBase.

Graph Databases: These databases are used to store data whose relations are best represented in a graph. Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities). Examples of graph database include Neo4J and InfiniteGraph.

键值存储：数据存储在键值对数组中。“key”是链接到“value”的属性名。著名的键值商店包括Redis、Voldemort和Dynamo。
文档数据库：在这些数据库中，数据存储在文档中（而不是表中的行和列），这些文档在集合中分组在一起。每个文档可以有完全不同的结构。文档数据库包括CouchDB和MongoDB。
宽列数据库： 在列数据库中，我们使用列族而不是“表”，它们是行的容器。与关系数据库不同，我们不需要预先知道所有列，并且每行不必具有相同的列数。列式数据库最适合于分析大型数据集——大型数据库包括Cassandra和HBase。
图形数据库：这些数据库用于存储关系最好用图形表示的数据。数据保存在带有节点（实体）、属性（有关实体的信息）和线（实体之间的连接）的图形结构中。图形数据库的例子包括Neo4J和InfiniteGraph。

High level differences between SQL and NoSQL

Storage: SQL stores data in tables where each row represents an entity and each column represents a data point about that entity; for example, if we are storing a car entity in a table, different columns could be ‘Color’, ‘Make’, ‘Model’, and so on.

NoSQL databases have different data storage models. The main ones are key-value, document, graph, and columnar. We will discuss differences between these databases below.

存储：SQL在表中存储数据，其中每行表示一个实体，每列表示该实体的数据点；例如，如果我们在表中存储一个car实体，则不同的列可以是“Color”、“Make”、“Model”等等。
NoSQL数据库有不同的数据存储模型。主要有键值、文档、图形和列。我们将在下面讨论这些数据库之间的差异。

Schema: In SQL, each record conforms to a fixed schema, meaning the columns must be decided and chosen before data entry and each row must have data for each column. The schema can be altered later, but it involves modifying the whole database and going offline.

In NoSQL, schemas are dynamic. Columns can be added on the fly and each ‘row’ (or equivalent) doesn’t have to contain data for each ‘column.’

Schema：在SQL中，每条记录都符合一个固定的Schema，这意味着在输入数据之前必须确定和选择列，并且每一行必须有每一列的数据。以后可以更改模式，但它涉及修改整个数据库并使其脱机。
在NoSQL中，模式是动态的。可以动态添加列，并且每个“行”（或等效项）不必包含每个“列”的数据

Querying: SQL databases use SQL (structured query language) for defining and manipulating the data, which is very powerful. In a NoSQL database, queries are focused on a collection of documents. Sometimes it is also called UnQL (Unstructured Query Language). Different databases have different syntax for using UnQL.

**查询：**SQL数据库使用SQL（结构化查询语言）来定义和操作数据，非常强大。在NoSQL数据库中，查询集中在文档集合上。有时也称为UnQL（非结构化查询语言）。不同的数据库使用UnQL有不同的语法。

Scalability: In most common situations, SQL databases are vertically scalable, i.e., by increasing the horsepower (higher Memory, CPU, etc.) of the hardware, which can get very expensive. It is possible to scale a relational database across multiple servers, but this is a challenging and time-consuming process.

On the other hand, NoSQL databases are horizontally scalable, meaning we can add more servers easily in our NoSQL database infrastructure to handle a lot of traffic. Any cheap commodity hardware or cloud instances can host NoSQL databases, thus making it a lot more cost-effective than vertical scaling. A lot of NoSQL technologies also distribute data across servers automatically.

可伸缩性：在大多数情况下，SQL数据库是垂直可伸缩的，也就是说，通过增加硬件的马力（更高的内存、CPU等），这可能会非常昂贵。可以跨多个服务器扩展关系数据库，但这是一个具有挑战性且耗时的过程。
另一方面，NoSQL数据库是水平可伸缩的，这意味着我们可以在NoSQL数据库基础设施中轻松添加更多服务器来处理大量流量。任何廉价的商品硬件或云实例都可以承载NoSQL数据库，因此比垂直扩展更具成本效益。许多NoSQL技术还自动在服务器之间分发数据。

Reliability or ACID Compliancy (Atomicity, Consistency, Isolation, Durability): The vast majority of relational databases are ACID compliant. So, when it comes to data reliability and safe guarantee of performing transactions, SQL databases are still the better bet.

Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability.

**可靠性或ACID兼容性（原子性、一致性、隔离性、持久性）：**绝大多数关系数据库都是ACID兼容的。因此，在数据可靠性和执行事务的安全保证方面，SQL数据库仍然是更好的选择。
大多数NoSQL解决方案为了性能和可伸缩性牺牲了ACID遵从性。

SQL VS. NoSQL - Which one to use?

When it comes to database technology, there’s no one-size-fits-all solution. That’s why many businesses rely on both relational and non-relational databases for different needs. Even as NoSQL databases are gaining popularity for their speed and scalability, there are still situations where a highly structured SQL database may perform better; choosing the right technology hinges on the use case.

Reasons to use SQL database

Here are a few reasons to choose a SQL database:

We need to ensure ACID compliance. ACID compliance reduces anomalies and protects the integrity of your database by prescribing exactly how transactions interact with the database. Generally, NoSQL databases sacrifice ACID compliance for scalability and processing speed, but for many e-commerce and financial applications, an ACID-compliant database remains the preferred option.
Your data is structured and unchanging. If your business is not experiencing massive growth that would require more servers and if you’re only working with data that is consistent, then there may be no reason to use a system designed to support a variety of data types and high traffic volume.

我们需要确保符合ACID。ACID遵从性通过规定事务如何与数据库交互来减少异常并保护数据库的完整性。一般来说，NoSQL数据库牺牲ACID遵从性来提高可伸缩性和处理速度，但是对于许多电子商务和金融应用程序，ACID遵从性数据库仍然是首选。
你的数据是结构化的，不变的。如果您的业务没有经历需要更多服务器的大规模增长，并且您只处理一致的数据，那么就没有理由使用设计为支持多种数据类型和高流量的系统。

Reasons to use NoSQL database

When all the other components of our application are fast and seamless, NoSQL databases prevent data from being the bottleneck. Big data is contributing to a large success for NoSQL databases, mainly because it handles data differently than the traditional relational databases. A few popular examples of NoSQL databases are MongoDB, CouchDB, Cassandra, and HBase.

Storing large volumes of data that often have little to no structure. A NoSQL database sets no limits on the types of data we can store together and allows us to add new types as the need changes. With document-based databases, you can store data in one place without having to define what “types” of data those are in advance.
Making the most of cloud computing and storage. Cloud-based storage is an excellent cost-saving solution but requires data to be easily spread across multiple servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves you the hassle of additional software and NoSQL databases like Cassandra are designed to be scaled across multiple data centers out of the box, without a lot of headaches.
Rapid development. NoSQL is extremely useful for rapid development as it doesn’t need to be prepped ahead of time. If you’re working on quick iterations of your system which require making frequent updates to the data structure without a lot of downtime between versions, a relational database will slow you down.

当我们的应用程序的所有其他组件都快速无缝时，NoSQL数据库可以防止数据成为瓶颈。大数据为NoSQL数据库的巨大成功做出了贡献，主要是因为它处理数据的方式不同于传统的关系数据库。NoSQL数据库的几个流行示例是MongoDB、CouchDB、Cassandra和HBase。
存储通常很少或没有结构的大量数据。NoSQL数据库对我们可以存储在一起的数据类型没有限制，并允许我们根据需要添加新类型。使用基于文档的数据库，您可以将数据存储在一个位置，而无需事先定义这些数据的“类型”。
充分利用云计算和存储。基于云的存储是一个非常好的成本节约解决方案，但需要将数据轻松地分布在多个服务器上，以便进行扩展。在现场或云中使用商品（价格合理、体积较小）硬件可以为您省去额外软件的麻烦，而像Cassandra这样的NoSQL数据库被设计成可以跨多个数据中心进行开箱即用的扩展，而不会带来很多麻烦。
快速发展。NoSQL对于快速开发非常有用，因为它不需要提前准备。如果您正在进行系统的快速迭代，需要频繁地更新数据结构，而不需要在版本之间有大量的停机时间，那么关系数据库会降低您的速度。