我试图理解什么shard和replica在Elasticsearch中,但我没有设法理解它。如果我下载Elasticsearch并运行脚本,那么根据我所知道的,我已经启动了一个具有单个节点的集群。现在这个节点(我的PC)有5个碎片(?)和一些副本(?)。

它们是什么,我有5个重复的索引吗?如果是,为什么?我需要一些解释。


当前回答

I will explain this using a real word scenarios. Imagine you are a running a ecommerce website. As you become more popular more sellers and products add to your website. You will realize the number of products you might need to index has grown and it is too large to fit in one hard disk of one node. Even if it fits in to hard disk, performing a linear search through all the documents in one machine is extremely slow. one index on one node will not take advantage of the distributed cluster configuration on which the elasticsearch works.

So elasticsearch splits the documents in the index across multiple nodes in the cluster. Each and every split of the document is called a shard. Each node carrying a shard of a document will have only a subset of the document. suppose you have 100 products and 5 shards, each shard will have 20 products. This sharding of data is what makes low latency search possible in elasticsearch. search is conducted parallel on multiple nodes. Results are aggregated and returned. However the shards doesnot provide fault tolerance. Meaning if any node containing the shard is down, the cluster health becomes yellow. Meaning some of the data is not available.

To increase the fault tolerance replicas come in to picture. By deault elastic search creates a single replica of each shard. These replicas are always created on a other node where the primary shard is not residing. So to make the system fault tolerant, you might have to increase the number of nodes in your cluster and it also depends on number of shards of your index. The general formula to calculate the number of nodes required based on replicas and shards is "number of nodes = number of shards*(number of replicas + 1)".The standard practice is to have atleast one replica for fault tolerance.

设置碎片数量是一个静态操作,这意味着您必须在创建索引时指定它。在此之后的任何改变都需要完全重新索引数据,并且需要时间。但是,副本数量的设置是一个动态操作,也可以在索引创建后的任何时间完成。

您可以使用下面的命令为索引设置碎片和副本的数量。

curl -XPUT 'localhost:9200/sampleindex?pretty' -H 'Content-Type: application/json' -d '
{
  "settings":{
    "number_of_shards":2,
    "number_of_replicas":1
  }
}'

其他回答

在ElasticSearch中,在顶层,我们将文档索引为索引。每个索引都有若干个分片,这些分片内部分布数据,而这些分片内部存在Lucene段,这是数据的核心存储。因此,如果索引有5个分片,这意味着数据已经分布在各个分片上,并且分片中存在不同的数据。

请观看解释ES核心的视频 https://www.youtube.com/watch?v=PpX7J-G2PEo

关于多索引或多碎片的文章 弹性搜索,多个索引vs不同数据集的一个索引和类型?

用最简单的术语来说,碎片只是存储在磁盘上一个分离文件夹中的索引的一部分:

这个截图显示了整个Elasticsearch目录。

如您所见,所有数据都进入data目录。

通过检查索引C-mAfLltQzuas72iMiIXNw,我们看到它有五个碎片(文件夹0到4)。

另一方面,JH_A8PgCRj-GK0GeQ0limw索引只有一个碎片(0文件夹)。

pri表示碎片的总数。

索引被分解成碎片,以便分布它们和扩展它们。

副本是分片的副本,在节点丢失时提供可靠性。这个数字经常会引起混淆,因为副本计数== 1意味着集群必须有可用的分片的主副本和复制副本才能处于绿色状态。

为了创建副本,您的集群中必须至少有2个节点。

你可能会发现这里的定义更容易理解: http://www.elasticsearch.org/guide/reference/glossary/

I will explain this using a real word scenarios. Imagine you are a running a ecommerce website. As you become more popular more sellers and products add to your website. You will realize the number of products you might need to index has grown and it is too large to fit in one hard disk of one node. Even if it fits in to hard disk, performing a linear search through all the documents in one machine is extremely slow. one index on one node will not take advantage of the distributed cluster configuration on which the elasticsearch works.

So elasticsearch splits the documents in the index across multiple nodes in the cluster. Each and every split of the document is called a shard. Each node carrying a shard of a document will have only a subset of the document. suppose you have 100 products and 5 shards, each shard will have 20 products. This sharding of data is what makes low latency search possible in elasticsearch. search is conducted parallel on multiple nodes. Results are aggregated and returned. However the shards doesnot provide fault tolerance. Meaning if any node containing the shard is down, the cluster health becomes yellow. Meaning some of the data is not available.

To increase the fault tolerance replicas come in to picture. By deault elastic search creates a single replica of each shard. These replicas are always created on a other node where the primary shard is not residing. So to make the system fault tolerant, you might have to increase the number of nodes in your cluster and it also depends on number of shards of your index. The general formula to calculate the number of nodes required based on replicas and shards is "number of nodes = number of shards*(number of replicas + 1)".The standard practice is to have atleast one replica for fault tolerance.

设置碎片数量是一个静态操作,这意味着您必须在创建索引时指定它。在此之后的任何改变都需要完全重新索引数据,并且需要时间。但是,副本数量的设置是一个动态操作,也可以在索引创建后的任何时间完成。

您可以使用下面的命令为索引设置碎片和副本的数量。

curl -XPUT 'localhost:9200/sampleindex?pretty' -H 'Content-Type: application/json' -d '
{
  "settings":{
    "number_of_shards":2,
    "number_of_replicas":1
  }
}'

Elasticsearch is superbly scalable with all the credit goes to its distributed architecture. It is made possible due to Sharding. Now, before moving further into it, let us consider a simple and very common use case. Let us suppose, you have an index which contains a hell lot of documents, and for the sake of simplicity, consider that the size of that index is 1 TB (i.e, Sum of sizes of each and every document in that index is 1 TB). Also, assume that you have two Nodes each with 512 GB of space available for storing data. As can be seen clearly, our entire index cannot be stored in any of the two nodes available and hence we need to distribute our index among these Nodes.

在这种情况下,索引的大小超过了单个节点的硬件限制,Sharding就可以发挥作用。Sharding通过将索引划分为更小的块来解决这个问题,这些块被命名为Shards。