Comprendre le fonctionnement d'Elasticsearch

Accueil » chora : solution de gestion d'informations » Support d'Elasticsearch »

Pour aller plus loin avec Elasticsearch, vous devrez en comprendre le fonctionnement. Ce document présente les 10 points essentiels à connaître pour pouvoir travailler avec l'API.

Fields

Fields are the smallest individual unit of data in Elasticsearch. Each field has a defined type and contains a single piece of data that can be, for example, a boolean, string or array expression. A collection of fields are together a single Elasticsearch document.

Starting with Elasticsearch version 2.X, field names cannot start with special characters and cannot contain dots.

Documents

Documents are JSON objects that are stored within an Elasticsearch index and are considered the base unit of storage. In the world of relational databases, documents can be compared to a row in table.

For example, let’s assume that you are running an e-commerce application. You could have one document per product or one document per order. There is no limit to how many documents you can store in a particular index.

Data in documents is defined with fields comprised of keys and values. A key is the name of the field, and a value can be an item of many different types such as a string, a number, a boolean expression, another object, or an array of values.

Documents also contain reserved fields that constitute the document metadata such as:

_index – the index where the document resides
~~_type – the type that the document represents~~
_id – the unique identifier for the document

An example of a document:

1

2

3

4

5

6

7

{

"_id": 3,

“_type”: [“user”],

"age": 28,

"name": ["daniel”],

"year":1989,

}

Types

Elasticsearch types are used within documents to subdivide similar types of data wherein each type represents a unique class of documents. Types consist of a name and a mapping (see below) and are used by adding the _type field. This field can then be used for filtering when querying a specific type.

An index can have any number of types, and you can store documents belonging to these types in the same index.

Note : les types sont actuellement en cours de suppression et il est conseillé d'utiliser des indexes différents pour stocker différentes classes de documents : https://www.elastic.co/blog/removal-of-mapping-types-elasticsearch#mapping-type

Mapping

Like a schema in the world of relational databases, mapping defines the different types that reside within an index. It defines the fields for documents of a specific type — the data type (such as text (formerly string) and integer) and how the fields should be indexed and stored in Elasticsearch.

A mapping can be defined explicitly or generated automatically when a document is indexed using templates. (Templates include settings and mappings that can be applied automatically to a new index.)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

# Example

curl -XPUT localhost:9200/example -d '{

"mappings": {

"mytype": {

"properties": {

"name": {

"type": "text"

},

"age": {

"type": "long"

}

}'

Index

Indices, the largest unit of data in Elasticsearch, are logical partitions of documents and can be compared to a database in the world of relational databases.

Shards

Put simply, shards are a single Lucene index. They are the building block of Elasticsearch and are what facilitate its scalability.

Index size is a common cause of Elasticsearch crashes. Since there is no limit to how many documents you can store on each index, an index may take up an amount of disk space that exceeds the limits of the hosting server. As soon as an index approaches this limit, indexing will begin to fail.

One way to counter this problem is to split up indices horizontally into pieces called shards. This allows you to distribute operations across shards and nodes to improve performance.

When you create an index, you can define how many shards you want. Each shard is an independent Lucene index that can be hosted anywhere in your cluster:

# Example
curl -XPUT localhost:9200/example -d '{
  "settings" : {
    "index" : {
      "number_of_shards" : 2, 
      "number_of_replicas" : 1 
    }
  }
}'

1

2

3

4

5

6

7

8

9

# Example

curl -XPUT localhost:9200/example -d '{

"settings" : {

"index" : {

"number_of_shards" : 2,

"number_of_replicas" : 1

}

}'

Replicas

Replicas, as the name implies, are Elasticsearch fail-safe mechanisms and are basically copies of your index’s shards. This is a useful backup system for a rainy day — or, in other words, when a node crashes. Replicas also serve read requests, so adding replicas can help to increase search performance.

To ensure high availability, replicas are not placed on the same node as the original shards (called the “primary” shard) from which they were replicated.

Like with shards, the number of replicas can be defined per index when the index is created. Unlike shards, however, you may change the number of replicas anytime after the index is created.

See the example in the “Shards” section above.

Analyzers

Analyzers are used during indexing to break down phrases or expressions into terms. Defined within an index, an analyzer consists of a single tokenizer and any number of token filters. For example, a tokenizer could split a string into specifically defined terms when encountering a specific expression.

By default, Elasticsearch will apply the “standard” analyzer, which contains a grammar-based tokenizer that removes common English words and applies additional filters. Elasticsearch comes bundled with a series of built-in tokenizers as well, and you can also use a custom tokenizer.

A token filter is used to filter or modify some tokens. For example, a ASCII folding filter will convert characters like ê, é, è to e.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

# Example

curl -XPUT localhost:9200/example -d '{

"mappings": {

"mytype": {

"properties": {

"name": {

"type": "string",

"analyzer": "whitespace"

}

}'

Nodes

The heart of any ELK setup is the Elasticsearch instance, which has the crucial task of storing and indexing data.

In a cluster, different responsibilities are assigned to the various node types:

Data nodes — stores data and executes data-related operations such as search and aggregation
Master nodes — in charge of cluster-wide management and configuration actions such as adding and removing nodes
Client nodes — forwards cluster requests to the master node and data-related requests to data nodes
Tribe nodes — act as a client node, performing read and write operations against all of the nodes in the cluster
Ingestion nodes (this is new in Elasticsearch 5.0) — for pre-processing documents before indexing

By default, each node is automatically assigned a unique identifier, or name, that is used for management purposes and becomes even more important in a multi-node, or clustered, environment.

When installed, a single node will form a new single-node cluster entitled “elasticsearch,” but it can also be configured to join an existing cluster (see below) using the cluster name. Needless to say, these nodes need to be able to identify each other to be able to connect.

In a development or testing environment, you can set up multiple nodes on a single server. In production, however, due to the number of resources that an Elasticsearch node consumes, it is recommended to have each Elasticsearch instance run on a separate server.

Cluster

An Elasticsearch cluster is comprised of one or more Elasticsearch nodes. As with nodes, each cluster has a unique identifier that must be used by any node attempting to join the cluster. By default, the cluster name is “elasticsearch,” but this name can be changed, of course.

One node in the cluster is the “master” node, which is in charge of cluster-wide management and configurations actions (such as adding and removing nodes). This node is chosen automatically by the cluster, but it can be changed if it fails. (See above on the other types of nodes in a cluster.)

Any node in the cluster can be queried, including the “master” node. But nodes also forward queries to the node that contains the data being queried.

As a cluster grows, it will reorganize itself to spread the data.

There are a number of useful cluster APIs that can query the general status of the cluster.

For example, the cluster health API returns health status reports of either “green” (all shards are allocated), “yellow” (the primary shard is allocated but replicas are not), or “red” (the shard is not allocated in the cluster). More about cluster APIs is here.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# Output Example

{

"cluster_name" : "elasticsearch",

"status" : "yellow",

"timed_out" : false,

"number_of_nodes" : 1,

"number_of_data_nodes" : 1,

"active_primary_shards" : 5,

"active_shards" : 5,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 5,

"delayed_unassigned_shards" : 0,

"number_of_pending_tasks" : 0,

"number_of_in_flight_fetch" : 0,

"task_max_waiting_in_queue_millis" : 0,

"active_shards_percent_as_number" : 50.0

}

Aller plus loin

—cartographie : https://www.elastic.co/blog/found-mapping-workflow
—analyse : https://www.elastic.co/blog/found-text-analysis-part-1 et https://www.elastic.co/blog/found-text-analysis-part-2