Skip to main content

ElasticSearch

tip

This note is based on ElasticSearch v8.11.3.

Prerequisites

Managing Documents

Create Index
PUT /car
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 2
}
}
Deleting Index
DELETE /car
Indexing document with auto-generated ID
POST /car/_doc
{
"brand": "McLaren",
"model": "Senna",
"stock": 2
}
Indexing document with custom ID
POST /car/_doc/mclaren_senna
{
"brand": "McLaren",
"model": "Senna",
"stock": 2
}
Retrieving documents by ID
GET /car/_doc/mclaren_senna
Update document

By using the Update API, it saves us API calls by

// Modify existing field
POST /car/_update/mclaren_senna
{
"doc": {
"stock": 1
}
}

// Add new field
POST /car/_update/mclaren_senna
{
"doc": {
"vin": "SBM11AAA9CW000199"
}
}
tip

Elasticsearch docuements are in fact immutable due to implementation of Apache Lucene, on which ElasticSearch is built.reference

Scripted updates
// Reducing the current value of `stock` by one
POST /car/_update/mclaren_senna
{
"script": {
"source": "ctx._source.stock--"
}
}

// Assigning an arbitrary value to `stock`
POST /car/_update/mclaren_senna
{
"script": {
"source": "ctx._source.stock = 10"
}
}

// Using parameters within scripts
POST /car/_update/mclaren_senna
{
"script": {
"source": "ctx._source.stock -= params.quantity",
"params": {
"quantity": 4
}
}
}

// Conditionally set the operation to `noop`, and no update will be performed
POST /car/_update/mclaren_senna
{
"script": {
"source": """
if (ctx._source.stock == 0) {
ctx.op = 'noop';
}

ctx._source.stock--;
"""
}
}

// Conditionally update a field value
POST /car/_update/mclaren_senna
{
"script": {
"source": """
if (ctx._source.stock > 0) {
ctx._source.stock--;
}
"""
}
}

// Conditionally delete a document by setting ctx.op to 'delete', immediately bails out after deletition
POST /car/_update/mclaren_senna
{
"script": {
"source": """
if (ctx._source.stock < 0) {
ctx.op = 'delete';
}

ctx._source.stock--;
"""
}
}
Upserts
POST /car/_update/mclaren_720s
{
"script": {
"source": "ctx._source.stock++"
},
"upsert": {
"brand": "McLaren",
"model": "720s",
"stock": 2
}
}
Replace Documents
PUT /products/_doc/100
{
"name": "Toaster",
"price": 79,
"in_stock": 4
}
Delete Documents
DELETE /products/_doc/101

Routing

  • Routing is the process of resloving a shard for a document by using a formula which uses metadata _id of a document by default, but can be customized and an additional metadata _routing will be added to the document.
  • The formula takes shard count into considerations, thus amount of shards must not be chaned after index is established. If needed, use the Shrink/Split API.
  • Routing should ensure evenly distribute documents among shards.

Text Analysis

Text analysis, performed when indexing or searching text fields, is the process of converting unstructured text(e.g body of an email, product description, etc.) into a structured format that’s optimized for search.

Text analysis is performed by an analyzer, which contains a set of rules that govern the analysis process of tokenization and normalization.

  • Tokenization: Breaking texts down into smaller chunks, called tokens.
  • Normalization: Normalize tokens into a standard format, e.g lowercasing, word stemming, synonym identifying, etc.

Elasticsearch includes a default analyzer called standard analyzer, but the analysis can be customized by choosing a different built-in analyzer or even configure a custom one, which offers benefits including:

  • Changes to the text before tokenization
  • How text is converted to tokens
  • Normalization changes made to tokens before indexing or search
tip

The _source field is not used when searching for documents.

An analyzer contains three lower-level building blocks which can be customized individually: character filters, tokenizers, and token filters.

  • Character Filters: Receive the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. An analyzer may have zero or more character filters, which are applied in order.
  • Tokenizers: Receive a stream of characters, perform below tasks and outputs a stream of tokens. An analyzer can only have exactly one tokenizer.
    • Breaks it up into individual tokens (usually individual words)
    • Recording the order or position of each term and the start and end character offsets of the original word which the term represents
  • Token Filters: Receive the token stream and may add, remove, or change tokens, but are not allowed to change the position or character offsets of each token. An analyzer may have zero or more token filters, which are applied in order.

By defualt, no character filters are applied, standard tokenizer is used to tokenize according to Unicode Text Segmentation Algorithm, and lowercase token filter is applied.

Text analysis occurs at two times with each time using their respective analyzer:

  • Index time: When a document is indexed, any text field values are analyzed with an index analyzer.
  • Search time (Query time): When running a full-text search on a text field, the query string (the text the user is searching for) is analyzed with a search analyzer.

In most cases, the same analyzer should be used at index and search time, unless this cause unexpected or irrelevant search matches. This ensures the values and query strings for a field are changed into the same form of tokens, which in turn ensures the tokens match as expected during a search.

ElasticSearch provides an Analyze API to test the result of the analyzer:

POST /_analyze
{
"text": "This is test string",
"char_filter": [],
"tokenizer": "standard",
"filter": ["lowercase"]
}

Inverted Indices

The output tokens of an analyzer are than stored with an data structure called inverted index, where each text field has its own dedicated index and maps the term within a field back to documents that contains them. Terms within an inverted index are sorted alphabetically for performance reasons.

tip

Note that inverted index is created and maintained by Apache Lucene, not ElasticSearch. Other data structures are used for different data type, such as BKD tree is used for numeric values, dates and geospatial data.

Mapping

References