Cleaning Elasticsearch Data being indexed

Sometimes we just don’t have control over the source of data coming into our elasticsearch indices.  In such cases cleaning Elasticsearch data and removing unwanted data such as html tags before they are put into your elasticsearch index.  This is to prevent unwanted and unpredictable behaviour.

For instance given the text bellow:

<a href=\"http://somedomain.com>\">website</a>

 

If the above is indexed without clean the html, a search for “somedomain” will match documents with the above link.  It might be what you want, but in most cases users do not.  So to prevent this you can use a custom analyser to clean your data.
Bellow is an example solution with cool techniques to debug and analyse your analyser such as query the actual data that is in your index. Note not the Elasticsearch document _source field which will always hold the true 100% raw data that hits elasticsearch unmodified.

Cleaning Elasticsearch Data

 

Create a new

Index with the required html_strip mapping filter configured

PUT /html_poc_v3
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_html_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "html_poc_type": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "my_html_analyzer"
        },
        "description": {
          "type": "string",
          "analyzer": "standard"
        },
        "title": {
          "type": "string",
          "index_analyzer": "my_html_analyzer"
        },
        "urlTitle": {
          "type": "string"
        }
      }
    }
  }
}

 

 

Post Some Data

POST /html_poc_v3/html_poc_type/02
{
  "description": "Description &lt;p&gt;Some d&amp;eacute;j&amp;agrave; vu &lt;a href=\"http://somedomain.com&gt;\"&gt;website&lt;/a&gt;",
  "title": "Title &lt;p&gt;Some d&amp;eacute;j&amp;agrave; vu &lt;a href=\"http://somedomain.com&gt;\"&gt;website&lt;/a&gt;",
  "body": "Body &lt;p&gt;Some d&amp;eacute;j&amp;agrave; vu &lt;a href=\"http://somedomain.com&gt;\"&gt;website&lt;/a&gt;"
}

Now retrieve indexed data

This will by-pass the _source field and fetch the actual indexed data/tokens

GET /html_poc_v3/html_poc_type/_search?pretty=true
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "title": {
      "script": "doc[field].values",
      "params": {
        "field": "title"
      }
    },
    "description": {
      "script": "doc[field].values",
      "params": {
        "field": "description"
      }
    },
    "body": {
      "script": "doc[field].values",
      "params": {
        "field": "body"
      }
    }
  }
}

 Example Response

 Note: the difference for title, description and body

{
  "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "html_poc_v3",
            "_type": "html_poc_type",
            "_id": "02",
            "_score": 1,
            "fields": {
               "title": [
                  [
                     "Some",
                     "Title",
                     "déjà",
                     "vu",
                     "website"
                  ]
               ],
               "body": [
                  [
                     "Body",
                     "Some",
                     "déjà",
                     "vu",
                     "website"
                  ]
               ],
               "description": [
                  [
                     "a",
                     "agrave",
                     "d",
                     "description",
                     "eacute",
                     "href",
                     "http",
                     "j",
                     "p",
                     "some",
                     "somedomain.com",
                     "vu",
                     "website"
                  ]
               ]
            }
         }
      ]
   }
}

Further Cleaning Elasticsearch Data References:

Use this tool to test you analyser : elasticsearch-inquisitor

 

Summary
Article Name
Cleaning Elasticsearch Data being indexed
Description
Sometimes we just don't have control over the source of data coming into our elasticsearch indices. In such cases cleaning Elasticsearch data and removing unwanted data such as html tags before they are put into your elasticsearch index. This is to prevent unwanted and unpredictable behaviour.
Author

Leave a Reply