Welcome!

Agile Computing Authors: Pat Romanski, LeanTaaS Blog, Progress Blog, Derek Weeks, Yeshim Deniz

Related Topics: @DevOpsSummit, Java IoT, Agile Computing

@DevOpsSummit: Blog Feed Post

Ten Elasticsearch Metrics By @Seti321 | @DevOpsSummit [#DevOps #Containers]

This should be especially helpful to those readers new to Elasticsearch

Top 10 Elasticsearch Metrics to Watch

By Stefan Thies

Elasticsearch is booming.  Together with Logstash, a tool for collecting and processing logs, and Kibana, a tool for searching and visualizing data in Elasticsearch (aka the “ELK” stack), adoption of Elasticsearch continues to grow by leaps and bounds.  When it comes to actually using Elasticsearch, there are tons of metrics generated.  Instead of taking on the formidable task of tackling all-things-metrics in one blog post, we’re going to serve up something that we at Sematext have found to be extremely useful in our work as Elasticsearch consultants, production support providers, and monitoring solution builders: the top 10 Elasticsearch metrics to watch.  This should be especially helpful to those readers new to Elasticsearch, and also to experienced users who want a quick start into performance monitoring of Elasticsearch.

Here are the Top 10 Elasticsearch metrics:

  1. Cluster Health – Nodes and Shards
  2. Node Performance – CPU
  3. Node Performance – Memory Usage
  4. Node Performance – Disk I/O
  5. Java – Heap Usage and Garbage Collection
  6. Java – JVM Pool Size
  7. Search Performance – Request Latency and Request Rate
  8. Search Performance – Filter Cache
  9. Search Performance – Field Data Cache
  10. Indexing Performance – Refresh Times and Merge Times

Most of the charts in this piece group metrics either by displaying multiple metrics in one chart or organizing them into dashboards. This is done to provide context for each of the metrics we’re exploring.

To start, here’s a dashboard view of the 10 Elasticsearch metrics we’re going to discuss.

Top_10_dashboard

This dashboard image, and all images in this post, are from Sematext’s SPM Performance Monitoring tool.

Now, let’s dig each of the top 10 metrics one by one and see how to interpret them.

1. Cluster Health – Nodes and Shards

Like OS metrics for a server, the cluster health status is a basic metric for Elasticsearch. It provides an overview of running nodes and the status of shards distributed to the nodes.

Top_10_cluster_health

Tracking running nodes by node type

Putting the counters for the shard allocation status together in one graph visualizes how the cluster recovers over time. Especially in the case of upgrade procedures with round-robin restarts, it’s important to know the time your cluster needs to allocate the shards.

Top_10_shard_allocation_status

Shard allocation status over time

The process of allocating shards after restarts can take a long time depending on the specific settings of the cluster. Taking some control of shard allocation is given by the Cluster API.

2. Node Performance  – CPU

As with any other server, Elasticsearch performance depends strongly on the machine it is installed on. CPU, Memory Usage and Disk I/O are basic operating system metrics for each Elasticsearch node.

In the context of Elasticsearch (or any other Java application) it is recommended that you look into Java Virtual Machine (JVM) metrics when CPU usage spikes. In the following example the reason for the spike was higher garbage collection activity.  It can be recognized by the ‘collection count’ because the time and and count increased.

Top_10_higher_CPU_usage

Higher CPU usage (user space where Elasticsearch lives)

Top_10_increased_garbage_collection

Increased garbage collection activity causing increased CPU usage

3. Node Performance – Memory Usage

The following graph shows a good balance. There is some spare memory and nearly 60% of memory is used, which leaves enough space for cached memory (e.g. file system cache).  What you’d see more typically is actually a chart that shows no free memory.  People new to looking at memory metrics often panic thinking that having no free memory means the server doesn’t have enough RAM.  That is actually not so.  It is good not to have free memory.  It is good if the server is making use of all the memory.  The question is just whether there is any buffered & cached memory (this is a good thing) or if it’s all used.  If it’s all used and there is very little or no buffered & cached memory, then indeed the server is low on RAM.  Because Elasticsearch runs inside the Java Virtual Machine, JVM memory and garbage collection are the areas to look at for Elasticsearch-specific memory utilization.

Top_10_balanced_memory

A balanced memory usage

——-

Summary Dashboard

Top_10_summary_nodes_mem_cpu

Number of nodes (left),  Memory Usage, CPU usage (right)

4. Node Performance – Disk I/O

A search engine makes heavy use of storage devices, and watching the disk I/O ensures that this basic need gets fulfilled. As there are so many reasons for reduced disk I/O, it’s considered a key metric and a good indicator for many kinds of problems. It is a good metric to check the effectiveness of indexing and query performance. Distinguishing between read and write operations directly indicates what the system needs most in the specific use case. Typically there are many more reads from queries than writes, although a popular use case for Elasticsearch is log management, which typically has high writes and low reads. When writes are higher than reads, optimizations for indexing are more important than query optimizations. This example shows a logging system with more writes than reads:

Top_10_node_perf_disk_io

Disk I/O – read vs. write operations

The operating system settings for disk I/O are a base for all other optimizations – tuning disk I/O can avoid potential problems.  If the disk I/O is still not sufficient, counter-measures such as optimizing the number of shards and their size, throttling merges, replacing slow disks, moving to SSDs or adding more nodes should be evaluated according to the circumstances causing the I/O bottlenecks. For example: while searching, disks get trashed if the indices don’t fit in the OS cache. This can be solved a number of different ways: by either adding more RAM or data nodes, or by reducing the index size (e.g. using time-based indices and aliases), or by being smarter about limiting searches to only specific shards or indices instead of searching all of them, or by caching, etc.

5. Java – Heap Usage and Garbage Collection

Elasticsearch runs in a JVM, so the optimal settings for the JVM and monitoring of the garbage collector and memory usage are critical. There are several things to consider with regards to  JVM and operating system memory settings:

  • Avoid the JVM process getting swapped to disk. On Unix, Linux, Mac OS X systems: lock the process address space into RAM by setting bootstrap.mlockall=true in the Elasticsearch configuration file and the environment variable MAX_LOCKED_MEMORY=unlimited (e.g. in /etc/default/elasticsearch). To set swappiness globally in Linux, set “vm.swappiness=1” in /etc/sysctl.conf. On Windows one can simply disable virtual memory.
  • Define the heap memory for Elasticsearch by setting the ES_HEAP_SIZE environment variable (-Xmx java option) and following these rules:
    • Choose a reasonable minimum heap memory to avoid ‘out of memory’ errors. The best practice is setting the minimum (-Xms) equal to the maximum heap size (-Xmx); so there is no need to allocate additional memory during runtime.  Example: ./bin/elasticsearch -Xmx16g -Xms16g
    • As a rule of thumb: set the maximum heap size to 50% of available physical RAM. Typically, one does not want to allocate more than 50-60% of total RAM to the JVM heap.  JVM memory tuning is not trivial and requires one to monitor used and cached main memory, as well as JVM memory heap, memory pool utilization, and garbage collection.
    • Don’t cross the 32 GB limit – if you have servers with lots of memory it is generally better to run more Elasticsearch nodes than going over the 32 GB limit for maximal heap size. In short, using -Xmx32g or higher results in the JVM using larger, 64-bit pointers that need more memory.  If you don’t go over -Xmx31g the JVM will use smaller, 32-bit pointers by using compressed Ordinary Object Pointers aka OOPs.

The report below should be obvious to all Java developers who know how JVM manages memory. Here we see relative sizes of all memory spaces and their total size.  If you are troubleshooting performance of the JVM (which one does with pretty much every Java application) this is one of the key places to check first, in addition to looking at the Garbage Collection and Memory Pool Utilization reports (see graph “Pool Utilization”).  In this graph we see a healthy sawtooth pattern clearly showing when major garbage collection kicked in.

Top_10_GC_sawtooth

Typical Garbage Collection Sawtooth

When we watch the summary of multiple Elasticsearch nodes, the sawtooth pattern is not as sharp as usual because garbage collection happens at different times on different machines. Nevertheless, the pattern can still be recognized, probably because all nodes in this cluster were started at the same time and are following similar garbage collection cycles.

Top_10_agg_view_multiple_jvm

Aggregate view of multiple Elasticsearch JVMs in the cluster

6. Java – JVM Pool Size

The Memory Pool Utilization graph shows what percentage of each pool is being used over time.  When some of these Memory Pools, especially Old Gen or Perm Gen, approach 100% utilization and stay there, it’s time to worry.  Actually, it’s already too late by then.  You have alerts set on these metrics, right?  When that happens you might also find increased garbage collection times and higher CPU usage, as the JVM keeps trying to free up some space in any pools that are (nearly) full. If there’s too much garbage collection activity, it could be due to one of the following causes:

  • One particular pool is stressed, and you can get away with tuning pools.
  • JVM needs more memory than has been allocated to it. In this case, you can either lower your requirements or add more heap memory (ES_HEAP_SIZE). Lowering the utilized heap in Elasticsearch could theoretically be done by reducing the field and filter cache. In practice it would have a negative impact to query performance by doing so. A much better way to reduce memory requirements for “not_analyzed” fields is to use in the index mapping / type definitions the “doc_values” format – it reduces the memory footprint at query time.

Top_10_jvm_pool_utilization

JVM Pool Utilization

A drastic change in memory usage or long garbage collection runs may indicate a critical situation. For example, in a summarized view of JVM Memory over all nodes a drop of several GB in memory might indicate that nodes left the cluster, restarted or got reconfigured for lower heap usage.

——-

Summary Dashboard

Top_10_summary_jvm_pool_size_GC

JVM Pool Size (left) and Garbage collection (right)

7. Search Performance – Request Latency and Request rate

When it comes to search applications, the user experience is typically highly correlated to the latency of search requests.  For example, the request latency for simple queries is typically below 100. We say “typically” because Elasticsearch if often used for analytical queries, too, and humans seem to still tolerate slower queries in scenarios.  There are numerous things that can affect your queries performance – poorly constructed queries, improperly configured Elasticsearch cluster, JVM memory and garbage collection issues, disk IO, and so on.  Through our Elasticsearch consulting practice we have seen many cases where request latency is low and then suddenly jumps as a consequence of something else starting to misbehave in the cluster.  Needless to say, query latency being the metric that directly impacts users, make sure you put some alerts on it.  Alerts based on query latency anomaly detection will be helpful here.  The following charts illustrate just such a case.  A spike like the blue 95th percentile query latency spike will trip any anomaly detection-based alerting system worth its salt.  A word of caution – query latency that Elasticsearch exposes are actually per-shard query latency metrics.  They are not latency values for the overall query.

Top_10_number_ES_nodes

Number of Elasticsearch nodes dropping (left) causing increase in query latency (right)

Putting the request latency together with the request rate into a graph immediately provides an overview of how much the system is used and how it responds to it.

8. Search Performance – Filter Cache

Most of the filters in Elasticsearch are cached by default. That means that during the first execution of a query with a filter Elasticsearch will find documents matching the filter and build a structure called bitset using that information. Data stored in the bitset is really simple – it contains document identifier and whether a given document matches the filter. Subsequent executions of queries having the same filter will reuse the information stored in the bitset, thus making query execution faster by saving I/O operations and CPU cycles. Even though filters are relatively small they can take up large portions of the JVM heap if you have a lot of data and numerous different filters. Because of that it is wise to set the “indices.cache.filter.size” property to limit the amount of heap to be used for the filter cache.  To find out the best setting for this property keep an eye on filter cache size and filter cache eviction metrics shown in the chart below.

Top_10_filter_cache_size

Filter cache size and evictions help optimize filter cache size setting

9. Search Performance – Field Data Cache

Field data cache size and evictions are typically important for search performance if aggregation queries are used. Field data is expensive to build – it requires pulling of data from disk into memory.  Field data is also also used for sorting and for scripted fields. Remember that by default (because of how costly it is to build it) field data cache is unbounded.  This, of course, could make your JVM heap explode.  To avoid nasty surprises consider limiting the size of field data cache accordingly by setting the “indices.fielddata.cache.size” property and keeping an eye on it to understand the actual size of the cache.

Top_10_field_cache_size

Field cache size (yellow) and field cache evictions (green)

Another thing worth configuring are circuit breakers, to limit the possibility of breaking the cluster by using a set of queries. Finally, you may want to use doc values, a special structures built during indexing that can be used by Elasticsearch instead of field data cache for non-analyzed string and numeric fields.

——-

Summary Dashboard

Top_10_summary_filter_cache_etc

Request Latency (left), Field Cache (center) and Filter Cache (right)

10. Indexing Performance – Refresh Times and Merge Times

Several different things take place in Elasticsearch during indexing and there are many metrics to monitor its performance. When running indexing benchmarks, a fixed number of records is typically used to calculate the indexing rate. In production, though, you’ll typically want to keep an eye on the real indexing rate.  Elasticsearch itself doesn’t expose the rate itself, but it does expose the number of documents, from which one can compute the rate, as shown here:

Top_10_indexing_rate

Indexing rate  (documents/second) and document count

This is another metric worth considering for alerts and/or anomaly detection.  Sudden spikes and dips in indexing rate could indicate issues with data sources.  Refresh time and merge time are closely related to indexing performance, plus they affect overall cluster performance. Refresh time increases with the number of file operations for the Lucene index (shard). Reduced refresh times can be achieved by setting the refresh interval to higher values (e.g. from 1 second to 30 seconds). When Elasticsearch (really, Apache Lucene, which is the indexing/searching library that lives at the core of Elasticsearch) merges many segments, or simply a very large index segment, the merge time increases.  This is a good indicator of having the right merge policy, shard, and segment settings in place. In addition, Disk I/O indicates intensive use of write operations while CPU usage spikes as well. Thus, merges should be as quick as possible.  Alternatively, if merges are affecting the cluster too much, one can limit the merge throughput and increase  “indices.memory.index_buffer_size” (to more than 10% on nodes with a small heap) to reduce  disk I/O and let concurrently executing queries more CPU cycles.

Top_10_summarized_merge_times

Growing summarized merge times over all nodes while indexing

Segments merging is very important process for the index performance, but it is not without side-effects. Higher indexing performance usually means allowing more segments to be present and thus making the queries slightly slower.

——-

Summary Dashboard

Top_10_refresh_merge_times

Refresh and Merge times (left) compared with Disk I/O (right)

So there you have it — the top Elasticsearch metrics to watch for those of you who find yourselves knee deep — or even deeper! — in charts, graphs, dashboards, etc.  Elasticsearch is a high-powered platform that can serve your organization’s search needs extremely well, but, like a blazing fast sports car, you’ve got to know what dials to watch and how to shift gears on the fly to keep things running smoothly.  Staying focused on the top 10 metrics and corresponding analysis presented here will get and/or keep you on the road to a successful Elasticsearch experience.

So, those are our top Elasticsearch metrics — what are YOUR top 10 metrics? We’d love to know so we can compare and contrast them with ours in a future post.  Please leave a comment, or send them to us via email or hit us on Twitter: @sematext.

And…if you’d like try SPM to monitor Elasticsearch yourself, check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

[Note: this post originally appeared on Radar.com]

Filed under: Monitoring Tagged: elasticsearch, metrics, performance monitoring, spm

Read the original blog entry...

More Stories By Sematext Blog

Sematext is a globally distributed organization that builds innovative Cloud and On Premises solutions for performance monitoring, alerting and anomaly detection (SPM), log management and analytics (Logsene), and search analytics (SSA). We also provide Search and Big Data consulting services and offer 24/7 production support for Solr and Elasticsearch.

@ThingsExpo Stories
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
DevOps at Cloud Expo – being held June 5-7, 2018, at the Javits Center in New York, NY – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Among the proven benefits,...
@DevOpsSummit at Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, is co-located with 22nd Cloud Expo | 1st DXWorld Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait...
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
SYS-CON Events announced today that T-Mobile exhibited at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on qua...
SYS-CON Events announced today that Cedexis will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Cedexis is the leader in data-driven enterprise global traffic management. Whether optimizing traffic through datacenters, clouds, CDNs, or any combination, Cedexis solutions drive quality and cost-effectiveness. For more information, please visit https://www.cedexis.com.
SYS-CON Events announced today that Google Cloud has been named “Keynote Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Companies come to Google Cloud to transform their businesses. Google Cloud’s comprehensive portfolio – from infrastructure to apps to devices – helps enterprises innovate faster, scale smarter, stay secure, and do more with data than ever before.
SYS-CON Events announced today that Vivint to exhibit at SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California. As a leading smart home technology provider, Vivint offers home security, energy management, home automation, local cloud storage, and high-speed Internet solutions to more than one million customers throughout the United States and Canada. The end result is a smart home solution that sav...
SYS-CON Events announced today that Opsani will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Opsani is the leading provider of deployment automation systems for running and scaling traditional enterprise applications on container infrastructure.
SYS-CON Events announced today that Nirmata will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Nirmata provides a comprehensive platform, for deploying, operating, and optimizing containerized applications across clouds, powered by Kubernetes. Nirmata empowers enterprise DevOps teams by fully automating the complex operations and management of application containers and its underlying ...
SYS-CON Events announced today that Opsani to exhibit at SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California. Opsani is creating the next generation of automated continuous deployment tools designed specifically for containers. How is continuous deployment different from continuous integration and continuous delivery? CI/CD tools provide build and test. Continuous Deployment is the means by which...