Friday, March 4, 2022

High CPU Utilization and High CPU Saturation(Load)

There are 4 Core Elements of a System
  • CPU ( Central Processing Unit)
  • Memory
  • Core
  • Network Manager
CPU Load :- The Number of Processes which are being executed by CPU or Waiting to be executed by CPU.
CPU Utilization :-  Computer usage of processing Resources or The amount of work handled by CPU.
Various factors influence CPU utilization, such as processor speed, the number of cores, and application requirements.
High CPU utilization can result from various factors, including poorly optimized applications, insufficient hardware resources, or malware and other security threats.
Low CPU utilization, on the other hand, could indicate inefficient scheduling or unused hardware resources. Strategies for improving low CPU utilization include implementing load balancing techniques to distribute workloads evenly across available resources, consolidating underutilized resources to maximize efficiency, and upgrading software or hardware components to better align with system requirements.
CPU utilization = (Total time spent time on Non-Idle Tasks / Total Time) * 100 .
For Eg:-, if the total time is 10,000 milliseconds, and the processor spends 8,000 milliseconds on non-idle tasks, the CPU utilization would be (8,000 / 10,000) x 100 = 80%. This result indicates that the processor is actively engaged in tasks for 80% of the measured time period, providing insights into system performance and potential areas for optimization.

The CPU - Processor Queue Length alarm becomes active when the number of Windows threads waiting for CPU resources exceeds a threshold. Sustained high processor queue length is a good indicator that you have a CPU bottleneck.

If the % Processor Time value is constantly high, check the disk and network metrics first. If they are low, the processor might be under stress. To confirm this, check the average and current Processor Queue Length value. If these values are higher than the recommended, it clearly indicates a processor bottleneck. The final solution for this situation is adding more processors, as this will enable more requests to be executed simultaneously. If the Processor Queue Length value is low (the recommendations are given below), consider using more powerful processors

CPU Load: CPU Load, mathematically, is the amount of work that is performed by CPU as a percentage of total capacity. Each process waiting for CPU increments the load by 1 and a process that is served decrements the Load by 1. Load average is a measurement of how many tasks are waiting in a kernel run queue (not just CPU time but also disk activity) over a period of time.

CPU Utilization(usage): It is a measure of time a CPU is not idle. It can also be defined as a measure of how busy the CPU is right now.

[Processor] "% Processor time" :  if this counter is constantly high, say above 90%, then you'll need to use other counters (described below) in order to further determine the root cause of the CPU pressure.

[Processor] "% Privileged time" : a high percentage of privileged time, anything consistently above 25%, can usually point to a driver or hardware issue and should be investigated. 
 
[Processor] "Queue Length"
This is the number of threads that are ready to execute but waiting for a core to become available.  On single core machines a sustained value greater than 2-3 can mean that you have some CPU pressure.  Similarly, for a multicore machine divide the queue length by the number of cores and if that is continuously greater than 2-3 there might be CPU pressure.

High CPU and Low Throughput :-
For any load test, if you observe a high CPU and low throughput, what would you conclude?
  1.  CPU spike is in single core or all the cores
  2.  Core level utilization is above 75% or not if it is above
  3. Generate thread dump to understand thread status
  4. Could be possible of dead lock status 
  5. sometime due gc happening very less or more frequently could cause the cpu to spike
  6. you have to check whether it is for longer duration say 10 mins or more or it gets fixed automatically or not
  7. CPU spike could also because of incorrect thread configuration in app server or db connections in database server
  8. You have to check the heap dump to understand the memory management
  9. Also it could be because class, method, objects being consumed and not released properly could be code level issue or framework level issue or not patched properly or could be because of security fixes as well
  10. sometimes every long running queries in db server is linked with code which has the open connections without releasing it could cause cpu spike. DB report can help you understand the issue
  11. CPU spike and low throughput has many interactions in it also if load balancer is not properly configured traffic is routed to single server instead splitting to multiple server could cause high cpu, memory issues as well  
single question can be linked to many parts of the applications or subsystems interviewer wants to know your debugging skills
================================
troubleshooting CPU problems in production that too in cloud environment
our application might have millions of lines of code, trying to identify the exact line of code that is causing the CPU to spike up, might be equivalent of finding a needle in the haystack.
To help readers better understand this troubleshooting technique, we built a sample application and deployed it into AWSEC2 instance. Once this application was launched, it caused CPU consumption to spike up to 199.1%. Now let’s walk you through the steps that we followed while troubleshooting this problem. Basically, there are 3 simple steps:
  1. Identify threads that consume CPU
  2. Capture thread dumps
  3. Identify lines of code that is causing CPU to spike up
1. Identify threads that are causing CPU to spike
In the EC2 instance, multiple processes could be running. The first step is to identify the process that is causing the CPU to spike up. Best way to do is to use the ‘TOP’ command that is present in *nixflavor of operating systems.

Issue command ‘top’ from the console

$ top
This command will display all the processes that are running in the EC2 instance sorted by high CPU consuming processes displayed at the top. When we issued the command in the EC2 instance we were seeing the below output:
Fig:‘top’ command issued from an AWS EC2 instance
From the output, you can notice process# 31294to be consuming 199.1% of CPU. It’s pretty high consumption. Ok, now we have identified the process in the EC2 instance which is causing the CPU to spike up. Next step is to identify the threads with in this process that is causing the CPU to spike up.

Issue command ‘top -H -p {pid}’ from the console. Example : $ top -H -p 31294
This command will display all the threads that are causing the CPU to spike up in this particular 31294 process. When we issued this command in the EC2 instance, we were seeing the below output:
Fig:‘top -H -p {pid}’ command issued from an AWS EC2 instance 
From the output you can notice:

Thread Id 31306 consuming 69.3%of CPU
Thread Id 31307 consuming 65.6%of CPU
Thread Id 31308 consuming 64.0%of CPU
Remaining all other threads consume negligible amount of CPU.
This is a good step forward, as we have identified the threads that are causing CPU to spike. As the next step, we need to capture thread dumps so that we can identify the lines of code that is causing the CPU to spike up.

2. Capture thread dumps
A thread dump is a snapshot of all threads that are present in the application. Thread state, stack trace (i.e. code path that thread is executing), thread Id related information of each thread in the application is reported in the thread dump.

There are 8 different options to capture thread dumps. You can choose the option that is convenient for you. One of the simplest options to take thread dump is to use tool ‘jstack’ which is packaged in JDK. This tool can be found in $JAVA_HOME/bin folder. Below is the command to capture thread dump:

 jstack -l  {pid} > {file-path} 
where
pid: is the process Id of the application, whose thread dump should be captured

file-path: is the file path where thread dump will be written in to.

Example: jstack-l 31294 > /opt/tmp/threadDump.txt 
As per the example, thread dump of the process would be generated in /opt/tmp/threadDump.txt file.

3. Identify lines of code that is causing CPU to spike up
Next step is to analyze the thread dump to identify the lines of code that is causing the CPU to spike up. We would recommend analyzing thread dumps through fastThread, a free online thread dump analysis tool.

Now we uploaded captured thread dump to fastThread tool. Tool generated this beautiful visual report. Report has multiple sections. On the right top corner of the report, there is a search box. There we entered the Ids of the threads which were consuming high CPU. Basically, thread Ids that we identified in step #1 i.e. ‘31306,31307, 31308’.

fastThread tool displayed all these 3 threads stack trace as shown below.

You can notice all the 3 threads to be in RUNNABLE state and executing this line of code:

 com.buggyapp.cpuspike.Object1.execute(Object1.java:13) 
Apparently following is the application source code

 package com.buggyapp.cpuspike;

 /**
 * 
 * @author Test User
 */
 public class Object1 {
public static void execute() {
while (true) {
doSomething();
}
}
public static void doSomething() {
}
 } 
You can see  in object1.java tobe ‘doSomething();’. You can see that ‘doSomething()’ method to do nothing, but it is invoked an infinite number of times because of non-terminating while loop  If a thread starts to a loopinfinite number of times, then CPU will start to spike up. That is what exactly happening in this sample program. If non-terminating loop is fixed, then this CPU spike problem will go away.

Conclusion
To summarize first we need to use ‘TOP’tool to identify the thread Ids that are causing the CPU spike up, then we need to capture the thread dumps, next step is to analyze thread dumps to identify exact lines of code that is causing CPU to spike up. Enjoy troubleshooting, happy hacking!
HOW TO TROUBLESHOOT CPU PROBLEMS?
Basically, there are 3 simple steps:
  1. Identify threads that consume CPU
  2. Capture thread dumps
  3. Identify lines of code that is causing CPU to spike up
1. Identify threads that are causing CPU to spike
In the EC2 instance, multiple processes could be running. The first step is to identify the process that is causing the CPU to spike up. Best way to do is to use the ‘TOP’ command that is present in *nixflavor of operating systems.

Issue command ‘top’ from the console

$ top
This command will display all the processes that are running in the EC2 instance sorted by high CPU consuming processes displayed at the top. When we issued the command in the EC2 instance we were seeing the below output:
Fig:‘top’ command issued from an AWS EC2 instance
From the output, you can notice process# 31294to be consuming 199.1% of CPU. It’s pretty high consumption. Ok, now we have identified the process in the EC2 instance which is causing the CPU to spike up. Next step is to identify the threads with in this process that is causing the CPU to spike up.

Issue command ‘top -H -p {pid}’ from the console. Example

$ top -H -p 31294
This command will display all the threads that are causing the CPU to spike up in this particular 31294 process. When we issued this command in the EC2 instance, we were seeing the below output:
Fig:‘top -H -p {pid}’ command issued from an AWS EC2 instance
From the output you can notice:

Thread Id 31306 consuming 69.3%of CPU
Thread Id 31307 consuming 65.6%of CPU
Thread Id 31308 consuming 64.0%of CPU
Remaining all other threads consume negligible amount of CPU.
This is a good step forward, as we have identified the threads that are causing CPU to spike. As the next step, we need to capture thread dumps so that we can identify the lines of code that is causing the CPU to spike up.

2. Capture thread dumps
A thread dump is a snapshot of all threads that are present in the application. Thread state, stack trace (i.e. code path that thread is executing), thread Id related information of each thread in the application is reported in the thread dump.

There are 8different options to capture thread dumps. You can choose the option that is convenient for you. One of the simplest options to take thread dump is to use tool ‘jstack’ which is packaged in JDK. This tool can be found in $JAVA_HOME/bin folder. Below is the command to capture thread dump:

 jstack -l  {pid} > {file-path} 
where   pid: is the process Id of the application, whose thread dump should be captured
file-path: is the file path where thread dump will be written in to.

Example:

jstack-l 31294 > /opt/tmp/threadDump.txt 
As per the example, thread dump of the process would be generated in /opt/tmp/threadDump.txt file.

3. Identify lines of code that is causing CPU to spike up
Next step is to analyze the thread dump to identify the lines of code that is causing the CPU to spike up. We would recommend analyzing thread dumps through fastThread, a free online thread dump analysis tool.

Now we uploaded captured thread dump to fastThread tool. Tool generated this beautiful visual report. Report has multiple sections. On the right top corner of the report, there is a search box. There we entered the Ids of the threads which were consuming high CPU. Basically, thread Ids that we identified in step #1 i.e. ‘31306,31307, 31308’.

fastThread tool displayed all these 3 threads stack trace as shown below.

You can notice all the 3 threads to be in RUNNABLE state and executing this line of code:

 com.buggyapp.cpuspike.Object1.execute(Object1.java:13) 
Apparently following is the application source code

package com.buggyapp.cpuspike;
 /**
 * 
 * @author Test User
 */
 public class Object1 {
public static void execute() {
while (true) {
doSomething();
}
}
public static void doSomething() {
}
 } 
You can see line #13 in object1.java tobe ‘doSomething();’. You can see that ‘doSomething()’ method to do nothing, but it is invoked an infinite number of times because of non-terminating while loop inline# 11. If a thread starts to a loopinfinite number of times, then CPU will start to spike up. That is what exactly happening in this sample program. If non-terminating loop in line #11 is fixed, then this CPU spike problem will go away.

Conclusion
To summarize first we need to use ‘TOP’tool to identify the thread Ids that are causing the CPU spike up, then we need to capture the thread dumps, next step is to analyze thread dumps to identify exact lines of code that is causing CPU to spike up. Enjoy troubleshooting, happy hacking!






No comments:

Post a Comment

Thread

Native Thread Demon Thread Non-Demon Thread Native Thread: - Any Method/Thread which is mapped to OS is called Native Thread or Method. Demo...