BKP_sreenivas: March 2022

Saturday, March 26, 2022

GC (Garbage Collection).

Permgen is Non-Heap memory., it contains completely the run time data

Method Area: it is also Non-Heap area. And it is a part of Permgen it is used to store class structure.

Memory pool is belongs to Young gen, Old gen and Perm gen (it means it is included Heap & Non-Heap), it creates mostly immutable objects only (Immutable objects means the state of object is constant)

so this memory pool is created by JVM managers for the sake immutable objects.

Stack Memory :it contains the Method specific vales and all the short-lived references to other Objects in the heap.

Native Thread : Any Thread which is mapped to OS, that is called Native Thread.

GC : Collecting Unused Objects

Ways to make an object eligible for Garbage Collector

If younger Generation is failed and not able to free the Memory to create new Objects it is Completly Developer Issue.

GC Mechanism :-

Half GC : Deallocate Memory from Younger Generation
Full GC : Deallocate Memory from Both Generation

Initial Memory(Total)
Free Memory
Max Memory
Used Memory

Free Memory < Total Memory

Consumed memory = Initial Memory - Free Memory

Available Memory = Total Memory - Used Memory
Even though the programmer is not responsible for destroying useless objects but it is highly recommended to make an object unreachable(thus eligible for GC) if it is no longer required.

What are Garbage Collection Roots in Java?

GCs work on the concept of Garbage Collection Roots (GC Roots) to identify live and dead objects. Examples of such Garbage Collection roots are:

• Classes loaded by system class loader (not custom class loaders)

• Live threads

• Local variables and parameters of the currently executing methods

• Local variables and parameters of JNI methods

• Global JNI reference

• Objects used as a monitor for synchronization

• Objects held from garbage collection by JVM for its purposes

How GC works ?

GC involves in 3 steps

1. Mark

2. Delete or Sweep

3. Compaction

so Phases of Garbage Collection in Java

• Mark the Objects as Live

• Sweep Dead Objects

• Compact Remaining Objects :- Means it is process of arranging in order( here GC is pausing)

Some Important terms :-

Live Objects : The Objects which is referenced by another object
Dead Objects : The Objects which is not referenced by another object and Unreachable
Demon Thread : GC Process is carried out by Demon Thread
System.GC : It is used to invoke the GC and on Invocation.

The GC will run to reduce the Unused memory space. System.gc() is a Static method.

There are generally four ways to make an object eligible for garbage collection.

Nullifying the reference variable
Re-assigning the reference variable
An object created inside the method
Island of Isolation

GC Pattern :-

Healthy saw-tooth pattern
Heavy caching pattern
Acute memory leak pattern
Consecutive Full GC pattern
Memory Leak Pattern

GC Types :-

Serial GC : java -XX:+UseSerialGC -jar Application.java
Parallel GC : java -XX:+UseParallelGC -jar Application.java
CMS GC
G1 GC : java -XX:+UseG1GC -jar Application.java
Z GC : java -XX:+UseZGC Application.java

Serial GC :- This GC implementation freezes all application threads when it runs

Parallel GC :- This GC uses multiple threads for managing heap space, but it also freezes other application threads while performing GC.

5. Z Garbage Collector (ZGC) is scalable, with low latency. It is a completely new GC, written from scratch. It can mark memory, copy and relocate it, all concurrently and it can work with heap memory, ranging from KBs to a large TB memory. As a concurrent garbage collector, ZGC guarantees not to exceed application latency by 10 milliseconds, even for bigger heap sizes. The ZGC was initially released as an experimental GC in Java 11 (Linux) and more changes are expected over time in JDK 11, 13, and 14.

The stop-the-world pauses are limited to root scanning in ZGC. It uses load barriers with colored pointers to perform concurrent operations when the threads are running and they are used to keep track of heap usage. Colored pointers are one of the core concepts of ZGC and it enables ZGC to find, mark, locate, and remap the objects. Compared to G1, ZGC has better ways to deal with very large object allocations which are highly performant when it comes to reclaiming memory and reallocating it and it is a single-generation GC.

ZGC divides memory into regions, also called ZPages. These ZPages can be dynamically created and destroyed and can also be dynamically sized. Unlike other GCs, the physical heap regions of ZGC can map into a bigger heap address space (which can include virtual memory) which can avoid memory fragmentation issues.

Troubleshoot performance issues

The first step is to determine whether the issue is actually garbage collection. If you determine that it is, select from the following list to troubleshoot the problem.

An out-of-memory exception is thrown
The process uses too much memory
The garbage collector does not reclaim objects fast enough
The managed heap is too fragmented
Garbage collection pauses are too long
Generation 0 is too big
CPU usage during a garbage collection is too high

Note :- If we increase GC Pause time. it is impact on Response time to getting more, and Throughput getting Down and automatically the Performance of the application will be decrease, and also getting high CPU Utilization , High Memory Utilization,

eg : if GC Pause time is 9 sec and Response Time is 4 sec then total Transaction waited time is 13 sec which is not acceptable.

whenever GC is taking High pause time means the GC is taking more time to collect all the object and compact objects(means arranging the object in Order)

Note : If you try to decrease the Old Area size to Decrease the Full GC Execution time, OutofMemory may occur or the number of Full GC cycles may increase.

Alternatively if you try to decrease the number of full GC by increasing the Old area size, the execution time will be increased.

Qus :- What did you do to improve the GC Mechanism?

Ans : The Major GC should be reduced or optimized because it will take application unresponsive duration.

Qus : What is the issue if Full GC Execution Time is High and Minor GC Execution Time Is also High ?

Ans :

Garbage Collection and Performance | Microsoft Learn

1. Healthy saw-tooth pattern

Fig 1: Healthy saw-tooth GC pattern

You will see a beautiful saw-tooth GC pattern when an application is healthy, as shown in the above graph. Heap usage will keep rising; once a ‘Full GC’ event is triggered, heap usage will drop all the way to the bottom.

In Fig 1, You can notice that when the heap usage reaches ~5.8GB, ‘Full GC’ event (red triangle) gets triggered. When the ‘Full GC’ event runs, memory utilization drops all the way to the bottom i.e., ~200MB. Please see the dotted black arrow line in the graph. It indicates that the application is in a healthy state & not suffering from any sort of memory problems.

2. Heavy caching pattern

Fig 2: Heavy caching GC pattern

When an application is caching many objects in memory, ‘GC’ events wouldn’t be able to drop the heap usage all the way to the bottom of the graph (like you saw in the earlier ‘Healthy saw-tooth’ pattern).

In Fig 2, you can notice that heap usage keeps growing. When it reaches around ~60GB, GC event (depicted as a small green square in the graph) gets triggered. However, these GC events aren’t able to drop the heap usage below ~38GB. Please refer to the dotted black arrow line in the graph. In contrast, in the earlier ‘Healthy saw-tooth pattern’, you can see that heap usage dropping all the way to the bottom ~200MB. When you see this sort of pattern (i.e., heap usage not dropping till all the way to the bottom), it indicates that the application is caching a lot of objects in memory.

When you see this sort of pattern, you may want to investigate your application’s heap using heap dump analysis tools like yCrash, HeapHero, Eclipse MAT and figure out whether you need to cache these many objects in memory. Several times, you might uncover unnecessary objects to be cached in the memory.

Here is the real-world GC log analysis report, which depicts this ‘Heavy caching’ pattern.

3. Acute memory leak pattern

Fig 3: Acute memory leak GC pattern

Several applications suffer from this ‘Acute memory leak pattern’. When an application suffers from this pattern, heap usage will climb up slowly, eventually resulting in OutOfMemoryError.

In Fig 3, you can notice that ‘Full GC’ (red triangle) event gets triggered when heap usage reaches around ~43GB. In the graph, you can also observe that amount of heap that full GC events could recover starts to decline over a period of time, i.e., you can notice that

a. When the first Full GC event ran, heap usage dropped to 22GB

b. When the second Full GC event ran, heap usage dropped only to 25GB

c. When the third Full GC event ran, heap usage dropped only to 26GB

d. When the final full GC event ran heap usage dropped only to 31GB

Please see the dotted black arrow line in the graph. You can notice the heap usage gradually climbing up. If this application runs for a prolonged period (days/weeks), it will experience OutOfMemoryError (please refer to Section #5 – ‘Memory Leak Pattern’).

Here is the real-world GC log analysis report, which depicts this ‘Acute memory leak’ pattern.

4. Consecutive Full GC pattern

Fig 4: Consecutive Full GC pattern

When the application’s traffic volume increases more than JVM can handle, this Consecutive full GC pattern will become pervasive.

In Fig 4, please refer to the black arrow mark in the graph. From 12:02pm to 12:30 pm on Oct’ 06, Full GCs (i.e., ‘red triangle’) are consecutively running; however, heap usage isn’t dropping during that time frame. It indicates that traffic volume spiked up in the application during that time frame, thus the application started to generate more objects, and Garbage Collection couldn’t keep up with the object creation rate. Thus, GC events started to run consecutively. Please note that when a GC event runs, it has two side effects:

a. CPU consumption will go high (as GC does an enormous amount of computation).

b. Entire application will be paused; no customers will get response.

Thus, during this time frame, 12:02pm to 12:30pm on Oct’ 06, since GC events are consecutively running, application’s CPU consumption would have been skyrocketing and customers wouldn’t be getting back any response. When this kind of pattern surfaces, you can resolve it using one of the solutions outlined in this post.

Here is the real-world GC log analysis report, which depicts this ‘Consecutive Full GC’ pattern.

5. Memory Leak Pattern

Fig 5: Memory Leak GC pattern

This is a ‘classic pattern’ that you will see whenever the application suffers from memory problems. In Fig 5, please observe the black arrow mark in the graph. You can notice that Full GC (i.e., ‘red triangle’) events are continuously running. This pattern is similar to the previous ‘Consecutive Full GC’ pattern, with one sharp difference. In the ‘Consecutive Full GC’ pattern, application would recover from repeated Full GC runs and return back to normal functioning state, once traffic volume dies down. However, if the application runs into a memory leak, it wouldn’t recover, even if traffic dies. The only way to recover the application is to restart the application. If the application is in this state, you can use tools like yCrash, HeapHero, Eclipse MAT to diagnose memory leak. Here is a more detailed post on how to diagnose Memory leak.

Here is the real-world GC log analysis report, which depicts this ‘Memory Leak’ pattern.

Ways for requesting JVM to run Garbage Collector

Once we make an object eligible for garbage collection, it may not destroy immediately by the garbage collector. Whenever JVM runs the Garbage Collector program, then only the object will be destroyed. But when JVM runs Garbage Collector, we can not expect.

We can also request JVM to run Garbage Collector. There are two ways to do it :

Using System.gc() method: System class contain static method gc() for requesting JVM to run Garbage Collector.

Using Runtime.getRuntime().gc() method: Runtime class allows the application to interface with the JVM in which the application is running. Hence by using its gc() method, we can request JVM to run Garbage Collector.

There is no guarantee that any of the above two methods will run Garbage Collector.

The call System.gc() is effectively equivalent to the call : Runtime.getRuntime().gc()

Finalization

Just before destroying an object, Garbage Collector calls finalize() method on the object to perform cleanup activities. Once finalize() method completes, Garbage Collector destroys that object.

finalize() method is present in Object class with the following prototype.

protected void finalize() throws Throwable

Based on our requirement, we can override finalize() method for performing our cleanup activities like closing connection from the database.

The finalize() method is called by Garbage Collector, not JVM. However, Garbage Collector is one of the modules of JVM.

Object class finalize() method has an empty implementation. Thus, it is recommended to override the finalize() method to dispose of system resources or perform other cleanups.

The finalize() method is never invoked more than once for any object.

If an uncaught exception is thrown by the finalize() method, the exception is ignored, and the finalization of that object terminates.

The advantages of Garbage Collection in Java are:

It makes java memory-efficient because the garbage collector removes the unreferenced objects from heap memory.
It is automatically done by the garbage collector(a part of JVM), so we don’t need extra effort.

Qus :- When the Object is getting destroyed ?

Ans : When ever the Object is not having Reference then the object is getting destroyed.

Qus : How do you the GC is Running ?

Ans :- by using any Jstack Command or JVisual VM, or JConsole or JProfiler or by using any APM tool we can monitor GC Metrics'.

Qus : Where we need to increase the Heap size ?

Ans : in Tomcat Catalina file

in WebLogic set Domain Env .cmd file

Set WLS_MEM_ARGS_64Bit = -XMS 512M and -XMX 1024M

Set WLS_MEM_ARGS_32Bit = -XMS 512M and -XMX 512M

Serial GC will work only standalone applications and Desktop Applications.

If we use Distributed Application we will get High Performance issues.

Wednesday, March 23, 2022

AWR_Understand each field

Hard page faults occur when the page is not located in physical memory or a memory-mapped file created by the process (the situation we discussed above). The performance of applications will suffer when there is insufficient RAM and excessive hard page faults occur

a soft page fault occurs when the page is resident elsewhere in memory.

AWR report is broken into multiple parts.

1)Instance information:-

This provides information the instance name , number,snapshot ids,total time the report was taken for and the database time during this elapsed time.

Elapsed time= end snapshot time - start snapshot time

Database time= Work done by database during this much elapsed time( CPU and I/o both add to Database time).If this is lesser than the elapsed time by a great margin, then database is idle. Database time does not include time spend by the background processes.

2)Cache Sizes : This shows the size of each SGA region after AMM has changed them. This information can be compared to the original init.ora parameters at the end of the AWR report.

3)Load Profile: This important section shows important rates expressed in units of per second and transactions per second. This is very important for understanding how is the instance behaving. This has to be compared to base line report to understand the expected load on the machine and the delta during bad times.

4)Instance Efficiency Percentages (Target 100%): This section talks about how close are the vital ratios like buffer cache hit, library cache hit,parses etc. These can be taken as indicators ,but should not be a cause of worry if they are low. As the ratios cold be low or high based in database activities, and not due to real performance problem. Hence these are not stand alone statistics, should be read for a high level view .

5)Shared Pool Statistics: This summarizes changes to the shared pool during the snapshot period.

6)Top 5 Timed Events :This is the section which is most relevant for analysis. This section shows what % of database time was the wait event seen for.Till 9i, this was the way to backtrack what was the total database time for the report , as there was no Database time column in 9i.

7)RAC Statistics :This part is seen only incase of cluster instance. This provides important indication on the average time take for block transfer, block receiving , messages ., which can point to performance problems in the Cluster instead of database.

8)Wait Class : This Depicts which wait class was the area of contention and where we need to focus. Was that network, concurrency, cluster, i/o Application, configuration etc.

9)Wait Events Statistics Section: This section shows a breakdown of the main wait events in the database including foreground and background database wait events as well as time model, operating system, service, and wait classes statistics.

10)Wait Events: This AWR report section provides more detailed wait event information for foreground user processes which includes Top 5 wait events and many other wait events that occurred during the snapshot interval.

11)Background Wait Events: This section is relevant to the background process wait events.

12)Time Model Statistics: Time mode statistics report how database-processing time is spent. This section contains detailed timing information on particular components participating in database processing. This gives information about background process timing also which is not included in database time.

13)Operating System Statistics: This section is important from OS server contention point of view. This section shows the main external resources including I/O, CPU, memory, and network usage.

14)Service Statistics: The service statistics section gives information services and their load in terms of CPU seconds, i/o seconds, number of buffer reads etc.

15)SQL Section: This section displays top SQL, ordered by important SQL execution metrics.

a)SQL Ordered by Elapsed Time: Includes SQL statements that took significant execution time during processing.

b)SQL Ordered by CPU Time: Includes SQL statements that consumed significant CPU time during its processing.

c)SQL Ordered by Gets: These SQLs performed a high number of logical reads while retrieving data.

d)SQL Ordered by Reads: These SQLs performed a high number of physical disk reads while retrieving data.

e)SQL Ordered by Parse Calls: These SQLs experienced a high number of reparsing operations.

f)SQL Ordered by Sharable Memory: Includes SQL statements cursors which consumed a large amount of SGA shared pool memory.

g)SQL Ordered by Version Count: These SQLs have a large number of versions in shared pool for some reason.

16)Instance Activity Stats: This section contains statistical information describing how the database operated during the snapshot period.

17)I/O Section: This section shows the all important I/O activity. This provides time it took to make 1 i/o say Av Rd(ms), and i/o per second say Av Rd/s.This should be compared to the baseline to see if the rate of i/o has always been like this or there is a diversion now.

18)Advisory Section: This section show details of the advisories for the buffer, shared pool, PGA and Java pool.

19)Buffer Wait Statistics: This important section shows buffer cache waits statistics.

20)Enqueue Activity: This important section shows how enqueue operates in the database. Enqueues are special internal structures which provide concurrent access to various database resources.

21)Undo Segment Summary: This section gives a summary about how undo segments are used by the database.

Undo Segment Stats: This section shows detailed history information about undo segment activity.

22)Latch Activity: This section shows details about latch statistics. Latches are a lightweight

serialization mechanism that is used to single-thread access to internal Oracle structures. The latch should be checked by its sleeps. The sleepiest Latch is the latch that is under contention , and not the latch with high requests. Hence run through the sleep breakdown part of this section to arrive at the latch under highest contention.

23)Segment Section: This portion is important to make a guess in which segment and which segment type the contention could be.Tally this with the top 5 wait events.

Segments by Logical Reads: Includes top segments which experienced high number of logical reads.

Segments by Physical Reads: Includes top segments which experienced high number of disk physical reads.

Segments by Buffer Busy Waits: These segments have the largest number of buffer waits caused by their data blocks.

Segments by Row Lock Waits: Includes segments that had a large number of row locks on their data.

Segments by ITL Waits: Includes segments that had a large contention for Interested

Transaction List (ITL). The contention for ITL can be reduced by increasing INITRANS storage parameter of the table.

24)Dictionary Cache Stats: This section exposes details about how the data dictionary cache is operating.

25)Library Cache Activity: Includes library cache statistics which are needed in case you see library cache in top 5 wait events.You might want to see if the reload/invalidations are causing the contention or there is some other issue with library cache.

26)SGA Memory Summary:This would tell us the difference in the respective pools at the start and end of report. This could be an indicator of setting minimum value for each, when sga)target is being used..

27)init.ora Parameters: This section shows the original init.ora parameters for the instance during the snapshot period.

There would be more Sections in case of RAC setups to provide details.

Thursday, March 10, 2022

Requirement Gathering (NFR)

Business Flows

Number of Users

Type of Testing (Baseline test, Bench Mark Test, Load Test, Fatigue or Stress Test, Scalability test, SOAK or Endurance Test, Flood or Volume Test)

Test Environment : (H/W, S/W, Technology, LAN/WAN)

SLA :(TXN Response Time, Hits/Sec, Throughput, CPU Usage);

1. Current Project Timeline to begin & Closed Testing activities.

2. Is Application Functionality is stable and Its Functional Testing is completed.

3. Type of PT tool and Type of Testing should be performed.

4. What are the Goals of PT activity.

5. What will be acceptable criteria each PT

6. What type of Application(Web/client server/ Mobile)

7. Type of technology & DB

8. What is App Server ( Tomcat/ Weblogic)

9. N/W band Width, LAN/WAN/ Details

10. Is there any Known Issues are there

11. Type of Protocol between Client and Server

12. Avg User Session Time, How many Users visit your site in 24 hours

13. How many Transactions does the each user do per day in the application.

Response Time

End to End Response Time : It Means total time that a user wait for a Response after submitting a Request .

End-to End Response Time = GUI Response Time + N/W + Server Response Time

Response Time is More means :

It might be Network Problem,
It might be System Issue
It might be LG Issue
It might be Application Code : (Memory Leak, Large Object, Heavy Functions, etc..)
It might be slow DB qry.
It might be Application Issue : eg : If we submit one page, that page is taking more time to loading because that page is having more Data.
High Physical Disk Queue length.
Throughput is down
Memory Leak
CPU Utilization is more.

Note :- If we increase Pacing Time and Think Time the Response Time will be Increase and Throughput is Decrease
If we decrease Pacing Time and Think Time the Response Time will be decrease and Throughput is Increase

Eg :- Pacing Time = 5sec Think Time = 5sec Response Time =3sec

here 5+5+3 = 13sec / one transaction that means

60/13 = 4.5 transaction/1 min ------> Throughput

Pacing Time = 3sec Think Time = 3sec Response Time =2sec

here 3+3+2 = 8sec / one transaction that means

60/8 = 7.5 transaction/1 min ------> Throughput

Note : For which Transaction is taking more response time, we need to analyze these graphs
web page Diagnostics Graph
Page Breakdown Graph

High Response Time may cause less Hits/Sec
Qus :- Response time is taken more time for two transactions and less time is taken for other transactions ?
Ans : Hence we need analyse in server side for those two transactions and check in Application side and check the Application as manually for those two transactions.

Response Time Graph merging with other Graphs :-

Response Time is more means : It might be Network problem or LG issue or System issue or Application Issue.

eg :- Application Issue : Suppose if we submit one page, that page taking more time, Because that page having more data.

-> Four key graphs used during LoadRunner analysis to analyze Response Time.

1. Transaction summary report.

2. AVG Transaction Response Time graph.

3. Transaction/ sec graph

4. Transaction Response Time underload.

1.Response Time Graph Merging with Number of Users Graph :-

To check how the Number of running Vusers affects the Transaction response time.

Whenever the Response time improved due to a decrease in the Vuser Load.

2. Merge the Response time Graph and Running Vuser graph : Here we can find exactly after what user transaction taking more response time, once you find the point which transaction taking more Response Time and what is user load at that particular point of time.

3. Response Time graph Merging with Throughput Graph :-

Increase the Response time with constant Throughput may be due to network bandwidth issue.

If you see the decrease in Throughput with the Increase in Response Time then Investigate of the server end.

If Response Time extremes poor means we need to take Thread Dump for Analysis.

4. Response Time Graph Merging with Error Graph :-

-> Here we can early identify the exact time when the first error occurred as well as the time until the error log.

->We can also check whether the Response Time increases after 1st Error appeared.

-> Increase in error % during Ramp up indicates the error due to user load.

-> If error identified during mid of steady state it indicates Queue file-up at the server.

Qus :- Response Time is taken more for two Transactions and less time taken for other transactions ?

Ans : Here we need to analyze in server side for these two transactions and check in a application side and check the application as manually for those two transactions.

-> If transaction Response time more means we need to analyze

1. Web page Diagnostics Graph

2. Page Breakdown graph.

-> whenever you identify high Response time along with Memory Leakage. so first carryout HeapDump analysis and check the frequency of GC and carryout the GC analysis to identify root cause analysis.

High Response Time along with Memory Leakage :- Here first carryout HeapDump analysis and check frequency of GC.

-> If your Response Time is High and your Threads goes to the ideal the CPU usage may also will be high.

Qus:- If i run the scenario with 1 hour for 10 Iterations, I want to know the time each Transaction took rather then AVG Response Time ?

Ans : Check the Raw Data from the AVG Transaction Response Time Graph.

Look under the Menu View -> View Raw Data and then choose entire scenario time. This will bring up a side bar, Click on it and it will display all the Response Time items.

It will also show a little flappy disk icon with it you can save an XLS file with all that data which will include on Transaction Response Time.

Qus :- How to get the Response Time of each request inside in the single transaction for any particular user Action ?

Ans :- Go to Runtime Settings -> Misc -> Each action as a Transaction so that we will get Response Time for Each Request.

Qus : Is Load Runner capture Failed Transaction Response Time ?

Ans :- No, The Load Runner capture the only Passed Transactions Response Time.

Qus :- Why the Response Time showing difference in Loadrunner and Dynatrace ?

Ans :- The Loadrunner capture the only passed Transactions Response Time but

The Dynatrace capture the Passed and Failed Transactions Response Time.

Friday, March 4, 2022

High CPU Utilization and High CPU Saturation(Load)

There are 4 Core Elements of a System

CPU ( Central Processing Unit)
Memory
Core
Network Manager

CPU Load :- The Number of Processes which are being executed by CPU or Waiting to be executed by CPU.

CPU Utilization :- Computer usage of processing Resources or The amount of work handled by CPU.

Various factors influence CPU utilization, such as processor speed, the number of cores, and application requirements.

High CPU utilization can result from various factors, including poorly optimized applications, insufficient hardware resources, or malware and other security threats.

Low CPU utilization, on the other hand, could indicate inefficient scheduling or unused hardware resources. Strategies for improving low CPU utilization include implementing load balancing techniques to distribute workloads evenly across available resources, consolidating underutilized resources to maximize efficiency, and upgrading software or hardware components to better align with system requirements.

CPU utilization = (Total time spent time on Non-Idle Tasks / Total Time) * 100 .

For Eg:-, if the total time is 10,000 milliseconds, and the processor spends 8,000 milliseconds on non-idle tasks, the CPU utilization would be (8,000 / 10,000) x 100 = 80%. This result indicates that the processor is actively engaged in tasks for 80% of the measured time period, providing insights into system performance and potential areas for optimization.

The CPU - Processor Queue Length alarm becomes active when the number of Windows threads waiting for CPU resources exceeds a threshold. Sustained high processor queue length is a good indicator that you have a CPU bottleneck.

If the % Processor Time value is constantly high, check the disk and network metrics first. If they are low, the processor might be under stress. To confirm this, check the average and current Processor Queue Length value. If these values are higher than the recommended, it clearly indicates a processor bottleneck. The final solution for this situation is adding more processors, as this will enable more requests to be executed simultaneously. If the Processor Queue Length value is low (the recommendations are given below), consider using more powerful processors

CPU Load: CPU Load, mathematically, is the amount of work that is performed by CPU as a percentage of total capacity. Each process waiting for CPU increments the load by 1 and a process that is served decrements the Load by 1. Load average is a measurement of how many tasks are waiting in a kernel run queue (not just CPU time but also disk activity) over a period of time.

CPU Utilization(usage): It is a measure of time a CPU is not idle. It can also be defined as a measure of how busy the CPU is right now.

[Processor] "% Processor time" : if this counter is constantly high, say above 90%, then you'll need to use other counters (described below) in order to further determine the root cause of the CPU pressure.

[Processor] "% Privileged time" : a high percentage of privileged time, anything consistently above 25%, can usually point to a driver or hardware issue and should be investigated.

[Processor] "Queue Length"

This is the number of threads that are ready to execute but waiting for a core to become available. On single core machines a sustained value greater than 2-3 can mean that you have some CPU pressure. Similarly, for a multicore machine divide the queue length by the number of cores and if that is continuously greater than 2-3 there might be CPU pressure.

High CPU and Low Throughput :-

For any load test, if you observe a high CPU and low throughput, what would you conclude?

CPU spike is in single core or all the cores
Core level utilization is above 75% or not if it is above
Generate thread dump to understand thread status
Could be possible of dead lock status
sometime due gc happening very less or more frequently could cause the cpu to spike
you have to check whether it is for longer duration say 10 mins or more or it gets fixed automatically or not
CPU spike could also because of incorrect thread configuration in app server or db connections in database server
You have to check the heap dump to understand the memory management
Also it could be because class, method, objects being consumed and not released properly could be code level issue or framework level issue or not patched properly or could be because of security fixes as well
sometimes every long running queries in db server is linked with code which has the open connections without releasing it could cause cpu spike. DB report can help you understand the issue
CPU spike and low throughput has many interactions in it also if load balancer is not properly configured traffic is routed to single server instead splitting to multiple server could cause high cpu, memory issues as well

single question can be linked to many parts of the applications or subsystems interviewer wants to know your debugging skills

================================

troubleshooting CPU problems in production that too in cloud environment

our application might have millions of lines of code, trying to identify the exact line of code that is causing the CPU to spike up, might be equivalent of finding a needle in the haystack.

To help readers better understand this troubleshooting technique, we built a sample application and deployed it into AWSEC2 instance. Once this application was launched, it caused CPU consumption to spike up to 199.1%. Now let’s walk you through the steps that we followed while troubleshooting this problem. Basically, there are 3 simple steps:

Identify threads that consume CPU
Capture thread dumps
Identify lines of code that is causing CPU to spike up

1. Identify threads that are causing CPU to spike

In the EC2 instance, multiple processes could be running. The first step is to identify the process that is causing the CPU to spike up. Best way to do is to use the ‘TOP’ command that is present in *nixflavor of operating systems.

Issue command ‘top’ from the console

$ top

This command will display all the processes that are running in the EC2 instance sorted by high CPU consuming processes displayed at the top. When we issued the command in the EC2 instance we were seeing the below output:

Fig:‘top’ command issued from an AWS EC2 instance

From the output, you can notice process# 31294to be consuming 199.1% of CPU. It’s pretty high consumption. Ok, now we have identified the process in the EC2 instance which is causing the CPU to spike up. Next step is to identify the threads with in this process that is causing the CPU to spike up.

Issue command ‘top -H -p {pid}’ from the console. Example : $ top -H -p 31294

This command will display all the threads that are causing the CPU to spike up in this particular 31294 process. When we issued this command in the EC2 instance, we were seeing the below output:

Fig:‘top -H -p {pid}’ command issued from an AWS EC2 instance

From the output you can notice:

Thread Id 31306 consuming 69.3%of CPU

Thread Id 31307 consuming 65.6%of CPU

Thread Id 31308 consuming 64.0%of CPU

Remaining all other threads consume negligible amount of CPU.

This is a good step forward, as we have identified the threads that are causing CPU to spike. As the next step, we need to capture thread dumps so that we can identify the lines of code that is causing the CPU to spike up.

2. Capture thread dumps

A thread dump is a snapshot of all threads that are present in the application. Thread state, stack trace (i.e. code path that thread is executing), thread Id related information of each thread in the application is reported in the thread dump.

There are 8 different options to capture thread dumps. You can choose the option that is convenient for you. One of the simplest options to take thread dump is to use tool ‘jstack’ which is packaged in JDK. This tool can be found in $JAVA_HOME/bin folder. Below is the command to capture thread dump:

jstack -l {pid} > {file-path}

where

pid: is the process Id of the application, whose thread dump should be captured

file-path: is the file path where thread dump will be written in to.

Example: jstack-l 31294 > /opt/tmp/threadDump.txt

As per the example, thread dump of the process would be generated in /opt/tmp/threadDump.txt file.

3. Identify lines of code that is causing CPU to spike up

Next step is to analyze the thread dump to identify the lines of code that is causing the CPU to spike up. We would recommend analyzing thread dumps through fastThread, a free online thread dump analysis tool.

Now we uploaded captured thread dump to fastThread tool. Tool generated this beautiful visual report. Report has multiple sections. On the right top corner of the report, there is a search box. There we entered the Ids of the threads which were consuming high CPU. Basically, thread Ids that we identified in step #1 i.e. ‘31306,31307, 31308’.

fastThread tool displayed all these 3 threads stack trace as shown below.

You can notice all the 3 threads to be in RUNNABLE state and executing this line of code:

com.buggyapp.cpuspike.Object1.execute(Object1.java:13)

Apparently following is the application source code

package com.buggyapp.cpuspike;

/**

* @author Test User

public class Object1 {

public static void execute() {

while (true) {

doSomething();

}

public static void doSomething() {

}

You can see in object1.java tobe ‘doSomething();’. You can see that ‘doSomething()’ method to do nothing, but it is invoked an infinite number of times because of non-terminating while loop If a thread starts to a loopinfinite number of times, then CPU will start to spike up. That is what exactly happening in this sample program. If non-terminating loop is fixed, then this CPU spike problem will go away.

Conclusion

To summarize first we need to use ‘TOP’tool to identify the thread Ids that are causing the CPU spike up, then we need to capture the thread dumps, next step is to analyze thread dumps to identify exact lines of code that is causing CPU to spike up. Enjoy troubleshooting, happy hacking!

HOW TO TROUBLESHOOT CPU PROBLEMS?

Basically, there are 3 simple steps:

Identify threads that consume CPU
Capture thread dumps
Identify lines of code that is causing CPU to spike up

1. Identify threads that are causing CPU to spike

Issue command ‘top’ from the console

$ top

Fig:‘top’ command issued from an AWS EC2 instance

Issue command ‘top -H -p {pid}’ from the console. Example

$ top -H -p 31294

This command will display all the threads that are causing the CPU to spike up in this particular 31294 process. When we issued this command in the EC2 instance, we were seeing the below output:

Fig:‘top -H -p {pid}’ command issued from an AWS EC2 instance

From the output you can notice:

Thread Id 31306 consuming 69.3%of CPU

Thread Id 31307 consuming 65.6%of CPU

Thread Id 31308 consuming 64.0%of CPU

Remaining all other threads consume negligible amount of CPU.

2. Capture thread dumps

There are 8different options to capture thread dumps. You can choose the option that is convenient for you. One of the simplest options to take thread dump is to use tool ‘jstack’ which is packaged in JDK. This tool can be found in $JAVA_HOME/bin folder. Below is the command to capture thread dump:

jstack -l {pid} > {file-path}

where pid: is the process Id of the application, whose thread dump should be captured

file-path: is the file path where thread dump will be written in to.

Example:

jstack-l 31294 > /opt/tmp/threadDump.txt

As per the example, thread dump of the process would be generated in /opt/tmp/threadDump.txt file.

3. Identify lines of code that is causing CPU to spike up

fastThread tool displayed all these 3 threads stack trace as shown below.

You can notice all the 3 threads to be in RUNNABLE state and executing this line of code:

com.buggyapp.cpuspike.Object1.execute(Object1.java:13)

Apparently following is the application source code

package com.buggyapp.cpuspike;

/**

* @author Test User

public class Object1 {

public static void execute() {

while (true) {

doSomething();

}

public static void doSomething() {

}

You can see line #13 in object1.java tobe ‘doSomething();’. You can see that ‘doSomething()’ method to do nothing, but it is invoked an infinite number of times because of non-terminating while loop inline# 11. If a thread starts to a loopinfinite number of times, then CPU will start to spike up. That is what exactly happening in this sample program. If non-terminating loop in line #11 is fixed, then this CPU spike problem will go away.

Conclusion