Java Application Performance

JIT Compilation and CodeCache

JVM interprets byte code and JIT compilation compiles frequently run code to machine code. To find which code block or method is JIT compiled, flag -XX:+PrintCompilation can be used.

First column is number of milliseconds since VM started. Second is the order in which code block/method was compiled (some parts took longer than others) Third columns can have char values : n which means native method, s for synchronized method, ! for exception handling, % means the code has been natively compiled and is put in code cache (most optimal). The last number shows the level of compilation.

C1 compiler can produce native levels 1, 2 and 3 in the above snapshot (each progressively more complex) and C2 compiler produces level 4 and puts in code cache. VM profiles the code and selects appropriate level.

If you wanted to log the PrintCompilation output, in case you do not have access to the remote machine where you can see the console, use the following command :

To get the size of CodeCache, use PrintCodeCache JVM flag. Currently it is set to 120 MB as shown below.

In 64-bit, Java 8, the CodeCache size can be up to 240 MB. To change the CodeCache size,

InitialCodeCacheSize – when the application starts

ReservedCodeCacheSize – Max Size of CodeCache

CodeCacheExpansionSize – how quickly CodeCache should grow

You can also use JConsole to check the CodeCacheSize at run time but keep in mind that using JConsole adds roughly 2 MB of additional CodeCache which is needed for JVM to communicate with JConsole.

Selecting JVM

On 64-bit OS, you can choose 32-bit java compiler or 64-bit java compiler.

You choose 32-bit compiler at runtime using “-client” or “-server” flag and “-d64” for 64-bit server compiler. However, these flags do not always work as expected on different OSs.

To turn off Tiered Compilation, use -XX:-TieredCompilation flag although there is never any good reason to do so.

To see how many threads are available for native compilation of the code, use jinfo as follows:

To change the number of threads available for native compilation

To control how many times a method must run before JVM decides to compile it, use

The Stack and the Heap

Stack – only local primitives, Heap – objects including Strings. Every thread in Java application has its own stack, but multiple threads share a heap. Pointer to the object in the heap is stored on the stack.

For objects passed into the methods, the REFERENCE to the object is passed BY VALUE. Primitives are passed by value, a copy of the variable is made on the stack and that copy is passed to the method.

Final keyword – the object that the reference points to cannot be changed (or re-assigned), but the object properties can be changed.

Escaping References

When you return a collection from a getter method to the caller, that reference has escaped. The caller could call clear() on the collection which was not the intent in returning that reference.

To fix this,

  1. Have the class implement Iterable interface and provide iterator() method. The caller can then loop thru e.g. Customer records. Still possible to delete Customer objects in the collection but difficult.
  2. Return a copy of the collection in the getter e.g. return new ArrayList(myList) – involves copying large collections
  3. Return Collections.unmodifiableList(myList) in Java 11+ or List.copyof(myList) in Java 8- best solution

If you are returning a Custom object (e.g. Customer), have a Copy constructor on the object so that get will return a copy. But the caller will not know if they are modifying a copy or the original which could be misleading. Better way – have Customer implement ReadonlyCustomer interface without setName method and return a ReadonlyCustomer interface from findCustomer(), so the caller will compile time error on setName.

The Metaspace (java 8 and above)

Other than Stacks and Heap, JVM maintains an area called Metaspace. It holds class metadata information like which methods are natively compiled and it holds all Static variables (primitives and references).

There used to be PermGen space in Java 7 and below which has been removed in Java 8. PermGen used to run out of memory, metaspace cannot run out of memory.

The String Pool

Small strings are put into a pool by JVM, so the behavior is as follows

However, strings which are derived as a result of instruction are not put into pool

But we can use intern() method to ask JVM to put the string in the pool.

String pool is a Hashmap, strings are stored by their hashcodes. To get initial size of buckets in the hashmap,

use -XX:+PrintStringTableStatistics flag as follows:

There are already 65,536 buckets with 1,732 strings in the pool, these are all code java library strings.

To set the size of string pool, use XX:StringTableSize=large_prime_number

Setting up the heap Size

-XX:MaxheapSize=600m or -Xmx600m => for setting 600 MB max heap

-XX:InitialHeapSize=1g or -Xms1g => for setting initial heap to 1GB

Monitoring the Heap

Java VisualVM – to monitor the heap over time and see garbage collection cycles

To see which objects are occupying the heap, use Memory Analyzer (MAT)

Generational Garbage Collection

Removing objects from the heap which are no longer reachable

General mechanism – mark and sweep – Instead of looking for all the objects to remove, GC looks for all objects to retain and rescues them. marking – program execution stopped, all threads paused, check all variables on stack and metaspace and follow its reference, all reachable objects are alive. sweeping – any objects not marked during the marking phase can be freed up and live objects are moved to a contiguous space of memory to avoid defragmentation.

Since GC is not really looking for garbage but for live objects, the more garbage there is, faster the collection, nothing to mark.

Most objects don’t live for long. If an object survives one GC, it is likely to live for a long time. Heap is organized to 2 sections – young generation and old generation. The young generation is smaller than old and is further divided into 3 sections. The process to mark young generation should be really quick as size is smaller and application freeze is not noticed. Surviving objects are moved to old generation and new objects added to young. GC of young generation is known as “minor collection”, a running app will have many minor collections and few major (old) collections.

Java 8 – Young generation split into Eden, S0 and S1 survivor spaces. When app starts, all 3 spaces are empty. Object created placed in Eden, when it gets full which happens quickly, GC runs Eden space, any surviving objects moved to either S0 or S1. Next time, the GC looks in Eden and either S0 or S1, and moves surviving objects to either S1 or S0. An object has been swapped between S0 and S1 for example 5 times, is 5 generations old. After certain threshold, the object is moved to Old generation.

To monitor when GC is taking place, use -verbose:gc

Tuning Garbage Collection

Young generation garbage collection is preferred over Old generation garbage collection as there is no puase.

-XX:NewRatio=n

The meaning of this flag – how many times bigger the old gen be compared to new gen ? if n= 4, old gen is 4 times bigger than young, if heap is 10 MB, old gen 8 MB and young will be 2 MB. To see the default value, find the process id first and then “jinfo -flag NewRatio PID” , usually n = 2 by default. To increase the size of young gen and reduce old gen, n has to be lower, must be a whole number, so the only choice is n = 1, divide old and young equally.

-XX:SurvivorRatio=n

The meaning of this flag – how much of the young generation should be taken by Survivor spaces S0 and S1 , the rest is Eden space. Default value using jinfo is 8 which means both S0 and S1 be 1/8th (12.5%) of the young gen and Eden space will be 6/8 or 3/4 (75%). If we reduce this value to 5, S0 and S1 both will be 1/5th of young gen.

-XX:MaxTenuringThreshold=n

The meaning – How many times the object needs to survive before it becomes part of the old generation? We want objects to live in young gen as long as possible. Default is 15 which is the Max value for this flag.

Selecting a garbage collector

default collector changed in java 9 and above. In java 8, 3 types of collectors :

Serial – uses single thread for all garbage collection, use -XX:+UseSerialGC

Parallel – uses threads for Minor (young gen) collection, use -XX:+UseParallelGC, this is the default GC for java 8 and below

Mostly Concurrent – closest thing to real time GC, application paused during the mark phase but not during sweep phase. There are 2 types of concurrent GC :

MarkSweep Collector – default in Java 9, -XX:+UseConcMarkSweepGC

G1 Collector – default from java 10, use -XX:+UseG1GC. G1 GC works very differently. Heap is split into 2048 regions, some regions are allocated to different parts of heap (S0, S1, Eden and Old) After each minor GC, number of regions allocated to each part of young gen are re-allocated to what java thinks is the optimal. So S0 region might get allocated to Eden, prev unallocated might get allocated to Eden etc. During the full GC, each Old region is looked for mostly garbage and collect garbage from those regions first, that’s why its called G1, garbage first collector. Just clearing a few old regions might be enough instead of full GC, so performance of G1 should be better.

You should not have to tune G1 collector but here are the flags

-XX:ConcGCThreads=n – the number of threads available for smaller regional collections

-XX:InitiatingHeapOccupancyPercent=n – default 45%, G1 runs when the entire heap (old, Survivor and Eden combined) is 45% full

-XX:UseStringDeDuplication (only for G1 collector) – allows collector to make more space if it finds duplicate strings in heap

Java Mission Control

Using a profiler – is the cpu usage too high? network or disk usgae? To find out whats going on in the JVM, use Profiler application. JProfiler and YourKit are commercial profilers, JMC is open source.

MBean Server – is same JMX Console mode, shows live metrics

Flight recorder – historical performance, or run up to application crashing

Live Set + Fragmentation – important dial – shows the size of the heap after GC , if close to 100% indicates memory leak

To use flight recorder, for Oracle JDK, -XX:+UnlockCommercialFeatures -XX:+FlightRecorder, for OpenJDK, only use -XX:+FlightRecorder

To start flight recording on cmd line :

-XX:StartFlightRecording=delay=2min,duration=60s,name=Test,name=recording.jfr,settings=profile

Start recording 2 mins after app starts, record for 60 seconds.

Assessing Performance

Using time taken to execute a piece of code (called Microbenchmark – a single method or code block) has complications:

Native compilation – comparing 2 versions of code with 1 being native compiled, false results

Garbage collection – takes place while our code is running

Assessing in isolation – the benchmarked code is not running as part of project competition for resources

Different hardware – benchmarked on dev machine, different hardware on prod machine

system.currentTimeMillis() – to measure the difference

Add warm-up loop calling the same method many times so native compilation takes place before you start micro benchmarking, then add -XX:+PrintCompilation flag to make sure method was native compiled in the warm-up time.

Use -XX:CompileThreshold=1000 so that method gets natively compiled faster in the warmup time, default is 10000

Java Microbenchmark Harness (JMH) – sets up warm up time and analyzes perf in a more production like env, runs code 1000s of times to produce summary output

Add @Benchmark annotation on top of the method you want to measure, mvn clean install jmh source code and run benchmarks.jar runnable jar.

Adding “-bm avgt” gives average time to run the benchmarked code

Lists in Java

There are 8 different types of Lists in java

AttributeList – available only in MBean objects, not generic list

RoleList,RoleUnresolvedList – only in Role objects, not generic type

CopyOnWriteArrayList – efficient when not many mutations to list, only traversal. Used in multi-threaded application, multiple threads accessing the same list, lots of iterations and reads, very few writes/additiona/removal

ArrayList – backed by array, initial size is zero but internal storage of 10 has been allocated on heap. The size grows by current_size + (current_size >> 1)

Vector – since Java 1, to ensure backwards compatibility, vector is thread-safe, comes at perf cost

Stack – child of Vector, LIFO, use LinkedList instead

LinkedList – implements List and Deque (double ended queue) interfaces, has pointers to the prev and next nodes

Adding an item to the end of list – arraylist might need resizing, linkedlist – will always be faster, java maintains the last item in the LL so can go straight to the end

Adding item to start of list – LL will be quick, not for arraylist – all items need to be moved to the right

Removing item – arraylist – all items that come after the item need to shifted to left, LL – change the pointers on either side of that item but first we have to find that item in the LL, get() is faster in arraylist than LL

Maps in Java

It takes the same amount of time to retrieve an item from a hashmap of 10 items or a hashmap of billion items.

When you create a hashmap, initial array of size 16

The Key is always converted to an integer value by using has code value

bucket = hashcode % size of map (here % is modulo)

System.out.println(“Little Women”.hashcode()) //675377748

675377748 % 16 = 4

So the object will be stored in bucket# 4.

There could be many objects stored in the same bucket. The bucket contains a linkedlist of objects, a new object is added to end of the linkedlist.

Hashmaps grow by a factor, default of 0.75 or 3/4. Once 3/4 of the buckets have one or more element in it, the hashmap is considered to be getting full and it will double its size. When HM grows all the items need to be reevaluated, new bucket numbers calculated, significant overhead.

Specify initial size and factor when creating the HM

Rules for Hashcode for custom objects – should have good range of numbers so that objects get placed in different buckets, equals objects must have equal hashcodes

When we iterate thru hashmap, we get results in random order. In LinkedHashMap, we get items back in the same order they were added. LinkedHashMap – order of the items is preserved, it uses an additional linked list across the buckets that preserves the order.

GraalVM

To use GraalVM compiler with OpenJDK11 (linux only)

-XX:+UnlockExperimentalVMOptions

-XX:+EnableJVMCI

-XX:+UseJVMCICompiler