Tuesday, January 17, 2012

Java 7: How to write really fast Java code

When I first wrote this blog my intention was to introduce you to a class ThreadLocalRandom which is new in Java 7 to generate random numbers. I have analyzed the performance of ThreadLocalRandom in a series of micro-benchmarks to find out how it performs in a single threaded environment. The results were relatively surprising: although the code is very similar, ThreadLocalRandom is twice as fast as Math.random()! The results drew my interest and I decided to investigate this a little further. I have documented my anlysis process. It is an examplary introduction into analysis steps, technologies and some of the JVM diagnostic tools required to understand differences in the performance of small code segments. Some experience with the described toolset and technologies will enable you to write faster Java code for your specific Hotspot target environment.

OK, that's enough talk, let's get started! My machine is an ordinary Intel x86, Family 6, 3 GHz, 32-bit, dual core running Windows XP Professional.

Math.random() works on a static singleton instance of Random whilst ThreadLocalRandom -> current() -> nextDouble() works on a thread local instance of ThreadLocalRandom which is a subclass of Random. ThreadLocal introduces the overhead of variable look up on each call to the current()-method. Considering what I've just said, then it's really a little surprising that it's twice as fast as Math.random() in a single thread, isn't it? I didn't expect such a significant difference.


Again, I am using a tiny micro-benchmarking framework presented in one of Heinz blogs. The framework that Heinz developed takes care of several challenges in benchmarking Java programs on modern JVMs. These challenges include: warm-up, garbage collection, accuracy of Javas time API, verification of test accuracy and so forth.

Here are my runnable benchmark classes:

public class ThreadLocalRandomGenerator implements BenchmarkRunnable {

 private double r;
 
 @Override
 public void run() {
  r = r + ThreadLocalRandom.current().nextDouble();
 }

 public double getR() {
  return r;
 }

 @Override
 public Object getResult() {
  return r;
 }
  
}

public class MathRandomGenerator implements BenchmarkRunnable {

 private double r;

 @Override
 public void run() {
  r = r + Math.random();
 }

 public double getR() {
  return r;
 }

 @Override
 public Object getResult() {
  return r;
 }
}

Let's run the benchmark using Heinz' framework:


Notice: To make sure the JVM does not identify the code as "dead code" I return a field variable and print out the result of my benchmarking immediately. That's why my runnable classes implement an interface called RunnableBenchmark. I am running this benchmark three times. The first run is in default mode, with inlining and JIT optimization enabled:

Benchmark target: MathRandomGenerator
Mean execution count: 14773594,4
Standard deviation: 180484,9
To avoid dead code coptimization: 6.4005410634212025E7
Benchmark target: ThreadLocalRandomGenerator
Mean execution count: 29861911,6
Standard deviation: 723934,46
To avoid dead code coptimization: 1.0155096190946539E8

Then again without JIT optimization (VM option -Xint):

Benchmark target: MathRandomGenerator
Mean execution count: 963226,2
Standard deviation: 5009,28
To avoid dead code coptimization: 3296912.509302683
Benchmark target: ThreadLocalRandomGenerator
Mean execution count: 1093147,4
Standard deviation: 491,15
To avoid dead code coptimization: 3811259.7334526842

The last test is with JIT optimization, but with -XX:MaxInlineSize=0 which (almost) disables inlining:

Benchmark target: MathRandomGenerator
Mean execution count: 13789245
Standard deviation: 200390,59
To avoid dead code coptimization: 4.802723374491231E7
Benchmark target: ThreadLocalRandomGenerator
Mean execution count: 24009159,8
Standard deviation: 149222,7
To avoid dead code coptimization: 8.378231170741305E7

Let's interpret the results carefully: With full JVM JIT optimization the ThreadLocalRanom is twice as fast as Math.random(). Turning JIT optimization off shows that the two perform equally good (bad) then. Method inlining seems to make 30% of the performance difference. The other differences may be due to other otimization techniques.

One reason why the JIT compiler can tune ThreadLocalRandom more effectively is the improved implementation of ThreadLocalRandom.next().

public class Random implements java.io.Serializable {
...
    protected int next(int bits) {
        long oldseed, nextseed;
        AtomicLong seed = this.seed;
        do {
            oldseed = seed.get();
            nextseed = (oldseed * multiplier + addend) & mask;
        } while (!seed.compareAndSet(oldseed, nextseed));
        return (int)(nextseed >>> (48 - bits));
    }
...
}

public class ThreadLocalRandom extends Random {
...
    protected int next(int bits) {
        rnd = (rnd * multiplier + addend) & mask;
        return (int) (rnd >>> (48-bits));
    }
...
}

The first snippet shows Random.next() which is used intensively in the benchmark of Math.random(). Compared to ThreadLocalRandom.next() the method requires significantly more instructions, although both methods do the same thing. In the Random class the seed variable stores a global shared state to all threads, it changes with every call to the next()-method. Therefore AtomicLong is required to safely access and change the seed value in calls to nextDouble(). ThreadLocalRandom on the other hand is - well - thread local :-) The next()-method does not have to be thread safe and can use an ordinary long variable as seed value.

About method inlining and ThreadLocalRandom

One very effective JIT optimization is method inlining. In hot paths executed frequently the hotspot compiler decides to inline the code of called methods (child method) into the callers method (parent method). "Inlining has important benefits. It dramatically reduces the dynamic frequency of method invocations, which saves the time needed to perform those method invocations. But even more importantly, inlining produces much larger blocks of code for the optimizer to work on. This creates a situation that significantly increases the effectiveness of traditional compiler optimizations, overcoming a major obstacle to increased Java programming language performance."

Since OpenJDK 7 debug builds you can monitor method inlining by using diagnostic JVM options. Running the code with '-XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining' will show the inlining efforts of the JIT compiler. Here are the relevant sections of the output for Math.random() benchmark:

@ 13   java.util.Random::nextDouble (24 bytes)
  @ 3   java.util.Random::next (47 bytes)   callee is too large
  @ 13   java.util.Random::next (47 bytes)   callee is too large

The JIT compiler cannot inline the Random.next() method that is called in Random.nextDouble(). This is the inlining output of ThreaLocalRandom.next():

@ 8   java.util.Random::nextDouble (24 bytes)
  @ 3   java.util.concurrent.ThreadLocalRandom::next (31 bytes)
  @ 13   java.util.concurrent.ThreadLocalRandom::next (31 bytes)

Due to the fact that the next()-method is shorter (31 bytes) it can be inlined. Because the next()-method is called intensively in both benchmarks this log suggests that method inlining may be one reason why ThreadLocalRandom performs significantly faster.

To verify that and to find out more it is required to deep dive into assembly code. With Java 7 JDKs it is possible to print out assembly code into the console. See here on how to enable -XX:+PrintAssembly VM Option. The option will print out the JIT optimized code, that means you can see the code the JVM actually executes. I have copied the relevant assembly code into the links below.

Assembly code of ThreadLocalRandomGenerator.run() here.
Assembly code of MathRandomGenerator.run() here.
Assembly code of Random.next() called by Math.random() here.

Assembly code is machine-specific and low level code, it's more complicated to read then bytecode. Let's try to verify that method inlining has a relevant effect on performance in my benchmarks and: are there other obvious differences how the JIT compiler treats ThreadLocalRandom and Math.random()? In ThreadLocalRandomGenerator.run() there is no procedure call to any of the subroutines like Random.nextDouble() or ThreatLocalRandom.next(). There is only one virtual (hence expensive) method call to ThreadLocal.get() visible (see line 35 in ThreadLocalRandomGenerator.run() assembly). All the other code is inlined into ThreadLocalRandomGenerator.run(). In the case of MathRandomGenerator.run() there are two virtual method calls to Random.next() (see block B4 line 204 ff. in the assembly code of MathRandomGenerator.run()). This fact confirms our suspicion that method inlining is one important root cause for the performance difference. Further more, due to synchronization hassle, there are considerably more (and some expensive!) assembly instructions required in Random.next() which is also counterproductive in terms of execution speed.

Understanding the overhead of the invokevirtual instruction

So why is (virtual) method invocation expensive and method inlining so effective? The pointer of invokevirtual instructions is not an offset of a concrete method in a class instance. The compiler does not know the internal layout of a class instance. Instead, it generates symbolic references to the methods of an instance, which are stored in the runtime constant pool. Those runtime constant pool items are resolved at run time to determine the actual method location. This dynamic (run-time) binding requires verification, preparation and resolution which can considerably effect performance. (see Invoking Methods and Linking in the JVM Spec for details)

That's all for now. The disclaimer: Of course, the list of topics you need to understand to solve performance riddles is endless. There is a lot more to understand then micro-benchmarking, JIT optimization, method inlining, java byte code, assemby language and so forth. Also, there are lot more root causes for performance differences then just virtual method calls or expensive thread synchronization instructions. However, I think the topics I have introduced are a good start into such deep diving stuff. Looking forward to critical and enjoyable comments!

Cheers,
Niklas

21 comments:

  1. Thanks for letting us know about ThreadLocalRandom and JVM options '-XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining' but doesn't use of ThreadLocal can posses risk of memory leak and loss of randomness between multiple thread, chances of getting same number on two different thread is more but I think for many practical purpose its good solution. keep writing about new topics on Java7 very informative.

    Thanks
    Javin
    10 JVM options Java developer should know

    ReplyDelete
  2. This is a very well written article and examples. I learned something new, and added a couple of new tools to the tool box.

    Thanks for taking the time to write the article.

    ReplyDelete
    Replies
    1. Hi John, thx for the nice comment. Cheers, Niklas

      Delete
  3. You need to add -XX:BiasedLockingStartupDelay=0 for any microbenchmark that calls synchronized methods for realistic results. (I didn't see that when I briefly scanned the page, sorry if I missed it).

    Best,
    Ismael

    ReplyDelete
    Replies
    1. We could have a long argument on this comment :) its true for benchmarks focussing on the algorithm under laboratory conditions :)

      Delete
    2. I don't know what you mean by that statement. In a real word application, your CPU-intensive code is likely to run after the default startup delay for biased locking and hence why a micro-benchmark without it is misleading if it involves synchronized methods.

      This is the official position by HotSpot developers (past and present).

      Best,
      Ismael

      Delete
    3. After reading the Math.random code in more detail, I see that it now uses a CAS instead of synchronized. I don't think biased locking is used for that (although I am not 100% sure).

      Delete
    4. Benchmark target: ThreadLocalRandomGenerator
      Mean execution count: 31534175,8
      Standard deviation: 191027,16
      To avoid dead code optimization: 1.0468182661632672E8
      Benchmark target: MathRandomGenerator
      Mean execution count: 16274804,2
      Standard deviation: 9816,36
      To avoid dead code optimization: 5.68644084533832E7
      VM option 'BiasedLockingStartupDelay=0'

      Delete
    5. CAS at the Java level may appear atomic; its not on a xeon processor these days. I've the paper around here somewhere dealing cache migration and java complications to cache coherency; the latter hopefully mitigated somewhat in JDK 7, but I don't know how w/o individual thread memory pools.

      Delete
  4. Thanks for this informative article!!!

    ReplyDelete
  5. Nice article about some fundamental jvm args. Didn't know that there where dual core 386 available back in the 90ies. ;-)

    ReplyDelete
    Replies
    1. :-D I was thinking to do all that on a Commodore 64, but then may be the half of the community didn't know what I am talking about ;-)

      Delete
  6. Thanks for this very detailed differentiating explanation between the two and the options that'll no doubt help me solve many of the problems i am facing because of inlining.

    ReplyDelete
  7. I'm a bit disappointed with this article. I was expecting more tips. Unfortunately this article only mentions a optimization for something that isn't used that often.
    Since the article mentions the overhead associated with virtual methods, it could also point out how to minimize that overhead. Using static and final methods can result (sometimes) in performance improvements, and is something that has a wider range of use.

    ReplyDelete
    Replies
    1. Hi Joao, I was focussing on the procedure on how to deep dive into exactly that performance questions you mention. How can you see exactly what's happening? That knowledge is the precondition for being able to analyze and write fast code. The list of concrete performance tipps is very long. In this case however the synchronization associated with AtomicLong and virual method calls apperently results in a significant performance difference.
      Cheers, Niklas

      Delete
  8. Thanks to John for making it ez for me.
    I couldn't do better than to copy/paste words right out of his mouth:
    "This is a very well written article and examples. I learned something new."

    ReplyDelete
  9. Thanks your post very much~
    And the ThreadLocalRandom.current().setSeed(seed) is not supported now.
    The api doc says "Throws UnsupportedOperationException. Setting seeds in this generator is not supported. "

    ReplyDelete
  10. Nice article.. Interesting to read about fundas.. :)

    ReplyDelete
  11. (Old thread I know!)
    Based on the insights presented, it's clear we can do even better.

    In the single-threaded case, the instance of ThreadLocalRandom returned by ThreadLocalRandom.current() will be the same for each call to run(). Hence the call to ThreadLocalRandom.current() can be done once and stored in a final field, so avoiding that virtual method call every loop.

    /**
    * NOTE - This is NOT thread-safe. It is based on the same thread constructing an instance and then making all calls to run() on it
    */
    public class StoredThreadLocalRandomGenerator implements BenchmarkRunnable {
    private final ThreadLocalRandom current = ThreadLocalRandom.current();
    private double r;

    @Override
    public void run() {
    r = r + current.nextDouble();
    }

    public double getR() {
    return r;
    }

    @Override
    public Object getResult() {
    return r;
    }

    }

    I also tested a "RandomInstanceGenerator", which has...
    private final java.util.Random rnd = new java.util.Random();
    ...
    @Override
    public void run() {
    r = r + rnd.nextDouble();
    }
    ...


    Using 64-bit Java 1.7, StoredThreadLocalRandomGenerator is about 1.8 times the speed of ThreadLocalRandomGenerator and 3.8 times the speed of the original MathRandomGenerator. RandomInstanceGenerator has about the same performance as MathRandomGenerator.


    Benchmark target: MathRandomGenerator
    Mean execution count: 53333180.8
    Standard deviation: 864505.41
    To avoid dead code coptimization: 1.8675819429696763E8

    Benchmark target: ThreadLocalRandomGenerator
    Mean execution count: 111053832.6
    Standard deviation: 4261579.83
    To avoid dead code coptimization: 3.879649954345086E8

    Benchmark target: StoredThreadLocalRandomGenerator
    Mean execution count: 204889655
    Standard deviation: 5966488.25
    To avoid dead code coptimization: 7.170974417005749E8

    Benchmark target: RandomInstanceGenerator
    Mean execution count: 53358416.4
    Standard deviation: 782600.74
    To avoid dead code coptimization: 1.868748768427826E8

    ReplyDelete
  12. Thank you for sharing this very helpful article about Java programming. Developers of such are surely grateful of this post. Cheers!

    ReplyDelete