Last week we had a severe problem in a critical web module of our production environment. The problem was that we had to restart a server at a time where we had very high user and transaction volumes (we do ~500.000 transactions in that module per business day). The server shutdown deleted our XSL template cache. As a consequence many threads tried to recompile the templates concurrently after restart. This in turn introduced a (CPU) system overload issue. We could only restart the server by blocking the incoming requests at the web server level. In fact we only allowed some threads to bypass the web server and to enter the application server. After some warmup time we opened the web server and the system started to work at an acceptable CPU load.
We've looked at the application code that caused the issue and decided to implement an intelligent concurrency pattern that fulfills following requirements:
- the number of threads that perform a CPU intense task concurrently (like XSL template compilation) should be limited to a configurable size (=> avoid CPU overload on startup in user load peek times)
- cache the result of each CPU intense task so that it will only execute once (we had that already in our error prone solution)
- enable the system to determine whether the CPU intense task needs to execute again (=> if the application was redeployed the templates need to recompile)
I spare me the effort (and the painfullness) to post the old non-safe code. The bugs in that code were:
firstly - it did not limit the number of threads allowed to compile templates
secondly - it did not ensure that only one thread compiles a specific template at a given time (e.g. our startup page) - instead many threads tried to compile the same template concurrently - this one is critical!
That all being said, here comes the solution to such a "too-many-threads concurrency issue". Here's the code and I also added a class diagram 'cause I believe this is a common situation in web applications.
The following snippet shows our new
HTMLDocumentGenerator, its responsibility is to create and cache
ConcurrentTransformerinstances, one for each XSL document. The responsibility of
ConcurrentTransformeris to compile a XSL template and to store the result in its
HTMLDocumentGeneratorowns the cache (Line 3), defines a limit of threads that can compile XSL documents (Line 5) and declares a
Semaphorethat implements a "thread bouncer" (Line 6). The
createDocument-Method (Lines 10-20) creates and caches the new
ConcurrentTransformerinstances for each XSL document. Example:
/start/index.xslwill have its own
/start/main_menu.xslwill also have its own unique instance.
Notice that we use
ConcurrentHashmap.putIfAbsent()which allows cache lookup and cached object creation in a single step. This is equivalent to:
if (!map.containsKey(key)) return map.put(key, value); else return map.get(key);
This is a perfect approach in a multithreaded high volume application. You could do the same with owned locks or
synchronizedblocks, but it will hardly be so safe (and fast!) like the example above. You will understand my statement if you look at the implementation of
ConcurrentHashmap. It locks segments internally and not the whole table! This allows very high volumes of concurrent access without lock contention. In Line 18 we call the method
generateTemplatethat actually performs the CPU intense task (template compilation).
In Line 5 we declared a
Semaphorewhich acts as a bouncer for the CPU intense code sections. Using this class you can control the number of concurrent threads that perform a certain task concurrently. It's important to declare the
HTMLDocumentGeneratorclass, 'cause the thread limit applies to all cached
Now let's look at the
ConcurrentTransformerthat I declared as a protected inner class of
HTMLDocumentGenerator. Using an inner class
ConcurrentTransformerinstances have immediate access to member variables of
HTMLDocumentGenerator. There is one
ConcurrentTransformerfor each unique XSL template.
Let's go through
Line 9: check if thread needs to compile this XSL template. If
false, return the compiled template.
trueif for example the
null, which means the template wasn't compiled yet. The first thread that enters
mustCompile() = true'cause the
ConcurrentTransformerwas just created by that thread and the
Line 12: Block all subsequent threads 'cause only ONE thread can compile THIS XSL template at a given time.
Line 14: check again if thread needs to compile the XSL template. Sounds weird? Imagine a second thread that was blocked at Line 12 'cause the first call to
true. This thread does not need to compile again 'cause the other (faster/first) thread already compiled the template.
Line 17: check if thread exceeds the permitted number of threads allowed to do compile jobs. Because the
bouncerwas defined in the
HTMLDocumentGeneratoronly a limited number of threads will enter the next code block. This actually was the tricky part of the solution: Lock threads that want to compile a specific template but also ensure that the total number of active compilation tasks does not exceed a defined limit! Because the
bouncerapplies to all cached
ConcurrentTransformerinstances this is possible.
Line 19-25: Do the actual compilation work which is the CPU intense part that aused the system overload.
Line 27: Release a permit to allow other thread (in a different
ConcurrentTransformerinstance) to enter the compilation code.
Line 30: Release the lock so that other threads waiting for that XSL document can continue their processing. (they will return the template that was just compiled, see line 14)
Done! This solution will enforce that (1) only one thread will try to compile a specific XSL document and that (2) only a limitted amount of threads can do compilation work.
Here is the class diagram that shows the pattern-style structure of the solution:
Sleeper bugs and why systems work by coincidence
Some weeks ago I met Dr. Heinz Kabbutz. I attended his Java Specialist Master Course and he opened my eyes for problems like this. Multi-threading was his first lesson and he started by saying something like this: "Your web applications are not thread safe, they just work - by sheer coincidence!" Although its a provocative statement, he is not wrong by saying that ... The described concurrency bug was in our code for a _long_ time, say for 8 years? It just didn't show up. What changed? We migrated from WebSphere Application Server 6 to 7. And we decided to use a new XSL parser, optimized for z/OS. The old XSL parser performed good at compilation time, but the compilation result did not perform so well. The new XSL parser promised an optimized compilation result. But obviously it did take more CPU during compilation. Now, since the compilation was CPU intense with the new parser our consurrency bug turned out to be a big problem suddenly. I'd call that a "sleeper bug". It is there for years, and suddenly it sabotages your system. I believe there are many bugs like this in todays Java applications, and I believe it's not too provocative to say that you will also have some sleeper bugs in your systems. Your systems work - but by sheer coincidence!