Canal - Runtime Concurrency Analysis

Introduction

Canal is a Java runtime concurrency analysis tool. It instruments your Java code and records field access, synchronization, locks and publication from all threads, and then analyses whether there is any unsafe field access. In today's environment where everything is multi-threaded, Canal is invaluable. Installation is simple and unobrusive, requiring no change to your app.

Why analyse at runtime?

In decending order of performance:

Its easy
More likely to find bugs
Runtime is only time you know code is going to run in multiple threads
Doesn't require prior knowledge runtime threading environment

Doesn't require prior knowledge runtime threading environment

These days we've got lots of code, and lots of different runtime environments. There will be times when you don't know¹, you can't know², or you or forget the code you're writing will be executed in multiple threads. If you are unaware, you won't be writing multi-threaded unit tests, or applying compile time analysis tools. There wil be times that you want to introduce concurrency in an app for increased performance. In this case, you would have to go and write new multi-treaded unit tests for the whole app.

With Canal you don't have to worry about these things - you can write code without needing to know whether it will run in multiple threads (I'm not saying you should!), then when run in a production-like test environment, Canal will pick up non threadsafe accesses.

Runtime is only time you know code is going to run in multiple threads

Unit tests can not tell you whether you code is going to be accessed from multiple threads. Indeed, with today's complex frameworks and runtime environments, even if you check, you can miss where threads are introduced. Only at runtime in a production like environment do you really know what code is used from multiple threads.

More likely to find bugs

... than unit testing

There are some great books and articles about writing multi-threaded unit tests for you code, but it is very hard to get multi-threaded unit testing correct. Even then, the results are usually indetermonistic.

... than static code analysis

By its very nature, static code anaysis doesn't know if code is going to be run in multiple threads. Thus, it can only check for inconsistencies in the locking policy. If there is no locking, and the code is eventually used in multiple threads, static code analysis can't help.

Why not analyse at runtime?

I hear you say "Ok so all that sounds great, there must be downside."

Indeed - runtime analysis means analysis has to be done at runtime. This analysis needs resources to run, resources that would usually run your program. Recording field access, synchronisation, etc takes time which slows down our program and uses memory. This is true of any instrumentation techniques - unit test coverage tools, bean property change listeners etc. They add code to yours so there's more to run. More in the performance section.

Installation

Download the latest Canal distribution from Sourceforge. the distribution contains the Canal jar and the few dependencies Canal has.
Extract the zip to a directory. (You must keep dependencies in same dir as Canal jar, as Canal needs these on the bootstrap classpath).
Add Canal java agent to your application start up script:

Before:

java -classpath=<…> MainClass <args>

After:

java -javaagent:<path to>/Canal.jar -classpath=<…> MainClass <args>

optional

Change configuration of what packages are instrumented.
Move canal.xml configuration file: put it anywhere on the classpath.
You may well find you need to increase the max memory on you JVM using something like -Xmx1024M

dependencies

Mandatory

ASM (for instrumentation). Due to likely conflicts with ASM version used by other projects, ASM is now embedded in the Canal jar using jarjar (in the package com.iandjones.asm).
Spring utils package (for ant style path matching in config)
Apache Commons logging (for … um… logging)

Optional

Log4j (default logging implementation)

Requirements

Java 5, for the instrumentation API

Usage

I use Canal regularly. I run my application with Canal in my development and UAT/QA/test environments, and also in my automated integration tests that exercise large chunks of your system. I won't want to run it in production, and I won't want to run in UAT all the time, as at some point I'm going to want to performance test a prod like environment. I also use it in some unit tests for really crucial parts of the system where I know there will be multiple threads. In this case I don't want Canal to report threading bugs to the console, but to throw an exception if there are any problems. So I can use the following in my test:

CanalFactory.getInstance().getAnalyser().analyseConcurrentAccessAndThrow(); // Throws CanalException if Canal detects any threading bugs

Another usage consideration is how much to instrument. With the default Canal configuration file, Canal will instrument the whole JVM. This means it will look for bugs in your code, bugs where your code uses other code, and indeed bugs in other code. The latter is undesirable, but difficult to distinguish from the former. The other option is to just instrument your code. In which case Canal will only look for bugs within your code, not where your code uses other code (e.g. It would't find a bug where your code was incorrectly using an ArrayList, as the ArrayList class is not instrumented). However, this will be much faster and use less memory.

Performance

Canal takes the approach of trying to minimise the performance impact, but doing as little as possible at runtime - it just records the events that happen. The legwork of analysis is done by default at shutdown. The obvious consequence of this is that by default Canal uses up a fair amount of memory, in sizable program. Memory is cheap these days, but you can tweak the performance/memory trade off by changing the frequency with which Canal does its analysis.

In this very early stage in this project the main work has focused on getting it working, with reasonable performance. There are many performance optimisations that will be made in future releases

The example below runs 1.6 times slower on my machine with Canal enabled. However, this is not the best performance test because its so short (so variance in runtimes is large), and entirely cpu bound.

You may find that with the reduced performance and increased memory usage, that you have to run you application with a smaller data set than it would have in production.

Example

Lets take a trivial example to see Canal in action. Here's a simple unthreadsafe counter, and a main method that runs the counter ten times in two threads, and outputs the total

public class UnsafeCounter {
	private int i = 0;
	
	public void count() {
		++i;
	}
	
	public int get() {
		return i;
	}
	
	private final static UnsafeCounter counter = new UnsafeCounter();

	public static void main(String[] args) throws InterruptedException {
		Thread thread = new Thread(new CounterRunner());
		thread.start();
		new CounterRunner().run();
		thread.join();
		System.out.println("Should be 20: " + counter.get());
	}
	
	private static final class CounterRunner implements Runnable {
		@Override
		public void run() {
			for(int i=0; i < 10; ++i) {
				counter.count();
			}
		}
	}
}

Listing 1: UnsafeCounter.java

I ran this ten times, and the output was:

Should be 20: 20

So no hint at the possible impending doom. If I run this with Canal enabled, the output is:

2010-11-20 00:21:48,662 [main] INFO  com.iandjones.canal.instrumentation.CanalInstrumentation  - Starting Canal instrumentation agent
Should be 20: 20
2010-11-20 00:21:48,866 [Thread-0] INFO  com.iandjones.canal.implementation.analysis.CanalAnalyserImplementation  - Running Canal concurrency analysis
2010-11-20 00:21:48,871 [Thread-0] ERROR com.iandjones.canal.implementation.analysis.CanalAnalyserImplementation  - Field com.iandjones.canal.example.UnsafeCounter@26094370:i not always accessed with common lock: 
   thread id: 9:Thread-1
       locks held: []
   thread id: 1:main
       locks held: []

This says that when analysis was done at shut down, that the field "i" on the object "com.iandjones.canal.example.UnsafeCounter@26094370" was accessed from two threads without any locks held.

Lets make the class threadsafe.

	...
	public synchronized void count() {
		++i;
	}
	
	public synchronized int get() {
		return i;
	}
    ...

Listing 2: UnsafeCounter.java made threadsafe

Re-running with Canal enabled outputs (and debug turned on), I get:

2010-11-20 00:31:50,135 [Thread-0] DEBUG com.iandjones.canal.implementation.analysis.CanalAnalyserImplementation  - Field com.iandjones.canal.example.UnsafeCounter@20754125:i always accessed from 2 threads with common lock(s) [com.iandjones.canal.example.UnsafeCounter@20754125]

To find the bug in the above code without Canal, I have to increase the number of iterations 100,000 to get it to start failing on my machine, and even then it only fails half the time. To get it to reliably fail I have to increase the number of iterations to 10,000,000. So my test is roughly a million times slower than my 'production' code. The performance overhead of Canal doesn't seem so bad now :-). The other option is to rewrite my code to increase concurrent execution by increasing the complexity of the code and adding some thread communication to ensure both threads wait and start at the same time. In that case, I only need to increase the number of iterations 10,000 times. This is my reasoning behind my claim that Canal is easier and more likely to find bugs than normal tests.

FindBugs finds no bugs in the above class.

How it works

At runtime Canal instruments your Java code and records field access, synchronization, locks and publication from all threads, and then analyses whether there is any unsafe field access. It uses ASM and the Java 5 instrumentation API to transform your code. For exampe the count method in Listing 2 would be roughly the following after instrumentation:

 	public synchronized void count() {
		CanalFactory.getInstance().registerLock(this); // register synchronisation lock on object this.
		try {
			CanalFactory.getInstance().registerFieldAccess(this, "i"); // read field with name "i" on object this
			CanalFactory.getInstance().registerFieldAccess(this, "i"); // write field with name "i" on object this
			i=i+1;
		} finally {
			CanalFactory.getInstance().registerUnlock(this); // register synchronisation unlock on object this.
		}
	}

When Canal registers an event, it basically just records the details of the event onto a threadlocal queue for later analysis.

Configuration

Sample configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<canal>
 	<instrumentation>
 		<filter>
 			<include> <!-- Ant sytle path matchers for classes to include in Canal analysis -->
 				<path>**</path> <!-- by default instrument everything, but you may want to just instrument your code e.g. com// -->
 			</include>
 			<exclude> <!-- Ant sytle path matchers for classes to exclude from Canal analysis -->
 				<path>**/*Test*</path>
  				<!-- snip ... some mandatory exclusions in full file... -->
  			</exclude>
 		</filter>
 		<locks> <!-- snip ... Configuration for how Canal recognises Lock classes. Advanced users only. See full config file ... -->
 		</locks>
 	</instrumentation>
 	<reporting>
 		<analyseAtShutdown>true</analyseAtShutdown> <!-- Run Canal analysis at shutdown. For jobs that exit when complete, or for a short run of a usually long running app e.g. one test cycle. -->
 		<analysePeriodically> <!-- Run Canal analysis periodically during runtime. For for long running apps, e.g. services that accept requests and must be running all the time -->
 			<enabled>false</enabled>
 			<period>5</period>
 			<unit>MINUTES</unit>  <!-- valid values are from java.util.concurrent.TimeUnit --> 		
		</analysePeriodically>
 	</reporting>
</canal>

Current status

Pre alpha

Features not yet supported:

safe publishing, a key concurrency concept
happens before relationships
detection of non - atomic acitons
field access via reflection
read and write lock checking
reads of other objects in constructors
array access

Known issues:

ClassCircularityErrors when trying to get info about another class whilst instrumenting one

FAQ

Will Canal find all threading bugs in my application?
- No. We've said that you're more likely to find threading bugs with Canal. However, as Canal does runtime analysis, it can only analyse the codepaths that are executed. So if in your testing you don't excercise some piece of functionality when running Canal, you may find that in production when that functionality is used there is a threading bug.
- No. It doesn't detect all possible threading bugs, only the most common ones.

More to come

^[1] Whilst you shouldn't make code chages unless you're fully familiar with the whole codebase, and the environment in which it will run, in reality, we all move to new projects, or get new junior members on the team.

^[2] e.g. where you upgrade or change a library or frame, and the new version accesses your code from multiple threads.