7. 2. 2017

Playing with Java 8 collectors

I bet you wouldn't ever dare mutating key in HashMap. Such mutations lead to weird map behavior and hard-to-find bugs because of mismatch between actual hashcode and expected hash table bucket. In these blogpost I would like to talk about essentially same situation. Described problem has pretty bitten me in the ass and was not obvious at first sight due to nontrivial collector usage and improper usage of Guava BiMap.
As usually, I minimize problem scope and translate it to human-friendly domain.

Task

You are given set of activities and two sets of people representing work shifts. Each activity is assigned to just one person from first shift. At some time, all people from first shift have a rest and people from second shift start working. Each activity is reassigned to some person in second shift. The activity-to-person assignments are expressed as maps.

Your task is to list groups of activities which has been passed as complete set (i.e. not splitted nor merged) from one person to other.

For example, having Alice and Charlie in first shift, Bob, Dave and Ellen in second shift and following setup:

before\afterBobDaveEllen
Alicecleaning
cooking
Charliewashingironing

the result is set containing set [cleaning, cooking] since these activities were passed one-to-one from Alice to Bob, while Charlie splitted his work to two people.

Implementation note: generic types of data structures and collectors are complicated, often hard to read and sometimes enforcing usage of explicit type arguments. In order to avoid readers's confusion of map key and values, we will use just characters (java.lang.Character) A,B,C,D,E in following examples instead of full person names and reserve String type to modelling activities.

Incorrect implementation

The principle of solution is to "index" task assignment by person for both work shifts. Since I knew I would need to query both sides of association in practice, I chose to use Guava BiMap<Character,Set<String>>. Naive implementation would do following transformation of input Map<String,Character> into BiMap (get working example here):

BiMap<Character,Set<String>> index = activityToPersonMap.entrySet().stream()
        .collect(groupingBy(Map.Entry::getValue, HashBiMap::create, mapping(Map.Entry::getKey, toSet())));

The mapping(...) method creates the so called downstream collector - powerful mechanism of collector chaining, allowing to obtain Set<String> on the fly rather than Set<Map.Entry>. Also, since Guava BiMap is a Map we can provide factory method HashBiMap::create where supplier of a Map is expected.

After we got two BiMaps describing setup before and after, we just find sets of activities common to both BiMaps. Since the BiMap::values returns also Set, we just call

Sets.intersection(resultBefore.values(),resultAfter.values())

and we're done.

Really?

Let's run the example above:

Map<String,Character> before = ImmutableMap.of("cleaning",'A',"cooking",'A',"washing",'C',"ironing",'C');
Map<String,Character> after  = ImmutableMap.of("cleaning",'B',"cooking",'B',"washing",'D',"ironing",'E');
BiMap<Character,Set<String>> indexBefore = before.entrySet().stream()
        .collect(groupingBy(Map.Entry::getValue, HashBiMap::create, mapping(Map.Entry::getKey, toSet())));
BiMap<Character,Set<String>> indexAfter  = after.entrySet().stream()
        .collect(groupingBy(Map.Entry::getValue, HashBiMap::create, mapping(Map.Entry::getKey, toSet())));
System.out.println(Sets.intersection(resultBefore.values(), resultAfter.values()));

We expected [[cleaning, cooking]] but got []. Can you spot the bug?

Bug cause

BiMap keeps hashes also for inverse, value-to-key mappings. Hence for correct work of BiMap (unlike Map – where values can be safely mutated after putting into Map), values must not be mutated after they were put into Map. This limitation is crucial especially in this case where the value of BiMap is of Set<String> type.

In this wrong example, the toSet() method returns simple collector which creates Set and this set is added into newly created HashBiMap immediately after it was created. The value-to-key hash code is computed in early time for empty Set. Elements added later are physically stored into Set but hashcode is not recomputed and such a Set cannot be found – the HashBiMap is actually corrupted.

Correct implementation

Putting aside rewriting into plain old for-if imperative code (it should not be underpriced in practice, though I intentionally skip it for this article), the bugfix of functional implementation lies in deferring putting Set into BiMap to later time.

The accepted answer to very similar SO problem recommends using immutable sets. Let's try it:

BiMap<Character,ImmutableSet<String>> index = activityToPersonMap.entrySet().stream()
        .collect(groupingBy(Map.Entry::getValue, HashBiMap::create, mapping(Map.Entry::getKey, toImmutableSet())));

The ImmutableSet.toImmutableSet() method (you need Guava version 21+) returns collector which accumulates intermediate result into ImmutableSet.Builder. During stream processing, the builder is also stored into BiMap values though, so at first sight one could say this solution does not differ. However, (1) the builder does not override equals+hashCode, so its hashcode does not depend on its content, and (2) thanks to finisher associated with collector, the builder is converted to ImmutableSet at the end and replaced in BiMap.

We can prove correctness of such postponing even for mutable sets. The Collectors.collectingAndThen method wraps given collector and adds finisher after it. We wrap mapping collector and copy resulting value into new HashSet. The effect is same as in previous example.

BiMap<Character,Set<String>> index = activityToPersonMap.entrySet().stream()
        .collect(groupingBy(Map.Entry::getValue, HashBiMap::create, collectingAndThen(mapping(Map.Entry<String,Character>::getKey, toSet()), HashSet<String>::new)));

For completeness, creating Map first and then BiMap from Map would also be possible:

BiMap<Character,Set<String>> index = activityToPersonMap.entrySet().stream()
        .collect(collectingAndThen(groupingBy(Map.Entry::getValue, mapping(Map.Entry::getKey, toSet())), HashBiMap::create));

You can try all examples in the gist.

Conclusion

The collector API is very powerful, although for more complex collector chains the timing of events during stream processing is not straightforward.
Collectors with finisher seem to be better comprehensible than simple collectors.
Also it's easy to encounter limits of Java compiler with respect to generic type inference.
After all, important cause of this bug was also in using BiMap constructor as Map supplier ignoring fact that BiMap violates Liskov substitution principle.