Home Preallocate Java's ArrayList And HashMap For Better Performance
Post
Cancel

Preallocate Java's ArrayList And HashMap For Better Performance

TL;DR Preallocation avoids unnecessary memory reallocations due to dynamic resizing, copying of the data, rehashing for maps and puts less pressure on GC1. Although we will only look at Java, the same behavior exists in c++, rust and many other languages as well.

ArrayList

There are two ways to create an ArrayList .

  • Method 1: Created with zero argument constructor. This uses a default capacity of size 102.

    1
    2
    3
    
      var arr = new ArrayList<String>(); 
    	
      arr.add(elem) // add something later
    
  • Method 2: Provide an initial capacity with int argument constructor.

    1
    2
    3
    4
    5
    
    // created with a capacity of 100. 
    // No dynamic resizing needs to happen till 101th element is added
    var arr = new ArrayList<String>(100); 
      
    arr.add(elem) // add something later
    

Let’s measure the performance under different scenarios for different input sizes:

N
(Size of the input)
Vector With No PreAllocation
(Method 1)
Vector With PreAllocation
(Method 2)
Improvement
100.037 μs0.037 μs0%
1000.403 μs0.350 μs15.14%
10004.448 μs2.903 μs53.22 %
1000052.361 μs29.089 μs80%
100000468.255 μs279.901 μs67.29%

Why Vector preallocation is efficient:

Whenever an ArrayList runs out of its internal capacity to hold additional elements, it needs to reallocate more space. Generally, most implementations double the existing size. So, a new array of larger size is created and existing elements are copied to this new array3. Later, whenever GC runs, the old array is garbage collected. At the time of writing, when 100000 elements are added to the arraylist with no preallocation, the array reallocation happens at the following sizes in openjdk 17 codebase:

10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 711404.

Note: The capacity growth policy is an implementation detail and can change across different jdk versions.

HashMap

Although we only talk about HashMap, the same argument applies to LinkedHashMap and other kinds of maps which take in slots/capacity as arguments.

HashMap can also be constructed with custom a loadFactor but we will stick with using the default loadFactor of0.75 for now . The capacity can be computed with formula:

\[\frac{count\space of \space expected \space elements}{loadFactor} = \frac{count\space of \space expected \space elements}{0.75} \approx 1.33 \times {count\space of \space expected \space elements}\]

The reason we allocate 1.33 times more than the expected elements count is because of hash collisions. In a perfect world, the number of buckets/slots we need to allocate will be equal to the number of elements to add. However, in reality, collisions will happen and we need to account for that.

Consider the following two methods of constructing a HashMap:

  • Method 1 - With No Preallocation : Created with 0 argument constructor. This constructs an instance where capacity is 16.

    1
    
     var map = new HashMap<String, Integer>();
    
  • Method 2 - With Preallocation: Created with required capacity (N)

    1
    2
    3
    
    // here N is the number of elements we intend to add to the HashMap
    var capacity = (int)Math.ceil(N/0.75);
    var map = new HashMap<String, Integer>(capacity)
    

Let’s measure the hashmap performance under different scenarios for different input sizes (N):

N
(Size of the input)
HashMap No PreAllocation
(Method 1)
HashMap With PreAllocation
(Method 2)
Improvement
100.245 μs0.233 μs4.89%
1002.624 μs1.699 μs35.25%
100030.244 μs20.187 μs33.25%
10000477.449 μs336.069 μs29.61%
10000011013.597 μs8875.078 μs19.42%

Why HashMap PreAllocation is Efficient

Preallocation of HashMap is efficient as this avoids unnecessary rehashing. At the time of writing, rehashing happens when map.size() > (capacity() * loadFactor). Rehashing involves the following process:

  • Allocate a new slot table of larger size. In openjdk 17, the size increases by approximately 1.55
  • Rip through each entry in the existing map, rehash the entry and find its new slot in the new array

When you try to add 100K elements, the hashmap is rehashed at the following sizes: 10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 71140. When HashMap is preallocated, all these rehashing is eliminated.

Drawbacks of Over Allocation

Hashmap/Vector have no way of knowing how many elements we intend to add and thus, generally are conservative in nature and initially, start with a small capacity. If we are very aggressive and set a very large capacity even though we never utilize the full capacity will result in memory wastage. In a GC language like Java, this would result in unnecessary GC pauses.If you are using an architecture like lambda or cloud functions where you are charged depending on how much memory you use, then your cloud bill should be higher if you waste space.

Unnecessary space in Vector can be claimed by invoking trimToSize method.

How to Determine the Capacity

Generally, all applications typically interact with external systems like databases, handle income rest or grpc requests, listen to incoming messages on an event bus, interface with external apis etc. The clue comes from carefully examining how we interact with such an external system.

  • Database: Consider a hypothetical example of retrieving the top 100 rated posts globally using a sql: select post_id from top_posts order by position desc limit 100 . We know that in almost all cases, we get 100 rows from the above sql query. So, we can initialize the container with a capacity of 100.

  • Creating an HashMap: Consider an external api that returns a list of User and you need to convert to a map of userId to User

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  class User {
    // not a complete class. Only relevant method is shown for illustration
    public int getId() {
      return id;
    }
  }
  
  // Usage
  List<User> users = getUsers(); 
  var capacity = (int)Math.ceil(users.size()/0.75);
  Map<Integer, User> userIdMap = users.stream()
      .collect(Collectors.toMap(user -> user.getId(), // key
                                Function.identity(), // value
                                (u1, u2) -> u1, // since we expect no duplicates, just pick any user Object for merge operation
                                () -> new HashMap<>(capacity)) // HashMap capacity is pre allocated
              );




  1. Garbage Collection (GC) pressure only applies to garbaged collected languages like Java, go etc but not c++, rust. 

  2. At the time of writing, the default size in open jdk 17 is 10 . In fact, a shared array of size 0 is used until the first addition. This avoids unnecessary space allocation if ArrayList is created and no elements were to be ever added. 

  3. In fact, this is the worst case scenario where addition of element doesn’t take constant time. 

  4. Java instead of doubling the capacity, increases the capacity by approprimately 1.5 times 

  5. Typical implementations double the size of existing capacity. 

This post is licensed under CC BY 4.0 by the author.