Find all pairs in array of integers whose sum is equal to a given number ?

Problem: Write a program to find all pairs of integers whose sum is equal to a given number. For example if input integer array is {1, 2, 3, 4, 5}, and the given sum is 5, the code should return two pairs {1, 4} and {2, 3}.

This is a common interview question related to array structure. A quick search on Google can return several pages discussing the problem, to list but a few [1,2, 3, 4]. Those pages only discuss the solutions at the conceptual level, without supporting comparing numbers, although they also provide the code. In this note I would re-implement all these approaches and compare their efficiency in practice.

Common approaches
There are four approaches which are often mentioned together: Brute-force, HashSet, sorting with search and sorting with binary search. Each has its own pros and cons. Interested readers can find the details in the list of sources below. Here is a summary of the storage complexity and computational complexity.

Approach Storage complexity Computational complexity
Brute-force constant O($N^2$)
HashSet O(N) O(N)
Sort + Linear Search constant O(NlogN)
Sort + Binary Search constant O(NlogN)

Evaluation Setup
- Programming Language: Java 8 (1.8.0_51)
- To measure elapsed time of running a method
TimeUnit.NANOSECONDS.toMillis(end_time - start_time))
where start_time, end_time is set to System.nanoTime() before and after the method correspondingly.
- Generate an integer array randomly at different size $\{10, 100, ...10^9\}$. Each element is an integer, ranging from [0, 10^3]. The given sum is randomly selected in range of $[0, 10^6]$. (Note: At the moment, I have yet considered cases of negative numbers.) Each method is run 100 times. The report time is taken as the average over 100 runs.
- Due to heavy-load in computation, the Brute-force approach is restricted to experiments with array size smaller or equal to $10^5$.
- The code uses the algorithms implemented in Java such as Arrays.sort() and Arrays.binarySearch().

Experimental Result


Array Size Brute force HashSet Sort + Linear Search Sort + Binary Search
10 0 0 0 0.07
$10^2$ 0 0 0 0
$10^3$ 0.06 0 0 0
$10^4$ 20.13 0 0.18 0.05
$10^5$ 2057.79 2.62 4.08 3.06
$10^6$ N/A 30.54 46.05 42.16
$10^7$ N/A 329.57 472.1 465.6
$10^8$ N/A 3420.79 4334.57 5187.29
$10^9$ N/A 46938.4 43404 58332.79
$1.5*10^9$ N/A 61834.1 64844.03 89533.32
$2*10^9$ N/A 75279.2 86555.01 118527.82

Note: the measure unit is milliseconds.

Discussion
-  If the given array has up to 1000 elements, difference in running time is negligible among the three approaches.

- With arrays of 10,000 elements, Brute-force is much slower than others. However, its running time is around 20 milliseconds, which is still within a blink.

- At the size of 100,000, the difference is getting obvious. Brute-force is 785x slower than HashSet, 504x slower than Sort+Linear and 672x slower than Sort+Binary. So far HashSet is still the winner in running time. Comparing to the result of 10,000-element arrays, the Brute-force is 102x slower than itself when the size is 10x difference, which is aligned with the complexity of O($N^2$). For this reason, I presume for arrays around 1M, it may take an hour to run, which is not worth trying since the approach already slower than the other methods. Hence, the Brute-force is not included in the 'large-scale' test sets, i.e. arrays with more than $10^6$ elements.

- It's more interesting to see the performance between HashSet method and other sort-based methods. In theory, HashSet with complexity of O(N) is faster than all the sort-based approaches with O(NlogN). In practice, the claim does not always hold. Sometime the constant in the Big-O notation may be significant. We can observe the phenomenon clearly by comparing the two methods: HashSet with Sort+Linear. The ratio between the twos' running time is declining with regards to increase in array size. With N=$10^9$, HashSet is slightly slower than In-place sort. With N=$1.5*10^9$, the performance between the two are equivalent.

- To explain the observation, we can look at the implementation of HashSet in Java, which I briefly cover below. The key idea here is with HashSet, it can also be time-consuming to handle cases of many unique keys (requiring to expand the key list) or many instances sharing same hash code (traversing the linked-list attached to a bucket). The actual complexity of HashSet is O(N*N/M) (some readers in [4] already mentioned that) where M is the number of keys created to hold the array. The idea case is M=N (where all the values are distributed evenly among the buckets) leading to O(1). The worst case is M=1, where all the values belong to the same bucket.
 
Further Discussion
There are different scenarios to take note: negative values ? duplicated value ? reversed pairs ? underflow and overflow (when the given values are double or float type) ?
- HashSet: fastest, but requires additional space of O(N). Another drawback of this approach is that it cannot handle duplicated pairs. For example, the given array is {-1, -1, -1, 3} and the given sum is 2. Since all -1 will be subsumed by the Hash table, the algorithm will return only one instances of {-1, 3}.
- Sort + binary search: Slower than HashSet and sort + linear search in practice but can deal with duplicated pairs.
- Sort + linear search: got issues with duplicated pairs. For example, {-1, 3, 3, 3} and sum=2.
- The experiment settings are not really complete. For example, I have not covered the cases with negative numbers, cases with arrays of identical elements, cases with arrays of distinct elements, etc.

The code will be updated later.

Sources:
1. http://javarevisited.blogspot.sg/2014/08/how-to-find-all-pairs-in-array-of-integers-whose-sum-equal-given-number-java.html (java)
2. http://www.cs.uml.edu/~jlu1/doc/codes/findSum.html (python)
3. https://discuss.codechef.com/questions/3292/algorithm-given-an-array-find-all-the-pairs-of-elements-whose-sum-is-k (c++)
4. http://www.ardendertat.com/2011/09/17/programming-interview-questions-1-array-pair-sum/
5. http://javaconceptoftheday.com/how-hashset-works-internally-in-java/
6. http://www.java2blog.com/2014/07/how-hashset-works-in-java.html
7. http://javahungry.blogspot.com/2013/08/how-sets-are-implemented-internally-in.html
8. https://docs.oracle.com/javase/tutorial/collections/implementations/set.html
9. http://javahungry.blogspot.com/2013/08/hashing-how-hash-map-works-in-java-or.html

Appendix A1. How HashSet works internally in Java ?
- It uses HashMap internally to store it's object. The values inserted to HashSet will be the keys of the HashMap object, the values are some dummy constant object named PRESENT. [5]
- How HashMap determine similarity between customized objects as keys ? [6]
+ Step 1: Compare the twos' hashcode (using hashcode() function)
+ Step 2: If hashcode is the same, check the equals() method.
[6] also suggests useful practice on overriding equals method. The key point is to make sure the consistence between equals() and hashcode(), but the reverse it not necessary.

Appendix A2. How HashMap works internally in Java ? [9]
Several keywords: hash functions, hash values and Bucket.
- Hash functions is the hashCode() function which returns an integer value. This method is located in the Object class.
- Hash value is the integer value returned by the hashCode() function
- Bucket stores (key, value) pairs. In HashMap, Bucket uses simple linked-list to store object in the form of (hash, key, value, bucketindex).
- When we call get(Key k) method, the following steps are:
     + Step 1: Check whether the key is null or not. There is only one null key in HashMap
     + Step 2a: If key is null, Null keys --> 0.
     + Step 2b: If key is not null, compute the hashValue of the key object by calling hashCode(), then re-hash that value by calling its own hash function hash(hashValue).
     + Step 3: find the location in the Bucket by using indexFor(int, int) method. The linked list in the appropriate bucket is iterated over to the end or the matched key.

Comments

Popular posts from this blog

Manifold Learning [unfinished]

Feature scaling