Python Language Extracting N largest or N smallest items from an iterable


To find some number (more than one) of largest or smallest values of an iterable, you can use the nlargest and nsmallest of the heapq module:

import heapq

# get 5 largest items from the range

heapq.nlargest(5, range(10))
# Output: [9, 8, 7, 6, 5]

heapq.nsmallest(5, range(10))
# Output: [0, 1, 2, 3, 4]

This is much more efficient than sorting the whole iterable and then slicing from the end or beginning. Internally these functions use the binary heap priority queue data structure, which is very efficient for this use case.

Like min, max and sorted, these functions accept the optional key keyword argument, which must be a function that, given an element, returns its sort key.

Here is a program that extracts 1000 longest lines from a file:

import heapq
with open(filename) as f:
    longest_lines = heapq.nlargest(1000, f, key=len)

Here we open the file, and pass the file handle f to nlargest. Iterating the file yields each line of the file as a separate string; nlargest then passes each element (or line) is passed to the function len to determine its sort key. len, given a string, returns the length of the line in characters.

This only needs storage for a list of 1000 largest lines so far, which can be contrasted with

longest_lines = sorted(f, key=len)[1000:]

which will have to hold the entire file in memory.