To find some number (more than one) of largest or smallest values of an iterable, you can use the nlargest
and nsmallest
of the heapq
module:
import heapq
# get 5 largest items from the range
heapq.nlargest(5, range(10))
# Output: [9, 8, 7, 6, 5]
heapq.nsmallest(5, range(10))
# Output: [0, 1, 2, 3, 4]
This is much more efficient than sorting the whole iterable and then slicing from the end or beginning. Internally these functions use the binary heap priority queue data structure, which is very efficient for this use case.
Like min
, max
and sorted
, these functions accept the optional key
keyword argument, which must be a function that, given an element, returns its sort key.
Here is a program that extracts 1000 longest lines from a file:
import heapq
with open(filename) as f:
longest_lines = heapq.nlargest(1000, f, key=len)
Here we open the file, and pass the file handle f
to nlargest
. Iterating the file yields each line of the file as a separate string; nlargest
then passes each element (or line) is passed to the function len
to determine its sort key. len
, given a string, returns the length of the line in characters.
This only needs storage for a list of 1000 largest lines so far, which can be contrasted with
longest_lines = sorted(f, key=len)[1000:]
which will have to hold the entire file in memory.