Sets (set)#

Jupyter Info

Reminder, that on this site the Jupyter Notebooks are read-only and you can’t interact with them. Click the button above to launch an interactive version of this notebook.

  • With Binder, you get a temporary Jupyter Notebook website that opens with this notebook. Any code you write will be lost when you close the tab. Make sure to download the notebook so you can save it for later!

  • With Colab, it will open Google Colaboratory. You can save the notebook there to your Google Drive. If you don’t save to your Drive, any code you write will be lost when you close the tab. You can find the data files for this notebook below:

You will need to run all the cells of the notebook to see the output. You can do this with hitting Shift-Enter on each cell or clicking the “Run All” button above.

Recall from Lesson 4 that we wanted to count the number of unique words in a file using a list to keep track of all the unique words you’ve seen so far. You probably had a solution that looked like the following (we changed the function name a little).

def count_unique_words_list(file_name):
    unique_words = []
    with open(file_name) as f:
        for line in f.readlines():
            for word in line.split():
                if word not in unique_words:
                    unique_words.append(word)
    return len(unique_words)

We could then run the function to see its output. The three cells below try it for three different files of different sizes (from smallest to largest: Green Eggs and Ham, the Bee Movie script, and the text of Moby-Dick).

There is a special command at the top of each cell called a “magic command” that lets the Jupyter Notebook print out how long it takes the cell to run. You do not need to understand the magic commands.

You can see when running the cells that it shows you the amount of time that elapsed from the time you pressed run after the “Wall time:” in the output. You might notice if you run the same cell multiple times, you’ll get different times. The reason the run-time is not constant is the computer is actually running more than one program at once and is allocating a small amount of time to each; this causes a lot of randomness in the actual runtime of a program

%%time
count_unique_words_list('green-eggs.txt')
CPU times: user 654 µs, sys: 56 µs, total: 710 µs
Wall time: 909 µs
105
%%time
count_unique_words_list('bee-movie.txt')
CPU times: user 136 ms, sys: 0 ns, total: 136 ms
Wall time: 136 ms
4104
%%time
count_unique_words_list('mobydick.txt')
CPU times: user 8.57 s, sys: 0 ns, total: 8.57 s
Wall time: 8.57 s
32553

Wow! It takes more than ~7 seconds to count all the words in Moby-Dick! That’s a long time in comparison to something like the Bee Movie script that only took ~100 ms (one tenth of a second) to run.

What’s even more shocking is that Moby-Dick has a little more than 200,000 words while the Bee Movie has a little more than 15,000. So even though Moby-Dick only has about ~13x the number of words as the Bee Movie, it takes about ~70x as long to run in comparison!

We will talk a bit more later in the course about how to quantify the runtime of a program without having to resort to timing code (which is uncertain by nature) and why this program is slow. A spoiler for that discussion is the slow part of this program above is the line:

if word not in unique_words:

Since that requires looping over the whole list of unique words for each word in the file to see if this condition is True! We’ll understand exactly how this causes a slow-down later.

Set#

The problem here is trying to use a list to solve a task that it’s not well-designed for. lists have no notion of preventing duplicates, so this extra work we have to do to avoid duplicates in the list causes a huge slowdown.

Instead, let’s see another data structure called a set that is designed with the idea of finding a set of unique values in mind. A set is like a list in the sense that it is a sequence of values, but differs in two major ways:

  • The set does not allow duplicate values.

  • The set does not have a notion of indices. There is no “index 0”, in a set. The mindset you should have a set is an unordered “bag” of values. The set does have some ordering in which it stores the values, but you can’t access any particular value inside of it.

set is a type defined in Python that implements this behavior. The following cell shows how to create one (using set() to make an empty set). The set class has the following methods (assume stored in a variable called s):

  • s.add(x) which adds x to s (ignores duplicates)

  • s.remove(x) removes x from s

  • s.clear() removes all values from s

nums = set()  # Creates an empty set
print(nums)

nums.add(1)
nums.add(2)
nums.add(3)
nums.add(4)
nums.add(2)  # Is a duplicate, so it's ignored
nums.add(-1)

print(nums)
set()
{1, 2, 3, 4, -1}

Also helpful, sets also support the len function and allow you to use the in keyword to see if they contain a value.

print(len(nums))

if 5 in nums:
    print('Found it')
else:
    print('Did not find it')
5
Did not find it

However, like we said earlier the set does not have a notion of an index. That means if you try to run the following cell, you will get an error.

print(nums[0])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-055bcd253040> in <module>
----> 1 print(nums[0])

TypeError: 'set' object is not subscriptable

So what’s the point of using a set over a list? It turns out the way it is implemented internally, it does an incredibly fast job at determining whether a particular element is inside of it so as to avoid adding duplicates. By losing the ability to “index” into the set, we gain speed since it can store the data in its own special way to optimize these “membership queries” (i.e. “is this element in this set?”).

To show this, let’s do the unique word example with basically all of the same code above, except:

  • Change the unique_words to a set instead of a list.

  • Remove the in check since the set just skips duplicates.

def count_unique_words_set(file_name):
    unique_words = set()
    with open(file_name) as f:
        for line in f.readlines():
            for word in line.split():
                unique_words.add(word)
    return len(unique_words)

And let’s go ahead and time these for the three files:

%%time
count_unique_words_set('green-eggs.txt')
CPU times: user 3.05 ms, sys: 21 µs, total: 3.07 ms
Wall time: 4.81 ms
105
%%time
count_unique_words_set('bee-movie.txt')
CPU times: user 0 ns, sys: 3.42 ms, total: 3.42 ms
Wall time: 4.66 ms
4104
%%time
count_unique_words_set('mobydick.txt')
CPU times: user 26.1 ms, sys: 0 ns, total: 26.1 ms
Wall time: 26.2 ms
32553

Wow! All three files can have their unique words be counted in less than ~100ms! Because we used a data structure that is all about storing unique values, we gained a huge speedup from just allowing it to do its thing.

Later in the quarter, we will learn the “magic” that allows sets to be so fast at finding duplicates.