You can get pretty far in programming without understanding Data Structures, but eventually, you are going to need to know them, understand how they work and when to use them.
What is a data structure?
A data structure as the name implies is a structure for holding data. When you are writing a program, you always have to store some form of data.
The simplest form of data could be the value of a variable, but once you need to store multiple values, how you store them begins to matter to the performance of your application.
Once you understand the different data structures and when to use them, you will have another tool in your toolbox that you can use to write better programs.
As the saying goes:
If all you have is a hammer, everything looks like a nail.
Treating everything like a nail is going to lead to poorly designed software with bad performance. Different data structures have different uses, so you want to make sure you are using the right one.
The simplest type of data structure is an array.
Arrays are simply multiple values stored in memory. The key thing to remember is that arrays in most languages are stored in contiguous blocks of memory.
What I mean by that, is they are stored all together in one block in memory. When you create an array, you have to specify how big the array will be, either with a number or by initializing it with the elements in the array.
This is because the computer needs to find a free block of memory large enough to store the array. If it can’t find a block large enough, it needs to see if any memory can be freed up somewhere. All of this memory management takes extra time, which can really make a difference if you are trying to build a performant application.
Some programming languages will give you methods to add and remove items from an array.
Even though the programming languages make it easy to do this, you have to remember that the computer will still need to find a free block of memory to store your whole array in and then free up the memory where your array used to be.
This can be computationally expensive if you are constantly adding or removing items from an array.
Arrays are great when you would like to iterate through all the items, either for mathematical operations or for looping through the values. As all the items in the array are located next to each in memory, it is less taxing on the computer.
If you need to constantly add and remove items, then you might be better off with a list.
Lists are a bit like arrays in that they store multiple values but instead of having a fixed size, you can add and remove items from them more easily.
Unlike arrays, you don’t need to allocate how much space you need when you first initialize the list.
Lists can be considered auto-arrays, as they automatically allocate the memory for you. They are still essentially arrays under the hood.
In most languages, lists are essentially over-allocated arrays. Instead of reallocating memory every time an item is added, they allocate more memory than they require, usually twice as much, which means less memory management is needed when you add items.
It is not just memory allocation that is different from lists. Lists tend to store pointers or references to the values you are storing.
In languages like Python, this means you can store different data types in a list, unlike an array that has to be the same datatype for all the items in the array.
Lists are great if you will be adding and removing items a lot in your code, however for sequential reading of items an array is always faster.
The fact that the list is storing a reference to your data, instead of the data itself, means the computer needs to do another read to get your data from a list vs an array. A list also takes up more memory than an array as it is storing the count, the pointers, and the data itself.
Queues, like lists, are still using arrays internally but work differently from lists.
Queues work on a first-in, first-out (FIFO) basis, just like a real-life queue.
New items are added to the tail end of the queue and removed from the head of the queue.
Unlike arrays and lists, you can’t read items from the middle of the queue, you always have to read them in the order they went in.
Queues are mostly used for scalability reasons. They allow you to process things in order, even if the computer can’t keep up with the number of requests you are getting.
For example, let’s say you have a service that turns a photo into a cartoon.
You might have a website with an API endpoint that allows users to upload a photo. This takes the photo and does a few image manipulations and filters to make it look like a cartoon, and then returns it to the user.
The website then goes viral and your API can’t cope with the number of requests it is getting. This is where queues come in. You can add the image to a queue and then have a worker pick up the image, do the manipulations and return it to the user.
If you start getting more requests than you expected, then you just spin up more workers.
Queues can be used in code if you need to process something in order in your application, but they are most commonly used at the infrastructure level.
So, we have covered various ways of storing multiple values in a data structure.
But what if you want to make sure that something is only added to a list once to avoid duplicates?
This is where a HashSet comes in. A HashSet performs a hash on the data and then stores it according to its hash. This prevents duplicates from occurring in your set.
The key difference between a HashSet and the other data structures we have looked at is that a HashSet doesn’t store items in the order you insert them. It stores them based on the hash of the data.
HashSet are perfect when you want to prevent duplicates of the items you are storing but aren’t good if you need items in a particular order.
Next up are dictionaries. Dictionaries are similar to hashsets in that you can’t store duplicate values in them.
However, they work on key-value pairs.
If you think of a real dictionary, you can look up a word and find its meaning. The word in the dictionary is essentially the key, and the meaning is the value.
So, dictionaries allow you to store a key and then use it to look up another value.
Key-Value storage like dictionaries has so many uses in programming. They are often used for caching values against a key. Which is essentially what Redis is doing.
They can also be used for things like in-memory lookup tables as well.
Dictionaries are a form of hashmap. The key that you are storing gets hashed, and the hash is then used to look up the value.
HashSets work on hashmaps too, they just hash the value itself instead of a key.
Like a HashSet, dictionaries aren’t stored in order of insertion. This generally isn’t an issue anyway, as you will be using the key to retrieve data from it.
If you do need an ordered dictionary, most languages offer this as an option too.
Next, we have Linked Lists. Linked Lists are another way of storing multiple values.
As the name suggests, each item in the list is linked to the next item and usually also linked back to the previous item.
Unlike normal lists, there is no way to access a particular item in the list by its index, instead, you need to traverse the whole list from start to finish.
Unlike arrays and lists, linked lists aren’t stored in memory as a contiguous block. Instead, each item in the list has a reference or pointer to the next and often previous item.
This way, new items can be added or deleted from the list without having to reallocate memory for the entire list. This makes insertions and deletions a lot quicker and more predictable.
You can even insert an item into the middle of a list quickly, as only the nodes on either side of the insertion need to be changed.
Linked lists are perfect if you don’t need random access to items, and you want to ensure a constant insertion or delete time, which is crucial for real-time applications where every millisecond counts.
Typically, linked lists can be used to implement things like queues or stacks where you will be accessing the items in the order they were inserted.
Another way of storing data in a structured way is by using trees. Trees are more like family trees than the green kind.
They allow you to store not only the value but also the relationship as well.
A tree will have one root node, which then has one or many child nodes. It is similar to a linked list, except a tree can have more than one relationship.
The number of children each node will have will depend on the type of tree you are implementing.
- General trees, where each node can have any number of children.
- Binary trees, where each node can only have up to 2 children.
There are others as well such as Binary Search Trees, Red-Black Trees, and AVL Trees.
Trees are useful in storing data that has a hierarchy associated with it.
For example, an HTML or XML document has nested elements that could be stored as a tree. The files and folders stored on your computer follow a tree structure, as does a company employee structure.
Trees are also used for database indexes, domain name servers, machine learning, and much more.
Most programming languages don’t come with a ready-to-use Tree data structure as there are so many ways to implement them, and it is different depending on your particular use case.
Graphs are very similar to trees except it lacks that top-down relationship you have with trees.
If a tree is like a family tree, a graph is more like your social network. Each node can have a relationship with multiple other nodes and the relationship goes both ways.
In fact, most social networks like Facebook and LinkedIn will use a graph to store all the relationships between their users.
If you find that each node in your data is related to other nodes, then you might need to look into graph storage for your application
Again like trees most programming languages don’t come with a graph implementation, this is something you will need to build yourself, typically by using a combination of arrays and linked lists.
These are the 8 main data structures that are used in programming. Most of the other data structures you will hear about are just a variation of these.
Next time you have to pick a data structure, think of these 8 and see which one fits the scenario the best.