I don't need to tell you why it's nice to have your code run fast. Here are some tricks I've found helpful, starting with the most bang for the buck.
1. Profile
You can't know how to make your code faster until you know how it is slow. Profiling tools let you see how much time your code is spending on which lines and operations. cProfile is a classic and you can't go wrong with it, but my personal favorite right now is py-spy. It doesn't require you to modify your code at all and doesn't slow it down.
2. Elimination
The fastest code is the code that never runs. After working on code for a while it's amazing how much cruft accumulates. Deleting it is the best optimization trick. And the best part is you don't incur any technical debt — there's no new code to hide bugs and no new libraries to depend on.
3. Avoidance
Some code only needs to be run some of the time. The addition of if-else checks can help you avoid costly computations that aren't strictly necessary.
4. Cacheing
Your code may be computing the same thing several times. Store it and re-use it, rather than re-computing it.
5. Pre-processing
This is similar to cacheing, but uses the disk. If you have to transform the data in a table every time you load it, you might be able to save a lot of time by making a separate table with the pre-transformed data. Then each time you use it, it's all ready to go.
6. Vectorization
If you're working with many numbers in Python, put them in arrays and use NumPy. Avoid for loops whenever possible. You will be amazed how fast it goes.
7. Algorithms and Data Types
This is the level where the computer science courses pay off. (Or where you might want to look into reading some CS literature.) Choosing the right algorithm for what you are trying to do can sometimes reduce a longer-than-the-age-of-the-universe calculatation down to five minutes. This will be entirely case-specific and will require you to know roughly how many bits are doing what at each point in your program. It's a fun direction to head in, and sometimes you can get caught up in it so completely that you forget about the problem you were originally trying to solve.
Closely related is choosing the right data type. Again, it will depend on exactly what you are doing, but having the right data type can totally change the game. While working on a variant of Dijkstra's algorithm, a switch from a list to a heap sped up the algorithm by five orders of magnitude (100,000X). It can pay to understand what's going on under the hood.
8. Acceleration
After you've removed all the unnecessary code, vectorized, tweaked your algorithms, and made sure that nothing heavy runs more often than it needs to, and you still feel the need to speed up your code, you can move up to next level. When you're working in Python, you can get big speedups by compiling your code into C.
My favorite tool for doing this is Numba, which is shockingly easy to use. The biggest cost is that it can introduce some debugging challenges, but nothing that can't be navigated with persistence. I've seen it speed up code more than 50 times. If you have nested for loops with branching logic, Numba is your new best friend.
9. Parallelization
If you still need to speed up your code, you are now in the realm of brute force. The first brute force trick is to split up your code and run it on lots of processors at the same time. It isn't any more efficient, but now you have an army instead of a lone soldier.
There are several tools for doing this. On a local machine, I prefer the multiprocessing package. I've also had excellent experiences with joblib. There are of course many others, and that doesn't even touch cloud tools that scale to add as many processors as you want.
Take note, parallelization can open up a lot of doors for demons to enter. Communication and coordination become tricker, and debugging can become an absolute nightmare. It can be a tremendous boost to your compute power, but it will not be painless.
10. Graphical Processing Units
When you need blink-and-it's-gone speed, you skip software altogether and hard wire your computation directly into silicon. Forty years ago, the ancestors of the Graphical Processing Unit (GPU) were specialized video cards created to render video at higher frame rates. Through the years they have evolved to do more sophisticated graphics computations, like 3D rotation and shading. Underlying all this is exceptionally fast matrix compuation and a ridiculously large number of processors. This just so happens to match the needs of some artificial neural networks. If you can coerce your code to run on a GPU (or take advantage of a library that has already solved that problem) it can speed up your code by hundreds of times.
This path has two dangers, one obvious and one less so. The obvious danger is that it can be hard work to get your code to run on entirely different hardware. The rules have changed. Your intuitions can be wrong. Debugging GPUs is a specialized discipline of its own.
The second is that you are placing a bet that the computations you need to do are what the GPU can do. There are lots of things that GPUs can't do, or at least can't do better than a CPU. Be very honest with yourself about whether you need what they have to offer. You might be missing out on mind-blowing algorithmic innovations just because the GPU API doesn't have a command for that. It's great to strap a jet engine to your bicycle, but you need to be really sure that your bicycle is point pointed in the right direction before you ignite it.
Beware the Rabbit Hole
Like Alice chasing the White Rabbit, it's tempting to set our sights on code speed at the expense of everything else — complexity, brittleness, dependencies, memory. Worse, it's all to easy to chase small performance wins in parts of our code that are already fast, rather than focusing on the slowest ones. Keep your sights set on the big goal: ease of use. Buying a 5% speedup with a doubled memory footprint may not be worth it.
Donald Knuth's words on the topic are as true today as they were in the 70s.
We should forget about small efficiencies, say about 97 percent of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3 percent.
To walk through a progression of these optimization steps in deep neural network autoencoder, join me for Course 314 in the End-to-End Machine Learning School.