R or Python?

Torch or TensorFlow?

Spark or map-reduce?

When we're getting started on a project or in the field, the mountain of tools to choose from can be overwhelming. Sometimes it makes me feel small and bewildered, like Alice in Wonderland. Luckily, the Cheshire Cat cut to the heart of the problem:

“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where–” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
“–so long as I get SOMEWHERE,” Alice added as an explanation.
“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”
(Alice’s Adventures in Wonderland, Chapter 6)

If you don't have a goal in mind, then it doesn't matter which tool you use.

Choose a goal

The Cat recasts our problem. Instead of choosing a tool, we need to first choose a goal. It should be specific and concrete. When your goal is clear, you should be able to answer most of these questions:

What do you want to build?
What question will it answer?
How will it look?
Who will see it or use it?
What is most important to them?

Even if your purpose is as broad as "learn data science," structure your training as a set of small, well-defined projects. The best way to learn a skill is to use it to create things.

Weigh your options

As with machine learning algorithms, no tool is inherently better than any other. Each is suited to different circumstances. One may be faster, but the other has a simpler interface. One may be more flexible, but the other is easier to learn. Here's where the work you did choosing a specific goal really pays off. It will tell you what is most important. The trade-off between performance and ease of use is a common one. Fast and opaque or slow and intuitive? If your goal is to perform a set of calculations for a monthly report, speed may be less important than explainability of the approach and maintainability of the code. But if your goal is to do real-time algorithmic trading on the New York Stock Exchange, then extra speed will be well worth any amount of inscrutable code.

If you don't yet know what the strengths and weaknesses of your options are, that's a fine place to invest some research time. Search the Internet. Read the ranting emails from your co-worker about the pain being inflicted on them by the latest version of the library you're considering. Chat with your hacker friends. Don't implicitly trust any one source or website. Collect a small data set and look for themes.

Crosthwaite and Mercury in mine hole greece 1949

Six traps to avoid

There are many ways to drive your project into a muddy bog. The good news is that they are entirely avoidable.

1. Greed

"I want my streaming visualization to show Petabytes of data in a dazzling real-time 3D clustering. By next Tuesday. Cheap." You can't have everything. Choosing will require giving up some things you want for others you want more.

2. Wishy-washiness

"I want to support open source, but I also think our customers would appreciate the responsiveness of Company X's product. I love the cleanness of these visuals, but maybe we should stick to a library that is simpler to learn." This is the twin brother of greed. Use the crisp project goals to make hard decisions and commit to them.

3. Scope creep

"That's what we decided six weeks ago, but the VP just said to make sure we have one-click donations." This is like wishy-washiness, a failure to commit, just spread out over time. This is one of the easiest to slip into, because it is sometimes mislabeled as "vision" or "leadership."

4. Blurry goals

"I want it to make money." That's a great career goal, but a terrible project goal. It doesn't tell you anything about which tools you need. Take a step back, build a business plan, distill it to a series of well-defined projects and try again.

5. Intimidation

"What we really need is the performance we can get from Spark, but no one on the team has ever used it." You can do it. Don't give in to fear.

6. Peer pressure

"Everyone on Twitter is talking about using SuperDuperNets. They must be awesome. I should use them in my recipe website to make it cool." Hype is seductive. "Everyone else is doing it" is a bad reason. Just ask your mom. Your recipe website will be cool if it gets users the recipe they want without pain.

Embrace the big picture: Nine landmarks to navigate by

In addition to avoiding traps, it's important to keep your eyes wide open to all the obvious and no-so-obvious costs of each tool. A common blind spot among machine learning aficionados is to consider compute speed and benchmark accuracy to the exclusion of everything else. There are actually lots of other things to keep in mind, and sometimes they're far more important than performance.

1. Price

If you work for a company, then the price of the tool matters. If you are building something on your own time, then the price really matters. A low price tag can easily become more attractive than high accuracy.

2. Time to learn

Your time is worth a lot. To your employer, it literally has a dollar amount attached. A "free" tool that takes three months to master is not free at all. Ease of use matters.

3. Educational value

Time cost can be discounted heavily if you want to use the tool for its own sake. Professional development or curiosity are great reasons to jump in and play with a new tool, and have a real value of their own. Tying your self-education to a project kills two birds with one stone.

4. Time to integrate

Often, what you build will have to interact with things that other people have built in order to be useful. Even a tool that is easy to use and quick to learn can be difficult to integrate with other tools. Testing, communication and managing dependencies can take an enormous amount of effort. Looking into this up front can help prevent unpleasant surprises. This can be largely avoided by re-using tools that are already prevalent in your stack.

5. Maturity

The maturity of a tool can be closely related to integration time. More mature tools are also usually (but not always) more stable, supported on more platforms, better integrated with other tools, and more widely supported in the online tech community. Riding the bleeding edge is exciting, but it can be painful.

6. Scalability

If I'm building a website for a startup, I want to plan for my number of visitors to grow over time. Some tools have more growing pains than others. Do you expect your data or bandwidth requirements to double? Grow ten times? A million times? Do you expect the team working on your code to grow to five people? To five hundred? The tool that you choose may depend strongly on the answer. Be careful here though. More than one project foundered because it bogged itself down with the overhead of highly scalable tools long before it needed them.

7. Legal status

Who owns the tool you want to use? Can they limit how you use it? Whether you sell it? Can they squeeze you for money once you have built your company on it? If you're building something for your personal use, this is not a huge issue, but if you're hoping to share it widely, you should read the fine print with a magnifying glass. "Open source" is a broad term that has lots of subcategories, but beware that it doesn't always mean "free" and it doesn't usually mean "use however you want."

8. Compute time

There are a surprising number of use cases where time isn't a concern at all. If you are interested in periodically updating a map of rent prices in Manhattan, you can afford to spend two weeks computing it. Rents just don't change that fast. Then again, if you are estimating how close a vacuuming robot is to the top of a staircase, you need that get that quickly to avoid catastrophe. Know your time demands well, so that you can give up speed you don't need in exchange for other strengths.

9. Accuracy

Higher accuracy is every data scientist's siren song. You can always get the accuracy just a bit higher, so why wouldn't you? This garden path can lead to death by perfection as every other need is starved for attention. As with compute time, know how much you need to get the job done and force yourself with gritted teeth to stop tinkering when you get there.

There are so many dimensions to consider, there's no way that anyone else can tell you that a language or library is good for your project until they understand it as deeply as you do.

Thanks to @ruchitgarg and @HeathWillCode for inspiring this post. They posed questions about Python 2 vs Python 3 and Torch vs TensorFlow. Although we can't declare one better than another, it is useful to understand what the trade-offs of each are. I'm still gathering my own data on deep learning frameworks, but in the meantime I rely on my colleague @anurive's rich comparison of Torch and TensorFlow. For Python versions, I can't say anything wiser than python.org's advice. Not surprisingly, they lead with:

Which version you ought to use is mostly dependent on what you want to get done.

Happy building!