From an interview with the Black Swans blog
How did you get started in data-science?
I didn’t know I was getting started in data science. I was just trying to answer questions that happened to involve a lot of data. My educational background is in mechanical engineering and biomedical applications of robotics. I was trying to do things like, given a very irregular, noisy, and intermittent recording of blood pressure, infer someone’s heart rate. Given a series of images taken from the robot's camera, infer the robot's position. Given the arm movements of several dozen stroke patients, determine whether and how much those movements have gotten smoother. These problems required a lot of data interpretation, data cleaning, modeling, and statistical assessment. Finally, in 2013 when switching employers I realized that my toolbox was being referred to as data science, and I applied for data scientist positions. I have been wearing that label ever since.
Which models/ algorithms do you make use of most often when solving data-science problems? How do you decide between competing models?
I’ve found model selection to be a very personal exercise. The way you go about it reveals something about you, just as you can get insights into how someone thinks by watching them play chess. One way to think about the selection process as having different levels of maturity.
At the beginning, we choose a model because we know it, or because we are familiar with it. It may not be the best, it may not even be appropriate, but it is the one that we used in our last project or that we published a bunch of conference papers about, and we are interested in it or committed to it.
The next step in the progression is a performance-driven selection. And advanced modeler will be aware that some models handle small data sets better than others, some are much more forgiving when their underlying assumptions are violated. An experienced modeler will have a solid understanding of these trade-offs, and will choose the option that best suits the problem they’re trying to solve.
After someone has seen their model implemented, and used by others over the course of a few years, they start to gain and even broader appreciation for the trade-offs involved. At this level, a modeler select models not just based on their technical performance, but also on their long-term costs and benefits. Some models are very easy to compute. Some occupy very little space in memory or on disk. Some models are very sensitive to changes in the phenomenon being modeled and some are not. Some require much more maintenance than others, or require that someone who is an expert in the method be on hand to make adjustments. Someone selecting a model with this system level awareness will take all of these things into account.
What advice would you give to somebody starting out in data science?
When I think about the advice that I would go back in time and give to myself as a beginner data scientist, several things pop to mind. The first and biggest is to spend most of your time building things. Projects, analyses, tutorials, visualizations, code. The more applied my work has been, the more I have learned from it. There is definitely a place for studying theory, derivations, and philosophical deep dives. But these for me have just been the mortar. The bulk strength of my data science foundation, the granite blocks, has come from practice.
The next piece of advice is to not be afraid of digging into your tools and asking how they work. Being able to use the XGBoost library in practice is a fantastic skill. Having an intuition for how it works takes you to the next level. It lets you accomplish things with it that you wouldn’t otherwise be able to do. It lets you know where you should and where you shouldn’t use it. While we rightly respect our models and methods, there is nothing sacred about them. They each have their limitations and quirks. Understanding these will help you progress from proficiency to mastery.
The third piece of advice I would give myself is to actively resist imposter syndrome. There are more facets to data science and more analysis tools than any one individual can possibly become an expert in. There will always be things you don’t know. You’ll hear names of algorithms tossed out casually on Twitter that you have never heard of before. This is OK. Not only is it normal, it is universal.
What are the greatest risks presented by Big Data and AI?
It is difficult to predict what changes a proliferation of data and machine learning methods will bring. Historically, we humans are really bad at predicting the future. So I won’t try. However I do find it instructive to look at previous examples of technological changes. Every big development, whether it is the automobile, space flight, or vacuum cleaners, has been painted as a threat to humanity as we know it. So far, this has not proven to be the case. My favorite example of this is the alarm raised by those who feared that reading books would so fully occupy the minds of young people that they would be socially stunted and degrade society. This is a reliable human response to any significant change, real or perceived. However, all of these have in fact resulted in big changes. In some cases, these changes have upended people’s lives. Some people have lost money; some people have made money. Some people have lost jobs, some people have found new areas of employment. Some regions of the country and the world have benefited more than others. If history is any guide, we can expect that the changes we’ve seen will continue to unfold in ways we haven’t quite predicted. It probably won’t be the end of the world, but it is surely worth our time to keep an eye on what is changing and in which direction.