This post was reviewed by: Valeria Gomes, Luis Moneda, Cristiano Breuel, Hector Lira, Scott McKuen
Interviews are always a hot topic in any field because of how intense they can be. The interview is usually a short process aimed at collecting as much information as possible about a candidate in order to make a hiring decision.
Despite that, it’s impossible to summarize all the candidate’s strengths and weaknesses from a couple of questions, answers and exercises during a couple of interviews.
Although Nubank is a relatively young company (founded in 2013), we have been using Data Science from day one. This means we have interviewed a lot of people for several different positions, such as Data Scientists, Machine Learning Engineers and Data Science Managers.
So, if you want to know some tips to take into account when participating in interviews for Data Science positions, this is the right article to read!
1. Understand the problem first
In several situations you may be asked to say how you would solve a particular problem using Machine Learning.
This is usually the case in interviews for most applied Data Scientist and Machine Learning Engineer roles, where you are required to solve actual business problems (as opposed to just researching techniques that may never see the light of the day)
Understanding the problem is usually more important than the solution itself. Once the problem is clearly defined, the solution(s) is(are) often obvious.
In several cases, especially during a synthetic scenario within an interview, leveraging data and modeling will be the way to go, but you should make that call yourself.
The takeaway here is to make sure you ask questions to ensure you understand the problem and the tradeoffs involved before jumping into solutions.
2. Recognize when not to use Data Science/Machine Learning solutions
Here are some examples of when they might not be appropriate:
- When we don’t understand the problem well enough (maybe a good, thorough EDA is enough for now)
- The business needs a quick solution to fix an urgent problem right now (modeling would take too long)
- There is no data to work with! (probably a manually built heuristic is better here, until you collect enough data)
- The cost of a DS/ML solution is higher than the value in solving the problem (bad economics)
Caring more about the business outcomes than how those outcomes get delivered is key. Data science is not an end in itself.
3. Understand how and where Machine Learning-enabled applications differ from regular software
Machine Learning-enabled software is still software. Production Machine Learning systems still need to be connected to the underlying IT infrastructure in some way.
It has to be version-controlled and tested thoroughly via unit/integration tests and common code should be extracted away into other libraries and shared across the team. Complexity should be kept under control so it is easy to reason about the code and refactor it later on.
In short, everything you learned in Software Engineering 101 also applies to ML code.
Main Differences between Machine Learning Systems and Regular Software
The first is the fact that they can fail silently. Unlike regular software, an ML system may produce garbage even when there is no explicit “error message”. When an important feature for your model breaks due to upstream data problems, the system will still go on scoring events, not knowing it’s being fed bad data.
The constant need to keep track of data is the second main difference. If you want to reproduce the system behavior, you need not only the exact code used, but also the data it was trained with.
4. Understand the tradeoffs between model performance and complexity
The complexity of Machine Learning solutions should be proportional to the value they bring to the business. More complex models, such as higher capacity classifiers, using nonlinearities and ensembles, usually do yield better performance. This is also the case with the features you use, like complex preprocessing and engineering.
However, the more complex a Machine Learning system is, the more costly and difficult it will be to deploy that model to production. In addition, maintenance and monitoring work will also be heavier and more time-consuming. Ask yourself the questions below when you’re thinking about these tradeoffs:
- Is replacing a simple Boosted-Tree Model with a Deep Neural Net worthwhile for a small increase in AUC?
- Should you replace a very simple linear regression model with a more complex one in this scenario?
- Should we incorporate textual and image features to the current tabular model?
- Should we use an ensemble combining the results of multiple models instead of a single one?
The answer to such questions is usually “It depends on the business setting”. Your job as a Data Scientist is to help stakeholders understand where to draw that line and where we start seeing diminishing returns.
5. Consider how models drive business outcomes
In all but the most entry-level positions, you will be expected to understand how Machine Learning models help businesses make money, i.e. how do they help businesses reduce cost, increase revenue, or drive customer satisfaction?
You have to contextualize Machine Learning models as part of the business flow they’re in. Ask yourself these questions:
Data Scientist (DS)
- “How would you measure the business impact of this model?”
- “If I told you that non-technical people would be using the outcomes of this model, how would it change your choice of model?”
Machine Learning Engineer (MLE)
- “How would a client service use the predictions produced by this model?”
- “With scenario X in mind, what are the tradeoffs between running this model as a real time service versus running it as a daily batch job?”
6. Don’t be afraid to say “I don’t know”
It’s almost certain that you’ll be asked about things you haven’t had experience with, so don’t be afraid to be honest in these situations. You aren’t expected to know everything, but to be able to learn whatever you need as part of your day-to-day job.
Pretending to know about something you don’t is one of the very worst things you can do at an interview. It’s easy for an interviewer to check if you know what you are talking about. And, if it becomes clear you are pretending to know something you don’t, it will be a huge red flag.
A good alternative to exaggerating your level of knowledge is to make educated guesses (making it clear it’s a guess). It’s a great opportunity to show you’re quick to adapt to new concepts!
“I don’t know exactly what you mean by X, but based upon the context and my previous experience, I would guess it’s something along the lines of …. Is that correct?”
7. Be expected to answer questions about topics you claim experience in
It is usually in your best interest to show that you have experience in certain areas related to Data Science/Machine Learning, especially when it connects to the position you are applying to. If you are interviewing for a Computer Vision company, it’ll definitely be worthwhile to mention that you have had experience working with convolutional neural nets and image processing at your previous job.
However, this will invite questions around those topics, so make sure that you are able to answer them. Always double check if your CV and profile only include points you would be comfortable answering questions on.
8. Consider the Machine Learning lifecycle from ideation to operation
It’s important to know how a Machine Learning-enabled system evolves from an idea to an actual system in production.
Before having a working model, you need to understand the problem to be solved and make sure DS/ML is the way to go. This is where you’ll probably speak to several stakeholders, such as Product Managers, Business Analysts and Software Engineers.
Make sure all the players involved understand the constraints and how the model will be used.
During data analysis and modeling, many premises are put to the test and, again, there will be lots of communication to validate assumptions and get people up to speed as the project progresses. Here is where you usually think about the policy layer (i.e. turning model scores into decisions).
The first deployment is a critical step in the project, where many things can go wrong. Here is where the model gets integrated to whatever infrastructure it will be a part of. There will very likely be some fixing and adjusting before everything is ready to actually be used in production.
After the model has been deployed, monitoring routines to make sure that the model inputs and outputs are behaving as expected will be a must. Lastly, you’ll need to consider if/when to retrain the model and whether the model itself will affect future training sets.
All information here is to be taken as a rough guideline only. While we have tried to make the text as widely applicable as possible, what works for us at Nubank may not necessarily be suitable for every situation.