Some interview questions for data scientists

Last time, one of my friends asked me for some interview questions to test the candidates of data scientist jobs. I think it is good to share the questions. Later on, I may (if I got some free time) post some detailed solutions and discussions on them.

1-D Facebook (difficulty: medium, problem solving)

We are living in a 3-D world, with x, y and z coordinates. In an 1D world, there is only the x coordinate, and people can only see left and right. There is a startup social network company, we can call it ‘’ , and it wants to build ‘find your friends’ program for 1-D. In 1-D world, people has no size but a dot (no worry about diet 🙂 ) It has a list of 1M people’s location info, represented as an 1-D float array of 1M length, unsorted . Now, giving a people at array position index N, please find the closest two (left and right) friends.This position array can be very long (but fit in the memory), and unsorted, so sorting and search can be OK, but not yet the optimal solution. preprocess is OK, no space limit.

Reference: this problem was from some real-life cases, for example, it is the final step of amazon’s collaborative filter about ‘people bought this also bought’ after calculated all probability combinations (here this problem uses distance, and amazon using division). Another example is the spelling corrector which needs to find the closest-spelling words from a big dictionary where the distance is defined by edit-distance. A real spelling corrector is much more tricky on the data structure. Here I just simplified the edit-distance to position difference. PS: google’s spelling correction using bayesian models instead of edit-distance.

2-D Facebook (difficulty: hard, problem solving + algorithm)

Since you have solved 1-D Facebook, a 2-D world Facebook company, ‘’, wants to hire you and make the same program for 2-D people, where each people, with position x and y, can see left/right and up-down, and then find 4 closest friends. Surely, it does not have to be up-friend, down-friend etc, any direction is OK.

Reference: this is a real Facebook problem (not interview, no time limit), called ‘small world’ from Facebook Puzzle (this webpage is taken down by Facebook).

DVD Auto-classification (difficulty: medium, problem solving + machine learning)

A DVD-rent company wants to beat Netflix, so they want to build a smart machine learning algorithm to auto-classify DVD movies rather than manually labeling all movies. Fortunately, they only host romantic movies and action movies, so things are easier than those in Netflix. They observed one thing that, in romantic movies, people kiss a lot; in action movies, people fight a lot, but romantic movies can have fights too. Can you use this information to build a classifier which can tell if a new movie is action or romantic?

Reference: From the book ‘Machine Learning in action’. It is also from a real-life problem from my current project, but I simplified it to numerical features.

Super long feature similarity (difficulty: medium-hard, programming+machine learning)

Some machine learning models produced list of features for two soft drinks, for example, value of the content of sugar etc. One wants to compare the similarity of these two drinks using machine learning, how? (Interviewee should answer cosine similarity or dot-product or some other distance functions to compare two feature vectors).

Let’s take cosine similarity for example. Now, the real situation is that, there are millions of features from machine learning models, and some drinks may miss many features, in the other words, the feature is very sparse. So, if we want to compare two drinks with sparse features, where one drink can have many features that the other drink does not have. Do we really need to multiple each feature for these many zero values?

Calculate square-root of integer N. (difficulty: medium-hard. Numerical methods and programming)

This question can have some variations:

  • (easy) How to tell if an integer N is a perfect square number (N=k*k where k is an integer).
  • (medium) Given a very large integer N, and the number m where m*m<=N and (m+1)*(m+1)>N.
  • (hard, needs hint) How to determine a number is a Fibonacci number? The hint should be given by the interviewer: a Fibonacci number can be represented either in 5*N**2+4 or 5*N**2 -4, so simply to test if this number plus/minus 4 divided by 5 is a perfect square number.
  • (medium-hard) How to determine a number is summation of two perfect square numbers?

What if N is very large, and one can not build a table of square numbers?

Essay-copying (difficulty: medium-hard, NLP, machine learning, modeling)

In the final test of the university, the professor received 200 essays from students, about 1000 words each. Badly, he found some students were copying other people’s essays. But these students were smart: they did not copy the entire essay, maybe change words in some sentences, may copy from 2-3 other persons (but surely, they do not copy from all the other 200 students, no enough time 🙂 ). Please build a machine learning system to help professor find these bad students.

Reference: clustering using nature language processing is very important in the real life. This is an example.