Some interview questions for data scientists

Last time, one of my friends asked me for some interview questions to test the candidates of data scientist jobs. I think it is good to share the questions. Later on, I may (if I got some free time) post some detailed solutions and discussions on them.

1-D Facebook (difficulty: medium, problem solving)

We are living in a 3-D world, with x, y and z coordinates. In an 1D world, there is only the x coordinate, and people can only see left and right. There is a startup social network company, we can call it ‘1-D-facebook.com’ , and it wants to build ‘find your friends’ program for 1-D. In 1-D world, people has no size but a dot (no worry about diet 🙂 ) It has a list of 1M people’s location info, represented as an 1-D float array of 1M length, unsorted . Now, giving a people at array position index N, please find the closest two (left and right) friends.This position array can be very long (but fit in the memory), and unsorted, so sorting and search can be OK, but not yet the optimal solution. preprocess is OK, no space limit.

Reference: this problem was from some real-life cases, for example, it is the final step of amazon’s collaborative filter about ‘people bought this also bought’ after calculated all probability combinations (here this problem uses distance, and amazon using division). Another example is the spelling corrector which needs to find the closest-spelling words from a big dictionary where the distance is defined by edit-distance. A real spelling corrector is much more tricky on the data structure. Here I just simplified the edit-distance to position difference. PS: google’s spelling correction using bayesian models instead of edit-distance.

2-D Facebook (difficulty: hard, problem solving + algorithm)

Since you have solved 1-D Facebook, a 2-D world Facebook company, ‘flat-land.com’, wants to hire you and make the same program for 2-D people, where each people, with position x and y, can see left/right and up-down, and then find 4 closest friends. Surely, it does not have to be up-friend, down-friend etc, any direction is OK.

Reference: this is a real Facebook problem (not interview, no time limit), called ‘small world’ from Facebook Puzzle (this webpage is taken down by Facebook).

DVD Auto-classification (difficulty: medium, problem solving + machine learning)

A DVD-rent company wants to beat Netflix, so they want to build a smart machine learning algorithm to auto-classify DVD movies rather than manually labeling all movies. Fortunately, they only host romantic movies and action movies, so things are easier than those in Netflix. They observed one thing that, in romantic movies, people kiss a lot; in action movies, people fight a lot, but romantic movies can have fights too. Can you use this information to build a classifier which can tell if a new movie is action or romantic?

Reference: From the book ‘Machine Learning in action’. It is also from a real-life problem from my current project, but I simplified it to numerical features.

Super long feature similarity (difficulty: medium-hard, programming+machine learning)

Some machine learning models produced list of features for two soft drinks, for example, value of the content of sugar etc. One wants to compare the similarity of these two drinks using machine learning, how? (Interviewee should answer cosine similarity or dot-product or some other distance functions to compare two feature vectors).

Let’s take cosine similarity for example. Now, the real situation is that, there are millions of features from machine learning models, and some drinks may miss many features, in the other words, the feature is very sparse. So, if we want to compare two drinks with sparse features, where one drink can have many features that the other drink does not have. Do we really need to multiple each feature for these many zero values?

Calculate square-root of integer N. (difficulty: medium-hard. Numerical methods and programming)

This question can have some variations:

  • (easy) How to tell if an integer N is a perfect square number (N=k*k where k is an integer).
  • (medium) Given a very large integer N, and the number m where m*m<=N and (m+1)*(m+1)>N.
  • (hard, needs hint) How to determine a number is a Fibonacci number? The hint should be given by the interviewer: a Fibonacci number can be represented either in 5*N**2+4 or 5*N**2 -4, so simply to test if this number plus/minus 4 divided by 5 is a perfect square number.
  • (medium-hard) How to determine a number is summation of two perfect square numbers?

What if N is very large, and one can not build a table of square numbers?

Essay-copying (difficulty: medium-hard, NLP, machine learning, modeling)

In the final test of the university, the professor received 200 essays from students, about 1000 words each. Badly, he found some students were copying other people’s essays. But these students were smart: they did not copy the entire essay, maybe change words in some sentences, may copy from 2-3 other persons (but surely, they do not copy from all the other 200 students, no enough time 🙂 ). Please build a machine learning system to help professor find these bad students.

Reference: clustering using nature language processing is very important in the real life. This is an example.

 

Advertisements

Check if a number is a Fibonacci number: speed comparison for two Quora answers in python

It is interesting to read this Quora thread
What is the most efficient algorithm to check if a number is a Fibonacci Number? The top two answers are both very good and efficient.

Background, what is Fibonacci number? wikipedia

Anders gave a nice matrix exponentiation algorithm in Haskell while John posted his Java solution using square numbers. There was an argument which one was faster, so I made this comparison using python.

First of all, John needed a faster algorithm to determine if a number is a perfect square number. He used a loop from 2 to sqrt(N), and it was surely not an efficient way.

A better way to do it is the newton’s method of integer square number (wikipedia), in short, it equals using newton’s method to solve x*x-n = 0. In python, it can be easily solved by this (original code from this link):

def isqrt(x):
    if x < 0:
        raise ValueError('square root not defined for negative numbers')
    n = int(x)
    if n == 0:
        return 0
    a, b = divmod(n.bit_length(), 2)
    x = 2**(a+b)
    while True:
        y = (x + n//x)//2
        if y >= x:
            return x
        x = y

And then, John’s method is basically testing if 5*N*N+4 or 5*N*N-4 is a perfect square number. If the answer is yes for either one, this number is a Fibonacci number.

Actually the square root algorithm can have more optimizations using the 64 bit magic number  0x5fe6eb50c7b537a9 (the DOOM trick), please check wikipedia for more interesting details. To be platform independent, here I just used the original newton’s method.

Secondly, Anders code was in Haskell, so I rewrote them into Python for a fair comparison.

def fibPlus((a,b),(c,d)):
  bd = b*d
	return (bd-(b-a)*(d-c), a*c+bd)

def unFib((a,b),n):
	if n<a:
		return (0,0,1)
	else:
		(k,c,d) = unFib(fibPlus((a,b),(a,b)),n)
		(e,f) = fibPlus((a,b),(c,d))
		if n<e: return (2*k, c, d)
		else:
			return (2*k+1,e,f)

def isfib(n):
	(k,a,b) = unFib((1,1),n)
	return n==a

The full source code can be found on my github https://github.com/phunterlau/checkFibonacci

To test these two algorithm, I downloaded the first 500 Fibonacci numbers from http://planetmath.org/listoffibonaccinumbers and ran 100 times for each algorithm on this list of number. The result is interesting: python optimization makes difference. Unit is in second.

If run in python 2.7, John’s method won for 10% time:

python is_fib.py Anders method: 1.52931690216 John method: 1.36000704765

If run in pypy 1.9, Anders method is highly optimized for 2x speed:

pypy is_fib.py Anders method: 0.799499988556 John method: 2.0126721859

To conclude, both of these two algorithms are very good.

One another question to follow this:

Given a number N, if N is not a Fibonacci number, print out the largest Fibonacci smaller than N.

Hint: John’s method.

[challenge accepted] palindromic prime number of 5 digits

Thanks to the first comment in the last post which directed me to a math problem (http://contestcoding.wordpress.com/2013/03/22/palindromic-primes/) . From a scientist point of view, the answer itself, 98689, is just simple, but the way to answer it is kind of interesting. Let’s have a look at it step by step how a scientist solved it.

1. How to generate a list of palindromic number?

‘A palindromic number is a number that reads the same backwards as forwards (1991 for example).’ One can iterate in all numbers and check if it is palindromic:

str(num)==str(num)[::-1]

, or, a smarter way to generate them [1]:

from itertools import product

def palindromeNum(n):
   return [n*'%s'%tuple(list(i)+list(i[n*(n-1)/2%(n-1)-1::-1])) for i in product(*([range(1,10)]+[range(0,10)]*((n+1)/2-1)))]

2. Now we have a list of palindromic number, how do we determine it is a prime or not?

Simple. The is_prime() function from textbooks should work with no problem. One can loop from 2 to the square root of N and see if it can divide N. Let me introduce you a better way, Miller Rabin primality test, a pro’s way.

wikipedia: http://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test

In general, it is a faster and more efficient way to test if a number is a prime. Some other methods can be AKS primality test (wikipedia) and so on. It is good to know them 🙂

There is a nice implementation of Miller Rabin in python here, one can have a look at this beautiful algorithm.

So, we can build a loop from the end of the list of palindromic number that we generated at step 1, and test each of them using Miller Rabin primality test, we can easily find the number is 98689.

3. Some additional thoughts

  • The problem asked about 5 digits palindromic number. Also, these kind of numbers exist for 3 digits, 7 digits, 9 digits and so on, but never exist for 4 digits, 6 digits. Can you guess why? hint: some thing about number 11.
  • There is a nice website http://oeis.org/A114835 for the set of palindrome prime numbers. There are some other interesting sets of numbers.
  • Do you know the largest 11 digits palindromic number is 99999199999, 13 digits 9999987899999?

Source code:

Actually the function palindromeNum() can be optimized by generator and yield, which does real time calculation rather than unnecessarily generating the whole list.

from itertools import product
import miller_rabin

def palindromeNum(n):
 return [n*'%s'%tuple(list(i)+list(i[n*(n-1)/2%(n-1)-1::-1])) for i in product(*([range(1,10)]+[range(0,10)]*((n+1)/2-1)))]

for i in palindromeNum(5)[::-1]:
 if miller_rabin.miller_rabin(int(i)):
 print i
 break

A scientist’s ‘attitude’

Source code and data file can be found at https://github.com/phunterlau/attitude info me before use it, please.

You might have heard of this piece of junk about ‘attitude’ for many times, now a real scientist is showing you the secret behind this.

First of all, it is like this:

A small truth to make our Life 100% successful.. ……..

If A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Is equal to 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26

Then H+A+R+D+W+O+ R+K = 8+1+18+4+23+ 15+18+11 = 98%

K+N+O+W+L+E+ D+G+E = 11+14+15+23+ 12+5+4+7+ 5 = 96%

L+O+V+E=12+15+ 22+5=54%

L+U+C+K = 12+21+3+11 = 47%

(None of them makes 100%)
………… ……… ……… .
Then what makes 100%
Is it Money? ….. No!!!!!
Leadership? …… NO!!!!

Every problem has a solution, only if we perhaps change our “ATTITUDE”.
It is OUR ATTITUDE towards Life and Work that makes
OUR Life 100% Successful..

A+T+T+I+T+U+ D+E = 1+20+20+9+20+ 21+4+5 = 100%

Well, it might be true. But, is ‘attitude’ the only word which can make 100%? Let’s have a look.

What you need is python and a list of English words. I have done some google search and found this nice site of a list of 109582 English words: http://www-01.sil.org/linguistics/wordlists/english/ .

Open this list with python, and type in this piece of code:

print ‘,’.join([x.strip() for x in words_f.readlines() if sum([ord(i)-96 for i in x.strip()])==100])

We can easily get 1297 English words which can make 100%. Beside ‘attitude’, we also have, for example, :

alienation,inefficient,infernos……

This a scientist’s ‘attitude’: never trust 100%. We have something in the life much more important than just 100%, just like ‘love’=54%.