Deep learning for Hackers with MXNet (3): Instant neural art style transfer

This blog is originally posted in Chinese from my zhihu category.

So long from last blog, and thanks for coming back. Here I want to present a MXNet example of instant neural art style transfer, from which you can build your own Prisma app.

Do you know MXNet now can be installed via pip?

pip search mxnet
mxnet-cu75 (0.9.3a3)  - MXNet is an ultra-scalable deep learning framework. This version uses CUDA-7.5.
mxnet (0.9.3a3)       - MXNet is an ultra-scalable deep learning framework. This version uses openblas.
mxnet-cu80 (0.9.3a3)  - MXNet is an ultra-scalable deep learning framework. This version uses CUDA-8.0.

Let’s go

After installing MXNet, please do

git clone

which includes three different implementations of fast neural art style transfer. Big thanks to the author Zhao Wei. In this blog, I am going to talk about Perceptual Losses by Justin Johnson et al described in this paper. After git clone, please go to neural_style/perceptual/ and execute the following script:

import make_image
maker = make_image.Maker('models/s4', (512, 512))
maker.generate('output.jpg', 'niba.jpg')

where output.jpg is the output and niba.jpg is picture of the cutest deep learning cat Niba. Within a blink, we can see the output like this:



Beside this art style, multiple other pretrained neural art models are mentioned in README page under neural_style/perceptual/, please download them via the link mentioned in the page. These pretrained models should produce the art work and combine with ImageTrick:

montage output*.jpg -geometry +7+7+7 merge.jpg



Please note: some machines may encounter the following error

terminate called after throwing an instance of 'dmlc::Error' what():  [21:25:23] src/engine/./threaded_engine.h:306: [21:25:23] src/operator/./convolution-inl.h:299: Check failed: (param_.workspace) >= (required_size)

The reason behind this is from the workspace size of the convolution layers, where the default workspace might be too small for some large images. Please edit by adding workspace=4092 to each mx.symbol.Convolution function.

Hope you have some fun with your own Prisma app ūüôā


Neural art transfer has been a hot topic in deep learning, and it starts from this paper A Neural Algorithm of Artistic Style. As we have discussed in the last blog, this idea leverages the power of convolutional network where the high level features can describe so called style of an image, if apply this high level feature to a new image, one can transfer the art style and generate new art work. In the original paper, gram matrix is used for this magic. To understand gram matrix magic, one can take look at my friend’s paper Demystifying Neural Style Transfer for further understanding. There are many blogs and papers trying to understand why neural art transfer works, and this paper is probably the only correct one.

Back to the original neural art transfer: the original version calculates the per-pixel loss from the content image to the style image, and introduces a very large gram matrix, meanwhile, it has to run a logistic regression for tuning the weight of each layer. This method needs much computing time due to couple of heavy load from per-pixel loss, gram matrix plus the LR. In the market, there are several faster implementation, where Perceptual Losses method is one of the fastest ones. Perceptual Losses introduces pretrained loss networks from ImageNet, and re-uses the content loss and style loss to calculate perceptual loss, however, it doesn’t update the loss network, which saves much computing time. It works like this: when give the input image (e.g. Niba) to the transform network, it calculates the loss from the pretrained loss network, and gets back to transform network to minimize the loss, so transform network can learn the loss network style from minimizing the loss.

Perceptual loss network needs a set of pretrained network where each network for a style. One can follow under the same repo for creating new styles.


Why I paused updating this blog for a long time and resume?

Because I was carefully thinking about teaching MXNet and deep learning in a different way, much different from many other blogs or medium posts where each tutorial starts with theory or math or whatever fundamental knowledge, needs at least 30 minutes reading time, professionals don’t like the repeated fundamental knowledge part since they already know it, but new readers can’t understand what to do.

I believe the only way readers can remember the knowledge is by JUST DO IT!. From last year, I opened my category on and started publishing two minutes demo of deep learning in Chinese. It turned out very welcomed: my 2000+ followers had much fun trying these demos, they really learned after doing it and reading the theory part. If miss some math knowledge, I showed them where to learn. So, I am thinking that, why don’t I translate it back to English, and share with more readers. I will keep posting more blogs like this, hope you like them.

And, as always, have you clicked the star and fork buttons on MXNet repo ?

Setup Amazon AWS GPU instance with MXnet

Niba Meisje met de parel

(every blog should have a cat, right? This cat is “Niba”, and she is Pogo’s younger sister. The neural art style is “Girl with a Pearl Earring“. More about Neural Art with MXnet, please refer to my last blog.)

With popular requests, I wrote this blog for starting an Amazon AWS GPU instance and install MXnet for kaggle competitions, like Second Annual Data Science Bowl. Installing CUDA on AWS is kind of tricky: one needs to update kernels and solve some conflicts. I want to have special thanks to Caffe EC2 installation guide,-CUDA-7,-cuDNN)

Have you clicked “star” on MXnet’s github repo? If not yet, do it now:¬†

Create AWS GPU instance

For creating an AWS instance, please follow AWS spot instance can be an inexpensive solution for competing on Kaggle, and one can request a g2.2xlarge (single GPU) or a g2.8xlarge (4x GPUs) instance from the AWS console. The price when I wrote this blog is about 0.08$/h for g2.2xlarge and 0.34$/h for g2.8xlarge. Once approved, one can start an official Ubuntu 14.04 instance and start installing MXnet. One can save this AMI image for future reference.

Install dependencies


sudo apt-get update && sudo apt-get upgrade
sudo apt-get install -y build-essential git libcurl4-openssl-dev libatlas-base-dev libopencv-dev python-numpy unzip


Update Linux kernel on AWS

sudo apt-get install linux-image-extra-virtual

“Important: While installing the linux-image-extra-virtual, you may be prompted “What would you like to do about menu.lst?” I selected “keep the local version currently installed” ” as on Caffe reference.

Disable nouveau

nouveau conflicts with NVIDIA’s kernel module on AWS. One needs to edit /etc/modprobe.d/blacklist-nouveau.conf and disable nouveau.

sudo vi /etc/modprobe.d/blacklist-nouveau.conf

    blacklist nouveau
    blacklist lbm-nouveau
    options nouveau modeset=0
    alias nouveau off
    alias lbm-nouveau off

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u
sudo reboot

Wait before finish rebooting, usually 1 min, and login back to the instance to continue installation:

sudo apt-get install -y linux-source linux-headers-`uname -r`


Install CUDA

sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb
sudo apt-get update
sudo apt-get install -y cuda

Important: Please reboot the instance for loading the driver

sudo reboot

If everything is fine, nvidia-smi should look like this (4 GPU instance) for example:

Screen Shot 2016-01-15 at 11.58.02 PM

Optional: cuDNN

One can apply for the developer program here When approved, download cuDNN for Linux (either v4 RC or v3 is fine), upload the cuDNN package from the local computer to the instance, and install cuDNN:

tar -zxf cudnn-7.0-linux-x64-v4.0-rc.tgz #or cudnn-7.0-linux-x64-v3.0-prod.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/cudnn.h /usr/local/cuda/include/


Install MXnet

git clone --recursive
cd mxnet; cp make/ .

#add CUDA options
echo "USE_CUDA=1" >>
echo "USE_CUDA_PATH=/usr/local/cuda" >>
#if you have cuDNN, uncomment the following line
#echo "USE_CUDNN=1" >>
echo "USE_BLAS=atlas" >>
echo "USE_DIST_KVSTORE = 1" >>
echo "USE_S3=1" >>
make -j8


Add some link lib path

echo "export LD_LIBRARY_PATH=/home/ubuntu/mxnet/lib/:/usr/local/cuda-7.5/targets/x86_64-linux/lib/" >>> ~/.bashrc

Install python package

One can either use the system’s python or Miniconda/Anaconda python as mentioned in my previous blog. If use system’s python, do:

sudo apt-get install -y python-pip
cd python
python install --user

Test it

python example/image-classification/ --network lenet --gpus 0

One can also give --gpus 0,1,2,3 for using all 4 GPUs, if runs on a g2.8xlarge (4x GPUs) instance. Enjoy Kaggle competitions with MXnet!

Some trouble shooting

One may see some annoying message when starting training with GPU

libdc1394 error: Failed to initialize libdc1394

It is OpenCV problem on AWS. One can simply disable it by:

sudo ln /dev/null /dev/raw1394


Deep learning for hackers with MXnet (2): Neural art

Special thanks to Eric Xie for fixing the MXnet cuDNN problem. MXnet can fully utilize cuDNN for speeding up neural art.

Have you clicked “star” and “fork” on MXnet github repo? If not yet, do it now!

Update: want an almost real time Neural Art? Let’s go to MXNet-GAN model: it is a¬†Generative Adversarial Network pretrained model, please refer to¬† for details.

Neural artÔľöpaint your cat like Van Gogh

Neural art is a deep learning algorithm which can learn the style from famous artwork and apply to a new image. For example, given a cat picture and a Van Gogh artwork, we can paint the cat in Van Gogh style, like this (Van Gogh Self-portrait in 1889 wikipedia):


Neural art comes from this paper ‚ÄúA Neural Algorithm of Artistic Style‚ÄĚ by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge The basic idea is leveraging the power of Convolution Network (CNN) which can learn high level abstract features from the artwork and simulating the art style for generating a new image. These generative network is a popular research topic in deep learning. For example, you might know Google’s Inception which generates a goat-shape cloud image by given a goat image and a cloud image. Facebook also has their similar Deep Generative Image Models¬†,¬†inspired by this paper Generative Adversarial Networks where one of the authors, Bing Xu, is also the main author of the great MXnet.

Neural art algorithm has many implementations and applications, for example these two Lua/Torch7 ones LINK1 LINK2. The paper’s gitxiv also includes many interesting applications, for example, generating a neural art gif animation. All of these implemtations uses the VGG model of image classification as mentioned in the paper. With popular demand on github, MXnet has just published its fast and memory efficient implementation. Let’s have fun with MXnet!

Have you clicked the “star” and “fork” on MXnet github repo? If not yet, do it now!

MXnet’s Neural art example

The neural art example is under mxnet/example/neural-style/. Since this example needs much computing work, GPU is highly recommended, please refer to my previous blog “Deep learning for hackers with MXnet (1) GPU installation and MNIST” for detailed installation guide. For sure, one can use CPU anyway since mxnet supports seamless CPU/GPU switch, just a reminder, it may take about 40-50 minutes for generating an image with CPU.

Optional: MXnet can speed up with cuDNN! cuDNN v3 and v4 both work with mxnet, where v4 is 2-3 seconds faster than v3 on my GTX 960 4GB. Please go to and apply for nVidia Developer program. If approved, one can install CuDNN with CUDA as simple as this (Reference: Install Caffe on EC2 from scratch (Ubuntu, CUDA 7, cuDNN))) :

tar -zxf cudnn-7.0-linux-x64-v3.0-prod.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/cudnn.h /usr/local/cuda/include/

And please turn on USE_CUDNN = 0 in make/ and re-compile MXnet for CuDNN support, if not previously compiled. Please also update the python installation if necessary.

For readers who don’t have an GPU-ready MXnet, the market has these free or paid services and apps for trying Neural Art. Since neural art needs a lot of computing, all these paid or free services need to upload the images to servers, and wait for a long time for finishing processing, usually from hours (if lucky) to weeks:

  1. Deepart Free submission and the average waiting time is a week ūüė¶ if one wants a faster processing, one can consider donation and speed up to…. 24 hours wait.
  2. Pikazo App: It is a paid app (2.99$) as, and you still need to wait in line for a long time.
  3. AI Painter: It is a service from instapainting, free of charge, long waiting time too.
  4. DeepForger:¬†(Thanks to¬†alexjc¬†from reddit) a twitter account that people can submit images and get Neural Art results within hours. “It’s a new algorithm based on StyleNet that does context-sensitive style, and the implementation scales to HD.”

For my dear readers who are lucky to have a GPU-ready MXnet, let do it with MXnet! I am going to use an image from my sister’s cat “pogo” and show every single detail of generating an art image, from end to end.

Steps and parameters

MXnet needs a VGG model.¬†We need to download it for¬†the first time running using MXnet version of this VGG model takes about several MB where the Lua version of the same model costs about 1 GB. After having the model ready, let’s put the style image and the content image into the input folder. For example, I give mxnet the cat image as content, and Van Gogh’s painting as style:

python --content-image input/pogo.jpg --style-image input/vangogh.jpg

After 1-2 minutes, we can see the output in the output folder like this:


Let’s try painting the cat “Pogo” in a modern art style. By replacing Van Gogh with ‘Blue Horse’ Modern Equine Art Contemporary Horse Daily Oil Painting by Texas Artist Laurie Pace ( , pogo is painted like this:

python --content-image input/pogo.jpg --style-image input/blue_horse.jpg


Isn’t it cool?

In the python script, there are some fine tune parameters for better results, and each of them is explained as following:

  • --model The model name. In this example, we only have VGG model from image classification, so please let it as it is. In future, MXnet may provide multiple other models, like Google Inception, since they share the same framework.
  • --content-image Path to the content image, a.k.a the cat image.
  • --style-image Path to the style image, a.k.a the Van Gogh image.
  • --stop-eps The model use eps for evaluating the difference between the content and the style. One can see the eps value converging during the training. The less eps, the more similar style. stop-eps is the threshold for stopping the training. Usually a smaller stop-eps gives stronger style, but it needs longer training time. The default 0.005 value is good, and one can change to 0.004 for better results.
  • --content-weight --style-weight The weight of content and style. By default, it is 10:1. If one thinks the style is too strong, for example, the painting feels strange and harsh, please reduce it to 20:1 or 30:1.
  • --max-num-epochs The max number of epochs, by default it is 1000 epochs. Usually MXnet can converge to a good eps value around 200 epochs, and we can leave this parameter alone.
  • --max-long-edge The max length of the longer edge. MXnet adjust the content image to this size and keeps the aspect ratio. The runtime is almost proportional to the number of pixels (aread) of the image, because the convolution network input size is defined by the number of pixels, and each convolution is on each image block. In short, 700px image may double the memory cost and runtime to that in 500px image. In the following benchmark, one can see that, a 512 px image needs about 1.4GB memory, which is good for a 2014 Macbook Pro or other 2GB CUDA devices; a 850-900 px image is good for 4GB memory CUDA card; if one wants a 1080p HD image, one may need to get a 12GB memory Titan X. Meanwhile, the computing time is related to the number of CUDA cores: the more cores, the faster. I think my dear readers now understand why these free Neural Art services/apps needs hours to weeks of waiting time.
  • --lr logistic regression learning ratio eta for SGD. Mxnet uses SGD for finding a image which has similar content to cat and similar style to Van Gogh. As in other machine learning projects, larger eta converges faster, but it jumps around the minimum. The default value is 0.1, and 0.2 or 0.3 work too.
  • --gpu GPU ID. By default it is 0 for using the first GPU. For people who have multiple GPUs, please specify which ones would be used.--gpu -1 means using CPU-only mxnet, which takes 40-50 minutes per image.
  • --output Filename and path for the output.
  • --save-epochs If save the tempory results. By default, it saves output for each 50 epochs.
  • -remove-noise Gaussian radius for reducing image noise. Neural art starts with white noise images for converging to the neural art from content + style, so it artificially introduces some unnecessary noise. Mxnet can simply smooth this noise. The default value is 0.2, and one can change it to 0.15 for less blur.


Out of memory

Since the runtime memory cost is proportional to the size of the image. If --max-long-edge was set too large, MXnet may give this out of memory error:

terminate called after throwing an instance of 'dmlc::Error' what():  [18:23:33] src/engine/./threaded_engine.h:295: [18:23:33] src/storage/./gpu_device_storage.h:39: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: out of memory

To solve it, one needs to have smaller --max-long-edge: for 512px image, MXnet needs 1.4GB memory; for 850px image, MXnets needs 3.7GB. Please notices these two items:

  1. GTX 970 memory issue: GTX 970 can only support up to 3.5GB memory, otherwise it goes crazy. It is an known problem from nVidia, please refer to this link for more details.
  2. The system GUI costs some memory too. In ubuntu, one can press ctrl+alt+f1 for shutting down the system graphic interface, and save some about 40MB memory.

Out of workspace

If the image size is larger than 600 to 700 pixels, the default workspace parameter in may not be enough, and MXnet may give this error:

terminate called after throwing an instance of 'dmlc::Error' what():  [18:22:39] src/engine/./threaded_engine.h:295: [18:22:39] src/operator/./convolution-inl.h:256: Check failed: (param_.workspace) >= (required_size)
Minimum workspace size: 1386112000 Bytes
Given: 1073741824 Bytes

The reason is MXnet needs a buffer space which is defined in as workspace for each CNN layer. Please replace all workspace=1024 in with workspace=2048.


In this benchmark, we choose this Lua (Torch 7) implementation and compare it with MXnet for learning Van Gogh style and painting pogo the cat. The hardware include a single GTX 960 4GB, a 4-core AMD CPU and 16GB memory.


Memory Runtime
MXnet (w/o cuDNN) 1440MB 117s
MXnet (w/ cuDNN) 1209MB 89s
Lua Torch 7 2809MB 225s

MXnet has efficient memory usage, and it costs only half of the memory as that in the Lua/Torch7 version.


Lua/Torch 7 is not able to run with 850px image because of no enough memory, while MXnet costs 3.7GB memory and finishes in 350 seconds.

Memory Runtime
MXnet (w/o cuDNN) 3670MB 350s
MXnet (w/ cuDNN) 2986MB 320s
Lua Torch 7 Out of memory Out of memory


MXnet magic to squeeze memory (12.21.2015 update)

With some invaluable discussion from reddit, and special thanks to¬†alexjc¬†(the author of DeepForger) and¬†jcjohnss¬†(the author of Lua Neural-artstyle), I have this updated benchmark with MXnet’s new magic¬†MXNET_BACKWARD_DO_MIRROR to squeeze memory (github issue). Please update to the latest MXnet github and re-compile. To add this magic, one can simply do:

MXNET_BACKWARD_DO_MIRROR=1 python --content-image input/pogo.jpg --style-image input/vangogh.jpg


Memory Runtime
MXnet (w/o cuDNN) 1440MB 117s
MXnet (w/ cuDNN) 1209MB 89s
MXnet (w/o cuDNN + Mirror) 1116MB 92s


Memory Runtime
MXnet (w/o cuDNN) 3670MB 350s
MXnet (w/ cuDNN) 2986MB 320s
MXnet (w/ cuDNN+Mirror) 2727MB 332s

The mirror magic slows down a little bit and gains memory saving. With this Mirror magic, a 4GB GPU can process up to 1024px image with 3855MB memory!

Some comments about improving the memory efficiency: currently in the market, the Lasagne version (with Theano) is the most memory efficient Neural Art generator (github link, thanks to alexjc) which can process 1440px images with a 4GB GPU. antinucleon, the author of MXnet, has mentioned that, gram matrix uses imperative mode while symbolic mode should save more memory by reusing it. I will update the benchmark when the symbolic version is available.

In short, MXnet can save more memory than that in the Lua version, and has some speed up with CuDNN. Considering the price difference between a Titan X (1000$) and a GTX 960 4GB (220$), MXnet is also eco-friendly.

A note about the speed comparision: Lua version uses L-BFGS for the optimal parameter search while MXnet uses SGD, which is faster but needs a little bit tune-ups for best results. To be honest, the comparision above doesn’t mean MXnet is always 2x faster.
For readers who want to know MXnet’s secret of efficient memory usage, please refer to MXnet’s design document where all dark magic happens. The link is

Till now, my dear readers can play with Neural art in MXnet. Please share your creative artwork on twitter or instagram with #mxnet and I will check out your great art!

How machine learns the artwork style?

Quantize the “style”

“Style” itself doesn’t have a clear definition, it might be “pattern” or “texture” or “method of painting” or something else. People believe it can be described by some higher order statistical variables. However, different art styles have different representations, and for a general approach of “learning the style”, it becomes very difficulty to extract these higher order variables and apply to some new images.

Fortunately, Convolution Network (CNN) has proved its power of extracting high-level abstract features in the image classification, for example, computers can tell if a cat is in the image by using CNN. For more details, please refer to Yann Lecun’s deep learning tutorial. The power of “extracting high-level abstract features” is used in Neural Art: after couple of layers of convolution operations, the image has lost its pixel-level feature, and only keeps its high-level style. In the following figure from the paper, the author has defined a 5-layer CNN, where the staring night by Van Gogh keeps some content details in the 1st, 2nd and 3rd layer, and becomes “something looks like staring night” in the 4th and 5th layer:


And the author has reached the “Aha!” moment: if we put a Van Gogh image and one more other image to the same CNN network, some clever adjustment may make the second image closer to Van Gogh, but keeps some content in the first 3 layers. It is the way to simulate Van Gogh painting style! Moreover, there is a VGG model for image classification in the market for it!

Learn style and generate a new image

Now the problem becomes an optimization problem: I want the generated picture looks like my cat (the content feature should be kept for the first 3 layers), and I want Van Gogh style (the style feature for the 4th and 5th layer), thus the solution is an intermediate result which has a similar content representation to the cat, and a similar style representation to Van Gogh. In the paper, the author uses a white noise image for generating a new image closer to the content using SGD, and the other white nose image for being closer to the style. The author has defined a magical gram matrix for describing the texture and has used this matrix to defind the loss function which is a weighted mixture of these two white noise image. Mxnet uses SGD for converging it into a image which meets both of the content and style requirement.

For exmaple, in these 200+ steps of painting pogo the cat, the generated image changes like this:


where we can see, in the first 50 epoches, the generated image looks like a simple texture overlap in between the content and the style; with more epoches, the program gradually learns the color, the pattern etc, and becomes stable around 150th epoches, and finally paints pogo the cat in Van Gogh style.

Further reading: other methods of simulating art style

Neural art is not the only method of simulating artwork style and generating new images. There are many other computer vision and graph research papers, for example:


Neural art is a nice demo for convolution network, and people can generate artwork from their own images. Let’s have fun with MXnet neural art. Please share your creative artwork on twitter or instagram and add hashtag #mxnet.

A reminder about the style: if the content image is a portrait, please find a portrait artwork for learning the style instead of a landscape one. It is the same with landscape images, always landscape to landscape. Because the landscape artwork uses different paiting techniques and it doesn’t look good on portrait images.

In the next blog, I will have detailed introduction to the convolution network for image classification, a.k.a, the dog vs the cat.

Deep learning for hackers with MXnet (1) GPU installation and MNIST


I am going to have a series of blogs about implementing deep learning models and algorithms with MXnet. The topic list covers MNIST, LSTM/RNN, image recognition, neural artstyle image generation etc. Everything here is about programing deep learning (a.k.a. deep learning for hackers), instead of theoritical tutorials, so basic knowledge of machine learning and neural network is a prerequisite. I assume readers know how neural network works and what backpropagation is. If difficulties, please review Andew Ng’s coursera class Week 4:¬†

Surely, this blog doesn’t cover¬†everything about deep learning. It is very important to understand the fundamental deep learning knowledge. For readers who want to know in-depth theoritical deep learning knowledge, please read some good tutorials, for example,

MXnet: lightweight, distributed, portable deep learning toolkit

MXnet is a deep learning toolkit written in C++11, and it comes with DMLC (Distributed (Deep) Machine Learning Common You might have known MXnet’s famous DMLC-sibling xgboost, a parallel gradient boosting decision tree which dominates most Kaggle competitions and is generally used in many projects.

MXnet is very lightweight, dynamic, portable, easy to distribute, memory efficient, and one of the coolest features is, it can run on portable devices (e.g. image recognition on your Android phone ) MXnet also has clear design plus clean C++11 code, let go star and fork it on github:

Recently MXnet has received much attention in multiple conferences and blogs for its unique features of speed and efficient memory usage. Professionals are comparing MXnet with Caffe, Torch7 and Google’s TensorFlow. These benchmarks show that MXnet is a new rising star. Go check this recent tweet from Quora’s Xavier Amatriain:

Install MXnet with GPU

MXnet natively supports multiple platforms (Linux, Mac OS X and Windows) and multiple languages (C++, Java, Python, R and Julia, plus a recent support on javascript, no joking MXnet.js). In this tutorial, we use Ubuntu 14.04 LTS and Python for example. Just a reminder that, since we use CUDA for GPU computing and CUDA hasn’t yet support ubuntu 15.10 or newer (with gcc 5.2), let’s stay with 14.04 LTS, or, at latest 15.04.

The installation can be done on physical machines with nVidia CUDA GPUs or cloud instance, for example AWS GPU instance g2.2xlarge or g2.8xlarge. The following steps mostly come from the official installation guide, with some CUDA modification.

Please note: for installing CUDA on AWS from scratch, some additional steps are needed for updating linux-image-extra-virtual and disabling nouveau, for more details, please refer to Caffe’s guide:,-CUDA-7,-cuDNN)

Install dependency

MXnet only needs minimal dependency: gcc, BLAS, and OpenCV (optional), that is it. One can install git just in case it hasn’t been installed.

sudo apt-get update
sudo apt-get install -y build-essential git libblas-dev libopencv-dev

Clone mxnet

git clone --recursive

Just another reminder that --recursive is needed: MXnet depends on DMLC common packages mshadow, ps-lite and dmlc-core, where --recursive can clone all necessary ones. Please don’t compile now, and we need to install CUDA firstly.

Install CUDA

CUDA installation here is universal for other deep learning packages. Please go to for selecting the CUDA installation for the corresponding system. For example, installing CUDA for Ubuntu 14.04 should looks like this, and deb(network) is suggested for fastest downloading from the closest Ubuntu source.


Or, here it is the command-line-only solutionÔľö

sudo dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb
sudo apt-get update
sudo apt-get install cuda

If everything goes well, please check the video card status by nvidia-smi, and it should look like thisÔľö


CPU info may vary, and I am using a GTX 960 4GB (approximately 200$ now). MXnet has very efficient memory usage, and 4GB is good for most of the problems. If your video card has only 2GB, MXnet is fine with it with some small parameter tunes too.

Optional: CuDNN. Mxnet supports cuDNN too. cuDNN is nVidia deep learning toolkit which optimizes operations like convolution, poolings etc, for better speed and memory usage. Usually it can speed up MXnet by 40% to 50%. If interested, please go apply for the developer program here, and install cuDNN by the official instruction when approved,

Compile MXnet with CUDA support

MXnet needs to turn on CUDA support in the configuration. Please find from mxnet/make/, copy to mxnet/, and edit these three lines:

USE_CUDA_PATH = /usr/local/cuda
USE_BLAS = blas

where the second line is for CUDA installation path. The path usually is /usr/local/cuda or /usr/local/cuda-7.5. If readers prefer other BLAS implementations. e.g. OpenBlas or Atlas, please change USE_BLAS to openblas or atlas and add the blas path to ADD_LDFLAGS and ADD_CFLAGS.

We can compile MXnet with CUDA (-j4 for multi-thread compiling):

make -j4

One more reminder that, if one has non-CUDA video cards, for example Intel Iris or AMD R9, or there is not video card, please change USE_CUDA to 0. MXnet is dynamic for switching between CPU and GPU: instead of GPU version, one can compile multi-theading CPU version by setting USE_OPENMP = 1 or leave it to 0 so BLAS can take care of multi-threading, either way is fine with MXnet.

Install Python support

MXnet natively supports Python, one can simply do:

cd python; python install

Python 2.7 is suggested while Python 3.4 is also supported. One might need setuptools and numpy if not yet installed. I personally suggest Python from Anaconda or Miniconda

(answer some installation questions)
conda install numpy

Let’s run MNIST, a handwritten digit recognizer

Now we have a GPU-ready MXnet, let’s have the first deep learning example: MNIST. MNIST is a handwritten digit dataset with 60,000 training samples and 10,000 testing samples, where each sample is a 28X28 greyscale picture of digits, and the goal of MNIST is training a smart¬†machine learning model for recognizing hand-writing digits, for example, it recognizes the zip code that people write¬†on envelops and helps our post masters distribute the mails.

Let’s run the example:

cd mxnet/example/image-classification/
python will download MNIST dataset for the first time, please be patient while downloading. The sample code will print out the loss and precision for each step, and the final result should be approximately 97%. by default uses CPU only. MXnet has easy switch between CPU and GPU. Since we have GPU, let’s turn it on by:

python --gpus 0

That is it. --gpus 0 means using the first GPU. If one has multiple GPUs, for example 4 GPUs, one can set --gpus 0,1,2,3 for using all of them. While running with GPU, the nvidia-smi should look like thisÔľö


where one can see python is using GPU. Since MNIST is not a heavy task, with MXnet efficient GPU meomory usage, GPU usage is about 30-40% while memory usage at 67MB„Äā

Trouble shooting

When run with GPU for the first time, readers may encounter¬†something like thisÔľö

ImportError: cannot open shared object file: No such file 

It is because of the PATH of CUDA dynamic link lib, one can add this to ./bashrcÔľö

export LD_LIBRARY_PATH=/usr/local/cuda-7.5/targets/x86_64-linux/lib/:$LD_LIBRARY_PATH

Or compile it to MXnet by adding in

ADD_LDFLAGS = -I/usr/local/cuda-7.5/targets/x86_64-linux/lib/
ADD_CFLAGS =-I/usr/local/cuda-7.5/targets/x86_64-linux/lib/

MNIST code secrete revealed: design a simple MLP

In, there is an function get_mlp(). It implements a multilayer perceptron (MLP). In MXnet, a MLP needs some structure definition, like this in the code:

data = mx.symbol.Variable('data')
fc1 = mx.symbol.FullyConnected(data = data, name='fc1', num_hidden=128)
act1 = mx.symbol.Activation(data = fc1, name='relu1', act_type="relu")
fc2 = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
act2 = mx.symbol.Activation(data = fc2, name='relu2', act_type="relu")
fc3 = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=10)
mlp = mx.symbol.Softmax(data = fc3, name = 'mlp')

Let’s understand what is going on for this neural network. Samples in MNIST look like these:

  • Each same (a digit) is a 28X28 pixel grey scale image, which can be represented as a vector of 28X28=784 float value where each value is the grey scale of the pixel.
  • In MLP, each layer needs a layer structure. For example in the first layer, fc1 is a full connected layer mx.symbol.FullyConnected which takes input from data. This first layer has 128 nodes, defined as num_hidden.
  • Each layer also need an activation function Activation to connect to the next layer, in other words, transferring values from the current layer to the next layer. In this example, the Activation function for connecting layer fc1 and fc2 is ReLu, which is short for rectified linear unit or Rectifier, a function as f(x)=max(0,x). ReLU is a very commonly used activation function in deep learning, mostly because it is easy to calculate and easy to converge in gradient decent. In addition, we choose ReLU for the MNIST problem because MNIST has sparse feature where most values are 0. For more information about ReLU, please check wikipedia ) and other deep learning books or tutorials.
  • The second layer fc2 is similar to the first layer fc1: it takes the output from fc1 as input, and output to the third layer fc3.
  • fc3 works similar to the previous layer, and the only difference is that, it is an output layer, which has 10 nodes where each node outputs the probability of being one of the 10 digits, thus num_hidden=10.

With this network structure, MXnet also needs the stucture of input. Since each sample is 28X28 grey scale, MXnet takes the grey scale value vector of 28X28=784 elements, and give a python iterator get_iterator() for feeding data to the network defined above. The detailed code is in the example which is very clean, so I don’t copy-n-paste here.

The final step is running the model. If readers know scikit-learn, MXnet’s python looks very familiar, right?, net, get_iterator)

Congratulations! We can implement a MLP! It is the first step of deep learning, not hard, right?

It is Q&A time now. Some of my dear readers may ask, “Do I need to design my MNIST recognizer or some other deep learning network exactly like what you did here?” “hey, I see another function get_lenet() in the code, what is that?”

The answer to these two questions can be, most real life problems on deep learning are about designing the networks, and, no, you don’t have to design your network exactly like what I did here. Designing the neural network is art, and each problem needs a different network, and a single problem may have multiple solutions. In the MNIST example code, get_lenet() implements Yann Lecun’s convolution network LeNet for digit recognition, where each layer needs Convolution Activation and Pooling where the kernel size and filter are needed, instead of FullyConnected and ReLU. FYI: the detailed explanation of super cool convolution network (ConvNet) can be found at Yann Lecun’s tutorial: . Another good reference can be “Understanding ConvNet” by Colah. I may later on write a blog post for explaining ConvNet, since convnet is my personal favorite.

I have a fun homework for my dear readers. Let’s tune up some network parameters like the number of nodes and the activation function in get_mlp(), and see how it helps the precision and accuracy. One can also try changing num_epoch (number of iterations of learning from the data) and learning_rate (the speed of gradient decent, a.k.a the rate of learning) for better speed and precision. Please leave your comment for your network structure and precision score. Kaggle also has a MNIST competition, one can also go compete with mxnet MNIST models, please mention it is MXnet model. The portal:


Thanks for reading the¬†first blog of “Deep learning for hackers with MXnet”. The following blogs will include some examples in MXnet, which may include RNN/LSTM for generating a Shakespeare script (well, looks like Shakespeare), generative models of simulating Van Gogh for painting a cat, etc. Some of these models are available now on MXnet github Is MXnet cool? Go star and fork it on github

Yet another (not) winning solution: Kaggle Flavours of Physics for finding ŌĄ ‚Üí 3őľ

TL,DR: this blog describes feature engineering and models without implicitly/explicitly using tau invariant mass.

This blog is for describing the “Hi from CMS” team solution of the Kaggle Flavours of Physics competition.¬†It has the public score of 0.988420 and the private score of 0.989161 which has ranked at 29th. I thought I didn’t need¬†to publish this ‘not winning’ solution, but I found all¬†recent published top solutions heavily depended on invariant mass reconstruction in simulation (first place, second place), so I decided to post¬†some discussion here about more ‘physical sounds’ features and models, plus some other work.

The github link to all code is available here. The¬†simplified version was also put on Kaggle script here¬†during the competition (thanks for many forks). All code were¬†credited to the team members of¬†“Hi from CMS”: phunter, dlp78, Michael Broughton and littleboat.

Unlike the last competition of Higgs search where a single xgboost classifier was used from one year ago, this solution explored more on: physical sounds feature engineering, a specific neural network model, as well as the model ensemble method for Random Forest, xgboost, UGradient and neural network, with a (fancy) automatic model selection under the agreement test constraint.

1. Understanding the constraints: KS test, CvM test, invariant mass and SPDhits

This competition had two tests: agreement test and correlation test, which required the classifier be some robust against not-well simulated variable like SPDhits and independence to the tau invariant mass. The general ideas was not using SPDhits or tau invariant mass, but for different reasons.

SPDhits: not well simulated

SPDhits was the number of hits in the SPD calorimeter, and this variable was not well simulated in the training data because of the limited understanding of LHCb SPD calorimeter and LHC bunch intensity. In the real life analysis, SPDhits variable needed additional calibration for simulation vs collision data, and the SPDhits distribution in the control samples looked very different from that in the simulation. For passing KS test, SPDhits feature was not suggested.

A tricky but not-used feature engineering for ‘weakening SPDhits‘ was¬†binning SPDhits: a vector of feature of SPDhits on if SPDhits<10, <20, <30, … , <600 etc. Some experiment showed bins up to ¬†<150 could effectively help on up to +0.0001 AUC score without hurting the KS score, but it expanded¬†the feature spaces too much and caused some confusions to the classifier, so I decided not to use these features, although¬†it had some physics meanings that LHCb analysis with SPDhits were usually binned with SPDhits<100, SPDhits<200 etc for different track multiplicity.

In the later model selection for the final submission, a combination of models with and without SPDhits were used for an extra +0.0005 AUC score boost (discussed later in this blog). In my opinion, this ensemble was fine for physical sound because SPDhits calibration should be considered as systematic errors.

Invariant mass: it is not a good idea for physical sounds.

Tau invariant mass could be (easily) reconstructed using energy conservation and basic kinematics provided in the dataset, and may winning solutions were using it, which worried me. The detailed discussion on tau invariant mass was here (Kaggle forum link).

My thought on tau invariant mass and correlation test was that: it was not a good idea of reconstructing tau invariant mass from simulation and having this correlation test. In the real life analysis, a robust classifier for search for particles should not depend on the parent particle mass much:

  1. if this particle mass was unknown like the Higgs boson, there was an look-elsewhere effect where the invariant mass could bias the analysis;
  2. if this particle mass was known like the tau particle, the simulation with detector was not perfect thus the mass reconstruction was too good to be true.

The real life M_3body was reconstructed using likelihood methods, but repeating this method for this competition was impossible with the provided dataset. Using reconstructed tau invariant mass in the classifier was strongly biased to the simulation, and feature engineering with the tau invariant mass in simulation would probably have artificial dependence on the mass and might make the classifier useless in the real life analysis.

Surely, models with and without invariant mass could be ensembled for balancing CvM score, but I was afraid it was against the original idea of physical sounds, because the bias of invariant mass was physical, instead of systematic errors.

Using invariant mass feature was not a good¬†idea¬†in my opinion, so I had the following¬†feature engineering without invariant mass. All these following feature engineering had¬†very small CvM score¬†which meant they had little dependency on the tau invariant mass, and the model didn’t create artificial dependency on mass.

2. Physical sounds feature engineering: what are good and what are bad.

Kinematic features.

It is a 3-body decay where Ds and tau has similar invariant mass, thus tau is almost at rest in the Ds rest reference frame, so¬†the 3 children particles are almost symmetric in the Ds rest frame, thus individual particle’s kinematic feature may not be important, but some separations can be seen by pair-wise features. However, unfortunately we didn’t have phi and charge info for each particle, thus pair-wise mass reconstruction was not available, nor background veto by mu-mu mass ūüė¶

Since tau is boosted and¬†Ds and tau has similar invariant mass, tau’s IP and open angle (dira) should be small as expected, and tau has its lifetime thus it could fly for some distance. In the HEP analysis, usually IP significance was used, and fortunately it could be calculated as `flight_dist_sig` by dividing FlightDistance¬†by FlightDistanceError.

The distance from PV to the secondary vertex could be approximately calculated too by IP/(1-dira) where dira was the cosine value of tau open angle, and (1-dira) was the approximation when the open angle was very small (phi = sin(phi) = tan(phi) if phi is very small). A trick was that, dira was almost 1, thus a very small float number was added, and the feature was inv-log transformed.


Some features from Stepan Obraztsov (another CMS physicist, yeah!) were also included in my public script. They were:

  1. NEW_FD_SUMP which was FlightDistance divided by the summation of daughter particle momenta. It is considered as the total flight time in the lab reference frame.
  2. NEW5_lt which was Tau lifetime times the averaged daughter particle IP. It could be interesting to understand the effect of this feature.
  3. p_track_Chi2Dof_MAX which was the maximum of track chi2 quality. It is as in the selection features about the track quality.

Track/Vertex selection features.

The primary reason of using selection features was for distinguishing from the Ds-> eta mu nu background: eta could fly for a short distance and decay to 2 muons while tau-> 3 muons decay was immediate, which made the reconstructed Ds 3 muon vertex not as good as tau-> 3 muons. Thus, the track quality and vertex reconstruction quality (VertexChi2) should be used.

CDF, DCA¬†features were¬†other¬†selection features from some TMVA analyses, but xgboost and other models didn’t nicely picked them up and the feature importance for these individual features were minimal, which meant they needed to be dropped (aka feature selection), otherwise they would confuse xgboost-like decision tree classifiers. After dropping them out, xgboost improved¬†significantly.

However, there was some hope: this LHCb thesis about lepton flavour violation search (LINK) provided some good ideas about using these selection features, and a good one from this thesis was, max and min values of these selection features could help classification. It was understandable that, if one of these 3 tracks had bad or good quality, the pair-wise and 3 body reconstruction could be heavily effected.

An additional list of good selection features are (approximately +0.0003 AUC) :

  1. Min value of IsoBDT
  2. Max value of DCA
  3. Min value of isolation from a to f (which are pair-wise track isolation of the 3 tracks)
  4. Max value of track chi2 quality.
  5. Square summation of track chi2 quality: surprisedly good feature, maybe the tau vertex selection needed to consider multiple track quality instead of the max.

Pair-wise track IP feature

Although there was no phi nor charge information of each daughter track for pair-wise parent particle reconstruction, I had two features which improved the score and had good feature importance: The ratio of Tau IP vs p0p2 pair IP and Tau IP vs p1p2 pair IP.

It was¬†interesting¬†¬†of understanding this: the signal process was¬†tau -> Z’ mu and Z’->2 muon immediately, while the background was Ds->eta mu nu where a small missing energy nu was present, thus the pair-wise IP from the signal was surely¬†large by comparing with tau IP, but the missing energy from background confused¬†this comparison. The in-depth reason might need more information about the process. In the experiment, these two ratio values gave good separation and good feature importance.

select_p1p2_ip_ratio select_p0p2_ip_ratio

Other experimented but not used features:

  1. IP error of each particle.
  2. Pz of each particle plus combinations.
  3. Pair-wise open angle in eta.

Feature selection using trees

Decision tree classifiers didn’t¬†like confusing features, e.g. some features with¬†very small separation for the positive and negative, so feature selection was needed. I only had time for a¬†very simple feature selection by giving all these features to a random forest classifier for optimizing AUC, and dropped features with low feature importance. This tree-based selection gave consistent results as from human mind: individual momentum feature and individual quality selection features were dropped. It was¬†also interesting that, isolationb and isolationc were more less¬†important than the other 4 pair-wise track isolation variables. Keeping only¬†the other 4 track isolation variables helped AUC, the reason behind this may need some in-depth analysis.

3. The neural network model

More details to be updated here from Michael. Michael had a two-layer neural network using all these features, and the code was self-explained, please check the code.

The final model includes this neural network model with SPDhits. The model itself with SPDhits had very low¬†CvM score but disaster KS score (almost 0.3 and surely didn’t pass the KS test <0.09), but ensemble this model with other conservative models with a very low weight gave +0.0005 AUC score without failure in KS test.

4. “Weakening” decision tree models under the constraints

With good features, models have easier jobs. UGrad, Random Forest classifiers have good performance with these feature plus GridSearch for parameters, and their KS scores were around 0.06 which could easily pass. Since no invariant mass features were used, CvM score was round 0.0008 which had no problem either.

Xgboost is an aggressive and strong classifer, where it might give bad KS test score even without SPDhits features, e.g. 0.12, so weakening this model was needed for passing KS test. There were two tricks in xgboost parameters: large eta value for controlling convergence, and large colsample_bytree value plus small subsample by adding randomness, as mentioned in the xgboost document. The good combination was about:

  • eta = 0.2
  • colsample_bytree=13
  • subsample = 0.7

5. CV-based ensemble and automatic model selection

Littleboat had a cool randomized automatic model selection by the CV score. The idea was that, each model had some variance, thus the best ensemble could be summing up similar models with small parameter changes, e.g. different seeds, slightly different eta etc. The CV score was used for the ensemble: most of the time, CV scores were more reliable than the public leaderboard scores.

Each model had two versions: high score (with SPDhits) and low score (all other features, no SPDhits). The same parameters are used for both two versions of each model, and surely the high score one doesn’t pass KS test for most time.¬†After generating a list of model candidates, the script randomly pickup some models with good CV scores, searched for best weight to average them and evaluated the final CV score, ¬†as well as the KS score and CvM score.¬†The ensemble method was weighted average, and the weight was brute-force searched after selecting models for passing KS test.¬†If the result could pass the tests, it was saved as an candidate for submission.

After these selection, a manual process of weighted averaging with the neural network (with SPDhits) model was used. The weight was around 0.04 for the neural network model, which helped the final score by +0.0005 AUC. The final two submissions were weighted average of  Xgboost and UGrad, plus a small weight of neural network with SPDhits.

6. Lesson learned

  1. Kaggle competition really needs time. Thanks to my wife for her support.
  2. Team-up is great: I know some amazing data scientists from this competition and have learned much from them.
  3. Ensemble is needed, but it is not everything: feature engineering firstly, CV for each model, and ensemble them.
  4. Data leakage: every kaggle competition has this risk, and dealing with it is a good lesson for both admins and competitors. Exploiting it for winning? blaming it? having fun without using it? Any of them is OK, no one is perfect and no one can hide from statistics.
  5. When in doubt, use xgboost! Do you know xgboost can be installed by “pip install xgboost”. Submit an issue if you can’t install it.

7. One more thing for HEP: UGrad and xgboost

As learned from other high rank solution from the forum, UGrad itself was a strong classifier¬†with¬†BinFlatnessLossFunction,¬†and a simple ensemble of UGrad models could easily go 0.99+. But UGrad used¬†single thread and was very slow, so I didn’t¬†experiment more about¬†it.¬†I have submitted a wish-list issue on xgboost github repo¬†for implementing a xgboost loss function which can take¬†BinFlatnessLossFunction as in UGrad and using multi-thread xgboost for speed up, so HEP people can be have both BinFlatnessLossFunction and a fast gradient boosting classifier. It is do-able as Tianqi (author of xgboost) has described in the issue, and hope I (or someone) can find a time working on it. I hope it can help HEP research.

Advertisement time: DMLC is a great toolbox for machine learning. xgboost is part of it, and DMLC has a recent new tool for deep learning MXnet: it has nice design, it can implement LSTM in 25 lines of python code, and it can train ImageNet on 4 GTX 980 card in 8.5 days, go star it!

Winning solution of Kaggle Higgs competition: what a single model can do?

This blog is for describing the winning solution of the Kaggle Higgs competition. It has the public score of 3.75+ and the private score of 3.73+ which has ranked at 26th. This solution uses a single classifier with some feature work from basic high-school physics plus a few advanced but calculable physical features.

Github link to the source code.

1. The model

I choose XGBoost which is a parallel implementation of gradient boosting tree. It is a great piece of work: parallel, fast, efficient and tune-able parameters.

The parameter tuning-up is simple. I know GBM can have a good learning result by providing a small shrinkage eta value. According to a simply brute-force grid search for the parameters and the cross-validation score, I choose :

  • eta = 0.01 (small shrinkage)
  • max_depth = 9 (default max_depth=6 which is not deep enough)
  • sub_sample = 0.9 (giving some randomness for preventing overfitting)
  • num_rounds = 3000 (because of slow shrinkage, many trees are needed)

and leave other parameters by default. This parameter works good, however, since I only have limited knowledge of GBM, this model is not very optimal for:

  • shrinkage too slow and training time too long, so a training takes about 30 minutes and cross-validation takes longer time. On the forum, some faster parameters are published which can limit the training time to less than 5 min.
  • sub_sample parameter gives some randomness, especially when complicated non-linear features are involved, and the submission AMS score is not stable and not 100% reproducible. In the feature reduction section, I have a short discussion.

The cross validation is done as usually. A reminder about the AMS is that, from the AMS equation, one can simplify it to

{\rm AMS} = \frac{S}{\sqrt{B}}

which means, if we evenly partition the training set for K-fold cross validation where the S and B have applied the factor of (K-1)/K, the AMS score from CV is artificially lowered by approximately \sqrt{(K-1)/K}. The same with estimating the test AMS: the weight should be scaled by the number of test samples.

XGboost also supports customized loss function. People have some discussions in this thread and I have some tries. I haven’t applied this customized function for my submission.

2. The features

The key for this solution is the feature engineering. With the basic physics features, I can reach the public leaderboard score around 3.71, and the other advanced physics features push it to public 3.75 and private 3.73.

2.1 General idea for feature engineering

Since the signal (positive) samples are Higgs to tau-tau events where two taus coming from the same Higgs boson, while the background (negative) samples are tau-tau-look-like events where particles have no correlation with the Higgs boson, I have the general idea that, in the signal, these particles should have their kinematic correlations with Higgs, while the background particles don’t. This kinematic correlation can be represented as:

  • Simply as the open angles (see part 2.2)
  • Complicated as CAKE features (see this link in the Higgs forum, this feature is not developed by me nor being used in the submission/solution)

2.2 Basic physics features

Because of the general idea of finding the “correlation” in the kinematic system, the correlation between each pair of particles can be useful. The possible features are:

  • The open angles in the transverse (phi) plane and the longitudinal plan (eta angle).¬†The reminder is that, the phi angle difference must be mapped into ¬Īpi, and the absolute values of these angles work better for xgboost.
  • Some careful study on the open angle can find that, tau-MET open angles in phi direction is somehow useless. It is understandable that in tau-l-nu case the identified tau angle has no much useful correlation with nu angle.
  • Cartesian of each momentum value (px, py, pz): it works with xgboost.
  • The momentum correlation in the longitudinal direction (Z direction), for example, jet momentum in Z direction vs tau-lepton momentum in Z direction is important. This momentum in Z direction can be calculated using the pt (transverse momentum) and the eta angle.
  • The longitudinal eta open angle for the leading jet and the subleading jet: the physics reason is from the jet production mechanism from tau, but it is easy to be noticed when plotting PRI_jet_leading_eta –¬†PRI_jet_subleading_eta without physics knowledge.
  • The transverse momentum ratio of tau-lep, jet-jet, MET to the total transverse momentum.
  • The min and max PT: this idea comes from the traditional cut-based analysis where different physics channel, e.g. 2-jet VBF vs 1 jet jet suppression to lepton, there are minimal and maximal PT cut. In this approach, I give them as features instead of splitting the model, and xgboost picks them up in a nice way.

Overlapping the feature distribution for the signal and the background is a common technique for visualizing features in the experimental high energy physics, for example, the jet-lepton eta open angle distribution for the positive and negative samples can be visualized as following:

Jet-lepton open angle distributions

2.3 Advanced physics features

If one reads the CMS/ATLAS Higgs to tau-tau analysis paper, one can have some advanced features for eliminating particular background. For example, this CMS talk has covered the fundamental points of Higgs to tau-tau search.

In Higgs to tau tau search, one of the most important background is the Z particle where Z can decay into lepton pairs (l-l, tau-tau) and mimic the Higgs signal. Moreover, considering the known Higgs mass is 126 GeV which is close to Z mass 91 GeV, the tau-tau kinematics can be very similar in Z and Higgs.

The common technique for eliminating this background is reconstructing Z invariant mass peak which is around 91 GeV. The provided ‘PRI’ features only have tau momentum and lepton momentum, from which we can’t have precise reconstruction of Z particle invariant, however, this invariant mass idea gives some hint that, the pseudo transverse invariant mass can be a good feature where the transverse invariant mass distribution¬†can be :

Tau-lepton pseduo invariant mass

QCD and W+jet background are also important where lepton/tau can be mis-identified as jet, however, these two have much lower cross-section so the feature work on these two background are not very important. Tau-jet pseudo invariant mass features are useful too.

Some other not important features are:

  • For tau-tau channel, the di-tau momentum summation can be roughly estimated by tau-lep-jet combination although it is far from truth.
  • For 2-jets VBF channel, the jet-jet invariant mass should be high for signal.

Some comments about SVFIT and CAKE-A:

  • I know some teams (e.g. Lubos’ team) are¬†using SVFIT, which is the CMS counter-partner of ALTAS’ MMC method (the DER_invariant_MMC feature). SVFIT has more variables and more considerable features than MMC, so SVFIT can do better Higgs invariant mass construction. Deriving SVFIT feature using the current provided features is very hard, so I haven’t done it.
  • CAKE-A feature is basically the likelihood if a mass peak belongs to Z mass or Higgs mass. CAKE team claims that, it is a very complicated feature and some positive reports exist in the leaderboard, so ATLAS should investigate this feature for their Higgs search¬†model.

2.4 Feature selection

Gradient boosting tree method is generally sensitive to confusing samples. In this particular problem of Higgs to tau-tau, many background events can have very similar features to the signal ones, thus, a good understanding of features and feature selection can help reducing confusions for better model building.

The feature selection uses¬†some logic and/or domain knowledge, and I haven’t applied any automatic feature selection techniques, e.g. PCA. The reason is that, most auto feature selection methods are designed for capturing the max errors while this competition is for maximizing the AMS score, so I am afraid some feature selection can accidentally drop some important features.

Because of the proton-proton collision in the longitudinal direction, the transverse direction is symmetric, thus the raw phi values of these particles are mostly useless and should be removed for less confusions for the boosting process. It is also the reason why ATLAS and CMS detectors are cylinders.

The tau and lep raw eta values don’t help much too. The reason behind it¬†is the Higgs to tau tau production mechanism, but one can easily see it from the symmetric distributions of these variables.

Jets raw eta values are important. The physics reason behind it is from the jet production mechanism where the jets from the signal should be more centralized, but one can easily find it by overlapping the distributions of PRI_jet_(sub)leading_eta for the signal and the background without physics knowledge.

2.5 Discussions

More advanced features¬†or not? It is a question.¬†I only keep physically meaningful features for the final submission. I have experimented some other tricky features, e.g. the weighted transverse invariant mass with PT ratio, and some of them help scoring on the public LB. However, it doesn’t show significant improvement in my CV score.¬†To be safe, I spend the last 3 days before the deadline¬†removing these ‘tricky’ features, and keeping only the basic physics feature (linear combinations) as well as¬†these pseudo invariant mass features, in order not to overfit the public LB. After checking the private LB scores, I find some of them can help, but only a little. @Raising Spirit¬†on the forum has posted a feature which is¬†DER_mass_MMC*DER_pt_ratio_lep_tau/DER_sum_pt and Lubos has a nice comment if it is good idea or not.

CAKE features effect. I have used 3¬†test submissions for testing CAKE-A and CAKE-B feature. With CAKE A+B, both of my public and¬†private LB submission score drops around 0.01; with CAKE-A, my model score has almost no change (reduce 0.0001); with CAKE-B, my model score improves 0.01. I think it is because CAKE-A feature may have strong correlations with my current feature, while CAKE-B¬†is essentially¬†the MT2 variable in physics which can help for the hadronic (jet) final state. I haven’t include these submissions in my final scoring ones, but thanks to CAKE team for providing these features.

3. Conclusion and lessons learned

What I haven’t¬†used:

  • loss function¬†using AMS score: In these two posts (post 1 and post 2), they proposed the AMS loss function. XGboost has a good interface for these customized loss function, but I just didn’t have chance to tune up the parameters.
  • Tricky non-linear non-physical features.
  • Vector PT summations of tau-lep, lep-tau and other particle pairs, and their¬†open angles with other particles. They are physically meaningful, but my model doesn’t pick them up ūüė¶
  • Categorizing¬†the¬†jet multiplicity (PRI_jet_num). Usually this categorizing technique works better since it increase the feature dimension for better separation, but not for this time, maybe because of my model parameters.
  • Split models by PRI_jet_num. In the common Higgs analysis, the analysis is divided into different num-of-jets categories, e.g. 0-jet, 1-jet, 2-jets, because each partition can have different physical meanings in the Higgs production. XGboost has caught¬†this partition nicely with features, and it handles the missing values in a nice way.

Lesson learned

  • Automate the process: script for filing CV, script for the full workflow of training + testing, script for parameter scan, library of adding features so CV and training can have consistent features. It can save very much time.
  • Discuss with the¬†team members and check the forum.
  • Renting a cloud is easier than buying a machine.
  • I show learn ensembling classifiers for better score in future.
  • Spend some time: studying the features and the model needs some¬†time. I paused my submission for 2 months for preparing for my O1-A visa application (I got no luck in this year’s H1B lottery, so I had to write ‘a lot of a lot’ for this visa instead) and only fully resumed it about 2 weeks before deadline when my visa was approved, so my VM instance has run like crazy for feature work, model tuning and CV since then while I sleep or during daily work. Fortunately, these work (plus electricity bills to Google) has good payback on the rank.

4. Acknowledgment

I want to thank @Tianqi Chen (the author of xgboost), @yr, @Bing Xu, @LuboŇ° Motl for their nice work and the discussions with them. I also want to thank to Google Cloud Engine for providing 500$ free credit for¬†using their cloud computer, so I don’t have to buy my own computer but¬†just rent a 16-core VM.

5. Suggestions to CMS, ATLAS, ROOT and Kaggle

To CMS and ATLAS:¬†In the physics¬†analyses, ML was called ‘multi-variate analysis’ (MVA, so ROOT’s ML package is called TMVA) where features were ‘variates’. This method of selecting variates (features) was from the traditional cut-based analysis where each variate must be kind of strong physically meaningful so they were explainable, for example, in CMS’s tau-tau analysis, we had¬†2-jet VBF channel where tau-tau required jet-jet momentum was greater than 110 GeV etc. Using this cut-based feature selection idea in the MVA technique gave some limits of feature work where features were carefully selected by physicists using their experience and knowledge, which was good but very limited. In this competition, I came up some intuitive ‘magic’ features using my physics intuition, and the model + feature selection techniques in ML helped removing non-sense ones as well as finding new good features. So, my suggestion to CMS and ATLAS on the feature work is that, we should introduce more ML techniques for helping feature/variate engineering work and use machine’s power for finding more good features.

To ROOT TMVA: XGboost is a great idea for parallelizing the GBM learning. ROOT’s current TMVA is using single thread which is slow, and ROOT should have some similar idea of xgboost into the next version of TMVA.

To Kaggle: it might be a good idea of having some sponsored computing credits from some cloud computing providers and giving them to the competitors, e.g. Google Cloud and Amazon AWS. It can remove the obstacles of computing resources for competitors, and also a good advertisement for these cloud computing providers.

6. Our background

My teammate @dlp78 and I are both data scientists and physicists. We used to work on the CMS experiment. He worked on Higgs -> ZZ and Higgs-> bb discovery channel, and I worked on Higgs->gamma gamma discovery channel and some supersymmetry/exotics particle search. My PhD dissertation is search for long-lived particle decay into photons, in which the method is inspired by the tau discovery method, and my advisor is one of the scientists who discovered the tau particle (well, she also discovered many others, for example: Psi, D0, Tau, jets, Higgs). I have my Linkedin page linked on my Kaggle profile page, and I would like to link to you great Kaggle competitors.

The best photography spots in San Francisco: what can data tell you?

In yesterday morning time, I was reading Danny Dong‘s wedding photos and amazed by his captures in the Golden Gate Bridge, City Hall in San Francisco. And I got this question: what the best photography spots are in San Francisco, besides these two places? If we know this, photographers do not waste time of searching for a good spot. To answer this question, I needed some data.

Fortunately, in SF data, I found this data set of Film Locations in San Francisco. There were about 200 movies filmed in San Francisco, from 1950s to present. In my mind, if professionals took photos at some spots, these spots were great photography spots.

Just the addresses themselves were not enough, I wanted to put them on to the map. Google map has the Geo service which can translate one address into a latitude/longitude location, for example:

Metro (1997) filmed at Bush Street at Jones, can be translated into “Bush Street & Jones Street, San Francisco, CA 94109, USA” and the location is 37.7895596 -122.4137276

With python, this is easy. I installed geopy (in pip), and used this simple piece of code:

from geopy import geocoders
g = geocoders.GoogleV3()


location+=', San Francisco, CA'
#I changed the source code of geopy and it returned multiple locations if ambiguity.
place, (lat, lng) = g.geocode(location)[0]

We can use google fusion table to draw these locations on the map. It looks like this:

Using the filming location from 200+ famous movies filmed in SF, we can extract the best photography locations from these movie professionals.

Using the filming location from 200+ famous movies filmed in SF, we can extract the best photography locations from these movie professionals.

One can click this link for the interactive map and go to the location (with streetview!):

However, SF is so beautiful that too many locations to read. A heat map works better. I used myheatmap and generated this hot spot map for photography locations. It looks like that the pier area is super hot: quite consistent with my opinion.

Using the filming location from 200+ famous movies filmed in SF, we can extract the best photography locations from these movie professionals.

Using the filming location from 200+ famous movies filmed in SF, we can extract the best photography locations from these movie professionals.

We can click this link and find out the details¬†¬†(one may need to change the ‘decay’ to a smaller value on the right top of the interactive map).

In short, we now know what the best photography spots are, from data. Enjoy photography in San Francisco.

A data scientist’s way to solve the 8809=6 problem

Last time, one of my colleagues posted this question. It is related to feature extraction of machine learning.

The question starts simple:

8809=6 7111=0 2172=0 6666=4 1111=0 3213=0 7662=2 9313=1 0000=4 2222=0 3333=0 5555=0 8193=3 8096=5 7777=0 9999=4 7756=1 6855=3 9881=5 5531=0
So, 2581=?

The answer itself was easy: number of circles. For these kind of numeric puzzles, I think there should be a general solution that solves these puzzles automatically.

By checking the pattern of the equation list, one can guess that, the answer can be the summation of these patterns, so one can list the digits by its frequency, and this problem becomes a linear regression for a 10-dimension (x0-x9) space.

For example, 8809=6 gives 100000021 = 6 by listing the frequency of each digits, where 1,2,1 are parameters for ‘0’, ‘8’ and ‘9’, which means the weight summation of each digits is 6. The regression function can be dot product for a linear function and tells us why it is 6 for 8+8+0+9.

Thus, one can solve it by linear regression. The regression function can be:

def __residual(params, xdata, ydata):#guess the function is cosine dot
    return (ydata -, params))

And one can use numpy’s least square for the linear regression:

leastsq(__residual, x0, args=(xdata, ydata))

The full source code in python can be found at

Some discussions:

1. About number ‘4’

The output:

(array([ 1.00000000e+00, 4.75790411e-24, 4.75709632e-24, 4.75588463e-24, 1.00000000e+00, 4.75951970e-24, 1.00000000e+00, 4.75790411e-24, 2.00000000e+00, 1.00000000e+00]), 2)

where each value in the array means the correlations between each digit and its’ weight. One can discover that, ‘0’,’4′,’6′,’9′ have weight of 1, and ‘8’ has the weight of 2.

However, there was a bug in the code: the initial value:

x0=numpy.array([1,1,1,1,1,1,1,1,1,1])#initial guess

The initial guess was weight 1 for each digit. After checking with the input, one can find ‘4’ does not exist, so one should take off ‘4’ as weight 1.

2. Why so complicated?

Why it is associated with the topic of ‘feature extraction’ in machine learning?

If one is smart enough, one can tell that, the pattern is actually the number of circles in each digit, so ‘8’ has two circles and its weight is 2.

However, no one can be smart at all time. Some patterns are hidden. The procedure in the example is kind of latent feature extraction: one does not have to know the pattern is about the number of circles in each digit, one can still get that ‘8’ has weight of 2 by using this automatic code.

Some interview questions for data scientists

Last time, one of my friends asked me for some interview questions to test the candidates of data scientist jobs. I think it is good to share the questions. Later on, I may (if I got some free time) post some detailed solutions and discussions on them.

1-D Facebook (difficulty: medium, problem solving)

We are living in a 3-D world, with x, y and z coordinates. In an 1D world, there is only the x coordinate, and people can only see left and right. There is a startup social network company, we can call it ‚Äė‚Äô , and it wants to build ‚Äėfind your friends‚Äô program for 1-D. In 1-D world, people has no size but a dot (no worry about diet ūüôā ) It has a list of 1M people‚Äôs location info, represented as an 1-D float array of 1M length, unsorted . Now, giving a people at array position index N, please find the closest two (left and right) friends.This position array can be very long (but fit in the memory), and unsorted, so sorting and search can be OK, but not yet the optimal solution. preprocess is OK, no space limit.

Reference: this problem was from some real-life cases, for example, it is the final step of amazon‚Äôs collaborative filter about ‚Äėpeople bought this also bought‚Äô after calculated all probability combinations (here this problem uses distance, and amazon using division). Another example is the spelling corrector which needs to find the closest-spelling words from a big dictionary where the distance is defined by edit-distance. A real spelling corrector is much more tricky on the data structure. Here I just simplified the edit-distance to position difference. PS: google‚Äôs spelling correction using bayesian models instead of edit-distance.

2-D Facebook (difficulty: hard, problem solving + algorithm)

Since you have solved 1-D Facebook, a 2-D world Facebook company, ‚Äė‚Äô, wants to hire you and make the same program for 2-D people, where each people, with position x and y, can see left/right and up-down, and then find 4 closest friends. Surely, it does not have to be up-friend, down-friend etc, any direction is OK.

Reference: this is a real Facebook problem (not interview, no time limit), called ‚Äėsmall world‚Äô from Facebook Puzzle (this webpage is taken down by Facebook).

DVD Auto-classification (difficulty: medium, problem solving + machine learning)

A DVD-rent company wants to beat Netflix, so they want to build a smart machine learning algorithm to auto-classify DVD movies rather than manually labeling all movies. Fortunately, they only host romantic movies and action movies, so things are easier than those in Netflix. They observed one thing that, in romantic movies, people kiss a lot; in action movies, people fight a lot, but romantic movies can have fights too. Can you use this information to build a classifier which can tell if a new movie is action or romantic?

Reference: From the book ‘Machine Learning in action’.¬†It is also from a real-life problem from my current project, but I simplified it to numerical features.

Super long feature similarity (difficulty: medium-hard, programming+machine learning)

Some machine learning models produced list of features for two soft drinks, for example, value of the content of sugar etc. One wants to compare the similarity of these two drinks using machine learning, how? (Interviewee should answer cosine similarity or dot-product or some other distance functions to compare two feature vectors).

Let’s take cosine similarity for example. Now, the real situation is that, there are millions of features from machine learning models, and some drinks may miss many features, in the other words, the feature is very sparse. So, if we want to compare two drinks with sparse features, where one drink can have many features that the other drink does not have. Do we really need to multiple each feature for these many zero values?

Calculate square-root of integer N. (difficulty: medium-hard. Numerical methods and programming)

This question can have some variations:

  • (easy) How to tell if an integer N is a perfect square number (N=k*k where k is an integer).
  • (medium) Given a very large integer N, and the number m where m*m<=N and (m+1)*(m+1)>N.
  • (hard, needs hint) How to determine a number is a Fibonacci number? The hint should be given by the interviewer: a Fibonacci number can be represented either in 5*N**2+4 or 5*N**2 -4, so simply to test if this number plus/minus 4 divided by 5 is a perfect square number.
  • (medium-hard) How to determine a number is summation of two perfect square numbers?

What if N is very large, and one can not build a table of square numbers?

Essay-copying (difficulty: medium-hard, NLP, machine learning, modeling)

In the final test of the university, the professor received 200 essays from students, about 1000 words each. Badly, he found some students were copying other people‚Äôs essays. But these students were smart: they did not copy the entire essay, maybe change words in some sentences, may copy from 2-3 other persons (but surely, they do not copy from all the other 200 students, no enough time ūüôā ). Please build a machine learning system to help professor find these bad students.

Reference: clustering using nature language processing is very important in the real life. This is an example.


Check if a number is a Fibonacci number: speed comparison for two Quora answers in python

It is interesting to read this Quora thread
What is the most efficient algorithm to check if a number is a Fibonacci Number? The top two answers are both very good and efficient.

Background, what is Fibonacci number? wikipedia

Anders gave a nice matrix exponentiation algorithm in Haskell while John posted his Java solution using square numbers. There was an argument which one was faster, so I made this comparison using python.

First of all, John needed a faster algorithm to determine if a number is a perfect square number. He used a loop from 2 to sqrt(N), and it was surely not an efficient way.

A better way to do it is the newton’s method of integer square number (wikipedia), in short, it equals using newton’s method to solve x*x-n = 0. In python, it can be easily solved by this (original code from this link):

def isqrt(x):
    if x < 0:
        raise ValueError('square root not defined for negative numbers')
    n = int(x)
    if n == 0:
        return 0
    a, b = divmod(n.bit_length(), 2)
    x = 2**(a+b)
    while True:
        y = (x + n//x)//2
        if y >= x:
            return x
        x = y

And then, John’s method is basically testing if 5*N*N+4 or 5*N*N-4 is a perfect square number. If the answer is yes for either one, this number is a Fibonacci number.

Actually the square root algorithm can have more optimizations using the 64 bit magic number ¬†0x5fe6eb50c7b537a9¬†(the DOOM trick), please check wikipedia for more interesting details. To be platform independent, here I just used the original newton’s method.

Secondly, Anders code was in Haskell, so I rewrote them into Python for a fair comparison.

def fibPlus((a,b),(c,d)):
  bd = b*d
	return (bd-(b-a)*(d-c), a*c+bd)

def unFib((a,b),n):
	if n<a:
		return (0,0,1)
		(k,c,d) = unFib(fibPlus((a,b),(a,b)),n)
		(e,f) = fibPlus((a,b),(c,d))
		if n<e: return (2*k, c, d)
			return (2*k+1,e,f)

def isfib(n):
	(k,a,b) = unFib((1,1),n)
	return n==a

The full source code can be found on my github

To test these two algorithm, I downloaded the first 500 Fibonacci numbers from and ran 100 times for each algorithm on this list of number. The result is interesting: python optimization makes difference. Unit is in second.

If run in python 2.7, John’s method won for 10% time:

python Anders method: 1.52931690216 John method: 1.36000704765

If run in pypy 1.9, Anders method is highly optimized for 2x speed:

pypy Anders method: 0.799499988556 John method: 2.0126721859

To conclude, both of these two algorithms are very good.

One another question to follow this:

Given a number N, if N is not a Fibonacci number, print out the largest Fibonacci smaller than N.

Hint: John’s method.