Yet another (not) winning solution: Kaggle Flavours of Physics for finding τ → 3μ

TL,DR: this blog describes feature engineering and models without implicitly/explicitly using tau invariant mass.

This blog is for describing the “Hi from CMS” team solution of the Kaggle Flavours of Physics competition. It has the public score of 0.988420 and the private score of 0.989161 which has ranked at 29th. I thought I didn’t need to publish this ‘not winning’ solution, but I found all recent published top solutions heavily depended on invariant mass reconstruction in simulation (first place, second place), so I decided to post some discussion here about more ‘physical sounds’ features and models, plus some other work.

The github link to all code is available here. The simplified version was also put on Kaggle script here during the competition (thanks for many forks). All code were credited to the team members of “Hi from CMS”: phunter, dlp78, Michael Broughton and littleboat.

Unlike the last competition of Higgs search where a single xgboost classifier was used from one year ago, this solution explored more on: physical sounds feature engineering, a specific neural network model, as well as the model ensemble method for Random Forest, xgboost, UGradient and neural network, with a (fancy) automatic model selection under the agreement test constraint.

1. Understanding the constraints: KS test, CvM test, invariant mass and SPDhits

This competition had two tests: agreement test and correlation test, which required the classifier be some robust against not-well simulated variable like SPDhits and independence to the tau invariant mass. The general ideas was not using SPDhits or tau invariant mass, but for different reasons.

SPDhits: not well simulated

SPDhits was the number of hits in the SPD calorimeter, and this variable was not well simulated in the training data because of the limited understanding of LHCb SPD calorimeter and LHC bunch intensity. In the real life analysis, SPDhits variable needed additional calibration for simulation vs collision data, and the SPDhits distribution in the control samples looked very different from that in the simulation. For passing KS test, SPDhits feature was not suggested.

A tricky but not-used feature engineering for ‘weakening SPDhits‘ was binning SPDhits: a vector of feature of SPDhits on if SPDhits<10, <20, <30, … , <600 etc. Some experiment showed bins up to  <150 could effectively help on up to +0.0001 AUC score without hurting the KS score, but it expanded the feature spaces too much and caused some confusions to the classifier, so I decided not to use these features, although it had some physics meanings that LHCb analysis with SPDhits were usually binned with SPDhits<100, SPDhits<200 etc for different track multiplicity.

In the later model selection for the final submission, a combination of models with and without SPDhits were used for an extra +0.0005 AUC score boost (discussed later in this blog). In my opinion, this ensemble was fine for physical sound because SPDhits calibration should be considered as systematic errors.

Invariant mass: it is not a good idea for physical sounds.

Tau invariant mass could be (easily) reconstructed using energy conservation and basic kinematics provided in the dataset, and may winning solutions were using it, which worried me. The detailed discussion on tau invariant mass was here (Kaggle forum link).

My thought on tau invariant mass and correlation test was that: it was not a good idea of reconstructing tau invariant mass from simulation and having this correlation test. In the real life analysis, a robust classifier for search for particles should not depend on the parent particle mass much:

  1. if this particle mass was unknown like the Higgs boson, there was an look-elsewhere effect where the invariant mass could bias the analysis;
  2. if this particle mass was known like the tau particle, the simulation with detector was not perfect thus the mass reconstruction was too good to be true.

The real life M_3body was reconstructed using likelihood methods, but repeating this method for this competition was impossible with the provided dataset. Using reconstructed tau invariant mass in the classifier was strongly biased to the simulation, and feature engineering with the tau invariant mass in simulation would probably have artificial dependence on the mass and might make the classifier useless in the real life analysis.

Surely, models with and without invariant mass could be ensembled for balancing CvM score, but I was afraid it was against the original idea of physical sounds, because the bias of invariant mass was physical, instead of systematic errors.

Using invariant mass feature was not a good idea in my opinion, so I had the following feature engineering without invariant mass. All these following feature engineering had very small CvM score which meant they had little dependency on the tau invariant mass, and the model didn’t create artificial dependency on mass.

2. Physical sounds feature engineering: what are good and what are bad.

Kinematic features.

It is a 3-body decay where Ds and tau has similar invariant mass, thus tau is almost at rest in the Ds rest reference frame, so the 3 children particles are almost symmetric in the Ds rest frame, thus individual particle’s kinematic feature may not be important, but some separations can be seen by pair-wise features. However, unfortunately we didn’t have phi and charge info for each particle, thus pair-wise mass reconstruction was not available, nor background veto by mu-mu mass 😦

Since tau is boosted and Ds and tau has similar invariant mass, tau’s IP and open angle (dira) should be small as expected, and tau has its lifetime thus it could fly for some distance. In the HEP analysis, usually IP significance was used, and fortunately it could be calculated as `flight_dist_sig` by dividing FlightDistance by FlightDistanceError.

The distance from PV to the secondary vertex could be approximately calculated too by IP/(1-dira) where dira was the cosine value of tau open angle, and (1-dira) was the approximation when the open angle was very small (phi = sin(phi) = tan(phi) if phi is very small). A trick was that, dira was almost 1, thus a very small float number was added, and the feature was inv-log transformed.


Some features from Stepan Obraztsov (another CMS physicist, yeah!) were also included in my public script. They were:

  1. NEW_FD_SUMP which was FlightDistance divided by the summation of daughter particle momenta. It is considered as the total flight time in the lab reference frame.
  2. NEW5_lt which was Tau lifetime times the averaged daughter particle IP. It could be interesting to understand the effect of this feature.
  3. p_track_Chi2Dof_MAX which was the maximum of track chi2 quality. It is as in the selection features about the track quality.

Track/Vertex selection features.

The primary reason of using selection features was for distinguishing from the Ds-> eta mu nu background: eta could fly for a short distance and decay to 2 muons while tau-> 3 muons decay was immediate, which made the reconstructed Ds 3 muon vertex not as good as tau-> 3 muons. Thus, the track quality and vertex reconstruction quality (VertexChi2) should be used.

CDF, DCA features were other selection features from some TMVA analyses, but xgboost and other models didn’t nicely picked them up and the feature importance for these individual features were minimal, which meant they needed to be dropped (aka feature selection), otherwise they would confuse xgboost-like decision tree classifiers. After dropping them out, xgboost improved significantly.

However, there was some hope: this LHCb thesis about lepton flavour violation search (LINK) provided some good ideas about using these selection features, and a good one from this thesis was, max and min values of these selection features could help classification. It was understandable that, if one of these 3 tracks had bad or good quality, the pair-wise and 3 body reconstruction could be heavily effected.

An additional list of good selection features are (approximately +0.0003 AUC) :

  1. Min value of IsoBDT
  2. Max value of DCA
  3. Min value of isolation from a to f (which are pair-wise track isolation of the 3 tracks)
  4. Max value of track chi2 quality.
  5. Square summation of track chi2 quality: surprisedly good feature, maybe the tau vertex selection needed to consider multiple track quality instead of the max.

Pair-wise track IP feature

Although there was no phi nor charge information of each daughter track for pair-wise parent particle reconstruction, I had two features which improved the score and had good feature importance: The ratio of Tau IP vs p0p2 pair IP and Tau IP vs p1p2 pair IP.

It was interesting  of understanding this: the signal process was tau -> Z’ mu and Z’->2 muon immediately, while the background was Ds->eta mu nu where a small missing energy nu was present, thus the pair-wise IP from the signal was surely large by comparing with tau IP, but the missing energy from background confused this comparison. The in-depth reason might need more information about the process. In the experiment, these two ratio values gave good separation and good feature importance.

select_p1p2_ip_ratio select_p0p2_ip_ratio

Other experimented but not used features:

  1. IP error of each particle.
  2. Pz of each particle plus combinations.
  3. Pair-wise open angle in eta.

Feature selection using trees

Decision tree classifiers didn’t like confusing features, e.g. some features with very small separation for the positive and negative, so feature selection was needed. I only had time for a very simple feature selection by giving all these features to a random forest classifier for optimizing AUC, and dropped features with low feature importance. This tree-based selection gave consistent results as from human mind: individual momentum feature and individual quality selection features were dropped. It was also interesting that, isolationb and isolationc were more less important than the other 4 pair-wise track isolation variables. Keeping only the other 4 track isolation variables helped AUC, the reason behind this may need some in-depth analysis.

3. The neural network model

More details to be updated here from Michael. Michael had a two-layer neural network using all these features, and the code was self-explained, please check the code.

The final model includes this neural network model with SPDhits. The model itself with SPDhits had very low CvM score but disaster KS score (almost 0.3 and surely didn’t pass the KS test <0.09), but ensemble this model with other conservative models with a very low weight gave +0.0005 AUC score without failure in KS test.

4. “Weakening” decision tree models under the constraints

With good features, models have easier jobs. UGrad, Random Forest classifiers have good performance with these feature plus GridSearch for parameters, and their KS scores were around 0.06 which could easily pass. Since no invariant mass features were used, CvM score was round 0.0008 which had no problem either.

Xgboost is an aggressive and strong classifer, where it might give bad KS test score even without SPDhits features, e.g. 0.12, so weakening this model was needed for passing KS test. There were two tricks in xgboost parameters: large eta value for controlling convergence, and large colsample_bytree value plus small subsample by adding randomness, as mentioned in the xgboost document. The good combination was about:

  • eta = 0.2
  • colsample_bytree=13
  • subsample = 0.7

5. CV-based ensemble and automatic model selection

Littleboat had a cool randomized automatic model selection by the CV score. The idea was that, each model had some variance, thus the best ensemble could be summing up similar models with small parameter changes, e.g. different seeds, slightly different eta etc. The CV score was used for the ensemble: most of the time, CV scores were more reliable than the public leaderboard scores.

Each model had two versions: high score (with SPDhits) and low score (all other features, no SPDhits). The same parameters are used for both two versions of each model, and surely the high score one doesn’t pass KS test for most time. After generating a list of model candidates, the script randomly pickup some models with good CV scores, searched for best weight to average them and evaluated the final CV score,  as well as the KS score and CvM score. The ensemble method was weighted average, and the weight was brute-force searched after selecting models for passing KS test. If the result could pass the tests, it was saved as an candidate for submission.

After these selection, a manual process of weighted averaging with the neural network (with SPDhits) model was used. The weight was around 0.04 for the neural network model, which helped the final score by +0.0005 AUC. The final two submissions were weighted average of  Xgboost and UGrad, plus a small weight of neural network with SPDhits.

6. Lesson learned

  1. Kaggle competition really needs time. Thanks to my wife for her support.
  2. Team-up is great: I know some amazing data scientists from this competition and have learned much from them.
  3. Ensemble is needed, but it is not everything: feature engineering firstly, CV for each model, and ensemble them.
  4. Data leakage: every kaggle competition has this risk, and dealing with it is a good lesson for both admins and competitors. Exploiting it for winning? blaming it? having fun without using it? Any of them is OK, no one is perfect and no one can hide from statistics.
  5. When in doubt, use xgboost! Do you know xgboost can be installed by “pip install xgboost”. Submit an issue if you can’t install it.

7. One more thing for HEP: UGrad and xgboost

As learned from other high rank solution from the forum, UGrad itself was a strong classifier with BinFlatnessLossFunction, and a simple ensemble of UGrad models could easily go 0.99+. But UGrad used single thread and was very slow, so I didn’t experiment more about it. I have submitted a wish-list issue on xgboost github repo for implementing a xgboost loss function which can take BinFlatnessLossFunction as in UGrad and using multi-thread xgboost for speed up, so HEP people can be have both BinFlatnessLossFunction and a fast gradient boosting classifier. It is do-able as Tianqi (author of xgboost) has described in the issue, and hope I (or someone) can find a time working on it. I hope it can help HEP research.

Advertisement time: DMLC is a great toolbox for machine learning. xgboost is part of it, and DMLC has a recent new tool for deep learning MXnet: it has nice design, it can implement LSTM in 25 lines of python code, and it can train ImageNet on 4 GTX 980 card in 8.5 days, go star it!