This blog is for describing the winning solution of the Kaggle Higgs competition. It has the public score of 3.75+ and the private score of 3.73+ which has ranked at 26th. This solution uses a single classifier with some feature work from basic high-school physics plus a few advanced but calculable physical features.
Github link to the source code.
1. The model
I choose XGBoost which is a parallel implementation of gradient boosting tree. It is a great piece of work: parallel, fast, efficient and tune-able parameters.
The parameter tuning-up is simple. I know GBM can have a good learning result by providing a small shrinkage eta value. According to a simply brute-force grid search for the parameters and the cross-validation score, I choose :
- eta = 0.01 (small shrinkage)
- max_depth = 9 (default max_depth=6 which is not deep enough)
- sub_sample = 0.9 (giving some randomness for preventing overfitting)
- num_rounds = 3000 (because of slow shrinkage, many trees are needed)
and leave other parameters by default. This parameter works good, however, since I only have limited knowledge of GBM, this model is not very optimal for:
- shrinkage too slow and training time too long, so a training takes about 30 minutes and cross-validation takes longer time. On the forum, some faster parameters are published which can limit the training time to less than 5 min.
- sub_sample parameter gives some randomness, especially when complicated non-linear features are involved, and the submission AMS score is not stable and not 100% reproducible. In the feature reduction section, I have a short discussion.
The cross validation is done as usually. A reminder about the AMS is that, from the AMS equation, one can simplify it to
which means, if we evenly partition the training set for K-fold cross validation where the S and B have applied the factor of , the AMS score from CV is artificially lowered by approximately . The same with estimating the test AMS: the weight should be scaled by the number of test samples.
XGboost also supports customized loss function. People have some discussions in this thread and I have some tries. I haven’t applied this customized function for my submission.
2. The features
The key for this solution is the feature engineering. With the basic physics features, I can reach the public leaderboard score around 3.71, and the other advanced physics features push it to public 3.75 and private 3.73.
2.1 General idea for feature engineering
Since the signal (positive) samples are Higgs to tau-tau events where two taus coming from the same Higgs boson, while the background (negative) samples are tau-tau-look-like events where particles have no correlation with the Higgs boson, I have the general idea that, in the signal, these particles should have their kinematic correlations with Higgs, while the background particles don’t. This kinematic correlation can be represented as:
- Simply as the open angles (see part 2.2)
- Complicated as CAKE features (see this link in the Higgs forum, this feature is not developed by me nor being used in the submission/solution)
2.2 Basic physics features
Because of the general idea of finding the “correlation” in the kinematic system, the correlation between each pair of particles can be useful. The possible features are:
- The open angles in the transverse (phi) plane and the longitudinal plan (eta angle). The reminder is that, the phi angle difference must be mapped into ±pi, and the absolute values of these angles work better for xgboost.
- Some careful study on the open angle can find that, tau-MET open angles in phi direction is somehow useless. It is understandable that in tau-l-nu case the identified tau angle has no much useful correlation with nu angle.
- Cartesian of each momentum value (px, py, pz): it works with xgboost.
- The momentum correlation in the longitudinal direction (Z direction), for example, jet momentum in Z direction vs tau-lepton momentum in Z direction is important. This momentum in Z direction can be calculated using the pt (transverse momentum) and the eta angle.
- The longitudinal eta open angle for the leading jet and the subleading jet: the physics reason is from the jet production mechanism from tau, but it is easy to be noticed when plotting PRI_jet_leading_eta – PRI_jet_subleading_eta without physics knowledge.
- The transverse momentum ratio of tau-lep, jet-jet, MET to the total transverse momentum.
- The min and max PT: this idea comes from the traditional cut-based analysis where different physics channel, e.g. 2-jet VBF vs 1 jet jet suppression to lepton, there are minimal and maximal PT cut. In this approach, I give them as features instead of splitting the model, and xgboost picks them up in a nice way.
Overlapping the feature distribution for the signal and the background is a common technique for visualizing features in the experimental high energy physics, for example, the jet-lepton eta open angle distribution for the positive and negative samples can be visualized as following:
2.3 Advanced physics features
If one reads the CMS/ATLAS Higgs to tau-tau analysis paper, one can have some advanced features for eliminating particular background. For example, this CMS talk has covered the fundamental points of Higgs to tau-tau search.
In Higgs to tau tau search, one of the most important background is the Z particle where Z can decay into lepton pairs (l-l, tau-tau) and mimic the Higgs signal. Moreover, considering the known Higgs mass is 126 GeV which is close to Z mass 91 GeV, the tau-tau kinematics can be very similar in Z and Higgs.
The common technique for eliminating this background is reconstructing Z invariant mass peak which is around 91 GeV. The provided ‘PRI’ features only have tau momentum and lepton momentum, from which we can’t have precise reconstruction of Z particle invariant, however, this invariant mass idea gives some hint that, the pseudo transverse invariant mass can be a good feature where the transverse invariant mass distribution can be :
QCD and W+jet background are also important where lepton/tau can be mis-identified as jet, however, these two have much lower cross-section so the feature work on these two background are not very important. Tau-jet pseudo invariant mass features are useful too.
Some other not important features are:
- For tau-tau channel, the di-tau momentum summation can be roughly estimated by tau-lep-jet combination although it is far from truth.
- For 2-jets VBF channel, the jet-jet invariant mass should be high for signal.
Some comments about SVFIT and CAKE-A:
- I know some teams (e.g. Lubos’ team) are using SVFIT, which is the CMS counter-partner of ALTAS’ MMC method (the DER_invariant_MMC feature). SVFIT has more variables and more considerable features than MMC, so SVFIT can do better Higgs invariant mass construction. Deriving SVFIT feature using the current provided features is very hard, so I haven’t done it.
- CAKE-A feature is basically the likelihood if a mass peak belongs to Z mass or Higgs mass. CAKE team claims that, it is a very complicated feature and some positive reports exist in the leaderboard, so ATLAS should investigate this feature for their Higgs search model.
2.4 Feature selection
Gradient boosting tree method is generally sensitive to confusing samples. In this particular problem of Higgs to tau-tau, many background events can have very similar features to the signal ones, thus, a good understanding of features and feature selection can help reducing confusions for better model building.
The feature selection uses some logic and/or domain knowledge, and I haven’t applied any automatic feature selection techniques, e.g. PCA. The reason is that, most auto feature selection methods are designed for capturing the max errors while this competition is for maximizing the AMS score, so I am afraid some feature selection can accidentally drop some important features.
Because of the proton-proton collision in the longitudinal direction, the transverse direction is symmetric, thus the raw phi values of these particles are mostly useless and should be removed for less confusions for the boosting process. It is also the reason why ATLAS and CMS detectors are cylinders.
- Some one on the forum asks why the Phi distribution shows some periodic ‘waves’ and questions if it is a useful feature. It is actually from ATLAS periodical calorimeter structure.
The tau and lep raw eta values don’t help much too. The reason behind it is the Higgs to tau tau production mechanism, but one can easily see it from the symmetric distributions of these variables.
Jets raw eta values are important. The physics reason behind it is from the jet production mechanism where the jets from the signal should be more centralized, but one can easily find it by overlapping the distributions of PRI_jet_(sub)leading_eta for the signal and the background without physics knowledge.
More advanced features or not? It is a question. I only keep physically meaningful features for the final submission. I have experimented some other tricky features, e.g. the weighted transverse invariant mass with PT ratio, and some of them help scoring on the public LB. However, it doesn’t show significant improvement in my CV score. To be safe, I spend the last 3 days before the deadline removing these ‘tricky’ features, and keeping only the basic physics feature (linear combinations) as well as these pseudo invariant mass features, in order not to overfit the public LB. After checking the private LB scores, I find some of them can help, but only a little. @Raising Spirit on the forum has posted a feature which is DER_mass_MMC*DER_pt_ratio_lep_tau/DER_sum_pt and Lubos has a nice comment if it is good idea or not.
CAKE features effect. I have used 3 test submissions for testing CAKE-A and CAKE-B feature. With CAKE A+B, both of my public and private LB submission score drops around 0.01; with CAKE-A, my model score has almost no change (reduce 0.0001); with CAKE-B, my model score improves 0.01. I think it is because CAKE-A feature may have strong correlations with my current feature, while CAKE-B is essentially the MT2 variable in physics which can help for the hadronic (jet) final state. I haven’t include these submissions in my final scoring ones, but thanks to CAKE team for providing these features.
3. Conclusion and lessons learned
What I haven’t used:
- loss function using AMS score: In these two posts (post 1 and post 2), they proposed the AMS loss function. XGboost has a good interface for these customized loss function, but I just didn’t have chance to tune up the parameters.
- Tricky non-linear non-physical features.
- Vector PT summations of tau-lep, lep-tau and other particle pairs, and their open angles with other particles. They are physically meaningful, but my model doesn’t pick them up 😦
- Categorizing the jet multiplicity (PRI_jet_num). Usually this categorizing technique works better since it increase the feature dimension for better separation, but not for this time, maybe because of my model parameters.
- Split models by PRI_jet_num. In the common Higgs analysis, the analysis is divided into different num-of-jets categories, e.g. 0-jet, 1-jet, 2-jets, because each partition can have different physical meanings in the Higgs production. XGboost has caught this partition nicely with features, and it handles the missing values in a nice way.
- Automate the process: script for filing CV, script for the full workflow of training + testing, script for parameter scan, library of adding features so CV and training can have consistent features. It can save very much time.
- Discuss with the team members and check the forum.
- Renting a cloud is easier than buying a machine.
- I show learn ensembling classifiers for better score in future.
- Spend some time: studying the features and the model needs some time. I paused my submission for 2 months for preparing for my O1-A visa application (I got no luck in this year’s H1B lottery, so I had to write ‘a lot of a lot’ for this visa instead) and only fully resumed it about 2 weeks before deadline when my visa was approved, so my VM instance has run like crazy for feature work, model tuning and CV since then while I sleep or during daily work. Fortunately, these work (plus electricity bills to Google) has good payback on the rank.
I want to thank @Tianqi Chen (the author of xgboost), @yr, @Bing Xu, @Luboš Motl for their nice work and the discussions with them. I also want to thank to Google Cloud Engine for providing 500$ free credit for using their cloud computer, so I don’t have to buy my own computer but just rent a 16-core VM.
5. Suggestions to CMS, ATLAS, ROOT and Kaggle
To CMS and ATLAS: In the physics analyses, ML was called ‘multi-variate analysis’ (MVA, so ROOT’s ML package is called TMVA) where features were ‘variates’. This method of selecting variates (features) was from the traditional cut-based analysis where each variate must be kind of strong physically meaningful so they were explainable, for example, in CMS’s tau-tau analysis, we had 2-jet VBF channel where tau-tau required jet-jet momentum was greater than 110 GeV etc. Using this cut-based feature selection idea in the MVA technique gave some limits of feature work where features were carefully selected by physicists using their experience and knowledge, which was good but very limited. In this competition, I came up some intuitive ‘magic’ features using my physics intuition, and the model + feature selection techniques in ML helped removing non-sense ones as well as finding new good features. So, my suggestion to CMS and ATLAS on the feature work is that, we should introduce more ML techniques for helping feature/variate engineering work and use machine’s power for finding more good features.
To ROOT TMVA: XGboost is a great idea for parallelizing the GBM learning. ROOT’s current TMVA is using single thread which is slow, and ROOT should have some similar idea of xgboost into the next version of TMVA.
To Kaggle: it might be a good idea of having some sponsored computing credits from some cloud computing providers and giving them to the competitors, e.g. Google Cloud and Amazon AWS. It can remove the obstacles of computing resources for competitors, and also a good advertisement for these cloud computing providers.
6. Our background
My teammate @dlp78 and I are both data scientists and physicists. We used to work on the CMS experiment. He worked on Higgs -> ZZ and Higgs-> bb discovery channel, and I worked on Higgs->gamma gamma discovery channel and some supersymmetry/exotics particle search. My PhD dissertation is search for long-lived particle decay into photons, in which the method is inspired by the tau discovery method, and my advisor is one of the scientists who discovered the tau particle (well, she also discovered many others, for example: Psi, D0, Tau, jets, Higgs). I have my Linkedin page linked on my Kaggle profile page, and I would like to link to you great Kaggle competitors.