Using Random Forest to identify planetary systems with an Earth Twin
Observational surveys are generally confronted to more targets than can be observed, given their granted observational time. Ranking target by order of merit is therefore important to improve the outcome of such programs. Using numerical simulations of planetary system formation as a basis, we trained a Random Forest Classifier to predict which systems are likely to harbour a planet similar to the Earth. We considered seven features: The architecture of the planetary system (as defined in a previous paper), the period, mass and radius of the innermost observable planet, the number of detected planets, the mass of the star, and the number of giant planets observed in the system. We showed that our RFC achieves very good performances in term of accuracy and recall.
Our RFC is made up of 500 decision trees, allowing to reduce the variance through ensemble learning, while keeping the training time reasonable. Each Decision Tree is trained on a minimum sample of 100 instances in order to increase the diversity between the trees, while allowing the generalisation of the classification. Trees trained on fewer instances have a tendency to learn details and overfit. Finally, the maximum depth of each Tree is limited to five in order to limit the complexity of the model, forcing it to capture only the most important relationships in the data.
The figure below, taken from this paper, depicts the importance of each feature in a SHAP (SHapley Additive exPlanations) values diagram. This shows how each feature’s contribution affects the model’s prediction for that instance. The y-axis lists the features from most influential (top) to least influential (bottom), while the x-axis shows the SHAP value of each feature for each dataset instance. Negative SHAP values indicate a stronger contribution to the decision ‘without Earth-like Planet’, while positive values indicate a stronger contribution to the decision ‘with Earth-like Planet’. Additionally, the colour of the points represents the feature value itself, with higher values in red and lower values in blue.
Bee swarm plot of the seven features considered. The x-axis represents the SHAP value of the feature for each instance, and the y-axis represents the seven features considered ranked from the most important (top) to the least (bottom). The colour of the dots represents the value of the feature itself, red being high values and blue being low values. Adapted from Davoult et al. (2025), Astronomy and Astrophysics, in press, DOI: 10.1051/0004-6361/202452434