Enhancing nitrate and strontium concentration prediction in groundwater by using new data mining algorithm

Bui, D. T.; Khosravi, K.; Karimi, M.; Busico, G.; Khozani, Z. S.; Nguyen, H.; Mastrocicco, M.; Tedesco, D.; Cuoco, E.; Kazakis, N.

doi:10.1016/j.scitotenv.2020.136836

Groundwater resources constitute the main source of clean fresh water for domestic use and it is essential for food production in the agricultural sector. Groundwater has a vital role for water supply in the Campanian Plain in Italy and hence a future sustainability of the resource is essential for the region. In the current paper novel data mining algorithms including Gaussian Process (GP) were used in a large groundwater quality database to predict nitrate (contaminant) and strontium (potential future increasing) concentrations in groundwater. The results were compared with M5P, random forest (RF) and random tree (RT) algorithms as a benchmark to test the robustness of the modeling process. The dataset includes 246 groundwater quality samples originating from different wells, municipals and agricultural. It was divided for the modeling process into two subgroups by using the 10-fold cross validation technique including 173 samples for model building (training dataset) and 73 samples for model validation (testing dataset). Different water quality variables including T, pH, EC, HCO3−, F−, Cl−, SO42−, Na+, K+, Mg2+, and Ca2+ have been used as an input to the models. At first stage, different input combinations have been constructed based on correlation coefficient and thus the optimal combination was chosen for the modeling phase. Different quantitative criteria alongside with visual comparison approach have been used for evaluating the modeling capability. Results revealed that to obtain reliable results also variables with low correlation should be considered as an input to the models together with those variables showing high correlation coefficients. According to the model evaluation criteria, GP algorithm outperforms all the other models in predicting both nitrate and strontium concentrations followed by RF, M5P and RT, respectively. Result also revealed that model's structure together with the accuracy and structure of the data can have a relevant impact on the model's results.