Analyzing the formation of high tropospheric Ozone during the 2015 heatwaves in Ljubljana with data mining Johanna A. Robinson1,2 johanna.robinson@ijs.si Sa?o D?eroski3,1 saso.dzeroski@ijs.si David Kocman2 david.kocman@ijs.si Milena Horvat2 milena.horvat@ijs.si ABSTRACT This paper presents a data mining approach to analyzing how environmental parameters, measured by electrochemical gas sensors, influence the formation of ozone. Ozone is a hazardous gaseous air pollutant whose formation is accelerated at high temperatures. The study period includes the four heatwaves experienced in the summer of 2015 in Ljubljana, Slovenia. A temperature threshold, which could result in above-limit ozone levels, was determined with the help of decision trees. It can be used to warn the public about exceeding the allowed ozone levels. In different types of decision trees, we found the same temperature threshold of 34,7¡C, which resulted in the highest ozone levels (that did not exceed the limit value). 1. INTRODUCTION While naturally occurring stratospheric ozone (O3) protects us from the harmful ultra violet (UV) radiation from the Sun, the tropospheric, i.e., ground level, ozone is detrimental for human health. Ozone is formed from other pollutants through a complex set of photochemical reactions that need UV light. Common precursor gasses are, e.g., oxides of nitrogen (NOx), carbon monoxide (CO) and volatile organic compounds (VOCs) [9]. Ozone is a major public concern because of its adverse impacts on human health. It causes a variety of health problems, varying from triggering asthma attacks through decreased lung function to even death. It is especially harmful for children, elderly and people with allergies and lung diseases [14,4]. Ozone formation increases during warm sunny weather [6]. Mortality is higher during heatwaves, with some incidents caused by high ozone levels [13]. Slovenia experienced four heatwaves during the summer of 2015. The Slovenian environment agency ARSO [10,11,12] reported the periods of the heatwaves as follows (1) 2-14.6.2015, (2) 1-8.7.2015, (3) 11-26.7.2015, (4) 4-15.8.2015. During such episodes exercising is recommended to be kept to the minimum. In order to protect the public, The World Health Organization [14] has created guidelines to pollutants, which are also adapted internationally, e.g., by the European Union [1,3]. The newest guideline for ozone is set to 100 µg/m3 for daily maximum 8-hour mean, the old 120 µg/m3 being still valid in the European Union for 8 hour mean. There is never a safe level of any air pollutant, and possible health effects might still occur in some individuals, even at pollutant concentrations below limit values. The guidelines are set to provide adequate protection of public health, taking into consideration also other living beings and the possible detrimental effect on historic heritage and bearing in mind the economic and technical feasibility of following the guidelines [1].123 2. AIMS AND HYPOTHESIS The four heatwaves experienced in Slovenia during the summer of 2015 provided us with a basis for studying the photochemical processes of ozone formation. Namely, ozone formation is accelerated at higher temperatures. We studied the effect and relation of environmental parameters on/to the formation of ozone by using data mining tools. The limit value for ozone is set in the Directive 2002/3/EC [2] to 120 µg/m3 for the 8 hourly maximum, which is allowed to be exceeded at most 25 times a year. The hourly average threshold level, for which the public need to be informed immediately after exceedance, is set to 180 µg/m3, whereas the alert threshold is at 240 µg/m3. Since higher ozone levels occur during higher temperatures, we wanted to find a temperature threshold for such high ozone values, which would exceed the hourly limit values. Such a threshold could help to give warnings to the general public before the occurrence of a high ozone episode even sets in. Usually the warnings are given together with a heatwave warning, whereas incidents could happen also on individual days or in places which are not covered by the national monitoring programme. ThatÕs why finding an indicative threshold for a parameter, such as temperature, which every household can easily read at home, would provide the general public with a good starting point to know when to avoid spending excess time outside and to restrict the amount of exercise performed during hours with high ozone levels. This could be especially important for those individuals who belong to the susceptible groups. 3. DATA AND METHODS The data for analyzing ozone formation was obtained from a network of low cost electrochemical gas sensor units deployed in Ljubljana since the winter of 2013/2014. The sensors belong to the CITI-SENSE project (www.citi-sense.eu), currently testing the performance and applicability of low cost electrochemical gas sensors. The period chosen for the data analysis covers the period from the first heatwave until the last, fourth heatwave, i.e., from 1.6.2015 to 19.8.2015, as illustrated in Figure 1. The numerical data was downloaded from an online server as a csv-file, after which it was pre-processed to meet the requirements of the chosen open source software data mining package, WEKA [5]. Figure 1. Measured temperature and ozone in Ljubljana during the period between 1.6.2015-19.8.2015. 3.1 Pre-processing of the Data During the pre-processing, the data underwent screening of their suitability for analysis. Hourly averages were calculated as a new attribute, in order to be comparable with the European legislations limit values based on averaged hourly data. Also, due to technical limitations of electrochemical gas sensors, all data which were below the limit of detection (LOD) (9,99 µg/m3) were excluded. The final number of data points, each corresponding to an one-hour of measurements, was 756. The data used for the analysis and its specifications are summarized in Table 1. All used values were numeric. Table 1. The data: Basic statistics. Parameter/Attribute Missing Min Max Mean Ozone, O3 (µg/m3) 0% 10.1 236 60.9 Temperature (¡C) 0% 9.30 41.2 26.6 Humidity (%) 0% 26.8 91.1 55.3 Air pressure (mb) 0% 971 990 980 Nitrogen monoxide, NO (µg/m3) 83% 0.12 27.0 6.32 Nitrogen dioxide, NO2 (µg/m3) 57% 0.20 166 38.7 Nitrogen oxides, NOx (µg/m3) 86% 0.01 165 27.0 Carbon monoxide, CO (µg/m3) 9.0% 0.24 376 80.8 3.2 Choosing a Data Mining Method An initial test to determine whether or not the ozone data is linearly dependent on temperature was run using LeastMedSq. However, according to the results, we believe that the relation is not linear. To proceed further to see which other parameters/ attributes play a role, we ranked the attributes with the RreliefFAttributeEval attribute evaluator within WEKA [5] This implements the instance-based RReliefF method [8] for estimating the relevance of attributes/ features and feature ranking. We used 10-fold cross validation to calculate the relevance scores of features and their variance. The parameters were ranked as shown in Table 2. NO (forming most of NOx) and CO are the two common precursor gases forming ozone in the troposphere and this explains their high ranking. NO2 on the other hand commonly counteracts this, thus reducing the amount of ozone: This probably explains why it ranks low, together with the fact that many data values for these gasses are missing (57%). Table 2. Ranking the attributes with RReliefF in terms of their relevance for predicting the ozone concentration. Rank Parameter 1. NO 2. NOx 3. CO 4. Humidity 5. Temperature 6. Air pressure 7. NO2 Two types of decision trees were considered: (1) model trees and (2) regression trees. For the purpose of building such trees we used the M5P algorithm, a Java reimplementation of the algorithm M5 [7] within the WEKA [5] package of data mining software. The availability and distribution of the environmental attributes (other than ozone) vary. The distribution of the values of all attributes is illustrated in Figure 2. Figure 2. The distribution of the attribute values in the data. 4. RESULTS AND DISCUSSION The initial runs with WEKA with all the available parameters are given below. The default settings of M5P were used unless otherwise stated. Both regression tree and model trees were created. The initial regression tree structure was more complex having 20 rules compared to the model tree with only 5 nodes. M5P selected humidity as the most important parameter for prediction in both decision trees (top). This position of humidity was expected, since the electrochemical gas sensors highly depend on the humidity, given the way the sensors themselves function by absorbing or evaporating humidity in the electrolyte inside the sensors. The highest ozone levels predicted were 139 µg/m3, 122 µg/m3 and 108 µg/m3, and did not exceed the limit values set by EU regulations (180 µg/m3). According to the regression tree predictions, we get to these high ozone levels if humidity is less than or equal to 50,5%, the temperature is more than 34,7¡C and carbon monoxide is less or equal to 80,9 µg/m3. We get to the two highest predicted ozone levels if the air pressure is less than or equal to 983 mb: if the humidity is less than or equal to 39,8% the predicted ozone is 139 µg/m3, and if higher than 39,8% the ozone is 122 µg/m3. The third highest ozone level (108 µg/m3) is reached if the air pressure is higher than 983 mb. The model tree follows the same rules, but has fewer nodes. As the initial regression tree was rather large, we pruned it further by setting the minimum number of instances to 80 (initially 4). This results in a tree with 14 nodes, which follows the same rules to reach the same three highest predicted ozone levels, now in nodes 6, 7 and 8 (Figure 3). The correlation coefficient is still high at 0.801 having pruned the tree to include more instances in the leaves. Since a large number of data points were missing for all oxides of nitrogen (Table 2), we excluded them from the next runs. They were also not playing a role in the decision trees in reaching the high ozone values. Considering that carbon monoxide is one of the pre-cursor gases, and it appears in the decision trees we decided to keep it in the next data runs even if it was missing 9% of its values. The rules leading to the prediction of high ozone values still stay the same. E.g., in the case of the highest predicted ozone level in the second pruned regression tree: IF hum <=50.5 AND IF Temp >34.7 AND IF CO <= 80.9 AND IF AP <= 983 AND IF Hum <= 39.75 THEN O3= 139. When also carbon monoxide, which was missing some of the data points, was excluded from the run, we get a correlation coefficient slightly lower than 0.8 both for the regression tree and the model tree. As the original research question was to find a threshold value in temperature, which would determine high ozone values, by observing the visualized regression trees and model trees we found a threshold temperature at 34,7¡C which leads to higher ozone values. The threshold stayed the same in all the decision trees, i.e., model trees as well as regression trees. In addition, the pruned regression tree produced an extra threshold at 37,7¡C with highest predicted ozone level at node 5 (133 µg/m3). Table 3 summarizes all the 3 runs, showing the used parameters as well as the size and accuracy of the obtained decision trees. Table 3. Summary of the three runs with M5P, all using temperature (Temp), humidity (Hum) and air pressure (AP). MT, RT, and PT = model, regression and pruned regr. tree. Temp, Hum, AP, CO, NO, NO2, NOx Temp, Hum, AP, CO Temp, Hum, and AP only MT RT PT MT RT PT MT RT PT Corr. Coeff. 0.846 0.805 0.801 0.844 0.810 0.806 0.797 0.775 0.776 Num. of rules 5 20 14 8 21 13 7 11 9 5. CONCLUSIONS Data from an experimental project using low cost electrochemical gas sensors was used to determine the influence of environmental parameters on ozone formation. As ozone is formed under photochemical reactions, we predicted that temperature would play the largest role in determining the ozone levels. The summer of 2015 provided a good database for studying this relation, since Slovenia experienced four heatwaves during the period between 1.6.2015 and 19.8.2015. Although the early runs of linear regression did not produce a simple rule, which would have been easy to implement in warning the population of high ozone levels, directed us to look into the other environmental parameters e.g., the precursor gases, to find how ozone is formed. Since the limit values of ozone are given in an hourly manner, the original raw data (15 minute averages) was aggregated at the hourly level to be comparable with the national and international information and alert thresholds. In order to see which parameters play the largest role, an instance-based feature selection algorithm was run. It produced a ranking where the highest ranks were reserved for the precursor gases, followed by humidity, and only then temperature. The M5P program was used to analyze the data by creating regression trees as well as model trees, helping us to find threshold values which determine high ozone concentrations. Initial runs were made with all the available environmental parameters (humidity, temperature, air pressure, NO, NO2, NOx and CO). The high importance of humidity is partially due to the technical features of how electrochemical gas sensors work. Many elements and processes in the atmosphere determine the formation of gases, e.g., the presence of pre-cursor gases as well as the intensity of UV-light. In order to keep a representative set of environmental parameters, we eliminated the ones which, even though they play a role in the atmosphere, were missing a substantial number of values (up to 86%) due to bad sensor performance in the chosen time period. The trees performed with a correlation coefficient of 0.8 or higher in predicting ozone levels. Regardless which decision tree was used, we found a temperature threshold of 34,7¡C. When the temperature is above 34,7¡C, higher ozone levels can occur in the troposphere. However, no temperature threshold was found by the model, which would directly result in ozone levels above the limit value. We also found that the predicted ozone values were relatively low, considering that the background level for ozone is 70 µg/m3 [14] and the mean ozone value in our dataset was 61 µg/m3. Information which can be extracted from parameters available for all households e.g., temperature, could be a good start for creating a smartphone application to warn people when ozone is expected to reach a certain level. The application could take data from a home thermometer placed outdoors whereas the threshold value for notification could be modified by the user, enabling it to trigger even at lower levels of ozone depending on the sensitivity of the user to ozone. 6. ACKNOWLEDGMENTS The CITI-SENSE project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement no 308524. This work has also received funding of the Slovenian Research Agency through a programme P1-0143. 7. REFERENCES [1] EC. Council Directive 96/62/EC of 27 September 1996 on ambient air quality assessment and management. Retrieved from: http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:31996L0062. [2] EC. Directive 2002/3/EC of the European Parliament and of the Council of 12 February 2002 relating to Ozone in ambient air. OJL 67, 3 March 2002. Retrieved from: http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex:32002L0003 [3] EC. Directive 2008/50/EC of the European Parliament and of the Council of 21 May 2008 on ambient air quality and cleaner air for Europe. Retrieved from: http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32008L0050. [4] European environment agency (EEA). 2012. Air quality in Europe Ñ 2012 report. EEA Report No 4/2012, Copenhagen. [5] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11, 1: 10Ð8. [6] Lee, J.D., Lewis, A.C., Monks, P.S., et al. 2006. Ozone photochemistry and elevated isoprene during the UK heatwave of august 2003. Atmos. Environ. 40, (Dec. 2006), 7598Ð7613. DOI= doi:10.1016/j.atmosenv.2006.06.057. [7] Quinlan, R.J. 1992. Learning with Continuous Classes. In: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. World Scientific, Singapore, 343-348. [8] Robnik-Sikonja, M., and Kononenko, I. 2003. Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 53, (Oct. 2003), 23-69. [9] Salvato, J., A., Nemerow, N., and Agardy, F., J. 2003. Environmental engineering. Fifth edition. John Wiley & Sons: New Jersey. [10] Slovenian Environment Agency Ð ARSOa. 2015. Vreme in podnebje - zanimivosti. Available in www.arso.gov.si/vreme/zanimivosti/ [11] Slovenian Environment Agency Ð ARSOb. 2015. Agrometeorologija - aktualni prispevki. Available in www.arso.gov.si/vreme/agrometeorologija/aktualno.html [12] Slovenian Environment Agency - ARSOc. 2015. Arhiv novic za zadnjih 6 mesecev. Available in www.arso.gov.si/o%20agenciji/novice/arhiv.html [13] Williams, S., Nitschke, M., Weinstein, P., Pisaniello, D.L., Parton, K.A., and Bi, P. 2012. The impact of summer temperatures and heatwaves on mortality and morbidity in Perth, Australia 1994 Ð2008. Environ. Int. 40 (Apr 2012), 33Ð38. DOI= doi:10.1016/j.envint.2011.11.011. [14] World Health Organization (WHO). 2006. Air quality guidelines for particulate matter, Ozone, nitrogen dioxide and sulfur dioxide. Global update 2005. Summary of risk assessment. WHO, Geneve. 1 Jo?ef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana 2 Jo?ef Stefan Institute, Department of Environmental Sciences Jamova 39, 1000 Ljubljana 3 Jo?ef Stefan Institute, Department of Knowledge Technologies Jamova 39, 1000 Ljubljana --------------- ---------------------------------------- --------------- ----------------------------------------