Sign up or log in to see what your friends are attending and create your own schedule!

View analytic


Science: Improving Tornado
    Wednesday July 18, 2012 2:45pm - 3:15pm @ Toledo 5th Floor

    Science: Improving Tornado Prediction Using Data Mining

    Abstract: Responsible for over four-hundred deaths in the United States in 2011, tornadoes are a major source of preventable deaths and property damage. One major cause for the high number of fatalities caused by tornadoes each year is the high False Alarm Ratio (FAR). FAR is a forecasting metric which describes the probability a tornado warning is issued, given no tornado exists. 

    While tornado forecasting has improved dramatically over the past decades, the FAR has consistently remained between seventy and eighty percent. This indicates as much as eighty percent of all tornado warnings are false-alarms. Consequently, the public has gradually become desensitized to tornado warnings, and thus do not seek shelter when it is appropriate [1]. If the number of fatalities caused by tornadoes is to be reduced, the FAR must first be improved.

    The reasons for the high FAR are complex and manifold. Arguably the most pragmatic reason is simply caution. Faced with the decision of whether or not to issue a warning, a meteorologist can err in one of two ways. The first error, a type I error, occurs when the meteorologist predicts a tornado, but one does not occur. The second error, a type II error, occurs when a meteorologist does not predict a tornado, but one does occur. Because the former is merely inconvenient and the latter is potentially fatal, meteorologists tend to err on the side of caution, and typically issue a tornado warning given a storm which shows any signs of being tornadic.

    In addition to caution, a critical cause for the high FAR is a limited understanding of tornadogenisis, or the process by which tornadoes form. Fortunately, tornadoes are a relatively infrequent phenomenon. However, tornado scarcity coupled with the limitations of current radar technology has result in a limited amount of real-world data. 

    Because of the extraordinary complexity of tornadogenisis and the limited amount of real-world data, tornadogenisis has stubbornly resisted complete understanding for centuries. However, for the first time in history, technology has reached sufficient fruition to begin solving this ancient problem. 

    The absence of real-world data can be addressed by running highly sophisticated and computationally intensive mathematical models on Kraken, a High-Performance Computer provided by XSEDE, to numerically simulate supercells. By using simulated in data in lieu of real-world data, datasets of arbitrary resolution and scale can be generated as needed [2].

    Unfortunately, while the use of a simulator mitigates the issue of data scarcity, it also creates a new one. If the simulations are of sufficiently high-resolution to be useful, the size of the dataset becomes astronomical. At present our dataset consists of approximately fifty individual simulations, each approximately one terabyte in size, totaling over fifty terabytes of memory. Because the dataset is too large for any individual to analyze and understand, new techniques needed to be developed which were capable of autonomously analyzing the dataset.

    Specifically, Spatiotemporal Relational Probability Trees (SRPTs) were developed. SRPTs are an augmentation of a classic data mining algorithm, the Probability Tree (PT). Probability Trees were chosen for their strong predictive ability, efficiency, and ability to scale to large datasets [3]. Additionally, unlike many data mining algorithms, a Probability Tree is human-readable. Consequently, after the PT has been grown using the simulated dataset, further insights can be drawn from its structure by domain scientists.

    SRPTs differ from PTs in several ways. The most important difference being SRPTs are capable of creating spatial, temporal, and relational distinctions within the tree. This gives SRPTs the ability to reason about spatiotemporal and relational data. Because the natural world is inherently spatiotemporal and relational, and many scientific datasets share these properties, this greatly enhances the strength and applicability of SRPTs.

    However, while SRPTs are powerful predictors, they do suffer from one major weakness: overfitting. Overfitting occurs when the SRPTs cease to discover generalizable, meaningful patterns within the data, and instead begin fitting to minutia and noise within the dataset. Overfitting in SRPTs can largely be mitigated by limiting the depth to which the tree grows. However, this also limits the predictive ability of an individual SRPT. 

    To address these issues, Spatiotemporal Relational Random Forests (SRRFs) were developed. Much like a traditional random forest, an SRRF is an ensemble of SRPTs. By growing hundreds of individual SRPTs from bootstrap samples of the dataset, then combining each individual tree’s prediction in an intelligent way, SRRFs are capable of discovering far more complex patterns within the data [4]. Additionally, because each tree within the SRRF is limited in size, overfitting is far less of an issue.

    By training SRRFs on the simulated tornado dataset, we hope to discover salient patterns and conditions necessary for the formation of tornadoes. By discovering and understanding these patterns, traditional tornado forecasting may be improved dramatically. Specifically, these patterns may aid in differentiating storms which will produce a tornado from those which will not. This ability would have the immediate effect of reducing the False-Alarm Ratio, and ultimately aid in restoring the public’s confidence in tornado warnings. By doing so, we may ultimately reduce the number of preventable fatalities caused by tornadoes.


    1. Rosendahl, D. H., 2008: Identifying precursors to strong low-level rotation with numerically simulated supercell thunderstorms: A data mining approach. Master's thesis, University of Oklahoma, School of Meteorology.

    2. Xue, Ming, Kevin Droegemeier, and V. Wong. "The Advanced Regional Prediction System (ARPS) - A Multiscale Nonhydrostatic Atmospheric Simulation and Prediction Model. Part 1: Model Dynamics and Verfication."Meteorology and Atmospheric Physics75 (2000): 161-193. Print.

    3. McGovern, Amy and Hiers, Nathan and Collier, Matthew and Gagne II, David J. and Brown, Rodger A. (2008). Spatiotemporal Relational Probability Trees. Proceedings of the 2008 IEEE International Conference on Data Mining, Pages 935-940. Pisa, Italy. 15-19 December 2008.

    4. Supinie, Timothy and McGovern, Amy and Williams, John and Abernethy, Jennifer. Spatiotemporal Relational Random Forests. Proceedings of the 2009 IEEE International Conference on Data Mining (ICDM) workshop on Spatiotemporal Data Mining, electronically published.



    Type Science Track
    Session Titles Statistical Methods/Weather

Get Adobe Flash player