Outage Duration Prediction and Cause Analysis

Outage Duration Prediction and Cause Analysis

Created by Anvita Gollu | GitHub Repo | EECS 398

Introduction

Research Motivation:

The central research question is: “What characteristics are associated with each category of cause?”. Identifying trends within the data could lead to predictive models for future outages, helping utility companies anticipate outages to prioritize maintaenance and design mitigantion strategies. Overall, this would reduce downtime and improve the reliabilty for customers.

Dataset Overview

The dataset contains 1534 rows with each row representing a specific outage event witnessed by different states in the continental U.S. during January 2000 – July 2016.

Column Name Description
CAUSE.CATEGORY Category of the event that caused the power outage
CAUSE.CATEGORY.DETAIL More specific description of the outage cause within the category
OUTAGE.DURATION Duration of the power outage in minutes
U.S._STATE US state where the outage occurred
CLIMATE.REGION US climate region classification based on National Centers for Environmental Information
YEAR The year the outage took place
POPULATION Population size, which may provide urbanization details or affect outage complexity and response time
MONTH The numeric month of the outage, which contributes to seasonal features
OUTAGE.START.DATE Date the outage started
OUTAGE.START.TIME Time of day the outage started
OUTAGE.RESTORATION.DATE Date when power was fully restored
OUTAGE.RESTORATION.TIME Time of day when power was restored

Exploratory Data Analysis


The Distribution of Outage Categories bar chart shows that “Severe Weather” is the most common cause category, accounting for 763 outages. This is more than half of all the recorded events. “Intentional Attack” is the second most common cause with 418 outages, highlighting that human-driven disruptions such as vandalism (cause of 335 outages) is a significant concern. Other categories such as equipment failure, fuel supply issues, and public appeal are relatively rare.

The second plot Top 5 Outage Causes reinforces that human factors and natural hazards contribute to power outages. The top weather related causes such as “Thunderstorm” (178), “Winter Storm” (101), “Hurricanes” (74), and “Heavy Wind” (61) occur during different seasons. This suggests that weather-related power disruptions are diverse and seasonally distributed.


Average Outage Duration by Cause (in Hours)
Cause Category Outage Duration (Hours)
fuel supply emergency224.73
severe weather64.73
equipment failure30.28
public appeal24.47
system operability disruption12.15
intentional attack7.17
islanding3.34

To explore the relationship between causes and outage duration, I created a box plot and pivot table. This reveals that severe weather outages are not only the most common but also have a wide range of durations where most of them fall within a reasonable window but some extend beyond 500 hours, indicating occasional extreme events. On the other hand, although fuel supply outages is rare, they have a long median duration and a few significant outliers exending 1000+ hours. Intentional attack and islanding have short and clustured durations, suggesting these are often resolved quickly. The pivot table complements the box plot by summarizing the mean outage duration per cause; fuel supply emergency is the highest with an average of 224.73 hours, followed by severe weather at 64.73 hours, and equipment failure at 30.28 hours.


Severe Weather Event Count by Climate Region
Climate Region Severe Weather Event Count
Northeast 175
Central 133
Southeast 116
South 106
East North Central 104
West 67
Northwest 25
Southwest 10
West North Central 4

The pivot table shows how frequently severe weather is reported by the ause of power outages across different U.S. climate regions. The top 3 are Northeast (175), Central (133), and Southeast (116). This aligns with the knows weather patterns in these areas as being particularly prone to storms, hurricanes, and winter weather. This distribution enforce the idea that geography plays a major role in the likelihood of outages due to weather.


Data Cleaning

Combining Start Date and Start Time

I combined the OUTAGE.START.DATE and OUTAGE.START.TIME columns into a single OUTAGE.START timestamp using pd.to_datetime(). Similarly, I also created a OUTAGE.RESTORATION timestamp from OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME columns. Before doing this, I dropped any rows that were missing in these columns to avoid generating invalid timestamps.

Significance: This transformation is important because it allows for more precise time-based calculations, such as calculating outage durations, identifying trends by hour or seasons, and incorpating time-based features into predictive models. Combining these featires makes time-based operations more consistent and readable.

Imputation

To ensure integrity of the model, I removed all rows that contained missing values in the featues used for the prediction model (CAUSE.CATEGORY, CLIMATE.REGION, YEAR, OUTAGE.START, POPULATION, MONTH, and U.S._STATE), as well as the response variable OUTAGE.DURATION.

For the remaining columns not used in modeling, I used a more standard imputation technique. I only chose to use this on the remaining columns becuase using it on the features for the predictive model might introduce bias or not reflect real patterns in the data, especially since the missingness isn’t random.

  • Numerical columns were imputed using the median value of each column. Compared to using mean, median is more robust to outliers and maintains central tendency, making it suitable for skewed and distrubutions with noise.
  • Categorical columns were imputed using mode. This maintains logical consistency, ensuring the imputed value reflect the most common real-world obseved value.

Framing a Prediction Problem

The prediction problem is to forecast the duration of a power outage, specifically the OUTAGE.DURATION column, which represents the length of time a power outage lasts(measured in minutes). This is a regression problem because we are predicting a continuous numerical outcome that represents the difference in minutes.

Why Outage Duration?

The OUTAGE.DURATION variable is chosen because understanding the length of outages can help utilities better plan for anticipating resource needs, improving maintainance schedules, and estimating potential impact on customers. Overall, it will help improve service reliability.

Evaluation Metric

I will use Mean Absolute Error (MAE) to evaluate performance, because it’s easy to interpret in real-world terms (minutes) and is robust to outliers. This metric treats all errors equally, regardless of whether they are above or below the predict.

I am chosing Mean Absolute Error over Mean Squared error because it is less sensitive to exteme outliers, which is important since a few unusually long outages could skew MSE. We care more about average accuracy over punishing rare outliers here. Additionally, the units for MSE would be minutes sqaured, making it harder to interpret.

Justification for Features at Prediction Time

All features used in training are ones that would be available at the time the outage is reported.

  • Known before the outage occurs: ‘YEAR’, ‘MONTH’, ‘POPULATION’, ‘U.S._STATE’, ‘CLIMATE.REGION’, ‘CLIMATE.CATEGORY’
  • Known immediately at the start of the outage: ‘CAUSE.CATEGORY’, ‘OUTAGE.START’

I intentionally excluded any features that would only be known after the outage is resolved (such as restoration time, total demand lost, or post-outage metrics) because including them would lead to data leakage and artificially inflate performance of the model. This ensure the model reflects a realistic prediction scenario and could be used to estimate outage duration in real-time as soon as an outage begins.


Baseline Model

For the baseline model, I used linear regression on two features: ‘YEAR’ and ‘U.S._State’. These features were chosen because they are both readily available at the time of predication and they serve as basic glimpses into the temporal and geographic context.

  • Quantitative feature (1):
    • YEAR: A numerical variable indicating the year the outage occurred.
  • Nominal feature (1):
    • U.S._STATE: A categorical variable representing the U.S. state where the outage took place. This was encoded using OneHotEncoder within a ColumnTransformer to make it usable in the linear model.

A pipeline was used to combine preprocessing and regression in a single step, ensuring a clean and repeatable transformation.

Performance Evaluation

The baseline model achieved a Mean Absolute Error (MAE) of 2943.65 minutes on the test set. This translates to an average error of about 49 hours, indicating the model provides a rough estimate of the outage duration. While the model is not highly accurate, it establishes a reasonable baseline. The purpose of this model is to provide a benchmark for evaluating more sophisticated models. I would not consider this model “good,” but it is appropriate as a starting point.


Final Model

New Engineered Features

  • ‘OUTAGE.HOUR’ – Extracted from ‘OUTAGE.START’, capturing the hour an outage began. Outages at night may face slower restoration times due to lower staffing or delayed detection.
  • ‘LOG.POPULATION’ – A log-transformed version of ‘POPULATION’. This transformation reduces skew caused by large population values and helps the model interpret the relationship more linearly.
  • ‘SEASON’ – Derived from the ‘MONTH’ column. Seasonal patterns (like hurricanes in summer or ice storms in winter) can influence the likelihood and severity of outages.

All three features were added using a FunctionTransformer in a Pipeline to ensure clean integration and reproducibility.

Features used for the model

Feature Name Type Reason for Inclusion
CAUSE.CATEGORY Nominal Represents primary reason for the outage which directly impacts duration
CLIMATE.REGION Nominal Captures geographic weather trends across regions
SEASON Nominal Derived from month; helps identify seasonal patterns
U.S._STATE Nominal Accounts for geographic and infrastructure differences between states
CLIMATE.CATEGORY Nominal Broader climate patterns (warm/normal/cold) that affect risk
YEAR Quantitative Captures long-term trends or changes in infrastructure and policy
OUTAGE.HOUR Quantitative Reflects time of day patterns
LOG.POPULATION Quantitative Log-transformed to reduce skew; population reflects urbanization which may affect duration

Model Comparison and Hyperparameter Tuning

I tested three regression models:

  • Linear Regression
  • Ridge Regression, with a grid search over alpha values to control regularization
  • Random Forest Regressor, with a grid search over n_estimators, max_depth, and min_samples_leaf

GridSearchCV with a 5-fold cross-validation was used to find the best hyperparameters for Ridge and Random Forest. All models were trained and evaluated on the same X_train and X_test split to ensure a fair comparison with the baseline model.

The Random Forest Regressor achieved the lowest Mean Absolute Error of 2469.26 minutes. This tree-based ensemble model can capture non-linear patterns and interactions that linear models may miss. I tuned three key hyperparameters:

  • n_estimators: number of trees in the forest ([50, 100])
  • max_depth: maximum depth of each tree ([10, 15])
  • min_samples_leaf: minimum samples in each leaf ([1, 5])

Using GridSearchCV with 5-fold cross-validation, the optimal configuration was:

  • n_estimators = 100
  • max_depth = 15-
  • min_samples_leaf = 5

Performance Comparison

  • Baseline MAE: 2943.65 minutes
  • Random Forest MAE: 2469.26 minutes

The Random Forest model outperformed the baseline linear regression model with an approximate 16.12% reduction in prediction error(nearly 7.9 hours of improvement on average). This improvement is meaningful because it demonstrates that leveraging more informative and thoughtfully engineered features—such as seasonality, population size, and time-of-day—can better capture the complexity of power outage durations. The model’s lower MAE indicates more reliable and actionable predictions, which are critical for utilities seeking to optimize response strategies and improve service reliability.