← All posts

Predicting Commercial Building Energy Costs: What the Data Shows

This was a university research project conducted at the University of Arizona in November 2025. It ended up becoming the foundation for building Forsa.

Predicting energy expenditures in commercial buildings

Introduction/Background:

Predicting total energy consumption within buildings is often one of the most challenging tasks regarding commercial buildings. When buyers purchase large commercial properties, they often lack information on the energy costs that the building will incur.

One of the biggest issues commercial real estate stakeholders have is the uncertainty around energy and utility costs, which is even more apparent in large scale buildings. Another important group is city officials, who need to know how a proposal development of a new building will demand energy and what impact it will have on the grid.

This econometric model attempts to estimate and implement a more accurate view of energy costs within a prospective commercial building. This model can be used to predict the energy consumption of buildings and, by extension, predict the operational energy costs of a building.

Proposed Models & Hypotheses:

Empirical model

ln(Energy Expenditure) = β0 + β1 ln(Sqft.) + β2Age + β3Region + β4City + β5Use + β6Cert + β7Renov + β8(Age × Cert1) + ε

Variable definitions

  • Sqft. is a numeric variable that measures the total amount of square footage for the building. This variable is logged because it is not an apples-to-apples comparison; a logarithmic variable captures this far better.
  • Age is a numeric variable that measures the age of the building.
  • Region is a categorical variable that captures what region of the United States the building is in.
  • City is a dummy variable which captures whether or not the building is located in a city.
  • Use is a categorical variable which captures the primary function of the building—retail, office, industrial, multifamily, etc. This variable was originally named PBA (Principal Building Activity).
  • Cert is a variable capturing green building or energy certification status for the property (as coded in the dataset).
  • Renov is a binary variable that captures whether a building has completed a renovation since 2000.
  • Age × Cert1 is an interaction variable which attempts to capture if green energy certifications offset building age.
  • ε captures the unexplained noise and random variation within the model and is assumed to have zero conditional mean.

The hypothesis from this model is that sq footage, age, and use will be the three most important factors in predicting total energy expenditures for a commercial building. Having a positive effect and driving up total energy expenditures.

Data:

The dataset used in this project is a 2018 Commercial Building Energy Consumption Survey (CBECS). The report was compiled by the U.S. Energy Information Administration (EIA), a department of the U.S. Department of Energy. The survey estimates 5.9 million commercial buildings worth about $141 billion of energy expenditure. The process that the CBECS uses is a random sample survey, where every commercial building has a known chance of being selected. They collect information in-person and through a web survey.

Two key procedures were used in data cleaning and interpretation. The first task was renaming variables to be more interpretable. This was done by reading the codebook which is provided by CBECS. This spreadsheet includes a list of variable keys and a description of their meanings. The next step was renaming them in R. (E.g. PBA (Principal Building Activity) was renamed to Use, and others were also renamed). Second step was removing all variables that included blank or 0 variables, as they could have had a big impact on the model.

Another step was generating a histogram to get a visual idea of the data:

Histogram: Distribution of Commercial Building Energy Expenditures. Number of buildings on the vertical axis, annual energy expenditure in dollars on the horizontal axis from 0 to 10 million. The distribution is strongly right-skewed, with the tallest bar at the lowest spending bin and a long tail toward higher expenditures.
Graph 1: Histogram

Estimation Method:

OLS (Ordinary Least Squares) was used which minimizes the sum of squared residuals. The log-linear specification allows coefficients to be interpreted as percentage changes. Additionally, hold-out validation was used to split the data and test.

Hypothesis Test

Two hypothesis tests were run, with 95% and 99% confidence intervals.

The 95% CI with 6356 degrees of freedom yielded a t-statistic of 34.768 a p-value of <2.2e-16 and the interval from 292,065-326,968. With a sample mean of 309,517. Therefore, because our p-value < 0.05 we reject the null and can conclude that the mean energy expenditure lies between $292,065-$326,968 with 95% confidence.

Figure 1 95% Hypothesis Test:

One Sample t-test

data:  EnData$MFEXP
t = 34.768, df = 6356, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 292065.2 326968.8
sample estimates:
mean of x
 309517

The 99% CI with 6356 degrees of freedom yielded a t-statistic of 34.765 and a p-value of <2.2e-16. We get an interval of 286,578-332,455 with a sample mean of 309,517. Therefore, we can conclude that the sample mean lies between our interval with 99% confidence, and reject the null hypothesis.

Figure 2 99% Hypothesis Test:

One Sample t-test

data:  EnData$MFEXP
t = 34.765, df = 6356, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 20
99 percent confidence interval:
 286578.9 332455.1
sample estimates:
mean of x
 309517

Estimation Results:

From our regression output, there are some notable findings. A 1% increase in shift increases energy expenditure by 0.96%, which is almost proportional. Inconsistent with assumptions, older buildings consume slightly (marginally) less energy each year. Buildings in city consume 17% more on energy, and consistent with assumptions building expenditure vary drastically by use.

Figure 3: Regression output

Call:

lm(formula = ln_energy_exp ~ ln_sqft + age + region_south + city + use_category + cert1 + cert2 + RENOV + age:cert1, data = EnData_clean)

Residuals:

    Min      1Q  Median      3Q      Max 
-4.9189 -0.3252  0.0220  0.3493  2.2950 

Coefficients:

(1 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)
(Intercept)-0.63191430.1241971-5.0883.81e-07***
ln_sqft0.95642850.0077999122.621< 2e-16***
age-0.00116260.0006338-1.8340.0667.
region_south-0.11567580.0203703-5.6791.47e-08***
city0.17378530.01959358.870< 2e-16***
use_category21.33896870.089923914.890< 2e-16***
use_category42.13534940.122214917.472< 2e-16***
use_category50.51020790.09484315.3797.96e-08***
use_category62.56533230.127789220.075< 2e-16***
use_category71.31215560.118400811.082< 2e-16***
use_category81.56772110.102856015.242< 2e-16***
use_category111.94596350.23404178.315< 2e-16***
use_category120.62616210.09984936.2714.02e-10***
use_category131.32259440.094618813.978< 2e-16***
use_category141.01723000.091247611.148< 2e-16***
use_category152.57274120.105748224.329< 2e-16***
use_category162.06939860.096633921.415< 2e-16***
use_category171.66073500.113620814.616< 2e-16***
use_category181.39343570.094209114.791< 2e-16***
use_category231.71262810.105147416.288< 2e-16***
use_category240.81405000.13894205.8595.09e-09***
use_category251.38273860.096606514.313< 2e-16***
use_category261.08167340.101833910.622< 2e-16***
use_category912.13209200.140932215.128< 2e-16***
cert10.30617750.04492986.8151.11e-11***
cert20.03320650.02323331.4290.1530
RENOVNANANANA
age:cert1-0.00124740.0008528-1.4630.1436

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model diagnostics

Our Adjusted R-Squared is 0.90. This model explains 90% of the variation in the dataset. Which is impressive. F-Statistic of 1344 (p<2.2e-16) meaning the model is highly significant. Our residual Standard Error: 0.5731 on 3515 degrees of freedom. A second model was also used which removed variables City, cert1, age, and city.

Figure 4: Model comparison output metrics

Reduced model Original model
Observations3,5423,542
R20.9030.909
Adjusted R20.9030.908
Residual std. error0.589 (df = 3519)0.573 (df = 3515)
F statistic1,495.672*** (df = 22; 3519)1,343.693*** (df = 26; 3515)

* p < 0.1; ** p < 0.05; *** p < 0.01

The first column is the first column (reduced) model, and second is the original. The original model had a higher adjusted R2 and lower residual Std. Error and thus, was selected.

Hold-out validation and Holt exponential smoothing

Hold-Out Method was performed using holt exponential smoothing. The optimized model performed better than the user model, With lower ME, RMSE, and MAE in the test set. The user model output indicates overfitting of the data.

Figure 5: Cross-validation error metrics

accuracy(forecast_user, actual_valid)
ME RMSE MAE MPE MAPE MASE ACF1
Training set−0.0025644642.0878111.688920−3.0318215.693940.7806671−0.09365676
Test set5.8148417926.1263115.81484248.5344248.534422.6877863NA
accuracy(forecast_cmp, actual_valid)
ME RMSE MAE MPE MAPE MASE ACF1
Training set0.0015012821.8756561.523166−3.0033914.269330.7040510−0.02505734
Test set0.0084316861.9203871.586933−3.0659814.828650.7335257NA

Conclusion/Implications:

This model and research provide an accurate basis for assessing building energy consumption. The framework can be used by a range of stakeholders who need to evaluate or predict energy use from observable building characteristics.

Still, there are three major limitations to keep in view. First, CBECS 2018 is a single snapshot; the nature of building energy performance has shifted since then, and many assets have become more efficient through operations, retrofits, and code cycles, so the coefficients are anchored in a past market regime. Second, several inputs are deliberately coarse: the renovation variable treats all renovations alike, the city indicator compresses heterogeneous urban contexts, and the certification coding does not capture every major green program a building might hold. Third, endogeneity and omitted-variable bias can distort inference—factors outside the survey (management quality, metering depth, unobserved capital plans) can influence both the regressors and the error term—so results should be paired with domain judgment and refreshed data when stakes are high.

References:

  1. BrainBox AI. “Mastering Building Energy Efficiency: EUI and Energy Consumption.” Brainboxai.com, Brainbox AI, 17 July 2024, https://brainboxai.com/en/articles/mastering-building-energy-efficiency-eui-and-energy-consumption.
  2. Bourdeau, Mathieu, et al. “Modeling and Forecasting Building Energy Consumption: A Review of Data-Driven Techniques.” Sustainable Cities and Society, vol. 48, July 2019, p. 101533, https://doi.org/10.1016/j.scs.2019.101533. Accessed 5 Feb. 2022.
  3. “Building Energy Use.” U.S. General Services Administration, 2023, https://www.gsa.gov/governmentwide-initiatives/federal-highperformance-buildings/highperformance-building-clearinghouse/energy/building-energy-use.
  4. EIA. “Energy Information Administration (EIA)—Commercial Buildings Energy Consumption Survey (CBECS).” Eia.gov, 2016, https://www.eia.gov/consumption/commercial/.
  5. “How Much Energy Is Consumed in U.S. Residential and Commercial Buildings?—FAQ—U.S. Energy Information Administration (EIA).” Eia.gov, 2016, https://www.eia.gov/tools/faqs/faq.php?id=86&t=1.