# Multivariate Bayesian Regression Model for Predicting Disposed Ash Composition at U.S. Coal Fired Power Stations

## Authors
Jin, Zehao; Huang, Junkai; Hower, James C.; Hsu-Kim, Heileen

## Files in this dataset are
readme.txt
readme.md
CoalSupplyDataset.csv
bayesian.pkl

## Reference Publication
Jin, Z.; Huang, J.; Hower, J.C.; Hsu-Kim, H.(2025). Predictive Assessment of the Chemical Composition of Coal Ash in Reserve at U.S. Disposal Sites. Environmental Science & Technology.

## Description
This dataset contains the code and data files needed for implementation of a Multivariate Bayesian Regression model, described in Jin et al. (2025), for the historical prediction of the chemical composition of disposed coal ash at U.S. coal fired power plants as a function of annualized coal purchase data. 

The integrated coal supply data file (CoalSupplyDataset.csv) represents a compilation of monthly fuel purchase records for the period 1973-2022 at major U.S. power stations. These records were obtained from the U.S. Energy Information Administration. The CSV file also contains, for each coal purchase record, the coal region of the mine as defined by the U.S. Geological Survey. Data entry errors and data gaps in the EIA records were corrected as described in Jin et al. This CSV file represents the integrated coal supply data after corrections were made.

The model structure and fitting parameters are encoded in pickle file format (Bayesian.pkl). The model was developed with the coal supply data and coal ash composition data, apportioned according to the Stratified Shuffle Split for training and testing subsets. The model was built using Python and the PyMC library.

## Setup and Usage

### 1. Python Environment Requirements

To ensure compatibility, the following Python environment is recommended:

- Python 3.10.x
- PyMC 5.10.3
- cloudpickle 3.0.0
- pandas 1.5.3
- scipy 1.12.0

## Data Preparation

### 2. Input Data Structure

The model requires two primary datasets:

- **Coal Supply Data (X):** A 2D matrix of shape `(n_samples, n_regions)` representing the mass proportions of coal ash from various regions. The regions should be ordered as follows: 
  - `['CENTRAL APPALACHIAN', 'EASTERN', 'NORTHERN APPALACHIAN', 'POWDER RIVER', 'GREEN RIVER', 'UINTA']`.

- **Element Concentration Data (y):** A 2D matrix of shape `(n_samples, n_elements)` indicating the percentage concentration of elements in their oxidized forms within the coal ash. The elements should be ordered as:
  - `['SiO2', 'Al2O3', 'Fe2O3', 'CaO', 'MgO', 'Na2O', 'K2O', 'TiO2']`.

## Model Usage

### 3.1 Model Initialization

To utilize the pre-trained model, follow these steps to load the model and trace data:

```
import pickle
import cloudpickle
import pymc as pm

# Load the pre-trained model and trace
pickle_file = 'bayesian.pkl'
with open(pickle_file, 'rb') as f:
    model_dict = pickle.load(f)

model = model_dict['model']
trace = model_dict['trace']
```

### 3.2 Making Predictions
With the model and trace loaded, you can perform predictions on new data:
```
with model:
    pm.set_data({'X': X, 'y': y})
    ppc = pm.sample_posterior_predictive(trace)

# Extract the prediction mean and standard deviation for each coal ash sample
mean = ppc.posterior_predictive['y_obs'].mean(dim=['chain', 'draw']).astype(float).to_numpy()
std = ppc.posterior_predictive['y_obs'].std(dim=['chain', 'draw']).astype(float).to_numpy()
```
The output includes the mean and standard deviation of the predicted element concentrations for each coal ash sample, enabling a detailed analysis of the composition of disposed coal ash at disposal sites.