# Multivariate Bayesian Regression Model for Predicting Disposed Ash Composition at U.S. Coal Fired Power Stations ## Authors Jin, Zehao; Huang, Junkai; Hower, James C.; Hsu-Kim, Heileen ## Files in this dataset are readme.txt readme.md CoalSupplyDataset.csv bayesian.pkl ## Reference Publication Jin, Z.; Huang, J.; Hower, J.C.; Hsu-Kim, H.(2025). Predictive Assessment of the Chemical Composition of Coal Ash in Reserve at U.S. Disposal Sites. Environmental Science & Technology. ## Description This dataset contains the code and data files needed for implementation of a Multivariate Bayesian Regression model, described in Jin et al. (2025), for the historical prediction of the chemical composition of disposed coal ash at U.S. coal fired power plants as a function of annualized coal purchase data. The integrated coal supply data file (CoalSupplyDataset.csv) represents a compilation of monthly fuel purchase records for the period 1973-2022 at major U.S. power stations. These records were obtained from the U.S. Energy Information Administration. The CSV file also contains, for each coal purchase record, the coal region of the mine as defined by the U.S. Geological Survey. Data entry errors and data gaps in the EIA records were corrected as described in Jin et al. This CSV file represents the integrated coal supply data after corrections were made. The model structure and fitting parameters are encoded in pickle file format (Bayesian.pkl). The model was developed with the coal supply data and coal ash composition data, apportioned according to the Stratified Shuffle Split for training and testing subsets. The model was built using Python and the PyMC library. ## Setup and Usage ### 1. Python Environment Requirements To ensure compatibility, the following Python environment is recommended: - Python 3.10.x - PyMC 5.10.3 - cloudpickle 3.0.0 - pandas 1.5.3 - scipy 1.12.0 ## Data Preparation ### 2. Input Data Structure The model requires two primary datasets: - **Coal Supply Data (X):** A 2D matrix of shape `(n_samples, n_regions)` representing the mass proportions of coal ash from various regions. The regions should be ordered as follows: - `['CENTRAL APPALACHIAN', 'EASTERN', 'NORTHERN APPALACHIAN', 'POWDER RIVER', 'GREEN RIVER', 'UINTA']`. - **Element Concentration Data (y):** A 2D matrix of shape `(n_samples, n_elements)` indicating the percentage concentration of elements in their oxidized forms within the coal ash. The elements should be ordered as: - `['SiO2', 'Al2O3', 'Fe2O3', 'CaO', 'MgO', 'Na2O', 'K2O', 'TiO2']`. ## Model Usage ### 3.1 Model Initialization To utilize the pre-trained model, follow these steps to load the model and trace data: ``` import pickle import cloudpickle import pymc as pm # Load the pre-trained model and trace pickle_file = 'bayesian.pkl' with open(pickle_file, 'rb') as f: model_dict = pickle.load(f) model = model_dict['model'] trace = model_dict['trace'] ``` ### 3.2 Making Predictions With the model and trace loaded, you can perform predictions on new data: ``` with model: pm.set_data({'X': X, 'y': y}) ppc = pm.sample_posterior_predictive(trace) # Extract the prediction mean and standard deviation for each coal ash sample mean = ppc.posterior_predictive['y_obs'].mean(dim=['chain', 'draw']).astype(float).to_numpy() std = ppc.posterior_predictive['y_obs'].std(dim=['chain', 'draw']).astype(float).to_numpy() ``` The output includes the mean and standard deviation of the predicted element concentrations for each coal ash sample, enabling a detailed analysis of the composition of disposed coal ash at disposal sites.