Introduction

Over the past decade or so, two technologies (i.e., horizontal drilling coupled with large-scale, multistage hydraulic fracturing) have made it possible to extract hydrocarbons trapped in self-sourced shale plays1,2. The development that knowns as shale revolution has moved to US world leadership on oil and gas production. A workflow for de-risking productivity of an unconventional shale play is in part based on our examination of the S1 X 100/TOC ratio, which represent potentially producible fluid oil and gas saturations. Thus, the S1 is one of the most important parameters of producible fluid oil and gas saturation of shales together with total organic carbon content (TOC). Although minimum 100 mg oil/g TOC threshold were offered to the producible fluid oil and gas saturations of shales3,4,5,6, recent studies have considered that the threshold ranges from more than 100 mg oil/g TOC at the onset of oil expulsion to less than 40 mg oil/g TOC at the end of expulsion depend primarily on maturity7,8,9. The S1 is referred to as the available “free”10 or, more correctly, “volatile” fluid hydrocarbons volumes7,9 retained within the source rocks. The S1 volumes can be obtained by Rock-Eval pyrolysis analysis. However, it is time consuming and not practical to obtain samples from all intervals of all wells in any shale play. Organic matter richness, type, maturity and, reservoir quality and adsorption capacity control S1 volumes of shales. Reservoir quality and adsorption capacity of shales are mainly controlled by age, depth and mineralogy together also with organic matter richness and maturity7,8,9,11,12,13,14,15,16,17,18. On the one hand, oil and gas occurrences which are important factors of S1 volumes are related with time–temperature index which is controlled by burial history of the shales. On the other hand, age and depth affect reservoir quality of shales because old and deeply buried shales have principally less porosity and permeability. Mineral contents of the shales cause to increase or decrease of S1 volumes depend on their reservoir qualities and adsorption capacities. S1 volumes can be effected by contamination and fracture zones14,15,17,18. It is important to note that S1 volumes is more vulnerable to loss of light oil due to evaporation, sample handling and, preparation before analysis. Although loss of S1 is often estimated to be 35%19, the correction of S1 for “evaporative loss” is an important step to restore the present-day values to original values. Oil API gravity17,20, which is partially controlled by maturity7,8,9, is a major control on evaporative loss from C15+ (lower boiling range point components). In this study, correction of S1 is not applied to the training and validation dataset, but it is applied to the predictions.

Although there are many machine learning studies21,22,23,24,25 to predict TOC values, only two machine learning studies23,26 has predicted S1 volumes, so far. In addition, the two predictions are low accuracy (R2 = 0.78 and 0.91) because they used small number of well logs and S1 datasets taken from Goldwyer Shale of the Canning Basin of Australia and the Shahejie Shale of the Bohai Bay Basin of Chine. While the study of Johnson et al.23 used Levenberg–Marquardt training algorithm with Matlab programing environment with the established relationships between the 6 well logs and laboratory measured 96 S1 data, the study of Wang et al.26 used both back propagation artificial neural network (BPANN) and convolutional neural network (CNN) model with the established relationships between the one well log and laboratory measured 125 S1 data. Although Wang et al.26 study has effective tools such as both BPANN and CNN, its prediction is low accuracy probably because any log data presents maturity which is very important to S1 volumes prediction7,9 and one well log and laboratory measured 125 S1 data is very low. Aim of this study is to determine a deep learning model for S1 volumes prediction based on data of world-class shales and to apply for the Bakken and Marcellus shales using Python 3 programing environment with Tensorflow and Keras open-source libraries.

Training and validation dataset

The dataset consists of 331 instances and 8 variables from the 19 wells of the well-known shales of the world (Fig. 1, Table 1, Datashare 1). The variables that contain S1, age, depth, TOC%, Requ%, quartz%, clay%, carbonate% are selected because organic matter richness, type and maturity control generated oil and gas volumes and, age, depth and mineralogy control reservoir quality and adsorption capacities of shales that impact oil and gas accumulations. Note that all shale samples of the dataset are selected from Type II organic matter that capable to produce oil and oil cracking secondary gas depending on maturity. Thus, type has not included to the variables and this study is related with only II organic matters model and predictions.

Figure 1
figure 1

Location map of the wells of the training and validation dataset (1–19) and the S1 volumes prediction dataset (A–I). 1–5 = Niobrara Shale, 6–11 = Eagle Ford Shale, 12 = Barnett Shale, 13 = Haynesville Shale, 14 = Woodford Shale, 15 = Vaca Muerta Shale, 16–19 = Dadaş Shale, A, B, C, D = Bakken Shale, E, F, GHI = Marcellus Shale. This figure was generated with modification of public EIA maps (www.eia.gov) using CoralDRAW software.

Table 1 Training and validation dataset summary.

The variables are world-class productive shales of the Niobrara (1, 2, 3, 4, 5 wells27), Eagle Ford (B, C, D, X wells28 and K1, K2 wells29), Barnett (Mesquite well14), Haynesville (SS-2 well30), Woodford (H2 well31), Vaca Muerta (1010 well32) and, Dadaş (Doğan 1, Soğuktepe 1, K. Migo 2 wells8 and Calıktepe 2 well33). It is important to note that the dataset variables and instances are standard E&P activities of the petroleum sector.

The TOC and S1 values of the dataset were obtained from Leco analyzer or Rock-Eval pyrolysis. Maturity calculated from Rock-Eval pyrolysis maximum temperature (Tmax) as equivalent vitrinite reflectance (Requ). Mineralogy of the dataset was measured with whole rock XRD.

DNN model, training and validation

The dataset has been uploaded to the Jupyter Notebook that supports Python 3 programing environment. The DL workflow for S1 volumes prediction is given in the Fig. 2. The dataset shape contains 8 variables and 331 instances from 19 wells of the well-known shales. It is important to note that pretreatment methods have been applied to the dataset. First, instances that TOC values lower than 1% has been deleted from dataset due to absence any fluid oil and gas volumes potential. Second, the dataset has been increased to 1324 instances with data augmentation that is giving opportunity to generating new data points from existing data due to overfitting or simple learning solutions (Fig. 1, Table 1, Datashare 1). The input data is normalized to eliminate bias toward variables having larger values34. The standard normalized method has been selected in this study because lower loss index and more accuracy was obtained by the method during training and validation runs. The method known as Z-score and an element (x) is calculated using an arithmetic mean (μ) and standard deviation (σ), according to the following Eq. (1):

$$ {\text{z}} = \left( {{\text{x}}{-}\upmu } \right)/\upsigma . $$
(1)
Figure 2
figure 2

A proposed workflow showing the steps followed in this study to predict the S1 volumes of shales.

A multilayer perceptron method is selected in this study. The method represents most basic deep neural network (DNN) which is composed of a series of fully connected layers. The method has the advantage of learning non-linear models, provides quick predictions after training and the same accuracy ratio with even with smaller samples.

A DNN model which contains 5 dense layers and, 1024, 512, 256, 128 and 128 neurons have been used for shale S1 volumes prediction because lower loss index and more accuracy was obtained by the dense layers and neurons during training and validation runs. The dataset has been spilled as 80% training and 20% validation. The training and validation instances use to find the best representation model of the dataset. This comparison is done by plotting calculated versus measured S1 volumes of the testing instances, and the quality of the model is measured by a best-fit linear line between the two S1 volumes. For a perfect fit of the data, a slope of one is required. The R2 is also provided so that how the differences of one variable can be explained by the other variable can be analyzed or the quality of the fit can be assessed.

The training objective is to find a Deep Neural Network (DNN) function with a minimum value of loss index. Mean Squared Error method was used to calculate loss index. Error minimization has been done by fine tuning of increasing the complexity of the neural network by adding more neurons and layers due to obtain low final training and validation losses as well as a high coefficient of determination (R2). The lines connecting the nodes are called activation functions, which are used to transform the activation level of a neuron into an output35. For a neuron j in this fully connected neural networks, its output yj is calculated by Eq. (2)

$$ {\text{yj}} = {\text{F}}\left( {\sum {\text{i}}\;{\text{wij}}\;{\text{xi}} + {\text{bj}}} \right) $$
(2)

where xi denotes the output of neuron i at the previous layer; wij and bj are the weight and bias, respectively; and F represents the activation function. Summarization of all neurons at a layer can thus be written as Eq. (3)

$$ {\text{y}} = {\text{F}}\left( {{\text{wx}} + {\text{b}}} \right). $$
(3)

During training, weights and biases were adjusted to minimize a loss function. The model was optimized by an Adam optimizer36 with a loss function of MSE. F was set as a rectified linear unit (ReLU) activation function.

Prediction model and its applications

Plots of correlations show that carbonate content, TOC, present depth and maturity cause increase in S1 volumes, whereas age, quartz content and clay content cause decrease in S1 volumes (Fig. 3). Best Mean Squared Error history was obtained using 1024, 512, 256, 128 and 128 neurons and 5 dense layers (Table 2, Fig. 4A). Linear correlation coefficient using the DNN parameters has predicted as high as R2 = 0.97 (Fig. 4B). The DL model has been applied to S1 volumes prediction of the Bakken and Marcellus shales.

Figure 3
figure 3

S1 correlation plots of the inputs and target of the training and validation dataset.

Table 2 The deep neural network (DNN) that includes 7 inputs, 1 target, and 1024, 512, 256, 128, 128 neurons with 5 layers.
Figure 4
figure 4

(A) The training and validation loss plot with 200 epochs, (B) Linear correlation coefficient plot of the real and predicted S1. R2 is as high as 0.97.

Prediction of fluid oil volumes of the Bakken shale

The Bakken Formation is an unconventional play which locates North America (Fig. 1). This formation is divided into three members. The Upper Bakken, the Middle Bakken, and the Lower Bakken. The Upper and Lower Bakken members mainly consists of dark-gray to brownish-black to black, fissile, slightly calcareous, organic-rich shale, which was deposited in an offshore marine environment. The DL model has been applied to the 4 wells37,38 (A, B, C, D wells in the Fig. 1). The Upper and Lower Bakken shales are represented by beginning of late oil windows with average of 0.92% Requ in the wells. TOC contents range from 6.58 to 16.69 wt.% with an average of 11.44 wt.%. Quartz, clay and carbonate contents are 45.75%, 42.09% and 1.10%, respectively. Predicted S1 volumes of the Bakken Shales vary from 1.3 to 7.6 mg oil/g TOC with an average of 3.64 mg oil/g TOC based on the DL model. In addition, average of evaporation loss corrected S1 volumes with 40 API gravity is 4.86 mg oil/g TOC. Thus, the average oil saturation index (OSI) of the Upper and Lower Bakken Shales has been calculated as 42.83 mg oil/g TOC (Table 3 and Datashare 1). The data show that the shales have lower reservoir quality and lower oil production rate because the OSI is very close to threshold accepting as 40 mg oil/g TOC7,8,9, or lower than accepting as 100 mg oil/g TOC6. The finding well-matches with former studies6,39 on the Bakken Shales.

Table 3 Prediction dataset of the Bakken and Marcellus shales and, predicted S1 volumes, evaporation loss corrected S1 volumes and OSI based on the DL model.

Prediction of fluid oil and gas volumes of the Marcellus shale

The Middle Devonian Marcellus Formation is a distal marine shale within a foreland succession deposited in the Appalachian Basin (Fig. 1). The Marcellus Shale has attracted great attention as an important gas-producing shale of US. The DL model has been applied to the 5 wells40,41 of the Marcellus Shale (E, F, G, H, I wells in the Fig. 1). The shale is represented by late oil (well 1), condensate-wet gas (well 3 and 4) and dry gas (well 5) windows because Requ of the wells are 1%, 1.36–1.38% and 2.67–2.68%, respectively. TOC contents range from 1.94 to 13.6 wt.% with an average of 5.12 wt.%. Quartz, clay and carbonate contents are 34.15%, 50.22% and 6.14, respectively. Predicted S1 volumes of the Marcellus Shale vary from 0.69 to 5.36 mg oil/g TOC with an average 2.48 mg oil/g TOC based on the DL model. In addition, average of evaporation loss corrected S1 volume with 45–50 API gravities is 7.72 mg oil/g TOC. Thus, the average oil saturation index (OSI) of the Marcellus Shale has been calculated as 169.87 mg oil/g TOC (Table 3 and Datashare 1). The data show that the Marcellus Shale has higher reservoir quality and higher oil and gas production rate because the OSI is higher than threshold both accepting as 40 mg oil/g TOC7,8,9 and accepting as 100 mg oil/g TOC6. The finding also well-matches with former studies42,43 on the Marcellus Shale.

Conclusions

The S1 or fluid oil and gas volumes retained within the shales can be obtained by Rock-Eval pyrolysis analysis. Organic matter richness, type, maturity, age, depth, and mineralogy control S1 volumes. Pyrolysis analysis is time consuming and S1 volumes are more vulnerable to loss of light oil. Thus, this study has determined a deep learning model for S1 volumes prediction and is applied to the Bakken and Marcellus shales using Python 3 programing language, Tensorflow and Keras libraries. The dataset contains 331 instances and 8 variables from the 19 wells of the world-class organic-rich shales of the Niobrara, Eagle Ford, Barnett, Haynesville, Woodford, Vaca Muerta and Dadaş. Best Mean Squared Error loss history to S1 volumes prediction has obtained using 1024, 512, 256, 128 and 128 neurons and 5 layers. Linear correlation coefficient using the DNN parameters was predicted as high as R2 = 0.97. The correlation shows that S1 volumes of the organic rich shale formations could quickly be predicted from the DL model obtained by dataset of standard E&P activities of the petroleum sector. The DL model has successfully been applied to S1 volumes prediction of the Marcellus and Bakken shales. Average predicted S1 volumes of the Bakken and Marcellus shales based on the DL model are 3.64 mg oil/g TOC and 2.48 mg oil/g TOC, respectively. Average evaporation loss corrected S1 values are 4.86 mg oil/g TOC in the Bakken Shales and 7.72 mg HC/g TOC in the Marcellus Shale. The predictions show that the Bakken and Marcellus shales have lower to higher reservoir quality and, oil and gas production rate that are well matched with former studies, respectively.

Methods

Supervised deep learning

Supervised Deep Learning (DL) method has been used to predict of the S1 volumes of the shales based on actual training and testing data of laboratory analysis of shale samples. The DL method is the most important technique for machine learning and artificial intelligence that uses multilayered neural networks to extract high-order features44,45. Deep Neural Network (DNN) is mathematical models based on the neural structure of intelligent organisms, specifically the human brain46,47,48. Their main characteristics are learning, generalization and abstraction capacities, which they obtain through the search of relationships, automatic construction of models and corrections based on experience, in order to reduce their own errors. Multilayer perceptron is the most commonly used neural network and it is structured in a way that the number of input features is the same as the number of neurons in the input layer. The output layer has the same number of neurons as the classes of interest. Hidden layers in between and the neurons in each one of these layers may vary according to the problem characteristics49,50,51,52.

Shale analyses

TOC and S1 volumes of the dataset were obtained from Leco analyzer and/or Rock-Eval 6 pyrolysis. Rock-Eval data analyses were carried out by Institut Français du Petrole standards. The results of the analysis were used to calculate the S1 (mg HC/g rock) liberated at temperatures under 300 °C. The samples were analyzed using the standard procedures10.

Equivalent vitrinite reflectance (Requ%) has been used as a thermal stress indicator in this study. Thus, the Eq. (4)53 was applied to the samples.

$$ {\text{Requ}}\% = 0.0{165} \cdot {\text{Tmax -}}\;{65},{143} $$
(4)

Quartz, clay and carbonate contents of the dataset were carried out with whole rock XRD analysis. Note that quartz, clay and carbonate contents of the Eagle Ford B, C, D, X wells are average values of K1 and K2 wells. Following XRD analyses, qualitative and quantitative determinations of mineral compositions have been calculated by Jade 7.0 Software, Inorganic Crystal Structure Database of the International Center for Diffraction Data, and the Reference Intensity Ratios of the Easy Quant Program.

Age and depth of the shales have been taken from given references in the dataset sections.

Evaporation loss correction for S1

Evaporative loss from C15+ was calculated by below Eqs. (5, 6)17

$$ \% {\text{C15}}^{ + } {\text{loss}} = {\text{oil}}\;{\text{API}}\;{\text{gravity}}{-}\left( {{2}0,{799}/0.{412}} \right) $$
(5)
$$ {\text{S1cf}} = {1}/\left( {{1} - \% {\text{C15}}^{ + } {\text{loss}}} \right) $$
(6)

where S1 correlation factor is represented as S1cf.

API gravity for correction of S1 evaporate loss has been estimated with maturity-API gravity chart13,17,54.