Exploratory Data Analysis

by Aaron Olson - Mon 03 February 2020

Tags: #WITSML #Pandas #Python #EDA

Today we will take up where we left off on the last blog where we had imported log data into a pandas dataframe for a particular well. From this data we will now explore the dataset via quick EDA and visualizations.

# Import required libraries & packages
from azure.storage.blob import BlockBlobService
import pandas as pd
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
from collections import defaultdict
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# List the account name and sas_token *** I have removed the SAS token in order to allow the user to register properly thorugh Equinor's data portal
azure_storage_account_name = 'dataplatformblvolve'
sas_token = 'sv=2018-03-28&sr=c&sig=cEL7YkHB5tZoUAr2m4teRmLohaARUHWjaVYpLxqUHzs%3D&se=2020-03-03T16%3A22%3A51Z&sp=rl'

# Create a service and use the SAS 
sas_blob_service = BlockBlobService( 
    account_name=azure_storage_account_name, 
    sas_token=sas_token,
)

filename = 'WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1'

blob = sas_blob_service.list_blobs('pub', filename)
for b in blob:
    print(b.name)

WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1/00001.xml
WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1/00002.xml
WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1/00003.xml

def units(df):
    '''
    The intent of this function is to convert units from metric to US

    Input: df - pandas dataframe
    Output: df - pandas dataframe with columns transformed based on current units

    '''
    corr_columns = []
    for col in df.columns:
        corr_columns.append(col.split(' - ')[0])
        try:
            df[col] = df[col].astype(float)
        except:
            continue
        # Analyze column units and transform if appropriate
        # Future work will use dictionary rather than series of if statements
        if (col.split(' - ')[1] == 'm') | (col.split(' - ')[1] == 'm/h') | (col.split(' - ')[1] == 'm/s2'):
            df[col] = df[col].astype(float) * 3.28084
        if col.split(' - ')[1] == 'degC':
            df[col] = (df[col].astype(float) * 9/5) + 32
        if col.split(' - ')[1] == 'm3':
            df[col] = df[col].astype(float) * 6.289814
        if col.split(' - ')[1] == 'kkgf':
            df[col] = df[col].astype(float) * 2.2
        if col.split(' - ')[1] == 'kPa':
            df[col] = df[col].astype(float) * 0.145038
        if col.split(' - ')[1] == 'L/min':
            df[col] = df[col].astype(float) * 0.264172875
        if col.split(' - ')[1] == 'g/cm3':
            df[col] = df[col].astype(float) * 8.3454
        if col.split(' - ')[1] == 'kN.m':
            df[col] = df[col].astype(float) * 0.737562

    df.columns = corr_columns
    return df

filename = 'WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1'
blob = sas_blob_service.list_blobs('pub', filename)
master_df = pd.DataFrame()
for x, file in enumerate(blob):
    blob = sas_blob_service.get_blob_to_text('pub', file.name)
    soup = BeautifulSoup(blob.content, 'xml')
    log_names = soup.find_all('mnemonicList')
    unit_names = soup.find_all('unitList')
    header = [i + ' - ' + j for i, j in zip(log_names[0].string.split(","), unit_names[0].string.split(","))] 
    data = soup.find_all('data')
    df = pd.DataFrame(columns=header, 
                      data=[row.string.split(',') for row in data])
    df = df.replace('', np.NaN)
    df = units(df)
    if x == 0:
        master_df = df
    else:
        master_df = pd.concat([master_df, df], sort = False)
print('Total Length of DataFrame: ' + str(len(master_df)))
master_df.head()

Total Length of DataFrame: 30000

	DEPTH	G_IC5	BDDI	INCL_CONT_RT	TDH	TVA	SHKRSK_RT	G_NC5	MFOP	NRPM_RT	...	AJAM_MWD	SHKPK_RT	STICK_RT	G_C3	ECD_MWD	G_IC4	DRILLTM_RT	G_C1	G_NC4	CRS1
0	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	601.201	NaN	NaN	NaN
1	300.688986	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	301.299222	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	301.610902	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	302.240823	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 58 columns

A quick walkthrough of the code that we just ran through - to re-familiarize ourselves with last weeks work. We connected to the blob service hosted on Azure storage. Then using that connection we download three different xml (WITSML) log files located under the directory: WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$9-F-14/1/log/1/1/1. This folder structure tells us that we are in the Norway-StatoilHydro-15$47$_9-F-14 well and grabbing a series of log files. We then loop through the three log files, read in the data and append them to the maser_df pandas dataframe.

In last weeks blog we looked at the data structure / missing data / etc of one of these three files. Now that we have a 'completed' log we can do a similar analysis:

plt.figure(figsize = (15,5))
plt.bar(range(len(master_df.isna().sum())), list(master_df.isna().sum().values / len(master_df)), align='center')
plt.xticks(range(len(master_df.isna().sum())), master_df.columns.to_list(), rotation='vertical')
plt.title('Proportion NA Data by Column')
plt.show()

Unfortunately - similar to the previous analysis - most channels contain missing data. Similar to last weeks analysis - we can leave it like this, we could see if there are any other data sources, we could see if there is time based data that can be re-indexed to depth based and determine if there is less missing data. This latter approach does have a few asterics: - The dataset should be filtered to only include 'on-bottom' data. This can be achieved by subtracting bit depth from hole depth and only selecting data where the result of that subtraction is <0.5ft (good to have a little buffer). - Some data may not be appropriate to write to bit depth (for example MWD gamma ray is located back at the MWD tool, which may be 20-100ft behind the bit). In this case - it will be pertinent to read in the BHA file for that particular run, look for the sensor offsets and use this information to subtract depth from the bit depth to correctly write the sensor depth to a depth based file.

Our other option is to use interpolation like we did in the previous weeks example. Since we've already covered interpolation last week - we will look at transforming a time based dataset into depth based. Located under the folder WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/2 are time based log files. We will import the time based data similar to the depth based and save it to a master dataframe file:

filename = 'WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/2/1/1'
blob = sas_blob_service.list_blobs('pub', filename)
master_df = pd.DataFrame()
for x, file in enumerate(blob):
    blob = sas_blob_service.get_blob_to_text('pub', file.name)
    soup = BeautifulSoup(blob.content, 'xml')
    log_names = soup.find_all('mnemonicList')
    unit_names = soup.find_all('unitList')
    header = [i + ' - ' + j for i, j in zip(log_names[0].string.split(","), unit_names[0].string.split(","))] 
    data = soup.find_all('data')
    df = pd.DataFrame(columns=header, 
                      data=[row.string.split(',') for row in data])
    df = df.replace('', np.NaN)
    df = units(df)
    if x == 0:
        master_df = df
    else:
        master_df = pd.concat([master_df, df], sort = False)
print('Total Length of DataFrame: ' + str(len(master_df)))
master_df.head()

Total Length of DataFrame: 309431

	TIME	TVT	GRR	TDH	TVDE	DCHM	STIS	TQA	ROP	...	BITRUN	PD_GRAV_BHC	PDINCL	TV04	TV06	DVER	MDOA	SDEP_CONT_RT
0	2008-05-09T14:57:41.040Z	4549.611268	NaN	64.274001	NaN	NaN	NaN	0.014751	348.687671	...	NaN	NaN	NaN	472.365021	324.743103	3551.574981	5.42451	NaN
1	2008-05-09T14:57:45.014Z	4546.277483	NaN	64.274001	NaN	3553.215401	NaN	0.014751	348.687671	...	NaN	NaN	NaN	472.365021	324.805991	3551.574981	5.42451	NaN
2	2008-05-09T14:57:50.026Z	4543.761787	NaN	64.274001	NaN	NaN	NaN	0.014751	348.687671	...	NaN	NaN	NaN	472.553709	324.805991	3551.574981	5.42451	NaN
3	2008-05-09T14:57:54.000Z	4542.000455	NaN	64.274001	NaN	NaN	NaN	0.014751	348.687671	...	NaN	NaN	NaN	472.553709	324.805991	3551.574981	5.42451	NaN
4	2008-05-09T14:57:59.011Z	4540.490961	NaN	64.274001	NaN	NaN	NaN	0.014751	348.687671	...	NaN	NaN	NaN	472.553709	324.805991	3551.574981	5.42451	NaN

5 rows × 97 columns

Here we can see that time is the index of the dataset (note when we imported the dataset - we didn't specify an index so pandas wrote an integer index). We could specify the index either when defining the dataset, or after. Below we will filter the dataset for only on bottom data and then will specify the index as the depth cahnnel.

master_df = master_df[master_df['DMEA'] - master_df['DBTM'] < 1]
master_df['DMEA'] = master_df['DMEA'].round(2)
depth_df = master_df.groupby(['DMEA']).mean()
depth_df.head()

	TVT	CFIA	GRR	TDH	TVDE	DCHM	STIS	TQA	ROP	SHKRSK_RT	...	SPM3	BITRUN	PD_GRAV_BHC	PDINCL	TV04	TV06	DVER	SWOB	MDOA	SDEP_CONT_RT
DMEA
3443.01	4811.393296	0.0	NaN	63.644000	NaN	3443.012097	NaN	2.979750	348.687671	NaN	...	69.0	NaN	NaN	NaN	470.635338	322.541667	3441.371677	4.767173	5.42451	NaN
3443.04	4792.523701	0.0	NaN	63.635000	NaN	3443.044537	NaN	3.593402	348.687671	NaN	...	69.0	NaN	NaN	NaN	470.610173	322.566824	3441.371677	5.209118	5.42451	NaN
3443.08	4768.009358	0.0	NaN	63.625999	NaN	3443.077378	NaN	3.033224	348.687671	NaN	...	69.0	NaN	NaN	NaN	470.540982	322.730365	3441.371677	5.810343	5.42451	NaN
3443.11	4744.582833	0.0	NaN	63.625999	NaN	NaN	NaN	3.291001	348.687671	NaN	...	69.0	NaN	NaN	NaN	470.264226	322.529087	3441.371677	6.348753	5.42451	NaN
3443.14	4729.405548	0.0	NaN	63.612499	NaN	3443.143059	NaN	3.186268	348.687671	NaN	...	69.0	NaN	NaN	NaN	470.415207	322.698909	3441.371677	6.971290	5.42451	NaN

5 rows × 94 columns

Above we first filtered to remove off bottom data, then rounded the hole depth channel to the nearest hundreth of a foot, then used groupby and mean to transform the index. There is also a set_index() functionality to pandas. This will transform the data to have a depth index, but will not merge the data at the same hole depth. If you have 10 datapoints that all correspond to 100.01 ft (100.005, 100.0011..) then you will continue to have all data - which may or may not be a preferred functionality. Above we used groupby and mean to average the data. This data has now been converted to depth - but as previously mentioned our sensor data that isn't written to bit depth needs to be updated - this can be difficult, there are a few ways to accomplish this: - If the index (hole depth) contains evenly spaced rows we can shift the data using .shift() at the end of the column - The data above isn't evenly spaced which makes it more complicated - the way to accomodate sensor data would be accomplished by separating the column into a new dataframe, drop it from the original dataframe then merge the two back together, below we will show how this can be accomplished for the PD_GRAV_BHC channel:

df_pd = depth_df['PD_GRAV_BHC']
df_pd = df_pd.reset_index()
df_pd['DMEA'] = df_pd['DMEA'] - 8.5
df_pd = df_pd.set_index('DMEA')
df_pd.head()

	PD_GRAV_BHC
DMEA
3434.51	NaN
3434.54	NaN
3434.58	NaN
3434.61	NaN
3434.64	NaN

depth_df = depth_df.drop(['PD_GRAV_BHC'], axis = 1)
depth_df = pd.concat([depth_df, df_pd], axis = 1, sort = False)
depth_df.head()

	TVT	CFIA	GRR	TDH	TVDE	DCHM	STIS	TQA	ROP	SHKRSK_RT	...	SPM3	BITRUN	PDINCL	TV04	TV06	DVER	SWOB	MDOA	SDEP_CONT_RT	PD_GRAV_BHC
DMEA
3434.51	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3434.54	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3434.58	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3434.61	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3434.64	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 94 columns

In the first cell we separate out the PD Grav variable, reset the index so that DMEA is its own column, then subtract the sensor offset distance from the BHA file. We then reset the index so that we can use the concat function.

In the second column, we drop the PD Grav variable from the original depth dataset, then use concat to combine the two with the sensor adjusted depth.

We can now visualize to see if our time to depth conversion helped improve NA values:

plt.figure(figsize = (15,5))
plt.bar(range(len(depth_df.isna().sum())), list(depth_df.isna().sum().values / len(depth_df)), align='center')
plt.xticks(range(len(depth_df.isna().sum())), depth_df.columns.to_list(), rotation='vertical')
plt.title('Proportion NA Data by Column')
plt.show()

What we can see is that we have more channels compared to the original depth dataset - and the proportion of missing data, while still large, is significantly reduced compared to the depth dataset. If we set the depth sample rate to the nearest 0.5 ft we will likely improve missing proprotion values as well:

depth_df = depth_df.reset_index()
depth_df['DMEA'] = round(depth_df['DMEA'] * 2) / 2
depth_df = depth_df.groupby('DMEA').mean()
depth_df.head()

	TVT	CFIA	GRR	TDH	TVDE	DCHM	STIS	TQA	ROP	SHKRSK_RT	...	SPM3	BITRUN	PDINCL	TV04	TV06	DVER	SWOB	MDOA	SDEP_CONT_RT	PD_GRAV_BHC
DMEA
3434.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3435.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3435.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3436.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3436.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	444.959

5 rows × 94 columns

plt.figure(figsize = (15,5))
plt.bar(range(len(depth_df.isna().sum())), list(depth_df.isna().sum().values / len(depth_df)), align='center')
plt.xticks(range(len(depth_df.isna().sum())), depth_df.columns.to_list(), rotation='vertical')
plt.title('Proportion NA Data by Column')
plt.show()

We can see that we have now substantially reduced the missing values in our dataset - which is very good. It may be questioned whether averaging to the nearest foot is appropriate - this would be dictated by the dataset. In general many channels will have a resolution at or greater than 0.5 ft, along with depth uncertainty, it can be argued that averaging to the nearest half foot is OK. If running wireline logs such as a resistivity image log, averaging to the nearest half foot may destroy some of the tiny fracture features contained within the log and would not be appropriate.

With this data it is very possible to inspect relationships in our data:

import seaborn as sns
plt.figure(figsize = (10,10))
sns.scatterplot('PD_GRAV_BHC', 'ROP5', data = depth_df)

<matplotlib.axes._subplots.AxesSubplot at 0x193ee58b860>

import seaborn as sns
plt.figure(figsize = (10,10))
sns.scatterplot('Bit_RPM', 'ROP5', data = depth_df)

<matplotlib.axes._subplots.AxesSubplot at 0x193803bb6d8>

Further EDA can also be carried out using standard Python EDA plots and analysis. The primary point stressed in this dataset is the importance of understanding the data. This is often the most laborious and painful part of analysis, often constituting excessive amount of time - but is also the most crucial. ML algorithms often struggle with missing data - understanding how to handle missing data most appropriately and accurately is essential to deriving insightful meaning from an ML model and/or business intelligence.

We saw that time based data transformed into depth based may provide a more complete dataset, and we saw how to accurately complete this process. In oil and gas data missing / incorrect data is very commonplace and should be expected. Taking the time to understand the dataset, understand its limitations and develop ways to improve the data contained is essential. Doing so sets the ability to then accurately use the data and derive meaning from it.

This EDA is by no means complete - many channels after converting time to depth still need to be sensor corrected. It is also pertinent to remove outlier data: - In the plot of PD gamma vs ROP we see a straight line near 440 API. Most likely this is some kind of default value as it's generally an outlier in the gamma dataset - but occurs frequently. Having more domain knowledge to understand this is important - Many times when setting or pulling slips hookload can increase/decrease abruptly causing a spike in the data. Especially when converting from time to depth based - these spikes which don't reflect the drilling process need to be removed. In others - such as detecting overpull and potential tight hole - these spikes can be quintessential.

Understanding the domain of the problem you're trying to solve, and the acceptable EDA to solve that problem are essential. This serves as a brief priner into the EDA analysis of an oil and gas dataset.

Exploratory Data Analysis

Comments