Exploratory Data Analysis

by Aaron Olson - Mon 03 February 2020
Tags: #WITSML #Pandas #Python #EDA

Today we will take up where we left off on the last blog where we had imported log data into a pandas dataframe for a particular well. From this data we will now explore the dataset via quick EDA and visualizations.

# Import required libraries & packages
from azure.storage.blob import BlockBlobService
import pandas as pd
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
from collections import defaultdict
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# List the account name and sas_token *** I have removed the SAS token in order to allow the user to register properly thorugh Equinor's data portal
azure_storage_account_name = 'dataplatformblvolve'
sas_token = 'sv=2018-03-28&sr=c&sig=cEL7YkHB5tZoUAr2m4teRmLohaARUHWjaVYpLxqUHzs%3D&se=2020-03-03T16%3A22%3A51Z&sp=rl'

# Create a service and use the SAS 
sas_blob_service = BlockBlobService( 
    account_name=azure_storage_account_name, 
    sas_token=sas_token,
)

filename = 'WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1'

blob = sas_blob_service.list_blobs('pub', filename)
for b in blob:
    print(b.name)
WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1/00001.xml
WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1/00002.xml
WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1/00003.xml
def units(df):
    '''
    The intent of this function is to convert units from metric to US

    Input: df - pandas dataframe
    Output: df - pandas dataframe with columns transformed based on current units

    '''
    corr_columns = []
    for col in df.columns:
        corr_columns.append(col.split(' - ')[0])
        try:
            df[col] = df[col].astype(float)
        except:
            continue
        # Analyze column units and transform if appropriate
        # Future work will use dictionary rather than series of if statements
        if (col.split(' - ')[1] == 'm') | (col.split(' - ')[1] == 'm/h') | (col.split(' - ')[1] == 'm/s2'):
            df[col] = df[col].astype(float) * 3.28084
        if col.split(' - ')[1] == 'degC':
            df[col] = (df[col].astype(float) * 9/5) + 32
        if col.split(' - ')[1] == 'm3':
            df[col] = df[col].astype(float) * 6.289814
        if col.split(' - ')[1] == 'kkgf':
            df[col] = df[col].astype(float) * 2.2
        if col.split(' - ')[1] == 'kPa':
            df[col] = df[col].astype(float) * 0.145038
        if col.split(' - ')[1] == 'L/min':
            df[col] = df[col].astype(float) * 0.264172875
        if col.split(' - ')[1] == 'g/cm3':
            df[col] = df[col].astype(float) * 8.3454
        if col.split(' - ')[1] == 'kN.m':
            df[col] = df[col].astype(float) * 0.737562

    df.columns = corr_columns
    return df
filename = 'WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/1/1/1'
blob = sas_blob_service.list_blobs('pub', filename)
master_df = pd.DataFrame()
for x, file in enumerate(blob):
    blob = sas_blob_service.get_blob_to_text('pub', file.name)
    soup = BeautifulSoup(blob.content, 'xml')
    log_names = soup.find_all('mnemonicList')
    unit_names = soup.find_all('unitList')
    header = [i + ' - ' + j for i, j in zip(log_names[0].string.split(","), unit_names[0].string.split(","))] 
    data = soup.find_all('data')
    df = pd.DataFrame(columns=header, 
                      data=[row.string.split(',') for row in data])
    df = df.replace('', np.NaN)
    df = units(df)
    if x == 0:
        master_df = df
    else:
        master_df = pd.concat([master_df, df], sort = False)
print('Total Length of DataFrame: ' + str(len(master_df)))
master_df.head()
Total Length of DataFrame: 30000
DEPTH G_IC5 BDDI INCL_CONT_RT TDH TVA SHKRSK_RT G_NC5 MFOP NRPM_RT ... AJAM_MWD SHKPK_RT STICK_RT G_C3 ECD_MWD G_IC4 DRILLTM_RT G_C1 G_NC4 CRS1
0 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 601.201 NaN NaN NaN
1 300.688986 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 301.299222 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 301.610902 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 302.240823 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 58 columns

A quick walkthrough of the code that we just ran through - to re-familiarize ourselves with last weeks work. We connected to the blob service hosted on Azure storage. Then using that connection we download three different xml (WITSML) log files located under the directory: WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$9-F-14/1/log/1/1/1. This folder structure tells us that we are in the Norway-StatoilHydro-15$47$_9-F-14 well and grabbing a series of log files. We then loop through the three log files, read in the data and append them to the maser_df pandas dataframe.

In last weeks blog we looked at the data structure / missing data / etc of one of these three files. Now that we have a 'completed' log we can do a similar analysis:

plt.figure(figsize = (15,5))
plt.bar(range(len(master_df.isna().sum())), list(master_df.isna().sum().values / len(master_df)), align='center')
plt.xticks(range(len(master_df.isna().sum())), master_df.columns.to_list(), rotation='vertical')
plt.title('Proportion NA Data by Column')
plt.show()

Unfortunately - similar to the previous analysis - most channels contain missing data. Similar to last weeks analysis - we can leave it like this, we could see if there are any other data sources, we could see if there is time based data that can be re-indexed to depth based and determine if there is less missing data. This latter approach does have a few asterics: - The dataset should be filtered to only include 'on-bottom' data. This can be achieved by subtracting bit depth from hole depth and only selecting data where the result of that subtraction is <0.5ft (good to have a little buffer). - Some data may not be appropriate to write to bit depth (for example MWD gamma ray is located back at the MWD tool, which may be 20-100ft behind the bit). In this case - it will be pertinent to read in the BHA file for that particular run, look for the sensor offsets and use this information to subtract depth from the bit depth to correctly write the sensor depth to a depth based file.

Our other option is to use interpolation like we did in the previous weeks example. Since we've already covered interpolation last week - we will look at transforming a time based dataset into depth based. Located under the folder WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/2 are time based log files. We will import the time based data similar to the depth based and save it to a master dataframe file:

filename = 'WITSML Realtime drilling data/Norway-StatoilHydro-15_$47$_9-F-14/1/log/2/1/1'
blob = sas_blob_service.list_blobs('pub', filename)
master_df = pd.DataFrame()
for x, file in enumerate(blob):
    blob = sas_blob_service.get_blob_to_text('pub', file.name)
    soup = BeautifulSoup(blob.content, 'xml')
    log_names = soup.find_all('mnemonicList')
    unit_names = soup.find_all('unitList')
    header = [i + ' - ' + j for i, j in zip(log_names[0].string.split(","), unit_names[0].string.split(","))] 
    data = soup.find_all('data')
    df = pd.DataFrame(columns=header, 
                      data=[row.string.split(',') for row in data])
    df = df.replace('', np.NaN)
    df = units(df)
    if x == 0:
        master_df = df
    else:
        master_df = pd.concat([master_df, df], sort = False)
print('Total Length of DataFrame: ' + str(len(master_df)))
master_df.head()
Total Length of DataFrame: 309431
TIME TVT CFIA GRR TDH TVDE DCHM STIS TQA ROP ... SPM3 BITRUN PD_GRAV_BHC PDINCL TV04 TV06 DVER SWOB MDOA SDEP_CONT_RT
0 2008-05-09T14:57:41.040Z 4549.611268 0.0 NaN 64.274001 NaN NaN NaN 0.014751 348.687671 ... 0.0 NaN NaN NaN 472.365021 324.743103 3551.574981 0.0 5.42451 NaN
1 2008-05-09T14:57:45.014Z 4546.277483 0.0 NaN 64.274001 NaN 3553.215401 NaN 0.014751 348.687671 ... 0.0 NaN NaN NaN 472.365021 324.805991 3551.574981 0.0 5.42451 NaN
2 2008-05-09T14:57:50.026Z 4543.761787 0.0 NaN 64.274001 NaN NaN NaN 0.014751 348.687671 ... 0.0 NaN NaN NaN 472.553709 324.805991 3551.574981 0.0 5.42451 NaN
3 2008-05-09T14:57:54.000Z 4542.000455 0.0 NaN 64.274001 NaN NaN NaN 0.014751 348.687671 ... 0.0 NaN NaN NaN 472.553709 324.805991 3551.574981 0.0 5.42451 NaN
4 2008-05-09T14:57:59.011Z 4540.490961 0.0 NaN 64.274001 NaN NaN NaN 0.014751 348.687671 ... 0.0 NaN NaN NaN 472.553709 324.805991 3551.574981 0.0 5.42451 NaN

5 rows × 97 columns

Here we can see that time is the index of the dataset (note when we imported the dataset - we didn't specify an index so pandas wrote an integer index). We could specify the index either when defining the dataset, or after. Below we will filter the dataset for only on bottom data and then will specify the index as the depth cahnnel.

master_df = master_df[master_df['DMEA'] - master_df['DBTM'] < 1]
master_df['DMEA'] = master_df['DMEA'].round(2)
depth_df = master_df.groupby(['DMEA']).mean()
depth_df.head()
TVT CFIA GRR TDH TVDE DCHM STIS TQA ROP SHKRSK_RT ... SPM3 BITRUN PD_GRAV_BHC PDINCL TV04 TV06 DVER SWOB MDOA SDEP_CONT_RT
DMEA
3443.01 4811.393296 0.0 NaN 63.644000 NaN 3443.012097 NaN 2.979750 348.687671 NaN ... 69.0 NaN NaN NaN 470.635338 322.541667 3441.371677 4.767173 5.42451 NaN
3443.04 4792.523701 0.0 NaN 63.635000 NaN 3443.044537 NaN 3.593402 348.687671 NaN ... 69.0 NaN NaN NaN 470.610173 322.566824 3441.371677 5.209118 5.42451 NaN
3443.08 4768.009358 0.0 NaN 63.625999 NaN 3443.077378 NaN 3.033224 348.687671 NaN ... 69.0 NaN NaN NaN 470.540982 322.730365 3441.371677 5.810343 5.42451 NaN
3443.11 4744.582833 0.0 NaN 63.625999 NaN NaN NaN 3.291001 348.687671 NaN ... 69.0 NaN NaN NaN 470.264226 322.529087 3441.371677 6.348753 5.42451 NaN
3443.14 4729.405548 0.0 NaN 63.612499 NaN 3443.143059 NaN 3.186268 348.687671 NaN ... 69.0 NaN NaN NaN 470.415207 322.698909 3441.371677 6.971290 5.42451 NaN

5 rows × 94 columns

Above we first filtered to remove off bottom data, then rounded the hole depth channel to the nearest hundreth of a foot, then used groupby and mean to transform the index. There is also a set_index() functionality to pandas. This will transform the data to have a depth index, but will not merge the data at the same hole depth. If you have 10 datapoints that all correspond to 100.01 ft (100.005, 100.0011..) then you will continue to have all data - which may or may not be a preferred functionality. Above we used groupby and mean to average the data. This data has now been converted to depth - but as previously mentioned our sensor data that isn't written to bit depth needs to be updated - this can be difficult, there are a few ways to accomplish this: - If the index (hole depth) contains evenly spaced rows we can shift the data using .shift() at the end of the column - The data above isn't evenly spaced which makes it more complicated - the way to accomodate sensor data would be accomplished by separating the column into a new dataframe, drop it from the original dataframe then merge the two back together, below we will show how this can be accomplished for the PD_GRAV_BHC channel:

df_pd = depth_df['PD_GRAV_BHC']
df_pd = df_pd.reset_index()
df_pd['DMEA'] = df_pd['DMEA'] - 8.5
df_pd = df_pd.set_index('DMEA')
df_pd.head()
PD_GRAV_BHC
DMEA
3434.51 NaN
3434.54 NaN
3434.58 NaN
3434.61 NaN
3434.64 NaN
depth_df = depth_df.drop(['PD_GRAV_BHC'], axis = 1)
depth_df = pd.concat([depth_df, df_pd], axis = 1, sort = False)
depth_df.head()
TVT CFIA GRR TDH TVDE DCHM STIS TQA ROP SHKRSK_RT ... SPM3 BITRUN PDINCL TV04 TV06 DVER SWOB MDOA SDEP_CONT_RT PD_GRAV_BHC
DMEA
3434.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3434.54 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3434.58 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3434.61 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3434.64 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 94 columns

In the first cell we separate out the PD Grav variable, reset the index so that DMEA is its own column, then subtract the sensor offset distance from the BHA file. We then reset the index so that we can use the concat function.

In the second column, we drop the PD Grav variable from the original depth dataset, then use concat to combine the two with the sensor adjusted depth.

We can now visualize to see if our time to depth conversion helped improve NA values:

plt.figure(figsize = (15,5))
plt.bar(range(len(depth_df.isna().sum())), list(depth_df.isna().sum().values / len(depth_df)), align='center')
plt.xticks(range(len(depth_df.isna().sum())), depth_df.columns.to_list(), rotation='vertical')
plt.title('Proportion NA Data by Column')
plt.show()

What we can see is that we have more channels compared to the original depth dataset - and the proportion of missing data, while still large, is significantly reduced compared to the depth dataset. If we set the depth sample rate to the nearest 0.5 ft we will likely improve missing proprotion values as well:

depth_df = depth_df.reset_index()
depth_df['DMEA'] = round(depth_df['DMEA'] * 2) / 2
depth_df = depth_df.groupby('DMEA').mean()
depth_df.head()
TVT CFIA GRR TDH TVDE DCHM STIS TQA ROP SHKRSK_RT ... SPM3 BITRUN PDINCL TV04 TV06 DVER SWOB MDOA SDEP_CONT_RT PD_GRAV_BHC
DMEA
3434.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3435.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3435.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3436.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3436.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 444.959

5 rows × 94 columns

plt.figure(figsize = (15,5))
plt.bar(range(len(depth_df.isna().sum())), list(depth_df.isna().sum().values / len(depth_df)), align='center')
plt.xticks(range(len(depth_df.isna().sum())), depth_df.columns.to_list(), rotation='vertical')
plt.title('Proportion NA Data by Column')
plt.show()

We can see that we have now substantially reduced the missing values in our dataset - which is very good. It may be questioned whether averaging to the nearest foot is appropriate - this would be dictated by the dataset. In general many channels will have a resolution at or greater than 0.5 ft, along with depth uncertainty, it can be argued that averaging to the nearest half foot is OK. If running wireline logs such as a resistivity image log, averaging to the nearest half foot may destroy some of the tiny fracture features contained within the log and would not be appropriate.

With this data it is very possible to inspect relationships in our data:

import seaborn as sns
plt.figure(figsize = (10,10))
sns.scatterplot('PD_GRAV_BHC', 'ROP5', data = depth_df)
<matplotlib.axes._subplots.AxesSubplot at 0x193ee58b860>

import seaborn as sns
plt.figure(figsize = (10,10))
sns.scatterplot('Bit_RPM', 'ROP5', data = depth_df)
<matplotlib.axes._subplots.AxesSubplot at 0x193803bb6d8>

Further EDA can also be carried out using standard Python EDA plots and analysis. The primary point stressed in this dataset is the importance of understanding the data. This is often the most laborious and painful part of analysis, often constituting excessive amount of time - but is also the most crucial. ML algorithms often struggle with missing data - understanding how to handle missing data most appropriately and accurately is essential to deriving insightful meaning from an ML model and/or business intelligence.

We saw that time based data transformed into depth based may provide a more complete dataset, and we saw how to accurately complete this process. In oil and gas data missing / incorrect data is very commonplace and should be expected. Taking the time to understand the dataset, understand its limitations and develop ways to improve the data contained is essential. Doing so sets the ability to then accurately use the data and derive meaning from it.

This EDA is by no means complete - many channels after converting time to depth still need to be sensor corrected. It is also pertinent to remove outlier data: - In the plot of PD gamma vs ROP we see a straight line near 440 API. Most likely this is some kind of default value as it's generally an outlier in the gamma dataset - but occurs frequently. Having more domain knowledge to understand this is important - Many times when setting or pulling slips hookload can increase/decrease abruptly causing a spike in the data. Especially when converting from time to depth based - these spikes which don't reflect the drilling process need to be removed. In others - such as detecting overpull and potential tight hole - these spikes can be quintessential.

Understanding the domain of the problem you're trying to solve, and the acceptable EDA to solve that problem are essential. This serves as a brief priner into the EDA analysis of an oil and gas dataset.

Comments