## Introduction

A caliper log records the borehole size. When a logging tool is pulled up the well, a caplier arm opens or closes as it encounters zones of washout, mudcake, or zones of stable hole condition. If there is abundant washout or mudcake, many logging tools (bulk density, neutron, etc.) do not collect valid petrophysical data. Often a petrophysicist will use some cutoff (2.5 inches of washout, maybe) to determine whether they can use the data over that interval in their analysis or if they should instead throw it out.

When a well is drilled, it is very common for there to be multiple “trips”, where a smaller bit size is used on each successive entry. As a result a caliper curve merged between logging runs will be stepped, as shown in this caliper vs. depth plot:

In order to assess whether our writeline data are valid or not, we need to compare the caliper to the bit size. The problem is that commonly bit size information is not available or is buried in rasterized records. When it is recorded in well metadata, it may not be precise (for any number of reasons, for example drilling depth and wireline depth being slightly different due to variable stretch on the drill pipe versus the wireline tool), or it wasn’t recorded properly by the sleep-deprived logging engineer, or it may be incomplete (“I see bit size for the shallow section, but what about the deeper part of the hole?”).

In this post, I document one method of estimating bit size directly from the caliper curve using a chagepoint detection algorithm and some `pandas`

magic.

## Set up

First we use `lasio`

, a good library by Kent Inverarity for working with LAS files as python objects, to create a `las`

object. `lasio`

also has a method for creating a `pandas`

DataFrame from the LAS file `~A`

data section.

```
import lasio
import pandas as pd
las = lasio.read('example_caliper.LAS')
df = las.df().reset_index()
```

Note that while traditionally LAS files are tied to depth, I will be working with the DataFrame index. The x-axes in the following plots can be thought of as log depth; only, divide by 2 to get the *actual* depth, since the depth step is 0.5 ft (one sample every half foot).

It is not uncommon to have null values in any logging curve, especially at casing points. In the figure above you can see the array has null values where the bit size shifts (namely at approximately indices 7000 and 21800). For this exercise to work we can’t have null values in our array, so we need to replace the nulls with numbers.

In this case we’ll select the `CALI`

column from the dataframe, interpolate using the `pad`

setting, and create a 1-D `numpy`

array.

```
import matplotlib.pyplot as plt
plt.plot(df['CALI'])
```

```
signal = df['CALI'].interpolate('pad').values
plt.plot(signal)
```

Here is the entire caliper curve from surface to the bottom of the well, with the interpolated values instead of nulls:

## Changepoint detection

Glancing at the figure above, you can tell that an approximately 12-inch bit was used from surface to about 7000 depth index (since the sample rate on this log is 0.5 ft, that is about 3500 ft depth). There the bit was changed to ~9 inches then again changed to ~6 inches at ~22,000 depth index.

To the human eye, it’s pretty obvious where there are bit size changes. But if we needed to do this for hundreds or thousands of wells, that is an extremely tedious task to estimate bit sizes and the change points manually. This is where changepoint detection comes in (thanks to Jesse Pisel for the suggestion to try this technique).

There’s a nice library called `ruptures`

that does changepoint detection. It has a familiar API to anyone with experience with scikit-learn which makes it quick to get up and running.

This is not meant to be a deep dive into the `ruptures`

library. Instead, it’s just a proof-of-concept that this library can be effectively used to detect the major shifts in the caliper log that represent changes in drill bit size. In this case I’m using the `BottomUp`

detection method and the `l2`

model, but there are several others built in to the library– you can read about the other detection methods and the other models that can be applied from the rupture documentation.

In this case I know the number of breakpoints and set the model to predict the changepoints at two places (i.e. `n_bkps=2`

).

```
import ruptures as rpt
# instantiate model, fit it to our data, and predict the breakpoint locations
model = "l2" # "l1", "rbf", "linear", "normal", "ar"
algo = rpt.BottomUp(model=model, jump=10).fit(signal)
my_bkps = algo.predict(n_bkps=2)
# show results
rpt.show.display(signal, my_bkps, my_bkps, figsize=(10, 3))
```

That worked pretty well! It took about 10 seconds to run with `jump=10`

. `ruptures`

has a built in method for creating a matplotlib plot with the breakpoints drawn and distinct regions of the curve shaded.

What if we don’t know the number of breakpoints? `ruptures`

can deal with this as well. You can feed a penalty parameter to the predictor to tell the algorithm when to stop searching for more breakpoints. After a few trial and errors, I settled on a sigma of 20 for the automated breakpoint enumeration, which gave me the same number of breakpoints as above.

```
import numpy as np
sigma = 20 # acceptable standard deviation for this exercise
dim = signal.ndim
n = len(signal)
my_bkps = algo.predict(pen=np.log(n)*dim*sigma**2)
```

## Create a bit size curve

We’ve found the break points thanks to `ruptures`

. Next we want to create a bit size array to add to our LAS file. One of the outputs from the model prediction is an array with the breakpoints listed:

```
my_bkps
```

```
[7040, 21840, 26864]
```

We can create a list of bins for use in creating the bit size array from scratch:

```
indices = [0] + my_bkps
bins = list(zip(indices, indices[1:]))
bins
```

```
[(0, 7040), (7040, 21840), (21840, 26864)]
```

The amazing `pandas`

library has a neat class called `IntervalIndex`

. We’ll instantiate one of these using the `bins`

we just came up with.

```
ii = pd.IntervalIndex.from_tuples(bins, closed='left')
```

Since we’ve been working in index space rather than depth space, we’ll use the `cut()`

method on the DataFrame index to create a new column called `BIN`

. By assigning the `IntervalIndex`

object we just created to the `bins`

kwarg, `cut()`

will evaluate each row’s index and assign it to a bin based on the intervals defined in the `IntervalIndex`

:

```
df['BIN'] = pd.cut(df.index, bins=ii)
```

Now every row of the dataframe belongs to a group which corresponds with the intervals we came up with using the `ruptures`

library.

Remember, the caliper is the true borehole diameter as measured by the caliper arm on the logging tool and the bit size is the ideal borehole diameter. The idea is that if we compute a low percentile value of the caliper, we will be pretty close to the bit size:

```
df.groupby('BIN')['CALI'].describe(percentiles=[0.02])['2%']
```

```
BIN
[0, 7040) 12.348
[7040, 21840) 8.793
[21840, 26864) 6.141
Name: 2%, dtype: float64
```

So this describes the approximate bit sizes over the intervals: bit sizes used in this well appear to be about 12.3, 8.8 and 6.1 inches.

Let’s take those values and put them in a numpy array:

```
bits = df.groupby('BIN')['CALI'].describe(percentiles=[0.05, 0.02])['2%'].values
```

Drill bits typically come in standard sizes, and are usually relatively consistent across a region (though they may vary from basin to basin or continent to continent, onshore to offshore, etc.).

Having looked at dozens of well log headers in this basin, I have a pretty good idea of the bit sizes encountered. So here we’ll create a little `dict`

with some common bit sizes.

```
bit_sizes = {
6.125: 6.125,
6.5: 6.5,
7.875: 7.875,
8.5: 8.5,
8.75: 8.75,
12.25: 12.25,
}
```

Now we look up which bit size each of the apparent bit sizes is closest to using that dictionary of standard bit sizes. To do so we pick the key where the magitude of the difference between the apparent bit size and the key in minimal. This is not very efficient but there aren’t many bit sizes nor many apparent bit sizes, so it is pretty fast despite the computational complexity.

```
estimated_bits = []
for i in bits:
bit_ = bit_sizes[min(bit_sizes.keys(), key=lambda k: abs(k-i))]
estimated_bits.append(bit_)
categories = dict(zip(df.BIN.unique(), estimated_bits))
categories
```

```
{Interval(0, 7040, closed='left'): 12.25,
Interval(7040, 21840, closed='left'): 8.75,
Interval(21840, 26864, closed='left'): 6.125}
```

There may be a better way to do dict lookup to go to a new column in `pandas`

, but this `apply()`

method, even though it’s not a vectorized operation, is fast enough for my purposes. The result is a column that records the bit size.

```
df['BITSIZE'] = df.apply(lambda x: categories.get(x.BIN), axis=1)
```

So, now we have a bit size curve. Here’s what it looks like when plotted alongside the caliper log.

```
plt.plot(df['CALI'])
plt.plot(df['BITSIZE'])
```

## Update the LAS object and write to a LAS file

To finish, let’s create a new DataFrame with just the columns we want to send back into the python `las`

object. We can do so by using the `set_data_from_df()`

method.

```
df2 = df[['DEPT', 'CALI', 'BITSIZE']].set_index('DEPT', drop=True)
las.set_data_from_df(df2)
```

Write the `las`

object to an LAS file and we’re done!

```
with open('bit_size_added.LAS', 'w') as f:
las.write(f)
```