This article will show you how to acquire historical NFT data to build a floor price prediction model for Bored Ape Yacht Club (BAYC) collection in Python. The model can be also used to predict other NFT collections by simply changing the NFT contract address.

This is the first article of its kind and the article is minted as NFTs. Holders of the NFT will gain access to my next article - NFT token price prediction model.

* Note*:

There are generally two types of price prediction that buyers are interested in:

**Collection Floor Price Prediction:**To predict the lowest price of the entire collection on any given day - the floor price prediction.**Token Price Prediction:**To predict the price of a specific token within one NFT collection. i.e. within the BAYC collection, based on the different traits and features and historic sale price of the specific token, predict the price of the token on any given day.

The following model will only cover the floor price prediction for the collection. The token price prediction is not covered here and will be explained in the next article.

The Python code blocks below load the required packages and define the functions needed to download the data from the data provider Covalent. To use Covalent’s API to download the data, you need to request for an authentication key from the site below. There are also other NFT data providers such as Flipside Crypto, Dune etc. that you can use to download historic NFT prices and sale volumes.

https://www.covalenthq.com/platform/#/auth/register/

```
###################
## Load packages ##
###################
import requests
import json
import pandas as pd
import numpy as np
import covalent_api as c
import matplotlib.pyplot as plt
from pandas.plotting import lag_plot
import datetime
from statsmodels.tsa.arima_model import ARIMA
import pmdarima as pm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import accuracy_score, r2_score, precision_score
from PIL import Image
import requests
from io import BytesIO
plt.rcParams.update({'figure.figsize':(9,7), 'figure.dpi':120})
####################
## Load functions ##
####################
def fetch_collection_hist(address):
api_url = 'https://api.covalenthq.com'
endpoint = f'/v1/1/nft_market/collection/{address}/'
url = api_url + endpoint
r = requests.get(url, auth=(auth_key,''))
input_data = r.json()
input_data = input_data['data']['items']
out = pd.DataFrame.from_records(input_data)
return(out)
def fetch_token_id(address):
api_url = 'https://api.covalenthq.com'
endpoint = f'/v1/1/tokens/{address}/nft_token_ids/'
url = api_url + endpoint
r = requests.get(url, auth=(auth_key,''))
input_data = r.json()
return(input_data)
def fetch_token_tx(address, token_id):
api_url = 'https://api.covalenthq.com'
endpoint = f'/v1/1/tokens/{address}/nft_transactions/{token_id}/'
url = api_url + endpoint
r = requests.get(url, auth=(auth_key,''))
input_data = r.json()
return(input_data)
def fetch_token_meta(address, token_id):
api_url = 'https://api.covalenthq.com'
endpoint = f'/v1/1/tokens/{address}/nft_metadata/{token_id}/'
url = api_url + endpoint
r = requests.get(url, auth=(auth_key,''))
input_data = r.json()
input_data = input_data['data']['items'][0]['nft_data']
out = pd.DataFrame.from_records(input_data)
out_2 = pd.DataFrame.from_records(out['external_data'])
out_table = pd.DataFrame({'token_id': out['token_id']
,'token_balance': out['token_balance']
,'image_url': out_2['image']
,'traits': out_2['attributes']})
trait = pd.DataFrame.from_records(out_table['traits'][0])
trait.loc[:,'token_id'] = out_table['token_id']
trait['token_id'] = out_table['token_id'].tolist() * len(trait)
trait['image_url'] = out_table['image_url'].tolist() * len(trait)
return(trait)
def stationary_test(x):
out = adfuller(x.dropna())
print('ADF Statistic: %f' % out[0])
print('p-value: %f' % out[1])
if out[1] <= 0.05:
print('Time-series is stationary at 5% significance level.')
else:
print('Time-series is non-stationary at 5% significance level. Find the order of difference!')
# Accuracy metrics
def accuracy(y_hat, y):
mape = np.mean(np.abs(y_hat - y)/np.abs(y))
rmse = np.mean((y_hat - y)**2)**0.5
corr = np.corrcoef(y_hat, y)[0,1]
return({'Mean absolute percentage error':mape
, 'Root mean squared error':rmse
, 'corr':corr})
```

The historic NFT data can be downloaded from Covalent API by entering the NFT contract address in the code block below. Here Bored Ape Yacht Club (BAYC)’s address is used as an example.

```
################################################
## ༼ つ ◕_◕ ༽つ Input NFT contract address below:
################################################
## (BAYC as an example here)
nft_address = '0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d'
```

```
###################################
## Read in historic floor prices ##
###################################
## Read in floor price
data = fetch_collection_hist(nft_address)
data['opening_date'] = pd.to_datetime(data['opening_date'])
data = data.sort_values(by='opening_date', ascending=True)
data.set_index('opening_date', inplace=True)
## Plot floor price
name = np.unique(data['collection_name']).tolist()
plt.plot( data["floor_price_quote_7d"])
plt.title("Historical floor price "+str(name))
plt.xlabel("time")
plt.ylabel("price")
plt.show()
```

The historic daily floor price of BAYC from April 30th 2020 to January 15th 2021 is shown below.

You can also input a specific token ID from the collection in the code block below to check out the NFT traits and see the image.

```
######################################
## ༼ つ ◕_◕ ༽つ Input a token ID below:
######################################
tokenId = '123'
## Display the NFT & its traits
token_meta = fetch_token_meta(nft_address, tokenId)
print("The "+token_meta["trait_type"]+" is: "+token_meta["value"])
url = token_meta['image_url'][0]
response = requests.get(url)
Image.open(BytesIO(response.content))
```

Token ID 123 is used as an example. You can find the traits and image from BAYC 123 as below:

The dependent variable to predict is the floor price of a given NFT collection. Since the past sale price is usually a good indicator of the future price, this can be interpreted as the prediction of a time-dependent event:

```
Y(c|tn) = f(w,x,Y(c|t0,...,tn-1))
```

where Y is the floor price of collection c at time tn and x represents the time dependent independent variables; Y from t0 to tn-1 represent the past prices that can be used to model the price now.

We know the prices of a lot NFTs have sky-rocketed since last year and there is surely a trend in the price. Given most of the time-series models require stationarity (no trend) in the data, generally Y needs to be transformed into a difference in order to remove trending in the price:

```
dY(c|t) = f(w,x,dY(c|t0,...,tn-1))
```

where dY is the percentage change in price of the collection (or token) c from month t-1 to t.

Some of the most commonly used time-series models are:

Since ARIMA model incorporates the differencing terms dY(c|t), the lag of the difference dY(c|t-1) and the lag of the error terms e(c|t), as an example for demonstration, the last option ARIMA is chosen to predict the floor price in the below sections.

ARIMA models are generally the most general class of models for forecasting a time series which can be made to be “stationary” by differencing (see more details here). The model form ARIMA(p,d,q) comes with three components:

- p is the number of autoregressive terms,
- d is the number of nonseasonal differences needed for stationarity, and
- q is the number of lagged forecast errors in the prediction equation.

In order to decide what values of p, d and q to use in the ARIMA model, it's useful to have a look at floor price data itself, its first difference (or second or higher differences) along with its autocorrelation function ACF and partial autocorrelation function PACF.

The plots below show that the original daily floor price time-series is trending (non-stationary). Once the first difference is used, the series becomes stationary as shown in the ACF and PACF plots. The second difference is over-differencing as shown by the over-shooting lag from ACF and PACF.

```
########################################
## Plot floor price & its differences ##
########################################
## Original floor price
fig, axes = plt.subplots(3, 3, sharex=False, figsize=(25,15))
diff0 = data['floor_price_quote_7d']
axes[0,0].plot(diff0); axes[0, 0].set_title('Original 7d floor price')
plot_acf(diff0.dropna(), ax=axes[0, 1])
axes[0,1].axis(xmin=-2, xmax=28)
plot_pacf(diff0.dropna(), ax=axes[0, 2])
axes[0,2].axis(xmin=-2, xmax=28)
## 1st diff
diff1 = data['floor_price_quote_7d'].diff()
axes[1,0].plot(diff1); axes[1, 0].set_title('1st difference of 7d floor price')
plot_acf(diff1.dropna(), ax=axes[1, 1])
axes[1,1].axis(xmin=-2, xmax=28)
plot_pacf(diff1.dropna(), ax=axes[1, 2])
axes[1,2].axis(xmin=-2, xmax=28)
## 2nd diff
## lag goes too far negative => over-differenced!
diff2 = data['floor_price_quote_7d'].diff().diff()
axes[2,0].plot(diff2); axes[2, 0].set_title('2nd difference of 7d floor price')
plot_acf(diff2.dropna(), ax=axes[2, 1])
plot_pacf(diff2.dropna(), ax=axes[2, 2])
plt.show()
```

The stationarity Augmented Dickey–Fuller ADF tests below also show the same statistically, where p value becomes small after first differencing. So we have an idea only first differencing (d=1) is needed to turn the trending data into stationary.

```
## Stationarity Tests
stationary_test(diff0) ## non-stationary
stationary_test(diff1) ## stationary
stationary_test(diff2)
```

In order to test better how the ARIMA model performs, the whole dataset is split into 80% for training and 20% for testing.

```
############################
## Training testing split ##
############################
train_size = round(len(data)*0.8)
train = data.floor_price_quote_7d[:train_size]
test = data.floor_price_quote_7d[train_size:]
```

Since there are three different parameters p, d, q in the ARIMA model and the combinations of the parameter choices could get very large. Fortunately the 'pmdarima' package provides an automated ARIMA that uses a stepwise approach to search multiple combinations of p,d,q parameters and chooses the best model.

```
################################
## Finding optimal parameters ##
################################
model = pm.auto_arima(train, start_p=1, start_q=1,
test='adf', # use adftest to find optimal 'd'
max_p=5, max_q=5,
# m=1,
d=None,
seasonal=False,
trace=True,
suppress_warnings=True,
stepwise=True,
random_state=123)
print(model.summary())
```

According to the automatic runs, the best model is ARIMA(1,1,0)(0,0,0), which is a model with first difference and one lag of the first difference. There is no seasonality or constant term (intercept) in the model.

Once the final model is run, it's essential to also check the residuals to see if all the assumptions of ARIMA models are met. Four components of the residuals are checked:

Standardised residual: The errors fluctuate around a mean of zero and have a constant uniform variance.

Density: The empirical density and the kernal density estimation (KDE) of the floor price series suggest it has a mean of zero but has a thinner and more pointy shape than a normal distribution.

QQ Plot: The QQ-plot of the floor price against a normal distribution shows not all the dots are around the red line. The deviations towards the negative values imply the distribution is skewed, the same conclusion from the density plot.

Correlogram: The correlogram (or ACF) plot shows most of the points are within the confidence interval, so the residual errors are not autocorrelated. However, there is one point that pushes far to the negative side, which indicates there might be some pattern in the residuals that are not explained in the model. Adding more predictors (explanatory variables) might help improve the model.

```
## Check residuals
model.plot_diagnostics(figsize=(14,10))
plt.show()
```

Using the optimal model ARIMA(1,1,0) trained from historic data from April 30th 2021 to November 24th 2021, you can predict the floor price from November 25th 2021 to January 15th 2022.

```
################
## Prediction ##
################
pred, ci = model.predict(n_periods = len(test), return_conf_int=True)
idx = np.arange(len(train), len(train) + len(test))
test_pred = pd.Series(pred, index = idx)
lb = pd.Series(ci[:, 0], index = idx)
ub = pd.Series(ci[:, 1], index = idx)
# Plot
plot_train = train.reset_index()
plot_test = pd.DataFrame(test).set_index(idx)
plot_pred = pd.DataFrame(test_pred).set_index(idx)
plt.figure(figsize=(12,5), dpi=100)
plt.plot(plot_train.floor_price_quote_7d, label = "training")
plt.plot(plot_test.floor_price_quote_7d, label = "testing")
plt.plot(plot_pred, label = "prediction")
plt.fill_between(lb.index, lb, ub, color='k', alpha = 0.1)
plt.legend(loc='upper left', fontsize = 12)
plt.title("Forecast vs. actual floor price")
plt.show()
```

The model performance metrics compare the predicted and the actual floor price during the testing period from November 25th 2021 to January 15th 2022. The mean absolute percentage error of 6.2% means the model is approximately 93.8% accurate in predicting the next 52 days.

```
## Performance metrics
actual = pd.DataFrame(test).reset_index()
actual.columns = ['date','actual']
predict = pd.DataFrame(test_pred).reset_index()
predict.columns = ['date','predict']
accuracy(predict['predict'], actual['actual'])
```

This notebook shows an example of how to use Covalent API to download historic daily NFT floor price; and build a simple time-series ARIMA model. Although the model performance is not bad, there are a few limitations and improvements that can be made:

- The residuals from the ARIMA (1,1,0) model are not exactly normally distributed, indicating there might be room to improve the model by adding some more exogenous variables such as sale volumes or even traits of the collection.
- Only a short history of daily floor price is used to predict BAYC in the example. There probably hasn't been enough cycles (ups and downs in trend) in the history for the model to be trained to predict well in the future.
- ARIMA is a generic and simple time-series model. There could be other models and also other exogenous variables that can be used to build a better model.

*If you find submirror valuable, please consider donate to wong2.eth to help cover server cost.*