Clustering the Ethereum Address Space
What is the Ethereum blockchain?¶
The Ethereum blockchain is a platform for decentralized applications called smart contracts. Smart contracts are automatic and self-executing agreements that operate without the need of a third party. They are used as the backbone of Decentralized Finance (DeFi) or for representing physical (real estate) or digital (digital art) objects: we now call the latter NFTs (non-fungible tokens).
The computations required to execute smart contracts are paid for in ether, the native currency of the Ethereum ecosystem. Ether is stored in cryptographically secured accounts called addresses.
Ethereum users may be anonymous, but their addresses are unique identifiers whose ownership does not change that leave traces publicly available. It is indeed possible to gather public data of each address' transactional behavior. For every transaction that is executed on Ethereum, there is a trace that we can analyze.
Different types of users use Ethereum for a variety of reasons and most of these users are normal people. But if we select the addresses with the most ether, we find some specific categories of users:
Most of the holders in the above-mentioned categories have several Ethereum wallets, making the analysis of their behaviour more difficult. Having many Ethereum wallets is often intentionally directed at being harder to track.
What do we mean by on-chain data?¶
On-chain data is all data that is natively stored on the blockchain. This includes:
- Details of every mined block on the chain (timestamp, gas price, miner's address, block size, etc.).
- Details of every transaction (sender and receiver addresses, the amount transferred, etc.).
- Details on the number and nature of tokens held by each address.
Why is such a clustering interesting?¶
There are many reasons why one might want to track these accounts. These could be for:
- Predicting price movements based on on-chain data.
- Having a view on the composition of the holders of a specific token.
- Auditing suspicious transactions for fiscal or criminal purposes.
- Understanding the general network activity.
- Making better trading strategies.
- Preventing Anti-Money Laundering activities.
How can we classify Ethereum address space in a meaningful way?¶
This project attempts to create meaningful categories of users by dividing the Ethereum address space into clusters. It focuses on the behaviour of Ethereum addresses and clusters them based on available transactional data. The source code is available on GitHub.
I attempt here to answer this question with unsupervised learning to see what unexpected patterns might be in the data. But first, let's a build a dataset with relevant features.
Bring Your Own Data: Build the Dataset¶
The database was constructed using the BYOD (Bring your Own Data) principle: I worked with data from multiple sources.
Google BigQuery hosts a dataset of all transactional data on Ethereum.
In order to extract patterns from the addresses, I first defined the traits that we would draw comparisons on. Using SQL, I queried for each address 28 features to help classify addresses. These features include statistics around the amount of ether each address is holding, how often they transact, who they transact with, the number of transfers to unique addresses, the number of unique tokens held, and so on.
As there are currently more than 100 million existing Ethereum addresses, filtering the addresses is essential, especially since the categories we find the most relevant (Miners, Exchanges, Whales,...) tend to have a high ether balance. I filtered the data by selecting only the 10,000 addresses with the highest ether balance.
Extract relevant addresses and features from BigQuery¶
eth_dataset.head()
A crucial feature for identifying Miners' addresses: the number of mined blocks¶
As good as the BigQuery dataset is, it does not provide any information on whether an address has received any block mining rewards. This is obviously a key feature to identify a Miner's address and would allow to label these addresses as "Miners".
Etherscan.io provides a free API Service to access Ethereum transactional data, including the number of mined blocks per address. Querying the API allowed me to label 51 addresses from our dataset as "Miners".
Add labels to the data¶
I found a dataset with crowdsourced labels to Ethereum addresses on Kaggle and added labels from addresses present in my dataset.
More labels of Ethereum addresses can be found on etherscan.io. Unfortunately, scraping was not possible. I manually added relevant labels to my dataset: 53 Exchange-owned addresses and 51 Miners-owned addresses, which represent together less than 1% of my dataset, which is not enough to try using a supervised learning algorithm and have meaningful results.
Pre-process the data¶
I cleaned the data: some dimensionality reduction and scaling (Principal Component Analysis). Sklearn’s pipeline functionality is used to preprocess the features, by using power transform, standard scaling, and PCA transformation. PCA transformation is useful here to reduce the dimensionality considering the number of features.
Cluster the data into clusters using K-Means¶
I trained a K-means algorithm to see if there are natural clusters within Ethereum addresses. To learn about the K-Means algorithm, you can read my explanation of the the algorithm here.
First, I had to determine a relevant number of clusters. We know that there should exist a cluster for Exchange-owned addresses, a cluster for Miners-owned addresses, but probably also for other Ethereum users' archetypes not captured in our labeled data: Whales (investors holding very large bags of Ether), normal users, ICO wallets, DeFi liquidity pools,... We should see for them specific transactional behavior linked to their particular identity. How many clusters should we then have to maximize the clusters' interpretability?
Determine the number of clusters with the Silhouette Method¶
The Silhouette method assesses the quality of a clustering by finding out how well each instance lies within its cluster. A high silhouette displays a good clustering. We selected here 8 clusters as our optimal number of clusters.
For a technical explanation of how the silhouette method works, you can refer to this excellent blog post.
plot_silhouette_scores(data, 4, 12)
K-Means clusters and their projection in 2 dimensions with t-SNE¶
The t-SNE algorithm puts similar cases together, handling non-linearities of data very well.
plot_tsne(clust.labels_, tsne_results)
We find the addresses here to be relatively well-differentiated. This helps us to see that there could be clear differences between the different type of users.
But what are the clusters like when we plot only the addresses for which we have labels (Miners and Exchanges), whose existence our clustering algorithm has no idea?
Here is the same figure with only our 104 labeled addresses highlighted:
plot_tsne_with_labels(tsne_results, dataset, dflabel, categs, colors)
And here is how they compare when plotted together:
plot_tsne(cl_labels.labels_, tsne_results_labels)
We see that they are well-separated, almost linearly. Now, how is our labeled data divided among our 8 clusters?
To that end, I assigned to each address its cluster and found the following distribution:
Exchange
- Cluster number 0 has 1729 addresses, including 1 addresses labeled as Exchange (label density: 0.057836899942163095).
- Cluster number 1 has 147 addresses, including 44 addresses labeled as Exchange (label density: 29.931972789115648).
- Cluster number 2 has 1689 addresses, including 0 addresses labeled as Exchange (label density: 0.0).
- Cluster number 3 has 973 addresses, including 0 addresses labeled as Exchange (label density: 0.0).
- Cluster number 4 has 438 addresses, including 1 addresses labeled as Exchange (label density: 0.228310502283105).
- Cluster number 5 has 21 addresses, including 4 addresses labeled as Exchange (label density: 19.047619047619047).
- Cluster number 6 has 843 addresses, including 3 addresses labeled as Exchange (label density: 0.3558718861209964).
- Cluster number 7 has 217 addresses, including 0 addresses labeled as Exchange (label density: 0.0).
Mining
- Cluster number 0 has 1729 addresses, including 7 addresses labeled as Mining (label density: 0.4048582995951417).
- Cluster number 1 has 147 addresses, including 9 addresses labeled as Mining (label density: 6.122448979591836).
- Cluster number 2 has 1689 addresses, including 9 addresses labeled as Mining (label density: 0.5328596802841918).
- Cluster number 3 has 973 addresses, including 0 addresses labeled as Mining (label density: 0.0).
- Cluster number 4 has 438 addresses, including 23 addresses labeled as Mining (label density: 5.251141552511415).
- Cluster number 5 has 21 addresses, including 2 addresses labeled as Mining (label density: 9.523809523809524).
- Cluster number 6 has 843 addresses, including 1 addresses labeled as Mining (label density: 0.11862396204033215).
- Cluster number 7 has 217 addresses, including 0 addresses labeled as Mining (label density: 0.0).
Ideally, we should find addresses with the same label in the same cluster. How can we use our (limited) knowledge of some of the addresses' categories to get clusters with a presumably better predictive power?
Semi-supervised Learning - Cluster data points with the same labels together¶
As in semi-supervised learning situations, we have a small fraction of the dataset labeled and most of the remaining examples are unlabeled. How can we leverage the few labeled examples? If an expert's opinion was available to label a few addresses, we could use active learning and build a supervised algorithm by labeling the addresses that contribute the most to the model quality.
Instead, I used labeled data to perform a round of re-clustering, taking account this time of the separation between Miners and Exchanges that we know is relevant.
This is done by applying a modified version of the K-Means algorithm using constraints on the addresses that must be clustered together or cannot be clustered together.
This is done by creating two lists:
- A "must-link" list, with the combinations of Ethereum addresses having to be clustered together (Miner-Miner, Exchange-Exchange).
- A "cannot-link" list, with the combinations of Ethereum addresses that cannot be clustered together (Miner-Exchange).
I used to that end the Python implementation of the Constrained K-Means algorithm by Behrouz Babaki found on Github, and based on Sklearn's implementation of K-Means.
After re-clustering, we can visualize the new clusters by applying t-SNE again:
plot_tsne(clusters, tsne_results_cop)
Interpreting the Results¶
In a future post, I will focus on the interpretation of results so as to draw conclusions about the different users' behaviour, based for example on the corresponding clusters centroids.
Future Exploratory Paths¶
With a dataset of only 104 labelled addresses, we were able to map some of the most prominent Ethereum address groups. This emphasises the need for more labeled data, as it will allow for the development of a more comprehensive picture of user types.
Expanding on this work would allow a more nuanced view of Ethereum blockchain data. Here are some particularly interesting areas:
- Distinguishing bots' addresses from humans' addresses
- Replicate the analysis for ERC-20 (fungible tokens) and ERC-721 (non-fungible) tokens.
- Replicate the analysis but filtering for a specific token
Blockchain analytics is in its infancy and there is much to do in the discovery of useful information about the different protagonists of the blockchain space.