Clustering the Ethereum Address Space

What is the Ethereum blockchain?

The Ethereum blockchain is a platform for decentralized applications called smart contracts. Smart contracts are automatic and self-executing agreements that operate without the need of a third party. They are used as the backbone of Decentralized Finance (DeFi) or for representing physical (real estate) or digital (digital art) objects: we now call the latter NFTs (non-fungible tokens).

The computations required to execute smart contracts are paid for in ether, the native currency of the Ethereum ecosystem. Ether is stored in cryptographically secured accounts called addresses.

Ethereum users may be anonymous, but their addresses are unique identifiers whose ownership does not change that leave traces publicly available. It is indeed possible to gather public data of each address' transactional behavior. For every transaction that is executed on Ethereum, there is a trace that we can analyze.

Different types of users use Ethereum for a variety of reasons and most of these users are normal people. But if we select the addresses with the most ether, we find some specific categories of users:

Image and icons from Freepik, Ddara, turkkub from www.flaticon.com

Most of the holders in the above-mentioned categories have several Ethereum wallets, making the analysis of their behaviour more difficult. Having many Ethereum wallets is often intentionally directed at being harder to track.

What do we mean by on-chain data?

On-chain data is all data that is natively stored on the blockchain. This includes:

  • Details of every mined block on the chain (timestamp, gas price, miner's address, block size, etc.).
  • Details of every transaction (sender and receiver addresses, the amount transferred, etc.).
  • Details on the number and nature of tokens held by each address.

Why is such a clustering interesting?

There are many reasons why one might want to track these accounts. These could be for:

  • Predicting price movements based on on-chain data.
  • Having a view on the composition of the holders of a specific token.
  • Auditing suspicious transactions for fiscal or criminal purposes.
  • Understanding the general network activity.
  • Making better trading strategies.
  • Preventing Anti-Money Laundering activities.

How can we classify Ethereum address space in a meaningful way?

This project attempts to create meaningful categories of users by dividing the Ethereum address space into clusters. It focuses on the behaviour of Ethereum addresses and clusters them based on available transactional data. The source code is available on GitHub.

I attempt here to answer this question with unsupervised learning to see what unexpected patterns might be in the data. But first, let's a build a dataset with relevant features.

Bring Your Own Data: Build the Dataset

The database was constructed using the BYOD (Bring your Own Data) principle: I worked with data from multiple sources.

Google BigQuery hosts a dataset of all transactional data on Ethereum.

In order to extract patterns from the addresses, I first defined the traits that we would draw comparisons on. Using SQL, I queried for each address 28 features to help classify addresses. These features include statistics around the amount of ether each address is holding, how often they transact, who they transact with, the number of transfers to unique addresses, the number of unique tokens held, and so on.

As there are currently more than 100 million existing Ethereum addresses, filtering the addresses is essential, especially since the categories we find the most relevant (Miners, Exchanges, Whales,...) tend to have a high ether balance. I filtered the data by selecting only the 10,000 addresses with the highest ether balance.

Extract relevant addresses and features from BigQuery

In [59]:
eth_dataset.head()
Out[59]:
ethereum_address ether_balance unique_tokens unique_transfers mined_blocks outgoing_txns incoming_txns total_eth_sent avg_eth_sent total_usd_sent ... monthly_usd_sent monthly_eth_recd monthly_usd_recd contracts_created contract_txns_sent incoming_avg_time_btwn_txns incoming_std_time_btwn_txns outgoing_avg_time_btwn_txns outgoing_std_time_btwn_txns num_tokens_used
0 0x0d0707963952f2fba59dd06f2b425ace40b492fe 443.735122 587 245377 0 479582 492072 2437018.042751326 5.081546102 1.066724e+09 ... 1.333405e+08 314641.354177898 1.375943e+08 0 0 35.793619 8.419819e+02 3.675822e+01 9.115655e+01 281
1 0x6cc5f688a315f3dc28a7781717a9a798a59fda7b 1031.186386 865 472190 0 392467 312401 5418637.39363095 13.806606399 2.131017e+09 ... 2.131017e+08 556867.74055041 2.250757e+08 0 0 77.365013 6.066634e+03 5.300636e+01 4.698340e+02 341
2 0x564286362092d8e7936f0549571a803b203aaced 23892.712593 502 109404 0 615240 678 5823039.74567252 9.464663783 3.502765e+09 ... 3.184332e+08 537952.716734639 3.228961e+08 0 0 39191.246677 6.094457e+04 4.332710e+01 1.286006e+03 261
3 0x0016eccecffc25b94050187017eb59fa05c029aa 126.407467 54 6180 0 2998 481 4479.533394411 1.494173914 1.205682e+06 ... 1.722403e+05 745.774408214 2.516832e+05 0 0 32761.102083 1.169466e+05 5.212212e+03 1.908218e+04 40
4 0xbe708d227f6dfa0b8f2698bf543b949dfe4e28fb 269.029806 202 1462 0 10164 243 20771.819829851 2.043665863 4.969486e+06 ... 6.211858e+05 38.314152287 1.370181e+04 0 0 77338.574380 1.781298e+05 1.845206e+03 6.596558e+03 166
5 0x9b77ab003d44b9b9cb47fa6a00276a23c05b49a5 2089.859796 54 3 0 5 108 60.85 12.17 1.007131e+04 ... 3.357105e+02 68.014756625 2.426593e+04 0 0 687237.691589 2.316083e+06 7.132346e+06 8.472997e+06 32
6 0x0681d8db095565fe8a346fa0277bffde9c0edbbf 11252.425439 561 105544 0 647655 727 6159822.80837053 9.510963103 3.714534e+09 ... 3.376849e+08 570567.985719681 3.425117e+08 0 0 36676.988981 5.331568e+04 4.115239e+01 1.296313e+03 310
7 0x1062a747393198f70f71ec65a582423dba7e5ab3 326.124385 404 443465 0 3909 1974 41965.969 10.73573011 1.262283e+07 ... 6.643596e+05 2218.526085996 6.614552e+05 0 0 22581.909782 3.823953e+05 1.169006e+04 1.592798e+05 215
8 0xeee28d484628d41a82d01e21d12e2e78d69920da 347.253282 357 294350 0 53893 12809 1767690.47107525 32.800001319 4.959678e+08 ... 2.610357e+07 97844.079789312 2.732701e+07 0 0 3342.928873 9.834598e+04 8.485321e+02 2.212886e+04 225
9 0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98 16018.851241 939 330486 0 5488707 235407 25691539.593409785 4.680799976 9.921043e+09 ... 2.480261e+08 75253.575290014 5.078835e+05 13 1326134 434.158199 2.684584e+04 1.872970e+01 4.067285e+02 503

10 rows × 29 columns

A crucial feature for identifying Miners' addresses: the number of mined blocks

As good as the BigQuery dataset is, it does not provide any information on whether an address has received any block mining rewards. This is obviously a key feature to identify a Miner's address and would allow to label these addresses as "Miners".

Etherscan.io provides a free API Service to access Ethereum transactional data, including the number of mined blocks per address. Querying the API allowed me to label 51 addresses from our dataset as "Miners".

Add labels to the data

I found a dataset with crowdsourced labels to Ethereum addresses on Kaggle and added labels from addresses present in my dataset.

More labels of Ethereum addresses can be found on etherscan.io. Unfortunately, scraping was not possible. I manually added relevant labels to my dataset: 53 Exchange-owned addresses and 51 Miners-owned addresses, which represent together less than 1% of my dataset, which is not enough to try using a supervised learning algorithm and have meaningful results.

Pre-process the data

I cleaned the data: some dimensionality reduction and scaling (Principal Component Analysis). Sklearn’s pipeline functionality is used to preprocess the features, by using power transform, standard scaling, and PCA transformation. PCA transformation is useful here to reduce the dimensionality considering the number of features.

Cluster the data into clusters using K-Means

I trained a K-means algorithm to see if there are natural clusters within Ethereum addresses. To learn about the K-Means algorithm, you can read my explanation of the the algorithm here.

First, I had to determine a relevant number of clusters. We know that there should exist a cluster for Exchange-owned addresses, a cluster for Miners-owned addresses, but probably also for other Ethereum users' archetypes not captured in our labeled data: Whales (investors holding very large bags of Ether), normal users, ICO wallets, DeFi liquidity pools,... We should see for them specific transactional behavior linked to their particular identity. How many clusters should we then have to maximize the clusters' interpretability?

Determine the number of clusters with the Silhouette Method

The Silhouette method assesses the quality of a clustering by finding out how well each instance lies within its cluster. A high silhouette displays a good clustering. We selected here 8 clusters as our optimal number of clusters.

For a technical explanation of how the silhouette method works, you can refer to this excellent blog post.

In [4]:
plot_silhouette_scores(data, 4, 12)

K-Means clusters and their projection in 2 dimensions with t-SNE

The t-SNE algorithm puts similar cases together, handling non-linearities of data very well.

In [7]:
plot_tsne(clust.labels_, tsne_results)

We find the addresses here to be relatively well-differentiated. This helps us to see that there could be clear differences between the different type of users.

But what are the clusters like when we plot only the addresses for which we have labels (Miners and Exchanges), whose existence our clustering algorithm has no idea?

Here is the same figure with only our 104 labeled addresses highlighted:

In [12]:
plot_tsne_with_labels(tsne_results, dataset, dflabel, categs, colors)

And here is how they compare when plotted together:

In [18]:
plot_tsne(cl_labels.labels_, tsne_results_labels)

We see that they are well-separated, almost linearly. Now, how is our labeled data divided among our 8 clusters?

To that end, I assigned to each address its cluster and found the following distribution:

Exchange

  • Cluster number 0 has 1729 addresses, including 1 addresses labeled as Exchange (label density: 0.057836899942163095).
  • Cluster number 1 has 147 addresses, including 44 addresses labeled as Exchange (label density: 29.931972789115648).
  • Cluster number 2 has 1689 addresses, including 0 addresses labeled as Exchange (label density: 0.0).
  • Cluster number 3 has 973 addresses, including 0 addresses labeled as Exchange (label density: 0.0).
  • Cluster number 4 has 438 addresses, including 1 addresses labeled as Exchange (label density: 0.228310502283105).
  • Cluster number 5 has 21 addresses, including 4 addresses labeled as Exchange (label density: 19.047619047619047).
  • Cluster number 6 has 843 addresses, including 3 addresses labeled as Exchange (label density: 0.3558718861209964).
  • Cluster number 7 has 217 addresses, including 0 addresses labeled as Exchange (label density: 0.0).

Mining

  • Cluster number 0 has 1729 addresses, including 7 addresses labeled as Mining (label density: 0.4048582995951417).
  • Cluster number 1 has 147 addresses, including 9 addresses labeled as Mining (label density: 6.122448979591836).
  • Cluster number 2 has 1689 addresses, including 9 addresses labeled as Mining (label density: 0.5328596802841918).
  • Cluster number 3 has 973 addresses, including 0 addresses labeled as Mining (label density: 0.0).
  • Cluster number 4 has 438 addresses, including 23 addresses labeled as Mining (label density: 5.251141552511415).
  • Cluster number 5 has 21 addresses, including 2 addresses labeled as Mining (label density: 9.523809523809524).
  • Cluster number 6 has 843 addresses, including 1 addresses labeled as Mining (label density: 0.11862396204033215).
  • Cluster number 7 has 217 addresses, including 0 addresses labeled as Mining (label density: 0.0).

Ideally, we should find addresses with the same label in the same cluster. How can we use our (limited) knowledge of some of the addresses' categories to get clusters with a presumably better predictive power?

Semi-supervised Learning - Cluster data points with the same labels together

As in semi-supervised learning situations, we have a small fraction of the dataset labeled and most of the remaining examples are unlabeled. How can we leverage the few labeled examples? If an expert's opinion was available to label a few addresses, we could use active learning and build a supervised algorithm by labeling the addresses that contribute the most to the model quality.

Instead, I used labeled data to perform a round of re-clustering, taking account this time of the separation between Miners and Exchanges that we know is relevant.

This is done by applying a modified version of the K-Means algorithm using constraints on the addresses that must be clustered together or cannot be clustered together.

This is done by creating two lists:

  • A "must-link" list, with the combinations of Ethereum addresses having to be clustered together (Miner-Miner, Exchange-Exchange).
  • A "cannot-link" list, with the combinations of Ethereum addresses that cannot be clustered together (Miner-Exchange).

I used to that end the Python implementation of the Constrained K-Means algorithm by Behrouz Babaki found on Github, and based on Sklearn's implementation of K-Means.

After re-clustering, we can visualize the new clusters by applying t-SNE again:

In [23]:
plot_tsne(clusters, tsne_results_cop)

Interpreting the Results

In a future post, I will focus on the interpretation of results so as to draw conclusions about the different users' behaviour, based for example on the corresponding clusters centroids.

Future Exploratory Paths

With a dataset of only 104 labelled addresses, we were able to map some of the most prominent Ethereum address groups. This emphasises the need for more labeled data, as it will allow for the development of a more comprehensive picture of user types.

Expanding on this work would allow a more nuanced view of Ethereum blockchain data. Here are some particularly interesting areas:

  • Distinguishing bots' addresses from humans' addresses
  • Replicate the analysis for ERC-20 (fungible tokens) and ERC-721 (non-fungible) tokens.
  • Replicate the analysis but filtering for a specific token

Blockchain analytics is in its infancy and there is much to do in the discovery of useful information about the different protagonists of the blockchain space.