Data Science Portfolio

A repository of the projects I worked on or am currently working on. Click on the projects' titles to see the full analysis and code.

You can also check out the code on Github.

---

Stand-alone Projects

GPT-2 EU text Generator

  • The OpenAI GPT-2 uses a transformer-based language model to write impressively coherent and passionate essays. Using GPT2-simple, I fine-tuned the model on all European Union's Directives, Regulations and Decisions to get generated new EU legislative acts.
  • I then put the model in a Docker container and depolyed it with Google Cloud Run.
  • The generated texts are surpisingly coherent and can produce some quirky use of legalese.
  • You can read more about the process and the results in my article on my blog.
  • You can generate some text here.

Baudelaire Poem Generator


Clustering the Ethereum Address Space

  • On the Ethereum blockchain, addresses are unique identifiers that leave traces as publicly available transactional data.
  • I built a dataset around Ethereum addresses and 28 relevant features from multiple sources: Google BigQuery dataset, etherscan.io public API, labels from a Kaggle dataset as well as manually added labels.
  • I attempt to create meaningful categories of users (Miners, Exchanges,...) by using K-Means clustering algorithm.
  • A small percentage of labeled addresses in the dataset allows me to re-cluster the data to leverage this information, using a constrained version of the K-Means algorithm.

---

Micro Projects

Sentiment Analysis on HK Protests Tweets

  • I scraped 45,000 tweets with Tweepy and preprocessed them.
  • Word cloud, sentiment analysis with NLTK and exploratory analysis of the data.

Clustering with K-means

  • A visual introduction and to the K-Means algorithm.
  • For a more visually pleasing experience, you can find my article here.

Pulsars Detection with HTRU2 Dataset

  • The HTRU2 Pulsars dataset contains data about pulsars. I first use the dataset as a binary classification problem, and as an opportunity to try different classification algorithms and compare their performance.
  • Then, I use the dataset for unsupervised learning tasks, namely by using a clustering method (K-Means) with PCA as a precursor step.

Animals from the QuickDraw Dataset

  • The QuickDraw dataset contains 50 millions of drawings collected by Google.
  • I select 12 categories from the dataset (only animals) and train this dataset on a CNN.

Facial Keypoints Detection (Kaggle)

  • The data comes from a Kaggle competition.
  • I train a CNN to recognize 15 keypoints on faces.

House Prices (Kaggle)

  • This Kaggle competition is a regression problem: we predict the price of houses based on more than 80 features (and many missing values). This gives us interesting possibilities for feature transformation and data visualization.