Exploratory Data Analysis on Fantom and UNICEF grant applications

这一篇文章是我们为 Open Data Foundation 写的对 Fantom 和 UNICEF 某一段时间收到的 Grant 的申请数据做数据分析。我们发现可以使用一些算法自动鉴别 Grant 的有效性。

In this writing, we will mainly try to explore hidden patterns behind grant applications. Specifically, we will try to use both vanilla and advanced techniques like data collecting/mining, clustering, semi-supervised learning, etc, to determine the eligibility of grant applications automatically. Let’s dive in!

Data Pre-processing#

Formatting#

The Open Data Foundation (ODF) has provided two grant applications for us to explore: the Fantom grant application and the UNICEF grant application. As the two applications have different fields (e.g., the Fantom grant applications have a previous_funding field where the UNICEF grant applications do not), we first format these two applications into the same format. Specifically, we only need the title, description, website,github_user, and project_githubfields for the latter analysis.

fantom_grants = fantom_grants[["title", "description", "website", "github_user", "project_github"]]
unicef_grants = unicef_grants[["title", "description", "website", "github_user", "project_github"]]

Relevance Detection#

When applying for a grant application, the title and the description are crucial for reviewers to understand the project, and its potential value of the project. Consequently, the title and description must be clear enough for understanding and provide enough information about the project. In this subsection, we will show how we detect the relevance of the provided title and the description, which could possibly be used to filter nonsense or spam applications using machine learning and pre-trained large-scale Natural Language Processing (NLP) models.
Observed that the description of a project can be very long, which is not good for the latter classification, we will first use a summarizer to summarize the very long description into a relatively shorter description. Here, we will use the bart-large-cnn model trained by Facebook. The bart-large-cnn is based on BART, a denosing auto-encoder for pre-training sequence-to-sequence models based on transformer. The experimental results of bart-large-cnn shows that it achieves very high accuracy on the cnn-news dataset.

from transformers import pipeline

# summarize the description if the description is longer than 100 words to filter meaningless sentences
    if len(description.split()) > 100:
        # if the description is longer than 512, take the first 512 words
        if len(description.split()) > 512:
            description = ' '.join(description.split()[:512])
        description = summarizer(description, max_length=100, min_length=0, do_sample=False)

Then, we will use another model to determine the relevance between the title and the description. Considering the task of assessing the quality of a project description in a grant application can be viewed as assessing the quality of a response in a vanilla dialog, we will use the response-quality-classifier-large model trained by tinkoff-ai. To convert our task to the response quality assessment task, we need to construct a query using the project title and the project description:

[CLS]What is your project, {PROJECT_TITLE}, about?
[RESPONSE_TOKEN]{PROJECT_DESCRIPTION}

Thus, by feeding the above-mentioned query into the model, the model will determine how relevant the PROJECT_DESCRIPTION is based on the question. The code of the assessment is demonstrated as follows:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

rel_tokenizer = AutoTokenizer.from_pretrained("tinkoff-ai/response-quality-classifier-large")
rel_model = AutoModelForSequenceClassification.from_pretrained("tinkoff-ai/response-quality-classifier-large")

query = f"""[CLS]What is your project, {title}, about?
[RESPONSE_TOKEN]{description}"""
    inputs = rel_tokenizer(query, max_length=128, add_special_tokens=False, truncation=True, return_tensors='pt')
    with torch.inference_mode():
        logits = rel_model(**inputs).logits
        probas = torch.sigmoid(logits)[0].cpu().detach().numpy()
    relevance, _ = probas

In our implementation, projects that have relevance < 0.1 will be desk rejected without any further consideration.

Website Checking#

We use WHOIS to check if the website is connectable, and query the information of the website.

import whois

def get_website_whois_info(urls):
    """
    query the whois info of given urls

    :param url: the url of the websites
    :return: the whois info of the websites
    """
    results = []

    for url in urls:
        try:
            whois_data = whois.whois(url)
            results.append(whois_data)
        except whois.parser.PywhoisError:
            results.append(None)

    return results

After getting the whois information of the provided website, we will check if the website is:

already expired.
will be expired in 90 days.
will be expired in 1 year.

Moreover, noticed that some projects use external links as their websites (e.g. github.io, twitter.com, youtube.com, notion.so, etc.), we use a simple classifier that determines if the provided website is external using pattern matching.

Github Checking#

For a person, we will check his(her) contributions over the last year. This metric reflects his(her) activity in the open-source community.

from bs4 import BeautifulSoup
import requests

GITHUB_URL = 'https://github.com/'

def get_github_user_contributions(usernames):
    """
    Get a github user's public contributions of the last year.

    :param usernames: A string or sequence of github usernames.
    """
    contributions = {'users': [], 'total': 0}

    if isinstance(usernames, str):
        usernames = [usernames]

    for username in usernames:
        # if the username is an url starting with 'https://', extract the username.
        if username.startswith('https://') or username.startswith('http://'):
            username = username.split('/')[3]

        response = requests.get('{0}{1}'.format(GITHUB_URL, username))

        if not response.ok:
            contributions['users'].append({username: dict(total=0)})
            continue

        bs = BeautifulSoup(response.content, "html.parser")
        total = bs.find('div', {'class': 'js-yearly-contributions'}).findNext('h2')
        contributions['users'].append({username: dict(total=int(total.text.split()[0].replace(',', '')))})
        contributions['total'] += int(total.text.split()[0].replace(',', ''))

    return json.dumps(contributions, indent=4)

For an organization, we will check the total number of commits of all public repositories under the organization over the last year. This metric reflects the activity of the organization in the open-source community.

import datetime
from github import Github

github = Github()

def get_github_org_contributions(orgs):
    """
    Get a github organization's public contributions of the last year.

    :param orgs: A string or sequence of github organizations.
    """
    contributions = {'orgs': [], 'total': 0}

    if isinstance(orgs, str):
        orgs = [orgs]

    for org in orgs:
        all_repos = github.get_organization(org).get_repos()
        total_commits = 0
        for repo in all_repos:
            commits = repo.get_commits(since=datetime.datetime.now() - datetime.timedelta(days=365))
            total_commits += commits.totalCount
        contributions['orgs'].append({org: dict(total=total_commits)})
        contributions['total'] += total_commits

    return json.dumps(contributions, indent=4)

Clustering#

After data pre-processing, there are 7 fields left for latter analysis:

		"github_user_contributions",
    "project_github_contributions",
    "website_expired",
    "website_expired_in_90_days",
    "website_expired_in_1_year",
    "external_url",
    "desc_relevance"

That is, our dataset currently has 7 dimensions, which is hard to analyze as we are living in a 3-dimensional world. Consequently, before clustering, let’s reduce the dimension of our dataset. We first normalize the dataset using MinMaxScaler():

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
grants_scaled = scaler.fit_transform(grants)

Then, to explore the possible patterns, let’s use T-SNE to reduce the dimension as it is currently one of the states of the arts.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=True)
grants_reduced = tsne.fit_transform(grants_scaled)

# visualize
sns.scatterplot(x=grants_reduced[:, 0], y=grants_reduced[:, 1])
plt.show()

Features of different grant applications

As can be seen from the figure, grants are divided into different groups. Let’s then use the DBSCAN algorithm to make a clustering. Here we use the DBSCAN algorithm instead of K-MEANS because DBSCAN is a density-based clustering algorithm that can detect outliers, and it does not need to specify the hyper-parameter k .

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

# DBSCAN clustering
db = DBSCAN(eps=1.0, min_samples=5).fit(grants_reduced)
labels = db.labels_

# visualize
sns.scatterplot(x=grants_reduced[:, 0], y=grants_reduced[:, 1],
                hue=labels, palette=sns.color_palette("hls", len(set(labels))))
plt.show()

Clustered grant applications

As demonstrated by the figure, each grant has been properly assigned a label indicating its cluster. Note that the -1 label means that the DBSCAN algorithm believes the grant is an outlier. Clearly, the result of the clustering validates the effectiveness of our processed dataset, and there are some hidden patterns waiting for us to discover.

Define “Eligible”#

Now as we have known relations between different applications (as the clustering shows), we want to assign labels to groups in the clustering so that we can classify an application by calculating which group it belongs to. In fact, it is enough if our classifier can tell us if the application is eligible or not, so the label can be as simple as a boolean indicating if the grant is eligible. To achieve this, we need to first define what is eligible. Consequently, in this section, we will collect a tiny set of eligible projects manually (maybe 10 projects), and later we will use semi-supervised learning to automatically learn how to classify grant applications in our dataset.

Ten well-known projects, including Uniswap, AAVE, Curve, Gnosis Safe, etc, are manually collected as positive data. The dataset is provided as the following for reproduction:

positive_applications.csv

After pre-processing the positive dataset using the exact same way as the Fantom and UNICEF dataset, the visualization of all data samples is shown in the following figure. Here, label 0 stands for unlabeled data, and label 1 stands for positive data.

Positive and negative data

As can be seen, the positive data are clearly closer to some groups compared to other groups, suggesting they can be used to help the machine learning model understand what kind of application should be eligible and what should not.

Learning From Positive and Unlabeled Data (PU-Learning)#

In this section, we will train a simple classifier to classify unlabeled data using a few positive data. In fact, there are already many algorithms proposed for learning from plenty of unlabeled data and few positive data. However, as we already had a pretty decent clustering after dimension reduction, we can actually make our own simple classifier using label propagation and majority voting. Specifically, we first calculate the L2 distance between every positive sample and the centric point of every cluster. Then, every positive sample will vote for the cluster that has the minimal distance to it. Finally, we apply top-k to vote results to retrieve the estimation of unlabeled data. The implementation is as follows:

# calculate the center of each cluster
centers = {}
for label in set(labels):
    if label == -1:
        continue
    centers[label] = np.mean(X[labels == label], axis=0)

votes = {}
positives = X[:10]
for positive in positives:
    dists = {}
    for label, point in centers.items():
        dist = np.linalg.norm(positive - point)
        dists[label] = dist
    min_label = min(dists, key=dists.get)
    if min_label not in votes:
        votes[min_label] = 1
    else:
        votes[min_label] += 1

# top-k votes
eligible_clusters = sorted(votes, key=votes.get, reverse=True)[:4]
new_labels = [int(label in eligible_clusters) for label in labels]

The estimation result is visualized as the following figure, where label 0 means the positive data collected by us manually, label 1 means estimated negative applications from unlabeled data, and label 2 means estimated positive applications from unlabeled data.

Final clustered results

The estimated negative applications are shown as follows. As can be seen, projects 114, 116, and 117 are clearly test applications, and our algorithm successfully classifies them as negative applications, validating the effectiveness of our proposed algorithm. Moreover, after manually inspecting other estimated negative projects, most of them have low GitHub activities and unofficial/external websites, suggesting that grant reviewers may need to pay more attention to them.

id                                       title  ... desc_relevance
2                             Just Brew It DAO  ...       0.762122
9                         The Sterling project  ...       0.640220
23           Validator Node Encouragement Fund  ...       0.668328
29                                       Mowse  ...       0.910632
30                           Crypto Policy DAO  ...       0.814382
31                               Racing Snails  ...       0.313622
48                   A Fantoman & Fantomonstre  ...       0.470715
49                                 Grey Market  ...       0.890565
57                              ALL IN FINANCE  ...       0.584226
58                               Planet Keeper  ...       0.714447
64                               Depeg Finance  ...       0.612105
69                               Fantom Nobles  ...       0.407439
74                               Fantom Italia  ...       0.701125
100                                  inDemniFi  ...       0.657283
101                      JPGs Against Humanity  ...       0.873490
111  Pixframe Studios - Transforming Education  ...       0.888190
114                     Daniele's Test Project  ...       0.060001
116                                        NaN  ...       0.000000
117                                       Test  ...       0.230208