Name Screening sandbox

At the core of this project lies a fundamental principle: to complement, rather than supplant, the tried-and-true methods of name matching known to specialists in the industry. Rather than reinventing the wheel, the approach focuses on enhancing these established techniques with the integration of advanced technology. By leveraging machine learning capabilities, the project imprints the author’s investigative insights directly into the matching process, giving way to use broader search while leaving the pruning task to a model.

One of the primary goals of the project is to introduce tuning specialists to the usage of machine learning techniques and offer insight into its practical application in their field.

Algorithms utilised in the project:

Initials Matching: This algorithm is specifically designed to handle cases where one of the strings contains initials rather than full names. It effectively compares individual letters from one string with their potential counterparts (words) in the other, enhancing the matching process by accounting for partial similarities.
"Incorrect Word Borders" Matching: This method addresses cases where a word may be split by multiple separators, such as spaces or punctuation marks. By identifying and correcting for these excessive word borders, the matching algorithm ensures more accurate and comprehensive results, particularly in unstructured or noisy datasets.
Phonetic Algorithm: This algorithm (Double Metaphone) accounts for variations in pronunciation and spelling. Under the umbrella of the project, the author explores creating a variation of Metaphone 3, tailored to Slavic word formation rules, to further improve accuracy and adaptability to specific linguistic contexts.
Name Variants Matching: Drawing from an open source dataset of possible variations of people's names, this matching technique identifies and considers different permutations and aliases associated with individual names. By encompassing a broad range of name variants, the algorithm allows identifying matches across diverse datasets and naming conventions.

A key question of choosing the most suitable model for the task at hand occurred at different stages of the projects. As the number of features grew and the logic became more and more complex, some of the models became obsolete. As of this moment (nowhere near the end of this endeavour) the author settled for a Histogram-based Gradient Boosting Classification Tree. Its main perks for the challenge at hand are:

Availability to use categorical features along with numerical ones.
Support of missing values.
A decision tree at its root (hope the reader acknowledges this pun). Produced models are moderately easy to interpret.

The model is trained using a portion of names from the OFAC SDN List.

The project was written using Python programming language. Its current form would have been vastly different without contributions of several key libraries in the implementation of this project:

scikit-learn (sklearn): A fundamental toolset for machine learning in Python, providing a wide range of algorithms and utilities.
rapidfuzz: A highly efficient library for fuzzy string matching, aiding in the engineering of the basic features for the model
pandas: Essential for data manipulation and preprocessing tasks in machine learning workflows, offering intuitive and powerful data structures.
numpy: Integral for array operations and numerical computing in machine learning, providing essential functionality for data processing and model training.

These libraries were instrumental in shaping the project.

Soon

Search algorithm

Logic

Articles