Re-identify banned commenters on social media with machine learning
Researchers at Johns Hopkins University have developed a Deep Metric approach to identify online commentators who may have had previous suspended accounts, or may be using multiple accounts to astroturf or otherwise manipulate good faith. online communities such as Reddit and Twitter.
The approach, presented in a new article led by NLP researcher Aleem Khan, does not require input data to be annotated automatically or manually, and improves the results of previous attempts, even when only small samples of text are. available, and where the text was not present in the dataset at the time of training.
The system offers a simple pattern of data augmentation, with embeds of varying sizes driven on a large-volume dataset containing over 300 million comments spanning one million different user accounts.
The framework, based on Reddit usage data, takes into account text content, sub-Reddit placement, and post time. The three factors are combined with various embedding methods, including one-dimensional convolutions and linear projections, and are assisted by an attention mechanism and a maximum grouping layer.
Although the system focuses on the text domain, the researchers say its approach can be translated into video or image analysis, since the derived algorithm operates on frequency occurrences at a high level, despite a variety of input lengths for training data points.
Avoid “ subject drift ”
One trap that research of this nature can fall into, and which the authors expressly addressed in the design of the system, is an excessive emphasis on the reappearance of particular topics or themes across articles from different accounts.
While a user may indeed write repetitively or iteratively within a particular stream of thought, the subject is likely to evolve and “drift” over time, devaluing its use as a key to identity. The authors characterize this potential trap as “being right for the wrong reasons” – a pitfall previously studied at John Hopkins.
Training methodology
The system uses mixed precision drive, an innovation introduced in 2018 by Baidu and NVIDIA, which cuts memory requirements in half by using half-precision floats: 16-bit floating point values ââinstead of 32-bit values. . The data was trained on two V100 GPUs, with an average training time of 72 hours.
The scheme uses simplified text encoding, with convolutional encoders limited to 2 to 4 subwords. Although the average length of frames of this nature is a maximum of five subwords, the researchers found that this saving not only impacted ranking performance, but that the increase in subwords to a maximum of five in fact degraded classification accuracy.
The dataset
The researchers derived a dataset of 300 million Reddit posts from the 2020 Pushshift Reddit Corpus dataset, called the Million User Dataset (MUD).
The dataset includes all articles from Reddit authors who published 100-1000 articles between July 2015 and June 2016. Sampling over time in this manner provides adequate history length for study and reduces the impact of sporadic spam messages that are outside the scope. research objectives.
Results
The image below shows a cumulative improvement in results as the ranking accuracy is tested at hourly intervals during training. After six hours, the system surpasses the baseline achievements of previous related initiatives.
In an ablation study, researchers found that removing the sub-Reddit feature from the workflow had surprisingly little impact on ranking accuracy, suggesting that the system generalizes very efficiently, with tooling. robust functionality.
Frequency of publication as re-identification signature
It also indicates that the framework is highly transferable to other commenting or posting systems where only the text content and the date / time of publication are available – and, in essence, that the temporal frequency of publication is in itself an indicator. valuable collateral of the actual text. content.
The researchers note that attempting to perform the same estimate in the content of a single sub-Reddit poses a greater challenge, as the sub-Reddit itself serves as a subject proxy, and an additional schema would arguably be needed to fulfill this role.
The study was nonetheless able to obtain promising results in these restrictions, with the only caveat that the system works best at high volumes, and may have increased difficulty re-identifying users when the volume of messages is. low.
Develop work
Unlike many supervised learning initiatives, the functionality of the Hopkins re-identification scheme is discrete and robust enough that system performance improves especially as the volume of data increases.
The researchers show their interest in the development of the system by adopting a more granular approach to the analysis of publication times, because the often predictable schedules of rote spammers (automated or not) are likely to be identified by such an approach, which would either more effectively remove robot content from a study primarily targeting offensive users, or to help identify automated content.