Authorship attribution of comments in Portuguese extracted from Reddit

Authors

DOI:

https://doi.org/10.5335/rbca.v15i2.14045

Keywords:

Authorship Attribution, Natural Language Processing, Machine Learning, Social Networks, Text Mining

Abstract

Internet interaction environments such as social networks transfer large-scale textual data that implicitly carry the writing styles of each network user. Given the constant and intense flow of information through information systems of this type, it is necessary to develop techniques that can distinguish a text between two candidate authors for reasons of, for example, avoiding the return of users banned from the platform. This paper addressed and evaluated different ways of performing authorship attribution through natural language processing and machine learning, based on comments in Portuguese extracted from Reddit social network. This paper aims to update the authorship attribution literature using Portuguese as the primary language given the scarcity of updated works in this language. The results of several viable methods for the task of binary authorship were exposed and evaluated in the question of feasibility according to their statistical significance, achieving two independent models in the same confidence interval that reached 0.88 of F1-score and 0.94 of AUC with extraction of textual attributes through BERTimbau embeddings and through TF-IDF of words.

Downloads

Download data is not yet available.

Author Biography

  • Luciano Antonio Digiampietri, Dr., Universidade de São Paulo

    Luciano Antonio Digiampietri é professor associado na USP. Possui graduação em Ciência da Computação pela Universidade Estadual de Campinas (2002) e doutorado em Ciência da Computação pela Universidade Estadual de Campinas (2007). É professor pesquisador no Bacharelado Sistemas de Informação (desde 2008) e no Programa de Pós-Graduação em Sistemas de Informação (desde 2010) da Escola de Artes, Ciências e Humanidades da Universidade de São Paulo (EACH-USP). Tem experiência na área de Ciência da Computação, com ênfase em Biologia Computacional, Bancos de Dados, Inteligência Artificial e Gerenciamento de Processos Científicos, atuando principalmente nos seguintes temas: workflows científicos, bioinformática, proveniência de dados, composição automática de serviços, rastreabilidade de experimentos, governo eletrônico e algoritmos

Downloads

Published

2023-07-27

Issue

Section

Original Paper

How to Cite

[1]
2023. Authorship attribution of comments in Portuguese extracted from Reddit. Brazilian Journal of Applied Computing. 15, 2 (Jul. 2023), 1–10. DOI:https://doi.org/10.5335/rbca.v15i2.14045.