publications
These are my publications.
2026
- Geometric Analysis of Self-Supervised Vision Representations for Semantic Image RetrievalEsteban Rodríguez-Betancourt and Edgar Casasola-Murillo2026
Content-based image retrieval (CBIR) systems enable users to search images based on visual content instead of relying on metadata. The text domain has benefited from vector search of representations created with unsupervised methods such as BERT. However, modern self-supervised learning methods for vision are mostly not reported in CBIR-related literature, instead relying on supervised models or multi-modal methods that align text and vision. We evaluate how the representations learned by modern self-supervised learning methods for vision perform under typical retrieval stacks that leverage vector databases and nearest neighbor search. Our evaluation reveals that the latent space geometry impacts approximate nearest neighbor (ANN) indexing. Specifically, highly anisotropic representations with high skewness produced by several modern SSL methods degrade the performance of partition-based and hashing-based search, even if their own linear probe or K-NN accuracy is not affected. In contrast, representations with higher isotropy and local purity better satisfy the distance-based assumptions of ANN indexes, leading to improved semantic retrieval performance.
@misc{rodríguezbetancourt2026geometricanalysisselfsupervisedvision, title = {Geometric Analysis of Self-Supervised Vision Representations for Semantic Image Retrieval}, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, url = {https://arxiv.org/abs/2604.24469}, } - Self-Supervised Representation Learning via Hyperspherical Density ShapingEsteban Rodríguez-Betancourt and Edgar Casasola-Murillo2026
Modern self-supervised representation learning methods often relies on empirical heuristics that are not theoretically grounded. In this study we propose HyDeS, a theoretically grounded method based on multi-view mutual information maximization within an hyperspherical space using Shannon differential entropy with a non-parametric von Mises-Fisher density estimator. We show that HyDeS bias the trained model towards focusing on foreground features of the images and perform well on segmentation tasks such as VOC PASCAL, while it lags in fine-grained classification. We provide a detailed analysis of the induced latent space geometry and learning dynamics, that can be used for designing other theoretically grounded self-supervised learning methods.
@misc{rodríguezbetancourt2026selfsupervisedrepresentationlearninghyperspherical, title = {Self-Supervised Representation Learning via Hyperspherical Density Shaping}, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, url = {https://arxiv.org/abs/2604.24498}, } - Hypersolid: Emergent Vision Representations via Short-Range RepulsionEsteban Rodríguez-Betancourt and Edgar Casasola-Murillo2026
A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.
@misc{rodríguezbetancourt2026hypersolidemergentvisionrepresentations, title = {Hypersolid: Emergent Vision Representations via Short-Range Repulsion}, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, url = {https://arxiv.org/abs/2601.21255}, } - Randomly Initialized Networks Can Learn from Peer-to-Peer ConsensusEsteban Rodríguez-Betancourt and Edgar Casasola-Murillo2026
In self-supervised learning, self-distilled methods have shown impressive performance, learning representations useful for downstream tasks and even displaying emergent properties. However, state-of-the-art methods usually rely on ensembles of complex mechanisms, with many design choices that are empirically motivated and not well understood. In this work, we explore the role of self-distillation within learning dynamics. Specifically, we isolate the effect of self-distillation by training a group of randomly initialized networks, removing all other common components such as projectors, predictors, and even pretext tasks. Our findings show that even this minimal setup can lead to learned representations with non-trivial improvements over a random baseline on downstream tasks. We also demonstrate how this effect varies with different hyperparameters and present a short analysis of what is being learned by the models under this setup.
@misc{rodríguezbetancourt2026randomlyinitializednetworkslearn, title = {Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus}, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, year = {2026}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2604.18390}, }
2024
- From cart to truck: meaning shift through words in English in the last two centuriesEsteban Rodríguez-Betancourt and Edgar Casasola-Murillo2024
This onomasiological study uses diachronic word embeddings to explore how different words represented the same concepts over time, using historical word data from 1800 to 2000. We identify shifts in energy, transport, entertainment, and computing domains, revealing connections between language and societal changes. Our approach consisted in using diachronic word embeddings trained using word2vec with skipgram and aligning them using orthogonal Procrustes. We discuss possible difficulties linked to the relationships the method identifies. Moreover, we look at the ethical aspects of interpreting results, highlighting the need for expert insights to understand the method’s significance.
@misc{betancourt2024carttruckmeaningshift, title = {From cart to truck: meaning shift through words in English in the last two centuries}, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, year = {2024}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2408.16209}, } - Teaching SQL New Tricks: Efficient Vector Indexing with TrigramsEsteban Rodríguez-Betancourt and Edgar Casasola-MurilloJAIIO, Jornadas Argentinas de Informática, Sep 2024
With the growing use of vector embeddings in areas like natural language processing and recommendation systems, the need for effective storage and retrieval methods is increasingly important. However, deploying specialized databases for vector indexing can be challenging due to resource limitations or operational constraints. This paper introduces a novel approach that utilizes existing trigram indexes within SQL databases to efficiently manage vector embeddings. By adapting traditional relational databases to handle high-dimensional data, organizations can use their existing infrastructure without the need to invest in new database systems. This method reduces management complexity and costs associated with maintaining separate systems for vector data. We outline the process of converting vector embeddings for trigram indexing and evaluate the performance and recall through empirical analysis. This paper aims to offer a practical solution for researchers and practitioners seeking to integrate advanced vector-based queries into their current database systems, thereby enhancing the functionality and accessibility of vector embeddings in mainstream applications.
@article{Rodríguez-Betancourt_Casasola-Murillo_2024, title = {Teaching SQL New Tricks: Efficient Vector Indexing with Trigrams}, volume = {10}, url = {https://revistas.unlp.edu.ar/JAIIO/article/view/17913}, number = {1}, journal = {JAIIO, Jornadas Argentinas de Informática}, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, year = {2024}, month = sep, pages = {150--157}, }
2023
- CLEIExploring the Limits of Large Language Models for Word Definition Generation: A Comparative AnalysisEsteban Rodríguez-Betancourt and Edgar Casasola-MurilloIn 2023 XLIX Latin American Computer Conference (CLEI), 2023
In this paper, we explore the ability of large language models (LLMs) to generate word definitions for newly invented words in Spanish through the task of Unknown Definition Modeling. The main goal of our study is to determine the extent to which LLMs can abstract meaning from context and compare the performance of different models for this task. To conduct our analysis, we created a dataset of 20 made-up words, usage examples, and their definitions in Spanish. We then evaluated several LLMs, including OpenAI GPT-3.5-turbo, OpenAI GPT-3, and Google Flan-T5, using automatic evaluation based on cosine similarity of sentence embeddings and qualitative human evaluation on a 4-point Likert scale. Our findings indicate that larger models tend to generate better definitions than smaller models, with the performance of the models generally aligning with their size. This study contributes to our understanding of LLMs’ strengths and weaknesses in generating definitions for unknown words, and offers valuable insights for future research and applications in natural language processing.
@inproceedings{10346136, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, booktitle = {2023 XLIX Latin American Computer Conference (CLEI)}, title = {Exploring the Limits of Large Language Models for Word Definition Generation: A Comparative Analysis}, year = {2023}, volume = {}, number = {}, pages = {1-7}, keywords = {Training;Analytical models;Waste materials;Computational modeling;Hate speech;Natural language processing;Internet;Natural languages;Linguistics;Natural language processing;Terminology;Dictionaries}, doi = {10.1109/CLEI60451.2023.10346136} } - CLEIEJAnalysis of the Semantic Shift in Diachronic Word Embeddings for Spanish Before and After COVID-19Esteban Rodríguez-Betancourt and Edgar Casasola-MurilloCLEI Electronic Journal, Sep 2023
Words can shift their meaning across time. This study shows the results obtained by the exploratory analysis of the semantic shifting on Spanish vocabulary using Diachronic Word Embeddings. Diachronic data consists of a 2018 Spanish corpus, before the COVID-19 outbreak, and a second corpus with documents from 2021. This paper addresses the construction of the diachronic Spanish word embeddings model, as well as the results obtained by the analysis using a non-supervised distance vector technique. The results allowed us to identify topics with the most semantic shift between those periods.
@article{betancourt2023cleiej, title = {Analysis of the Semantic Shift in Diachronic Word Embeddings for Spanish Before and After COVID-19}, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, journal = {CLEI Electronic Journal}, volume = {26}, number = {2}, year = {2023}, month = sep, doi = {10.19153/cleiej.26.2.4}, url = {https://doi.org/10.19153/cleiej.26.2.4}, keywords = {Linguistics, Natural Language Processing, Natural Languages, Pragmatics} }
2022
- CLEIAnalysis of Semantic Shift Before and After COVID-19 in Spanish Diachronic Word EmbeddingsEsteban Rodríguez-Betancourt and Edgar Casasola-MurilloIn 2022 XLVIII Latin American Computer Conference (CLEI), 2022
Words can shift their meaning across time. This case study shows the results obtained by the exploratory analysis of the semantic shifting on Spanish vocabulary using Diachronic Words Embeddings. Diachronic data consists of a 2018 Spanish corpus, before the COVID-19 outbreak, and a second corpus with documents from 2021. We focused on the semantic shift of three of the topics: COVID-19, masks and vaccines. This paper addresses the construction of the diachronic Spanish word embeddings model, as well as the results obtained by the analysis using a non-supervised distance vector technique. The results allowed to identify shifts related to increase in COVID-19 content.
@inproceedings{9959896, author = {Rodríguez-Betancourt, Esteban and Casasola-Murillo, Edgar}, booktitle = {2022 XLVIII Latin American Computer Conference (CLEI)}, title = {Analysis of Semantic Shift Before and After COVID-19 in Spanish Diachronic Word Embeddings}, year = {2022}, volume = {}, number = {}, pages = {1-9}, keywords = {COVID-19;Vocabulary;Analytical models;Computational modeling;Semantics;Vaccines;Linguistics;natural language processing;natural languages;pragmatics}, doi = {10.1109/CLEI56649.2022.9959896} }
2019
- CLEIDeep Neural Network Comparison for Spanish Tweets Polarity ClassificationEsteban Rodríguez-Betancourt, Pablo Sauma-Chacón, and Edgar Casasola-MurilloIn 2019 XLV Latin American Computing Conference (CLEI), 2019
Two deep neural network models were compared in the task of polarity classification in Spanish text retrieved from social networks. For each model accuracy, precision, recall and F1 was calculated over a particular corpus. Also, the effect of adding gaussian noise on the inputs over the classifier results was evaluated.
@inproceedings{9073947, author = {Rodríguez-Betancourt, Esteban and Sauma-Chacón, Pablo and Casasola-Murillo, Edgar}, booktitle = {2019 XLV Latin American Computing Conference (CLEI)}, title = {Deep Neural Network Comparison for Spanish Tweets Polarity Classification}, year = {2019}, volume = {}, number = {}, pages = {1-6}, keywords = {Computational modeling;Biological neural networks;Machine learning;Google;Convolutional neural networks;Task analysis;Social network services;Sentiment Analysis;Polarity;Neural Networks}, doi = {10.1109/CLEI47609.2019.235083} }