In the SLP department at Columbia University, I worked with the radicalization team to understand why radical (extreme) videos, particularly ones spreading misinformation, are persuasive and generate followers. I focused on right wing groups like QAnon. I came up with the following hypothesis for the question regarding why QAnon gained popularity: (1) Sensational rhetoric and hate-mongering speech, both exhibited in the QAnon conspiracy theories, lead to more engagement from social media users and thus bring such posts to prominence. (2) Additionally, QAnon aligned itself strongly with certain groups and “tacit” endorsement from people like Trump and powerful supporters like Marjorie Taylor Greene, who won the Republican nomination in Georgia’s 14th congressional district, allowed this theory to gain popularity. This research was very unique from existing work that only uses lexical and social context analysis to study the spread of radical ideas.
Data Collection: I scraped most data from Bitchute, a video hosting service dedicated to far-right conspiracy theories, for anti-QAnon videos and mainstream sources like DailyMotion and Youtube for pro-QAnon content.
Data Labeling: We were able to label the persuasiveness based on likes and responses. Additionally, we agreed on a subjective ranking based on universally recognized measures of persuasiveness: pathos, ethos, and logos.
Feature extraction: Using Parselmouth and MFCC, I extracted acoustic and prosodic features from above data. Initially, I came up with ~230 features including ngrams and LIWC features. I also implemented a script that distinguishes files with background sound (eg. music) and trims the dataset to include only files without it to improve the quality of my classifiers.
Classifiers:
At early stages, I experimented with SVM, decision tree, and logistic regression classifiers to classify videos as pro or anti QAnon. My goal was just to gather the most important features. For example, based on the decision tree, number of upvotes and the mention of “qanon” were most useful. These models could be more insightful or extensible when applied to a larger dataset.
I then designed an emotion classifier to run on the trimmed dataset. Initial findings suggested that pro-QAnon videos were generally classified as “happy” while anti ones were “sad”.
On behalf of the Data Science Institute, I worked as a NLP researcher with the Earth Institute on developing an accurate spatial time series database of flood damage in Bangladesh using news sources. database is now being used to compare with crowdsourced farmer reports of “bad years” of flooding, satellite flood observations, streamflow, and rainfall data. We were ultimately able to reach an accuracy of 93.8% for flood event categorization. This novel approach to validating satellite data could improve data collection from developing countries where data for smaller regions are sometimes limited.
Scraped news sources to create a training dataset of English news articles and onboarding FIST researchers to code the location, date, and damage of online media and LexisNexis (online media database) searches to identify flood related articles.
Developed machine learning models that classify whether a tweet is flood related, with best results from decision tree classifiers. Experimented using these models and deep neural networks (BERT) to classify whether an article is flood related, with promising early results from BERT. With the help of the ML models, we realized that articles headlines, leading paragraphs, and abstracts, were most useful.
Proposing a design for cross-lingual learning frameworks to effectively train classifiers with limited training data. Since a lot of more prominent and relevant media sources were in Bangla, this would enable our models to reach much higher accuracy.
My objective was to use computer vision and artificial intelligence techniques to identify pills and extract information from the labels on pill bottles. This project was applied to an app I co-founded that sorts and organizes an individual’s medical files. Given the data, it also creates reminders to take medicines on time, with a description of the medicine and how it should be taken. However, a major flaw in the app’s design was that users would have to manually enter data from a prescription into a form. Hence, we decided to write a program that extracts information from a prescription (typically found on pill bottles) in a useful format. Additionally, after interviewing some of our target users to understand how we can improve their medical lives, we decided to design an algorithm to identify pills based on their shapes, colors, and imprints.
Employed k-means clustering and template matching for shape and color recognition, and integrating Google Vision for imprint recognition.
Engineered a novel prescription scanner for extracting medical information from pill bottles using OpenCV