Understanding Author Gender Identification Through Biased Data Sets
Research Paper in Applied Machine Learning
In this paper, I examine how social media sites address changes in gender-related language in their methods of parsing and dealing with data that might not be stereotypically gendered. The fundamental question I posit in this paper is: is there a way to predict gender using non-biased features? If so, what are good linguistic features that can predict gender without relying on traditionally-gendered features?
Based on my findings, the question evolves to whether or not these features will prove to be universal, and whether they will be able to predict gender despite changing colloquialisms in text. My research methods include researching current colloquialisms used by queer communities, to support my argument that the textual landscape on the Internet is ever-changing and does not represent gender binaries. The AdaBoost algorithm is used because of its accurate prediction, simplicity, and varied successful applications. It is then further designed for gender identification. Experiments on gender biased-subsets of a large text corpus (Blog Author Corpus) pinpoint features that are most influential as gender discriminators.