About.

I am a PhD student in Computer Science at Columbia University, advised by Zhou Yu and Smaranda Muresan. My research primarily revolves around Natural Language Processing (NLP).

I'm broadly interested in the area where NLP meets Computational Social Science (CSS): in designing and leveraging computational methods to understand social aspects of language; in studying social phenomena—often at large scale—through what people say and how they say it; and in incorporating such insights into creating more social and more equitable language technologies.

My research is supported by a NSF Graduate Research Fellowship. I've previously interned at Amazon AWS AI, Microsoft Research, NASA Earth Sciences, Azure IoT, and at the University of Michigan (NSF REU). At Michigan, where I received my undergraduate degree, I had the fortune of working with (the amazing) David Jurgens, Joyce Chai, and Arunesh Sinha.




Publications.

I (mostly) publish under "Sky CH-Wang"; citations should refer to "S. CH-Wang". *Denotes equal contribution. +Denotes additional venues the work has appeared in, in addition to the main venue.


Affective Idiosyncratic Responses to Music
Sky CH-Wang, Evan Li, Oliver Li, Smaranda Muresan, Zhou Yu
Empirical Methods in Natural Language Processing (EMNLP) 2022
+ New Directions in Analyzing Text as Data (TADA) 2022
[PDF] [Code] [Twitter Thread] []

^

Affective responses to music are highly personal. Despite consensus that idiosyncratic factors play a key role in listener affective responses, precisely measuring the marginal effects of these variables has proved challenging. To address this gap, we develop computational methods to measure affective responses to music from over 403M listener comments on a Chinese social music platform. Building on studies from music psychology in systematic and quasi-causal analyses, we test for musical, lyrical, contextual, demographic, and mental health effects that drive listener affective responses. Finally, motivated by the social phenomenon known as 网抑云 (wǎng-yì-yún), we identify driving factors of platform user self disclosures, the social support they receive, and notable differences in discloser user activity.


Towards a Linguistics Free of "Native Speakerhood"
Annie Birkeland, Adeli Block, Justin Craft, Yourdanis Sedarous, Sky Wang, Gou Wu, Savithry Namboodiripad
In Preparation
+ Annual Meeting of the Linguistic Society of America (LSA) 2022
[Presentation] []

^

"Native speaker," as a term and concept, has long been critiqued across fields such as linguistic anthropology, second language acquisition, and English language teaching (e.g. Holliday, 2006; Rothman & Treffers-Daller, 2014). However, within less peripheralized subdisciplines, such as psycholinguistics and generative approaches to grammar, this term is often used uncritically, with the behavior and knowledge of “native speakers” being centered as the main object of inquiry.

We trace the history of “native speaker,” showing how this term naturalizes and essentializes language and identity through a process of rhematization (Gal and Irvine 2019) whereby indexes (such as critical period, identity, age of acquisition, etc.) are interpreted as icons of “native speakerhood.” As a consequence, many studies in linguistics use the term without explicitly defining it; this vagueness systematically excludes disabled and racialized/ethnicized individuals and communities by reifying hegemonic language practices and contexts of learning/use.

We suggest “native speaker” is best treated as a linguistic and semiotic ideology, not an idealized category, and we argue that this ideological construct cannot be the central object of study in a Linguistics that is aiming to push past its oppressive origins. Rather than substituting the term with discrete labels that have similar colonial ideological underpinnings (e.g. “L1, L2, LX”, Dewaele, 2018; “semi-speaker”, Boltokova, 2017), we propose that all researchers use specifics to describe their participants’ language experience, as contextually relevant (Cheng et al., under review).

We propose that moving away from this term and the ideas surrounding it will (a) lead to more specified hypotheses about how language experience affects linguistic knowledge/behavior, (b) more accurately reflect the diverse linguistic repertoires held by all language users, (c) force our theoretical approaches to center non-normative language use (cf. Crip Linguistics, Henner 2021), and (d) be one step towards a more inclusive discipline.


Using Sociolinguistic Variables to Reveal Changing Attitudes Towards Sexuality and Gender
Sky CH-Wang, David Jurgens
Empirical Methods in Natural Language Processing (EMNLP) 2021
+ New Ways of Analyzing Variation (NWAV) 2022
[PDF] [Poster] [Code] [] []

^

@inproceedings{ch-wang-jurgens-2021-using,
    title = "Using Sociolinguistic Variables to Reveal Changing Attitudes Towards Sexuality and Gender",
    author = "CH-Wang, Sky  and
      Jurgens, David",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.782",
    pages = "9918--9938",
    abstract = "Individuals signal aspects of their identity and beliefs through linguistic choices. Studying these choices in aggregate allows us to examine large-scale attitude shifts within a population. Here, we develop computational methods to study word choice within a sociolinguistic lexical variable{---}alternate words used to express the same concept{---}in order to test for change in the United States towards sexuality and gender. We examine two variables: i) referents to significant others, such as the word {``}partner{''} and ii) referents to an indefinite person, both of which could optionally be marked with gender. The linguistic choices in each variable allow us to study increased rates of acceptances of gay marriage and gender equality, respectively. In longitudinal analyses across Twitter and Reddit over 87M messages, we demonstrate that attitudes are changing but that these changes are driven by specific demographics within the United States. Further, in a quasi-causal analysis, we show that passages of Marriage Equality Acts in different states are drivers of linguistic change.",
}
                
^

Individuals signal aspects of their identity and beliefs through linguistic choices. Studying these choices in aggregate allows us to examine large-scale attitude shifts within a population. Here, we develop computational methods to study word choice within a sociolinguistic lexical variable—alternate words used to express the same concept—in order to test for change in the United States towards sexuality and gender. We examine two variables: i) referents to significant others, such as the word "partner" and ii) referents to an indefinite person, both of which could optionally be marked with gender. The linguistic choices in each variable allow us to study increased rates of acceptances of gay marriage and gender equality, respectively. In longitudinal analyses across Twitter and Reddit over 87M messages, we demonstrate that attitudes are changing but that these changes are driven by specific demographics within the United States. Further, in a quasi-causal analysis, we show that passages of Marriage Equality Acts in different states are drivers of linguistic change.


MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks
Cristian-Paul Bara*, Sky CH-Wang*, Joyce Chai
Empirical Methods in Natural Language Processing (EMNLP) 2021
[Outstanding Paper Award] [PDF] [Code] [] []

^

@inproceedings{bara-etal-2021-mindcraft,
    title = "{M}ind{C}raft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks",
    author = "Bara, Cristian-Paul  and
      CH-Wang, Sky  and
      Chai, Joyce",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.85",
    pages = "1112--1125",
    abstract = "An ideal integration of autonomous agents in a human world implies that they are able to collaborate on human terms. In particular, theory of mind plays an important role in maintaining common ground during human collaboration and communication. To enable theory of mind modeling in situated interactions, we introduce a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners{'} beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication. As a first step towards our goal of developing embodied AI agents able to infer belief states of collaborative partners in situ, we build and present results on computational models for several theory of mind tasks.",
}
                
^

An ideal integration of autonomous agents in a human world implies that they are able to collaborate on human terms. In particular, theory of mind plays an important role in maintaining common ground during human collaboration and communication. To enable theory of mind modeling in situated interactions, we introduce a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners' beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication. As a first step towards our goal of developing embodied AI agents able to infer belief states of collaborative partners in situ, we build and present results on computational models for several theory of mind tasks.


Building Action Sets in a Deep Reinforcement Learner
Yongzhao Wang, Arunesh Sinha, Sky CH-Wang, Michael Wellman
International Conference on Machine Learning and Applications (ICMLA) 2021
[PDF] [] []

^

@inproceedings{wang2021drl,
  title={Building Action Sets in a Deep Reinforcement Learner},
  author={Wang, Yongzhao and Sinha, Arunesh and CH-Wang, Sky and Wellman, Michael},
  booktitle={2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)},
  year={2021},
  organization={IEEE}
}
                
^

In many policy-learning applications, the agent may execute a set of actions at each decision stage. Choosing among an exponential number of alternatives poses a computational challenge, and even representing actions naturally expressed as sets can be a tricky design problem. Building upon prior approaches that employ deep neural networks and iterative construction of action sets, we introduce a reward-shaping approach to apportion reward to each atomic action based on its marginal contribution within an action set, thereby providing useful feedback for learning to build these sets. We demonstrate our method in two environments where action spaces are combinatorial. Experiments reveal that our method significantly accelerates and stabilizes policy learning with combinatorial actions.


A Learning and Masking Approach to Secure Learning
Linh Nguyen, Sky Wang, Arunesh Sinha
Decision and Game Theory for Security (GameSec) 2018
[PDF] [] []

^

@InProceedings{nguyen2018adv,
author="Nguyen, Linh
and Wang, Sky
and Sinha, Arunesh",
editor="Bushnell, Linda
and Poovendran, Radha
and Ba{\c{s}}ar, Tamer",
title="A Learning and Masking Approach to Secure Learning",
booktitle="Decision and Game Theory for Security",
year="2018",
publisher="Springer International Publishing",
address="Cham",
pages="453--464",
abstract="Deep Neural Networks (DNNs) have been shown to be vulnerable against adversarial examples, which are data points cleverly constructed to fool the classifier. In this paper, we introduce a new perspective on the problem. We do so by first defining robustness of a classifier to adversarial exploitation. Further, we categorize attacks in literature into high and low perturbation attacks. Next, we show that the defense problem can be posed as a learning problem itself and find that this approach effective against high perturbation attacks. For low perturbation attacks, we present a classifier boundary masking method that uses noise to randomly shift the classifier boundary at runtime. We also show that both our learning and masking based defense can work simultaneously to protect against multiple attacks. We demonstrate the efficacy of our techniques by experimenting with the MNIST and CIFAR-10 datasets.",
isbn="978-3-030-01554-1"
}
                
^

Deep Neural Networks (DNNs) have been shown to be vulnerable against adversarial examples, which are data points cleverly constructed to fool the classifier. Such attacks can be devastating in practice, especially as DNNs are being applied to ever increasing critical tasks like image recognition in autonomous driving. In this paper, we introduce a new perspective on the problem. We do so by first defining robustness of a classifier to adversarial exploitation. Next, we show that the problem of adversarial example generation can be posed as learning problem. We also categorize attacks in literature into high and low perturbation attacks; well-known attacks like fast-gradient sign method (FGSM) and our attack produce higher perturbation adversarial examples while the more potent but computationally inefficient Carlini-Wagner (CW) attack is low perturbation. Next, we show that the dual approach of the attack learning problem can be used as a defensive technique that is effective against high perturbation attacks. Finally, we show that a classifier masking method achieved by adding noise to the a neural network's logit output protects against low distortion attacks such as the CW attack. We also show that both our learning and masking defense can work simultaneously to protect against multiple attacks. We demonstrate the efficacy of our techniques by experimenting with the MNIST and CIFAR-10 datasets.


Enriching the Twitter Stream: Increasing Data Mining Yield and Quality Using Machine Learning
William Teng, Arif Albayrak, John Corcoran, Sky Wang, Daniel Maksumov, Carlee Loeser, Long Pham
American Geophysical Union Fall Meeting (AGU) 2018
[PDF] [] []

^

@INPROCEEDINGS{albayrak2018twitter,
       author = {{Albayrak}, A. and {Teng}, W.~L. and {Corcoran}, J. and {Wang}, S.~C. and {Maksumov}, D. and {Loeser}, C. and {Pham}, L.},
        title = "{Enriching the Twitter stream: increasing data mining yield and quality using machine learning}",
     keywords = {4325 Megacities and urban environment, NATURAL HAZARDSDE: 4335 Disaster management, NATURAL HAZARDSDE: 4341 Early warning systems, NATURAL HAZARDSDE: 4352 Interaction between science and disaster management authorities, NATURAL HAZARDS},
    booktitle = {AGU Fall Meeting Abstracts},
         year = 2018,
       volume = {2018},
        month = dec,
          eid = {NH43B-1043},
        pages = {NH43B-1043},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2018AGUFMNH43B1043A},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
                
^

Social media data streams are important sources of real-time and historical global information for science applications. At the NASA Goddard Earth Sciences Data and Information Services Center (GES DISC), we are exploring the Twitter data stream for its potential in augmenting the validation program of NASA Earth science missions, specifically the Global Precipitation Measurement (GPM) mission. We have implemented a tweet processing infrastructure that outputs classified precipitation tweets. Inputs are "passive" tweets, along with a smaller number of tweets from "active" participants, i.e., those knowingly contributing to our effort. The "active" tweets, presumably of higher quality, enrich the Twitter stream. "Active" sources include data scraped from other social media (e.g., public Facebook posts) and data from existing crowdsourcing programs (e.g., mPING reports). In addition, there is likely relevant precipitation information in images and documents that are the end points of links often included in tweets. Information derived from these "active" sources could then be tweeted into the Twitter stream, thus enriching its quality. The objective of our current work is to mine these tweet-linked images and documents, using neural networks, to increase the information content and quality related to precipitation. For images, we classified them as either precipitation-related or not. For training and validation, we used images obtained via the Google custom search API. We created two models: (1) by training a simple Convolutional Neural Network and (2) by using transfer learning principles to adapt a pre-trained object recognition model. For documents, both those linked to tweets and the tweet contents, we trained Hierarchical Attention Networks to determine precipitation occurrence, type, and intensity. For training and validation, we used a keyword-filtered tweet data set labelled with ground truth data from Dark Sky (an API to retrieve weather-related labels) and the National Severe Storms Laboratory's Multi-Radar/Multi-Sensor (MRMS) system. Our results demonstrated the efficacy of our machine learning approaches for enriching the Twitter stream, to derive information potentially useful for validation of earth science satellite data.


Site periodically updated. Last update: October 2022.