When COVID-19 hit, many behavioral scientists had a way to keep their research running: Move it online. The pandemic boosted an already growing trend of studies conducted via online platforms, among the most popular of which is Amazon’s Mechanical Turk (MTurk). The service charges “requesters” a commission to crowdsource tasks—such as completing a survey or solving a puzzle—to remote workers paid for each one. Web of Science data show publications mentioning MTurk as a keyword went from 34 in 2010 to a peak of 989 in 2021.

But as MTurk’s popularity surged, so, too, did the number of researchers sounding the alarm about problems with data it generates. Now, a new study reports 96% of the data collected in a mock survey on MTurk made no sense, with participants agreeing equally with contradictory statements such as “I like order” and “I crave chaos.”

The unreviewed study, posted to the preprint server PsyArXiv last month, is a “clear, easy demonstration” of well-known problems with MTurk, says Michael Chmielewski, a personality psychologist at Southern Methodist University who was not involved with the paper. Although researchers have been calling attention to these problems for years, Chmielewski says in his work as a journal editor and reviewer he still encounters many papers that use MTurk without adequately guarding against low-quality data.

The study should “hopefully move the needle a little bit” on helping more researchers realize that high-quality online studies are a lot more complex than simply posting a survey on MTurk, adds Dave Hauser, a personality and social psychologist at Queen’s University.

Launched in 2005, MTurk started out as a reasonably reliable source of research data. It and other platforms offer an “incredible opportunity,” says University College London cognitive psychologist Jennifer Rodd. They allow for studies with larger samples of participants and more diversity than the “tiny sliver of humanity” a researcher can access directly.

But Chmielewski says he and other researchers began to grumble about a perceived decline in data quality around 2018. He repeated a study on parent personality and child development that he and co-authors had previously run on MTurk in 2016 and 2017 and found a much higher proportion of participants in the 2018 study failed quality checks, such as inconsistently reporting their children’s ages or birthdays. Chmielewski suspects early adopters of MTurk were more conscientious, and the platform later drew in more participants seeking quick income.

Another change is the growing capability and accessibility of artificial intelligence chatbots, which participants can now use to offload some of the time and mental energy of responding. Robert West, a computer scientist at the Swiss Federal Institute of Technology Lausanne, found that approximately one-third of participants in his team’s 2023 study of MTurk had used the outputs of a large language model (LLM) in their responses.

In the new study, Union College psychologist Cameron Kay asked 400 participants on MTurk to respond to 27 pairs of “semantic antonyms” such as “I do not sleep well” and “I sleep soundly.” A participant answering carefully should “agree” or “strongly agree” with one of these statements and “disagree” or “strongly disagree” with the other, leading to a negative correlation between the statements in each pair. But Kay found that for 26 of the 27 pairs, responses were positively correlated—participants gave similar responses to both, suggesting they weren’t paying attention or answering honestly. Kay’s conclusion: Results from MTurk “cannot be trusted.”

Kay then applied exclusion criteria commonly used in research: removing data from anyone who took less than 2 seconds per question, gave the same response to many items in a row, or responded incorrectly to questions checking that participants were reading carefully, such as “Choose ‘strongly disagree’ for this item.” He was left with only 53% of his original data—and 67% of the opposing pairs still had nonsensical, positively correlated responses.

Hauser cautions these results should not be used to tar all online studies. “Out-of-the-box MTurk doesn’t really do much vetting, which is why it’s kind of a Wild West,” he says. He notes that alternative platforms exist that filter out participants who routinely give poor data—a more effective form of quality control than researchers screening poor responses one by one.

A platform called the MTurk Toolkit, a complement to MTurk that excludes low-quality MTurk participants, yielded significantly higher quality data than MTurk alone, Hauser found in a 2022 study. Its maker, the online study company CloudResearch, also has a separate pool of participants it recruits itself, called CloudResearch Connect. In his demonstration of the failings of MTurk, Kay ran the same study on CloudResearch Connect and got negatively correlated responses for 78% of the antonym pairs—suggesting much higher data quality, even though he paid participants the same rate.

The online platform Prolific also puts participants through a stringent vetting process, and West has used it to explore ways to keep survey respondents from turning to LLMs, which they do at approximately the same rate as on MTurk. Disabling the copy-paste function helped—and, surprisingly, so did simply asking participants not to use LLMs; together, these tactics cut LLM use by about half. But nothing is foolproof, West says.

“We were so excited 15 years ago, that we could now just press a button and all of a sudden we have this flash workforce that shows up,” West says. “But the past years have taught us that we need to be careful with that.”

More: https://www.science.org/content/article/psychology-study-participants-recruited-online-may-provide-nonsensical-answers