Reddit Migration Trajectories – Big Data and Artificial Intelligence in Migration Research

24.09.2025 , in ((Practices)) , ((No Comments))
Vestin Hategekimana

With the advent of big data, new approaches in migration studies have proliferated. However, they remain predominantly quantitative, focusing heavily on geospatial data and often missing the human stories and complex experiences behind each migration journey. Some researchers take a more narrative and qualitative approach, exploring the stories people construct about their journeys, although this kind of data remains limited. The extraction of data by Large Language Models (LLMs) allows for combining these two approaches by leveraging their respective strengths.

Reddit is a unique platform where people discuss in dedicated communities almost everything, including moving from one place to another. This makes it really useful for researchers who want to understand why and how people decide to move countries. They can learn from these real-time conversations without interfering with them.

For a long time, many users have been talking about their wishes or plans to move using Reddit’s communities like “IWantOut” and country-specific ones like “AskSwitzerland.” These conversations provide a large amount of material and valuable insights into why people want to leave their home countries.

The Challenge of Unstructured Data

But here’s the issue: these posts are free-flowing and lack any set format, in other words, they are unstructured. For instance, no one starts their message the same way and gives the information in the same order. It is not difficult for a human eye to distinguish all the information, but for a computer (before-AI), it is impossible.

While it is not an issue for qualitative approaches that analyze a small number of posts in depth, many researchers cannot use the full amount of data available due to its size and the time-consuming effort of going through all the posts.

While quantitative methods might be able to automate the analysis of this amount of conversation, researchers cannot extract much from user posts without requiring manual labelling, which is also time-consuming. So, quantitative approaches usually ignore the text to focus on simpler measurable variables or use a method called natural language processing (NLP), a computer technique that can scan and analyze text to extract information like common topics or overall sentiment.

This is where Artificial Intelligence models known as “Large Language Models” (LLMs) come in handy. They can make sense of natural language automatically, just like humans do. So, they can extract people’s personal reasons and experiences when moving countries from any conversation and return the result in a structured way usable by statistical models.

Using a programming language and an “Application Programming Interface” (API), which is an access to a software’s internal service that automatically pulls data from a website, it is possible to scan thousands or millions of posts. For instance, each LLM (proprietary or open) has a way to communicate with it programmatically.

Using this model can save hours of manually going through and labelling posts. However, these models are not perfect; they may sometimes be inconsistent, unpredictable, or just wrong. Unless we manually check everything, there is no sure way of knowing how often an AI model gets things wrong.

The Case of Reddit Users Moving to Switzerland

Here is an example of how to use this approach, combining both quantitative and qualitative methods to show the decision process of a mover to Switzerland. You can find free datasets of reddit’s posts online or collect them using reddit’s API or third-party API’s.

After gathering all the posts with specific keywords related to mobility to Switzerland in the “AskSwitzerland” community, we use an LLM to extract unstructured data from the content, such as the age, the sex, the level of education of the user, migration reason, migration intention, and their concerns etc.

We then ask the model to return the result in a structured format that can be treated statistically (for instance, in a table format). This can be done with tailored instructions for the AI model. Then, using the statistical result, we select cases of users that are interesting to follow by looking deeper at how they discuss their migration intent and experience over time.

Thus, using an AI model through mixed methods specifically aimed for social media data allows us to tackle both the quality limitation of quantitative methods and the quantity limitation of qualitative methods while bridging them.

First Results: Strong Bias and Incomplete Data

The population of this Reddit community, the people who participate there, is mostly from the United States and the United Kingdom. They are highly educated, moving for job opportunities, and mostly working in IT and healthcare. This reveals a bias in the population: the people in this group do not represent everyone who migrates to Switzerland; they are mostly from a few countries and work in specific fields.

Furthermore, the most cited concerns regarding Switzerland were the work-residential permits, the cost of leaving in Switzerland and finding accommodation.

However, it is important to highlight the scarcity of the data; not all information was available on each post. For instance, while the reasons of migration and concerns were often cited, age and sex were often omitted by the user. But this information can be completed by collecting user posts over time.

Limitations and Issues

At the moment, despite their usefulness, LLMs bring several challenges, mainly related to efficiency and privacy. These models can make a lot of mistakes, which limits our confidence in their results. On the privacy side, Reddit encourages researchers to use its public data, but other social media platforms are not as open.

Moreover, many AI companies still use data sent to them to train their models, which raises privacy concerns. Fortunately, there are alternatives that offer better data protection, such as running open-source models on your own computer or using privacy-focused server providers.

Using AI models for data extraction is a game-changer for researchers working with social media. However, we still need to be aware of the limitations of both the current AI models and the data itself.

Vestin Hategekimana is a Doctoral researcher at the nccr – on the move and the University of Geneva. He is part of the project « The Longitudinal Impact of Crises on Economic, Social, and Mobility-Related Outcomes: The Role of Gender, Skills, and Migration Status ».

Write a comment >