How to clean a very untidy data set with Freedom House country ratings, saved in an Excel sheet, which violates many principles of data organization in spreadsheets described in this paper by Karl Broman and Kara Woo, but otherwise is an invaluable source of data on freedom in the world? Data source: https://freedomhouse.org/content/freedom-world-data-and-resources The full code used in this post is available here. I would do this: Read in the file,
The climate protests in March 2019 mobilized over a million of people around the globe. A team of social scientists from universities across Europe organized a survey of the #FridaysForFuture strike events on March 15 in 13 cities in nine countries. The report can be found here. A new wave of climate protests (and surveys) is planned for the end of September. Naturally, most participants at these protests are acutely aware of the environmental threats and motivated to take action.
This post was written during a research visit at the Department of Computer Science at Aalto University, Finland, supported by the Helsinki Institute for Information Technology. Perspective is an API that uses machine learning models to predict the impact of a comment on the conversation. One of the models predicts the extent to which the comment might be perceived as toxic. A toxic comment is defined as “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion.
The International Social Survey Programme offers a wealth of data, with thematic modules repeated around every 10 years, and a solid and relatively stable block of socio-demographics. The data can be downloaded from the GESIS data archive either in separate files per year or with data bundled by topic (e.g., the Social Inequality dataset contains data from rounds 1987, 1992, 1999, and 2009). There is no integrated codebook indicating the availability of variables in different rounds, so someone interested in longitudinal analyses would need to download all files, open them and look for the variables of interest.
Introduction Illustration: Trust in institutions Step 1: Preparation and coding of technical variables Step 2: Selection of source variables for harmonization Step 3: Mapping source values to target values Step 4: Harmonization Results: Availability of trust items Comparability of sample aggregates Appendices: Code examples Appendix 1: Data preparation Appendix 2: Codebook from labelled data in R Appendix 3: Values crosswalk Appendix 4: Harmonization Introduction Ex-post (or retrospective) data harmonization refers to procedures applied to already collected data to improve the comparability and inferential equivalence of measures collected by different studies.
Working with categorical data, such as from surveys, requires a codebook. After spending some time unsuccessfully looking for a function that would create a nice, searchable codebook from labelled data in R, I decided to write my own. What I want to achieve is a simple table with variable names, labels, and frequencies of labelled values like the one below, to search for specific keywords in the value labels and to see distributions of various variables.
So You Want to Write a Fugue? Glenn Gould So you want to write a fugue? You’ve got the urge to write a fugue You’ve got the nerve to write a fugue So go ahead and write a fugue that we can sing Pay no heed to what we’ve told you Give no mind to what we’ve told you Just forget all that we’ve told you And the theory that you’ve read
Winner-loser trust gap across countries Winner-loser trust gap in Poland Trust differences across parties in Poland Voting for a party that ends up losing the election is known to be associated with lower satisfaction with democracy and trust in the parliament (cf. Martini and Quaranta 2019). How does Poland compare to other European countries? How has the winner-loser trust gap changed in Poland over time, and how have trust levels among supporters of current and former ruling parties changed in periods when they were not in government?
Data Packages Varieties of Democracy (V-Dem): Dedicated package Polyarchy: Semicolon delimited CSV file -> rio Freedom House: Excel file with by-year sheets Polity IV: SPSS file -> rio Democracy Barometer: Excel file with header in top rows -> rio The Standardized World Income Inequality Database (SWIID): Plain CSV file -> rio World Bank’s World Development Indicators: Dedicated package Merging all datasets Writing to file Shortly after writing this post on importing datasets in different formats (CSV, XLS, XLSX, SAV) to R, I got the following comment:
Data Packages Varieties of Democracy (V-Dem): Dedicated package Polyarchy: Semicolon delimited CSV file Freedom House: Excel file with by-year sheets Polity IV: SPSS file Democracy Barometer: Excel file with header in top rows The Standardized World Income Inequality Database (SWIID): Plain CSV file World Bank’s World Development Indicators: Dedicated package Merging all datasets Country graphs Variable graphs Writing to file with Viktoriia Muliavka Social and political scientists often need to put together datasets of country-level political, economic, and demographic variables with data from different sources.
How 2015 voters voted in 2007 and 2011 How 2007 voters voted in 2011 and 2015 About POLPAN Where did the current governing party get their votes from? Did supporters of the previous ruling party switch preferences or did they abstain from voting altogether? Cross-sectional datasets, such as one-off election polls, do not typically provide data to answer these questions. Panel studies, such as the Polish Panel Survey (POLPAN), do.
Determining meritocratic allocation Calculating the distance to meritocracy Distance to meritocracy by country Meritocracy is a principle according to which rewards are based on merit, as well as an ideal situation resulting from the operation of this principle. In their 1985 Social Foces paper titled “How Far to Meritocracy? Empirical Tests of a Controversial Thesis”, Tadeusz Krauze and Kazimierz M. Słomczyński proposed an algorithm to construct a theoretical joint distribution of education and income, given their marginal distributions, that would satisfy the conditions of meritocratic allocation.
What comes first? Wikipedia, Google, News Interest in technology Cross-correlations News coverage versus Wikipedia page views with Maria Khachatryan, Filip Kowalski, Jakub Siwiec, and Paweł Zawadzki The Hackathon Next Generation Internet Data Sprint was organized by the Digital Economy Lab of the University of Warsaw on November 9 and 10, 2018. The goal of the hackathon was to explore datasets on Wikipedia page views and edits, Reddit posts, media mentions, and others, to generate insights about the use of the internet and new technologies.
BigSurv18 and the Green City Hackathon Team number 5 Data Bike use Altitude of Bicing stations Location of mechanical and electric bike stations Empty stations by station altitude Next steps with Saleha Habibullah, Sakinat Folorunso, and Vera Paul BigSurv18 and the Green City Hackathon One of accompanying events of the BigSurv18: Big Data Meets Survey Science conference in Barcelona last week was the Green City Hackathon.
Political participation in Poland Latent class analysis Three types of participants: the Disengaged, Activists, and Protesters Region maps I recently came across Jennifer Oser’s 2017 article in Social Indicators Research about “political tool kits”, i.e. profiles (or patterns) of participation in different political activities. Her general argument is that research on citizen participation would benefit from analyses of such participation patterns instead of (or at least in addition to) just looking at determinants of participation in single activities.
Sample correlations Sample correlations by gender Sample correlations by age Sample correlations by education Contrast Conclusion One of the reasons for the harmonization of personal income in addition to household income was to check if the two correlate highly enough to use household income as a substitute for personal income in analyses where economic status is a control variable. This would be great, because household income variables are available in 1177 surveys out of 1721 analyzed in the Survey Data Recycling dataset (SDR) version 1, while personal income only in 453 surveys.
Data Number of response options Item non-response Distributions Harmonized target variables Next steps with Przemek Powałko Individual economic status is a necessary element of almost all sociological analyses, including studies of political attitudes and behavior. To supplement the already harmonized variables in the Survey Data Recycling dataset (SDR) version 1 and for the purposes of my resesarch of the effects of education on political engagement, Przemek and I harmonized two additional variables: personal income and household income1.
Political participation in the ESS Country levels of political participation Inequality of political participation Democracy indicators Economic inequality Matrix scatter plots How to measure political inequality? The Variaties of Democracy project (V-Dem) has a set of political equality indicators that capture the extent to which political power is distributed according to wealth and income, membership in a particular social group, gender or sexual orientation (cf. V-Dem Codebook v.
Cross-national survey projects conduct surveys on representative samples of adult populations. How do the distributions of respondents’ age vary across surveys carried out in the same country in different years and different projects? Like in a couple of previous posts (here, here and here) I use data from the Survey Data Recycling dataset (SDR) version 1, which includes selected harmonized variables from 22 cross-national survey projects. SDR only includes surveys that claim to have samples representative for adult populations.
Data Differences within country-years Differences by groups Gender Age Urban/rural residence Education Sampling scheme The growth in cross-national survey projects in the last decades leads to situations when two or more surveys are carried out in the same country and the same year but in different projects, and contain overlapping sets of survey questions. Assuming that the surveys are based on representative samples - a claim that major cross-national survey projects typically make - it could be expected that estimates from surveys carried out in the same country and year are reasonably close.
Setup tidy TED talks Applause, LOL Sentiment This year I spent two weeks of the summer attending the Summer Institute for Computational Social Science Parter Site (SICSS) in Tvärminne and Helsinki, Finland, organized by Matti Nelimarkka from Aalto University and the University of Helsinki, assisted by two TAs: Juho Pääkkönen and Pihla Toivanen from the University of Helsinki. I highly recommend it to anyone with background in the social sciences and interested in computer and data sciences, or the other way around!
Educational attainment data OECD data SDR data Cleaning and merging SDR and OECD data Results The curious case of ISSP Switzerland Conclusion Appendix with Przemek Powałko General population surveys with representative samples should have a similar education structure as shown by data from administrative sources, especially if survey weights are used. In this post we compare sample aggregates from 15 cross-national survey projects (including the European Social Survey, the World Values Survey and the European Values Study, and others) from the Survey Data Recycling database with educational attainment statistics from the OECD.
Instructions References In the previous post I wrote about downloading and exploring the Survey Data Recycling (SDR), version 1 dataset, which consists of selected harmonized variables from 22 survey projects, 1966-2013. The SDR project will develop a website for browsing, subsetting, downloading, and visualizing data from the SDR project. This website is currently under construction. Meanwhile, I made a Shiny app with basic functionalities of the future on-line browsing and subsetting tool (also serves as its mock-up): https://mkolczynska.
Introduction Downloading the SDR data Exploring SDR: availability of variables by project Exploring SDR: availability of variables with different formulations Identifying surveys containing selected variables Subsetting the Master File Country coverage plot Combining data from different survey projects creates new opportunities for research, alas, at the cost of increased volume (obviously) and complexity of the data. The Survey Data Recycling project created a dataset with data from 22 international survey projects.
Getting data from Twitter Tweets over time Text analysis Tweets by ISA Resesarch Committee The International Sociological Association 19th World Congress of Sociology in Toronto (15-21 July) has received quite some Twitter coverage. Waiting to board the flight back to Warsaw, I wanted to take a look at these Twitter data and apply the newly acquired skills in text analysis (thanks to the Summer Institute for Computational Social Science, SICSS, Partner Site in Tvärminne and Helsinki, Finland).