with Maria Khachatryan, Filip Kowalski, Jakub Siwiec, and Paweł Zawadzki
The Hackathon Next Generation Internet Data Sprint was organized by the Digital Economy Lab of the University of Warsaw on November 9 and 10, 2018. The goal of the hackathon was to explore datasets on Wikipedia page views and edits, Reddit posts, media mentions, and others, to generate insights about the use of the internet and new technologies.
We decided to focus on public interest in selected key words related to the internet, to see how this interest is reflected in different data types. The remainder of the post (below the tweets) summarizes the ca. 9 hours of our work.
@DELabUW) November 12, 2018
@DELabUW) November 12, 2018
What comes first?
Wikipedia, Google, News
What comes first: interest in a given topic of the general public, specialists, or the news? How do events trigger Google searches, news coverage, and Wikipedia views?
* Wikipedia page views and edits,
* Google searches,
* coverage in the news media and academic pre-print repositories (SSRN and arXiv).
All data are monthly.
Interest in technology
We chose four keywords (wireless, linux, bitcoin, and cyberattack) to illustrate the different patterns in interest as measured by Google searches, Wikipedia edits and pageviews, and occurrences in the news and academic pre-prints.
The graph of ‘linux’ keyword shows a high variability of Wiki edits, while Google hits, Wiki page views and news were smooth except one month in the case of news. Google hits, Wiki page and news curves show the lateral trend for wireless keyword graph, while Wiki page views can be viewed as a downward one.
On the graphs of ‘cyberattack’ and ‘bitcoin’ keywords, we can see a sharp rise of Wiki page views occurring together with news and Google hits rise. What is interesting, we cannot say the same about Wiki edits. Actually, in the end of 2017 the rise in Wiki page views, news and hits happened in the same moment as the sharp one-month fall of Wiki edits.
We have started to analyze the temporal correlation of public and media interest in specific topics quantitatively with cross-correlations between the time series. The graph below shows the cross-correlation for Google hits and news occurrences for the keyword “cyberattack”. It seems that Google hits and news occurrences are correlated contemporaneously, or Google hits precede news occurrences by about one month (positive correlation for the lag of -1).
News coverage versus Wikipedia page views
The animated scatterplot shows the changes in the interest in tech-related keywords. The x axis shows an index of keyword occurrences in the news and academic pre-prints. Wikipedia page views are on the y axis. The size of the bubble indicates monthly Wikipedia edits. All three measures are cunstructed such that they represent the three dimensions of interest in the keywords as they accumulate over time.
As we can see, these three dimensions of interest are related but not perfectly. Typically, as Wikipedia page views increase, so does the index of news occurrences. However, the growth in interest is not always balanced: while interest in ‘Bitcoin’ grew stronger in terms of Wikipedia page views than in terms of news occurrences, the pattern is the opposite in the case of ‘5g’ or ‘wireless’. This might again have to do with the extent to which certain keywords - such as Bitcoin - capture the imagination of the public more than other, more technical keywords, such as ‘wireless’.
If we look closely we can also see that some words fist move closer to one axis than to the other. This requires further investigation to see, for example, whether certain profiles of interest are characteristic for certain types of keywords. This kind of analysis may bring us closer to answering the question - what comes first?