In today’s world of technological advances, keeping track with the news published around the world is becoming increasingly challenging. Due to the vast amount of available information distributed by news outlets (e.g., bbc), in social media (such as Facebook and Twitter) and in classical media, putting current news events into perspective and understanding the background and evolution of a complex topic can quickly result in information overload. In addition, topics often times emerge quickly to form so-called hypes and vanish just as quickly shortly afterwards.
To combat this issue, I am researching possibilities to gather background information of a story and help understanding the story at hand. One example is our approach presented at JCDL 2015, where inherent temporal information as well as the content of an articles is used to create a network structure of news articles.
To apply our theoretical models to real-world data, we have been collecting news articles from German and English news outlets for the past two years. In this post, I would like to use the opportunity of the new year to take a look at the German news landscape in 2015.
In 2015, we collected news articles from 9 major German news websites in the topic area of politics and economy. To crawl for new stories and their metadata, we first check the RSS feed of each news outlet. If a new story is found, we download the raw HTML code of the article and extract the content from it. The HTML code of course contains more than the actual content, such as ads or code related to the layout of the website. To distinguish between content and irrelevant HTML code, we manually created a rule-set for each news outlet that is constantly monitored and adjusted as needed. While it would of course be possible to apply machine learning, the precision of manual rules is hard to beat.
Data set statistics
Overall, we collected about 142,563 German news articles in 2015. Let’s look at the distribution of news articles per month:
The relatively low number of news articles in January and February is due to the fact that we added 5 of the 9 German news outlets only at the end of February. In August, we can observe the annual ‚silly season‚ during summer where the number of articles is significantly lower. The same holds for December 2015, where the output is substantially lower, presumably due to national holidays and the Christmas season.
Next, let’s take a look at trending topics. But first, we need to define what we actually mean by topic and how to assign news articles to topics. The next section is slightly technical, so you might want to jump to the results 😉
We start analysing the news articles by applying standard methods of Natural Language Processing (NLP). As we are dealing with quite a lot of data, we employ the scalable framework Apache UIMA (Unstructured Information Management Architecture) to create a so-called pipeline. UIMA allows us to pass a document through various steps of a pipeline and thus accumulating more and more knowledge about the document similar to an assembly line. The extracted information include part-of-speech-tags, Named Entities (e.g., which persons and companies occur in the article), and many more. UIMA is widely used in the NLP community and represents the backbone of the famous IBM Watson system, for instance.
Defining topics is actually quite challenging because topics are an abstract concept that cannot be defined universally. For one person and in one use case, ‚politics‘ would be an appropriate topic, whereas in other cases, ‚the speech of the German chancellor on October 21st‘ would be a more useful topic. One commonly used approach to extract topics from document collections is ‚topic modeling‚. Without going into details, there are two reasons why I rejected topic models: first, they don’t provide a nice description of a topic but leave the interpretation up to the user. A topic is basically a list of words that often occur together (e.g., ‚war‘, ‚weapons‘, ‚December‘). Looking at the words, everyone can come up with their own definition of a topic but there is no definite truth. Second, they incorporate randomness. Thus, two topic models derived from the same data will probably be different.
Instead of using topic models, we thus use a fixed set of topics and hereby rely on the list of topics provided by the FAZ. Each news article published by FAZ online is assigned to a topic. The list of topics is very extensive and constantly updated. Moreover, the topics are descriptive and well-defined (e.g., ‚Flüchtlichtspolitik‘ — ‚refugee policy‘).
To assign topics to news articles from other news outlets, we perform three steps:
- Obtain all news articles for a topic from FAZ. As an example, let’s take the topic ‚Flüchtlingspolitik‘.
- Learn the association between important keywords in news articles and the topics. For instance, ‚Obergrenze‘ often occurs in news articles of the topic ‚Flüchtlingspolitik‘. To model the strength of an association, we use a logistic regression classifier that combines multiple metrics, such as language model scores or statistical concordance.
- Apply the model learned in (2) to new articles. Thus, if there is a news article mentioning ‚Flüchtlinge‘ (en. ‚refugees‘), we would assign the topic ‚Flüchtlingspolitik‘ (en. ‚refugee policy‘) to it.
Evolution of trending topics
Below, you can see the evolution of the top 30 topics in 2015 over the course of the year. The x-axis corresponds to weeks of the corresponding month and the y-axis shows the distribution topics of news articles published at the respective time. This is a stacked graph representation, meaning that the coloured, shaded area represents the percentage of news articles with a specific topic. Thus, the larger the coloured area, the more prominent a topic at a specific point in time. Clicking on the preview image will open a more interactive version of the chart.
I just want to emphasise two aspects that can be derived from the chart: we can see that the topic ‚Fluechtlinge‚ (blue, en. ‚refugees‘) is the most-discussed topic and increased continuously over the year, reaching its peak in September. But there are also topics that only peak at certain times such as the topic ‚streik‚ (grass-green, en. ‚walkout‘) which is, of course, only mentioned, when a walkout occurs. This happend around May, for instance, when the employees of the German Federal Railways went on strike.