![]() The corpus consists of 7,507 speeches held at the annual sessions of the United Nations General Assembly from 1970 to 2016. Let’s jump into the role of a political analyst who wants to get a feeling for the analytical potential of such a dataset.įor that, we will work with the UN General Debate dataset. Often, text from many years is publicly available so that an insight into the zeitgeist can be gained. If available, author-related attributes allow you to analyze groups of authors and to benchmark these groups against one another.Īnalyzing political text, be it news or programs of political parties or parliamentary debates, can give interesting insights on national and international topics. Time-like attributes are essential to understanding the evolution of the corpus. The document metadata comprise multiple descriptive attributes, which are useful for aggregation and filtering. ![]() Figure 1-1 shows typical attributes of a text corpus, some of which are included in the data source, while others could be calculated or derived. Some analyses focus on metadata attributes, while others deal with the textual content. The statistical exploration of such a corpus has different facets. Histograms and box plots will illustrate the distribution of values, and time-series plots will show their evolution.Ī dataset consisting of text documents such as news, tweets, emails, or service calls is called a corpus in natural language processing. Typical methods include summary statistics for numerical features as well as frequency counts for categorical features. Chapter 4 will discuss advanced linguistic methods for data preparation.Įxploratory data analysis is the process of systematically examining data on an aggregated level. The blueprints in this chapter focus on quick results and follow the KISS principle: “Keep it simple, stupid!” Thus, we primarily use Pandas as our library of choice for data analysis in combination with regular expressions and Python core functionality. We will also introduce TF-IDF weighting as an important concept that will be picked up later in the book for text vectorization. You will know how to tokenize text, filter stop words, and analyze textual content with frequency diagrams and word clouds. We will start by analyzing categorical metadata and then focus on word frequency analysis and visualization.Īfter studying this chapter, you will have basic knowledge about text processing and analysis. It gets you started quickly and introduces basic concepts that you will need to know in subsequent chapters. This chapter presents blueprints for the statistical analysis of text.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |