VOOZH about

URL: https://towardsdatascience.com/mining-modelling-character-networks-part-i-e37e4878c467/

⇱ Mining & Modelling Character Networks - Part I | Towards Data Science


Mining & Modelling Character Networks – Part I

Discussing Research Papers in Graph Theory

14 min read

Thoughts and Theory

Literature Review of Papers in Graph Theory Discussing Mining Characters from Text and Modelling Interaction Networks

👁 Image taken from Splash by Nastya Dulhiier
Image taken from Splash by Nastya Dulhiier

This is a two part article, the first part will cover the discussion of relevant research papers covering the topic of mining and modelling character networks from various bodies of text. The second part following this will go over the python implementation of mining characters, identifying character interactions and then creating interaction networks between these characters.

Mining & Modelling Character Networks – Part II

The following is the outline of this article.

Table of Contents

  • Introduction
  • Literature Review
  • Design & Methodology
  • Data & Network Creation
  • Modelling
  • Results
  • Discussions & Remarks
  • Resources

Introduction

There are many applications of mathematics in the real world, a well defined subsection of mathematics known as Graph Theory is the study of graphs. Graphs are a mathematical structure to model the pairwise relations between nodes through edges, it has incredible applications to define and understand the relationships between nodes and their connections [1]. One of many applications of graph theory lies in understanding character networks. These character networks are generated from novels, stories, movies, videos, and other forms of media. Depending on how the network is structured, one can choose the types of relationships pairs of nodes hold. These relationships can be characterized through conversation, gender, co-occurrences, themes, story arcs, conflicts, arguments, friends, enemies, race, etc. [1]. These networks can bring new viewpoints based on the structure, namely, we can strip and fortify existing themes and attributes displayed in the body of text based on observations and mathematical analysis of the network.

Results of the analysis can vary based on the network structure, in "Mining and Modelling Character Networks" by Bonato et al. they created a weighted character network based on the proximity of characters within the text within a bound of 15 words. If two or more characters were mentioned in the text within a 15 word proximity of each other, then a weighted edge would be created associated to the combination from the set of characters.

Literature Review

A lot of work has been done on the analysis of character networks. In "Extraction and Analysis of Fictional Character Networks: A Survey", Labatut and Bost have outlined and consolidated all the academic literature in the topic of character networks. From identification of characters, detecting character interactions, extracting graphs, to analysis and applications [2]. Labatut and Bost outline various ways of identifying characters and detecting interactions between these identified characters. The character identification methodology ranges from manual identification to using named entity recognition (NER) tools which locates and classifies named entities mentioned in a body of text based on predefined categories. When dealing with visual forms of media like videos, comics, manga and movies, the text outlines using face detection through computer vision. Labatut and Bost indicated 6 different approaches for identifying interaction between characters [2], namely,

  1. Co-occurrence of characters
  2. Direct verbal interactions between characters
  3. Explicit mentioning between characters
  4. Non verbal interactions (fighting, kissing, killing, etc.)
  5. Affiliation between characters (coworkers, family, friends, etc.)
  6. Hybrid approaches of 1–5

The graph extraction process encompasses the creation of the network based on the characters and interactions identified for the body of media previously. Labatut and Bost claim that many authors simply use individual characters as nodes, however some use a combination of characters / organizations as nodes in their network creation process. The edges are created based on the interactions the author uses between their specified nodes. Many authors choose to create their networks as weighted or multi-edge while a few create their graphs as directed. The analysis of these networks outlined in Labatut and Bost cover a variety of centrality measures, graph statistics, community structures, clustering etc.

In "Linguistic analysis of differences in portrayal of movie characters" by Ramakrishna et al. tries to examine differences in characters through psycholinguistic and network analysis [4]. Gender, age, and other metadata is what defines the differences between characters, this metadata is extracted through movie dialogue between characters. From the 945 screenplays at their disposal, they created a parser which identifies relevant metadata associated with the screenplay. They’ve identified gender, race and age associated with characters, and movie metadata including movie release date, writers, directors, etc. They outlined running various experiments namely, psycholinguistic normatives and linguistic inquiry and word counts (LIWC) on their created network. Psycholinguistic normatives provide a measure of various emotional and psychological constructs of the speaker, such as arousal, valence, concreteness, intelligibility, etc. and are computed entirely from language usage [4]. LIWC is a text processing application that processes raw text and outputs a percentage of words from the text that belong to linguistic, affective, perceptual and other dimensions. [4]. In the end they found that there was a clear distinction between the gender, age and race hierarchy in film and movie sets. In particular, they noticed that if there is female staff in the production team they have a more balanced gender ratio throughout the entire cast and crew, however if the production team is predominantly male then the cast and crew is also predominantly male. They also noticed that mixed race characters have a more sexual vocabulary in the film. The sophistication of the dialogue associated with older characters was notably higher than that of younger characters.

Design & Methodology

"Mining and Modelling Character Networks" by Bonato et al. the focal point of this paper, highlights experiments to create synthetic networks which closely captures the attributes and similarities of the character network.

Data & Network Creation

Bonato conducts in depth analysis of the networks created from the following bodies of text, Twilight, The Stand, and Harry Potter and the Goblet of Fire alongside 800 character networks from films taken from http://moviegalaxies.com/. They’ve built a parser which scans through the input text and identifies unique character names and their associated aliases. Upon identifying the character names, another parser goes through the body of text once again to identify co-occurrences of characters within a 15 word span. These co-occurrences are the representative edges of the network. Instead of keeping it as a multi edged graph, Bonato et al. decided to create a weighted network as seen in figure 1.

👁 Figure 1 : Creation of the network from text (Image provided by the author)
Figure 1 : Creation of the network from text (Image provided by the author)

Network statistics were then calculated on top of the created network to infer more insights on the character dynamics for the body of text. Bonato et al. ran Louvain’s algorithm on the network for identifying communities of characters. The following centrality measures were also ran on the network and some of these results were later fed into the machine learning models:

  1. Page Rank : This algorithm counts the frequency of quality interactions with a character to generate an estimate of importance for the character. The expectation is that characters with larger importance in the network will receive a larger page rank score.
  2. Eigenvector : This centrality measure is indicative of the level of influence a certain character has in the network, larger eigenvector centrality scores reflect a higher character importance. The underlying assumption is that nodes which have high importance will be highly connected with other nodes who also have high importance.
  3. Betweenness : This measure identifies a character’s influence over a network based on the character’s ability to "bridge" together other characters. It does so by identifying and scoring the shortest paths between all characters in the network. Characters which frequently fall in the identified shortest paths will have a larger betweenness, this is because this algorithm thinks this character is a good "bridge" to connect other characters.
  4. Degree : This measure reflects the number of edges connecting pairs of nodes. Nodes with high connectivity will have a large degree of centrality.
  5. Closeness : This measure identifies the closeness of a character with other characters based on the proximity of characters in the network. It can be inferred as the ability a character has to spread information to other characters.

Modelling

The aim for the modelling component is to identify a random graph model which parallels the observed properties of the character network [3]. Bonato et al. derives the k-profile and the eigenvalue distribution associated with the network. The k-profile of a Graph is the frequency of each graph on k nodes which appear in the induced subgraph of G [3]. The eigenvalue distribution is a histogram of the eigenvalues of the normalized Laplacian matrix through equally spaced bins [3]. There were 4 different random graph models used to synthesize the original character network.

  1. Binomial Random Graph : The graph is constructed by randomly connecting nodes. Each edge is connected with a probability p independent from all other edges. They set p = |E| / nC2 to match the average degree of the original character network [3].
  2. Configuration Model : They created an approximated network which has the same degree distribution as the original character network. This is done by selecting a graph uniformly from the set of graphs which exactly match the target degree distribution [3].
  3. Preferential Attachment : Given a graph G, at each time step t, a new node n and m edges are added connecting this new node n to existing nodes from time step t-i. They specified the edges be chosen by the following formula 2/n + 2m = 2|E| / n [3].
  4. Chung-Lu : The graphs created by this model are parameterized by the character networks degree distribution [3]. It creates edges with probability p proportional to the products expected degrees wi [3]. The edge probability can be represented by the following formula : pij = 1/C wi wj

To evaluate which of these stochastic models best reflects the original character networks, Bonato et al. decides to generate 100 random graphs from each of the models. They then attempt to determine the best random graph model of the 4 mentioned above by training a variety of classifiers on each of the random graphs generated to see if the classifier can correctly predict the class label for the original character network [3]. There were 3 different ML models used for this experiment, namely :

  1. Random Forest : An ensemble learning algorithm which uses the results of multiple decision trees to make an overall decision on what class is going to be labelled for this data point. The final class will be predicted based on a majority vote of the aggregates of each decision tree generated.
  2. SVM : The intuition behind SVM for classification is to generate a hyperplane which accurately separates the data points into their appropriate training classes. If the data was mapped in a 3 dimensional space, a 2 dimensional hyperplane which maximizes the margin between the hyperplane and all other data points. Similarly, if the data was mapped in a 2 dimensional space, a 1 dimensional hyperplane would be used to separate the data into their appropriate classes.
  3. Ada Boost : The intuition behind AdaBoost is to convert weak learners (decision trees) into strong learners by using many weak learners in conjunction with each other, it is an ensemble type model. Essentially, the mistakes of previous iterations are used to make accurate predictions at current iterations.

Results

This methodology was not only able to accurately predict the importance of characters but also the communities which were formed in the 3 books the analysis was conducted on. The characters ranked with the highest centrality measure (like Page Rank) did indeed have a high role to play on the body of text; this was conclusive across all the centrality measures. The corresponding communities created by Louvain’s algorithm accurately depicted communities outlined in the body of text. For example, in Twilight, there are 3 notable communities formed, the first community corresponds to the vampires, the second to the high school kids, and the third to the friends of Charlie. Similar conclusions can be made when viewing the corresponding graphs for other novels, see [3] for more info and visual representation of the community networks across the 3 bodies of texts.

Cross validation, a metric commonly used in machine learning to see how well the results of a machine learning model generalizes, was used on the classifiers in this experiment. Each classifier achieved a very high cross validation score as well as a large F1 score. The performance of the models can be seen in table 1. Different accuracy measures were used corresponding to each model, for Ada boost, the final decision function was used, for random forest, the soft decision probability was used and for SVM the distance separating the hyperplanes was used [3]. Clearly, based on the results shown in the table below, the Chung-Lu model outperforms all other models in all categories across each book.

👁 Table 1 : Taken from Bonato et al. [3] which outlines the performance of all the ML models when using graph profiles & eigenvalue histograms as features
Table 1 : Taken from Bonato et al. [3] which outlines the performance of all the ML models when using graph profiles & eigenvalue histograms as features

Similarly, in Figure 2, we see that the Chung-Lu model is the best performing model across all the movie galaxies data when using the graph profiles and eigenvalue histograms as feature [3].

👁 Figure 2 : Taken from Bonato et al. [3] which outlines the strong performance of the Chung-Lu model over all others when investigating the movie galaxies data set
Figure 2 : Taken from Bonato et al. [3] which outlines the strong performance of the Chung-Lu model over all others when investigating the movie galaxies data set

Discussion & Remarks

Bonato et al. conducted network analysis on weighted character networks from 3 large novels and 800 films. The networks were created based on co-occurrences of characters in the text. Upon generating the network, various statistical analyses were conducted including Louvaine’s algorithm for community detection, and centrality measures (page rank, eigenvector, betweenness, closeness, degree) of the characters. The analysis successfully extracted accurate literary conclusions and identified the influential and non influential characters in the network [3]. Through the use of various ML algorithms like SVM, random forest, ada boost, Bonato et al. compared the stochastic models with respect to other simulated data from complex networks. They considered the Chung-Lu model, configuration model, PA model, and binomial random graphs [3]. The results indicated clearly that the Chung-Lu model was the best performing model which accurately depicts the original character network.

There are various areas where the methodology introduced by Bonato et al. can be improved. Firstly, the performance of these models are most likely indicative of the network size. Since the original bodies of text (Twilight, The Stand, Harry Potter and the Goblet of Fire and the 800 character networks) were large to begin with, the corresponding character network built corresponding to that body of text will be large. The same approach and results might not be applicable for shorter stories, videos and small films. Secondly, there can be improvements made when identifying characters and character interaction autonomously. The methods introduced in "Extraction and Analysis of Fictional Character Networks: A Survey", Labatut and Bost for character identification through NER and a hybrid approach to identifying character interactions would yield compelling arguments aiding the currently proposed methodology and results for a more universal dataset.

The limitation in the current approach to network creation is in the way that character interaction is defined. They used an arbitrary number 15 which indicates the cooccurrences between characters in a body of text, this number was not derived, nor experimented with to see how results would differ if it was increased or decreased by some value k. Furthermore, based on the current character interaction definition, If one individual references a group / organization of characters, they automatically have a high weighted edge associated with each member in the group / organization. This would have a heavy impact on the analysis conducted with the centrality scores, community detection, etc.

In addition to the discussion points mentioned above, for future experiments associated with this topic, creating and exploring graphs which are created temporally over some time step t would yield interesting results to see which model best fits the data. It would allow us to see the change in the network as the story progresses, the associated network statistics would be very interesting as we can see how the character dynamics evolve over time. This allows the introduction of various time series analysis and predictive modelling which is very intriguing.

Resources

[1] Chen, Andy. "Modeling Movie Character Networks with …" Snap.stanford.edu, Stanford, Dec. 2017, https://snap.stanford.edu/class/cs224w-2017/projects/cs224w-32-final.pdf.

[2] Vincent Labatut and Xavier Bost. 2019. "Extraction and Analysis of Fictional Character Networks: A Survey" . ACM Computing Surveys 52(5):89. https://doi.org/10.1145/3344548

[3] Bonato, A., D’Angelo, D. R., Elenberg, E. R., Gleich, D. F., & Hou, Y. (2016). "Mining and modeling character networks." In Algorithms and Models for the Web Graph: 13th International Workshop, WAW 2016, Montreal, QC, Canada, December 14–15, 2016, Proceedings 13 (pp. 100–114). Springer International Publishing

[4] Ramakrishna, Anil, et al. "Linguistic Analysis of Differences in Portrayal of Movie …" 2017, https://aclanthology.org/P17-1153.pdf.


If you liked this article, here are some others I’ve written which you might be interested in :

Monte Carlo Method Explained

Bayesian A/B Testing Explained

Word2Vec Explained

Dimensionality Reduction Explained


Written By

Vatsal

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles