You Can’t Sit With Us!: Understanding South African Political Twitter Communities Through Graph Analysis
I regularly engage on Twitter, and naturally as human beings we tend to stereotype or put people in boxes based on our interactions with them or their interactions with others, and even form groups, cliques or communities.
During the latter half of 2021, I frequently saw tweets of a political persuasion (which makes sense since the municipal elections in South Africa were coming up!). Conversations and debates between Democratic Alliance, ANC, EFF and GOOD party supporters. Thus, I thought it would be interesting to scrape twitter data from various users in SA political Twitter, and see what sort of insights I could gain. In particular, can we use algorithms and graph theory to describe what community (in this case party affiliation), a user seems to belong to based on their tweeting behaviour.
There are methods to classify a set of nodes in a graph into distinct clusters or groups, based on the innate structure of the graph itself. The Louvain Method is one such method, and the one that we will be using. The Louvain Method will allow us to detect communities, and subsequently allow us to do graph colouring of these communities. The premise behind the utility of the Louvain method is that “Birds of a feather, flock together”.
“Birds of a feather, flock together”
My objective is to explain and deliver an understanding of community detection in a general sense, and how it can be used to get a broad understanding of a social ecosystem.
You will notice that I reference Mean Girls (2004), periodically in this post, as it’s surprisingly applicable to this subject. Particularly, because in some ways the different cliques of Mean Girls are not entirely unlike that of politically oriented groups on Twitter.
Modelling Social Networks Using Graphs
For our purposes we will be using what is known as an undirected graph to describe the structure of communication between a large number of users on Twitter.
An undirected graph is a data structure from Graph Theory, which consists of nodes and bidirectional edges. “An edge represents a relationship between two nodes, thus the number of relationships a node has with other nodes, is referred to as the degree of the node. More specifically, the degree of “node A”, is the number of edges that it has. See below.
Modularity is a measure of the difference between edge density within a group of nodes, versus if the edges were distributed at random. This value lies between [-0.5, 1] for a graph. 1 represents perfect modular clustering, -0.5 represents no modular clustering.
(The cafeteria in Mean Girls as a graph, has a modularity that is very close to 1)
Louvain Algorithm is a method to optimise modularity, or in other words maximising the edge density within a community and minimising edge density between communities. This is achieved by first assigning each node i to its own community. Secondly, we evaluate the potential gain in modularity if it was removed from its community and placed in the same community of its neighbouring node j. If there is a gain in modularity, then it will move to this new community, if not then it remains. (Janice Ian is the Louvain Algorithm)
Eigenvector Centrality is a measure of the relative influence of a node within its neighbouring network. It not only considers how many connections it has, but also considers the influence of those connections as well. This can be calculated using various techniques, one of which is the Power Method. I’ll go into further detail about this measure in a subsequent post. (Regina George has a very high eigenvector centrality since she has numerous connections with other highly influential people)
Quick background to the political landscape of South Africa
The following are some of the political parties, and political organisations that exist in South Africa:
ANC - centre-left social democratic party
EFF - Broadly Marxist-Leninist
DA — Centrist and follows Liberalism
GOOD — Centre Left, social democratic party
Afriforum — lobby group focused mainly on Afrikaner interests
ASD — Alliance of Social Democrats
Capitalist Party — Right-wing libertarian party
My code is uploaded to Github.
3.1 Twitter Scraping
I used snscrape for scraping Twitter, which has fewer features than the Twitter API, however unlike the Twitter API, there is no a tweet limit imposed.
I also created a retrieveTweets() method to make it more convenient to utilise snscrape.
The method I’ve chosen encompasses the use of a breadth-first style collection of users and their tweets. It can be summarised as follows:
- Select a starting node (User A)
- Scrape a list of tweets in a certain time period from the currently selected user and store in UserDFTweetList.
- Retrieve the top 12 most frequently replied to users by the currently selected user and store in topUsersRepliedTo.
- Append the currently selected user to userHistory list, to keep a record of who’ve we scraped, such that we do not duplicate the process.
- We iterate through UserDFTweetList list of users, and scrape the tweets for each user in this list, but first confirming that the current user selected is not already contained in userHistory.
- Steps 2–5 are repeated for n+1 iterations.
The choice of a starting node/user will obviously affect the choice of subsequent users, however while performing the scraping and subsequent graph creation, I noticed this effect becomes noticeably smaller the more loops one chooses to use, since users on Twitter already tend to speak most frequently to those who are in their own community and we also place a limit at only the top 12.
After the process is finished, I generally save the resulting dataframe as .csv just in case an issue crops up, so I don’t have to scrape again.
3.2 Dataset cleaning
Firstly, I have to modify the dataframe of tweets a bit.
- I remove all columns other than, “Username”, “inReplyToUser”
- We remove all rows which have “null” values in the “inReplyToUser”, column. These removed tweets are all the stand-alone tweets that don’t involve anyone else. Thus, we are left with reply and quote tweets only.
3.3 Graph Object Creation
I’m going to iterate through these rows, in order to create the nodes and edges. So instead of iterating through a dataframe, which is computationally more expensive, I can iterate through a list of lists(listOfReplies), which I can achieve by creating this through the use of the .tolist() method.
I use NetworkX to create the graph object. I create graph G, then I iterate through listOfReplies and if the node User1 does not already exist in G, I create a node and label it as User1’s Twitter username , then I create an edge to the User2, and if User2, doesn’t exist I create a new node in the same way as it was done for User1. Now, while iterating through the list if, there is another tuple containing User1, and User2 then the “weight” of the edge is increased by a specified value, in this case I increment it by 1. (Visually, the weight of the edge will affect it’s thickness)
This process continues, and once it’s iterated through the entire list you will have a NetworkX graph. You save this in .graphml or .gephi format, and then you can open it in Gephi.
Now in terms of the community detection. You can either calculate modularity by using the Louvain algorithm using the Louvain library in VS beforehand, or you can use Gephi to calculate it directly from the .graphml file. Either way works.
3.4 Community Detection and Visualisation
Using Gephi, I calculated the modularity of the graph, and subsequently each node gets put into a modularity class(a community), by assigning each node an integer value which represents this class. After running we get:
As you can observe, in this case there is a relatively high modularity, which means that there are communities which are quite distinct from each other.
I also calculated the Eigenvector Centrality for each node, since I will be using this value to determine the size of my nodes. The reason I use this as opposed to the degree centrality of the node, is because Eigenvector Centrality also takes into account the influence of others nodes that a specific node is connected to, and not just the degree of the node. I found this to be more visually initiative and accurate when viewing who is prominent in a social network on a graph.
Google’s PageRank algorithm is a variant of Eigenvector Centrality. I will be doing another subsequent blog post further explaining Eigenvector Centrality, and how to calculate it using the Power Method Algorithm, which is a numerical method used to approximate eigenvalues of matrices.
You see these measures are appended to the row of nodes in your “data table” after they’ve been calculated.
Each node can have a colour applied to it based on various measures and values. In this case, I used the values that pertaining to modularity class.
The change the relative size of nodes, using Eigenvector Centrality as a measure, in a way similar to how I applied the node colouring
In addition to this graph colouring, we have to deal with the layout of the graph, since at this point it probably looks like this:
There are many algorithms that we can use to do this depending on what sort of layout you desire. For my purposes, I used the OpenOrd algorithm, for my layout, since it does a decent job at arranging the nodes in clusters which mostly coincide with modularity classes. OpenOrd uses simulated annealing as part of it’s algorithm to achieve this, which you might remember if you did AI subjects in 3rd year university.
After applying OpenOrd, it would be prudent to also apply the “No Overlap” algorithm to remove or reduce any overlap that nodes might have with each other, which is what I did.
Once the above is done, I can go to “preview settings” and adjust various visual features of the graph, e.g. label sizes, rescale edge thickness, node size/appearance etc. Then I can save it to .png.
4. Results and Analysis
I give two examples of results from two different datasets that I used. As you can see the detail differs between the two depending how many nodes and edges there are. However, more nodes and edges require longer scraping times and larger collections of data. One can decide on the resolution/detail that they would like, based on their goals.
This is the main graph I created to see if it’s possible to allocate people according to communities.
Based on my own knowledge of political party members, as well as being familiar with the various Twitter users that are both represented in the graph and that I also have engaged with, this iteration seems to have mainly captured the more prominent members of the “White Body Politic” which is a subset of the larger South African Political landscape. The factors that produced this can be seen under the heading “limitations”.
The White Body Politic varies in it’s ideologies. It ranges from Leftists to far-right conservatives, libertarians, anarchists/communists and alt-right.
The communities in this graph are roughly the following:
- Blue are Democratic Alliance councilors, DA supporters, and centrists
- Red appears to coincide with the Capitalist Party, right-wing politics, Afriforum, and Afrikaner nationalists, and also oddly enough, conspiracy theory accounts. POTUS (Donald Trump), is contained in the red as well.
- The pink appear mostly to be left-wing politics, social democrats, communists and anarchist leaning people.
- Orange appears to be a bit oriented towards the GOOD Party, since some of their councilors are present in that class.
Groups 1, 2 , 3 and 4 form largely the “White Body Politic” community. Groups 1 and 2 are largely left to centre-left individuals, whereas groups 3 and 4 tend to be centre to right-wing individuals. Groups 5–11 are most predominantly BIPOC communities. Below is a rough summary of some of the groups:
Groups 1 and 2: Social democrats, anarchists and communists etc. Prominent journalists and media.
Groups 3 and 4: DA , FF+ and ZACP supporters. Conspiracy theorists and anti-vaxxers.
Group 5: LGBTQIA+ feature prominently. Social media influencers(larger nodes).
Group 7 and 8: EFF and radical economic transformation features here frequently.
If one looks closely at the frequency of edges, one can notice that between groups 1–4 and groups 5–11, not much communication happens when these two, regardless of the sentiment of the communication. Thus, communities are also heavily divided between racial demographics.
Descriptive vs. Prescriptive
This graph is descriptive in nature, and not prescriptive. In other words, it doesn’t have any predictive power, it only describes which user belongs to what community based on who they communicated with and how often, for the time interval prescribed. This brings us to our next limitation.
Time intervals used
Furthermore, and connected to the above, my graph was only for a +- 1 months period of tweets. Thus, it is likely that this will affect the Louvain algorithm, because it could be the case that someone that someone who actually belongs in the blue, could be erroneously allocated to the red if they had just been engaging in debates over a difference political opinion for that time period for example. I confirmed that there were a small minority of people who were misallocated. You can make provision for this limitation by extending the time interval for the scraping of the tweets, however it also extends time needed to scrape dramatically.
Choice of starting node
The choice of a starting node can have a big impact on who appears in the graph, as well as their relative influence or popularity. Thus, if your starting node is in a modularity class which is part of a larger group of potential classes/communities which are so deeply divided or different that they barely communicate at all, then it’s possible that this entire class/community could be left out. In my experience on South African political Twitter, a lot of the engagement is unfortunately separated on racial lines, which is also representative of the divisions in the country.
Semantics and sentiment
This project did not take into account the semantics or sentiments of the tweets. Thus, an example of a consequence of this could be that it’s possible as mentioned above, that this could lead to a situation where a person could be allocated to a community which they actually oppose, because during that time period they debated frequently with them over a topic for which they disagree on, during that time period.
Using graph analysis and community detection can help discover the layout of the Twitter landscape for a particular section of Twitter. Furthermore, it can illuminate the nature of the communication between communities.
Depending on the starting node, and how many iterations one continues the breadth-first search for one can determine the scope of your graph. As stated, there are some limitations, or caveats that one must take into account when attempting to retrieve insights from the resulting graphs.
As highlighted above, regardless of political orientation, communities are invariably divided along racial lines. On a more zoomed in scale communities are divided by various political orientations.
Notwithstanding, there does appear to be some truth to the adage “Birds of a feather, flock together”.
The code for this can be found here on Github.