Using Topic Graphs to Understand an Audience

Ed Stenson | 19th April 2016

The latest release of PYLON for Facebook Topic Data offers two new targets that return data to build network graphs:

chart1

A network graph consists of two elements: nodes and edges. In this case the nodes are Facebook topics and each edge is a connection between any pair of topics that appear together in a single interaction. For instance, if Facebook infers that the author of a post is talking about both "BMW" and "skiing" the PYLON interaction for that post will contain:

  • the topic "BMW" in the "Cars" category
  • the topic "skiing" in the "Interest" category

In a visualization of this data you would expect to see the "BWM" and "skiing" nodes joined by an edge.

It is useful to understand which parts of Facebook's topic graph your audience is engaging with. Network graphs like this are powerful tools for exploring the clusters of interest in a recording but, until now, they have been rather query intensive in PYLON. First you had to analyze the top topics in your index. Then, using each of the topics as a query filter, you made a further analysis call for each of those topics to find all the other topics that appeared with them in stories. A network graph of 200 nodes required 201 API calls. The latest release of PYLON allows you to gather all that data in just one call.

Using the new targets

The fb.topic_graph target returns the most frequently co-occurring topics in stories. The fb.parent.topic_graph target returns the most frequently co-occurring topics in stories being engaged with. It is a straight-forward task to go on to generate a network graph visualization.

These are analysis-only targets which means that you never use them in filtering. To request topic graph data, just include one of these targets in a call to /pylon/analyze. Here's an example using the Python client library:

# Start and end times for analysis
start = 1452729600
end = 1453939200

client = Client("DataSift username", "Identity API key")

analyze_parameters = {
  "analysis_type": "freqDist",
  "parameters": {
    "target": "fb.topic_graph",
    "threshold": 200
  }
}

data = client.pylon.analyze('index id', analyze_parameters, start=start, end=end)

This returns a JSON array containing information for the edges. Each edge looks like this:

{  
    "key": "Cars|BMW|Interest|skiing",  
    "unique_authors": 2800,  
    "interactions": 3000  
}

Notice the way the four values are separated by pipes. This object shows that 2,800 unique authors created 3,000 posts to which Facebook assigned both the BMW topic (in the Cars vategory) and the skiing topic (in the interest category).

The call returns up to 200 edges. It might return fewer if PYLON's redaction rules are triggered because an edge has to appear in 100 posts by unique authors before it appears in the array returned by these targets.

There are some key things to note here. The analysis_type parameter is freqDist which specifies that the endpoint will return frequency distribution data. You cannot set the analysis_type to timeSeries. When the analysis_type is freqDist, you must supply a threshold. The target imposes a limit of 200 edges so if you set the threshold to 200 (or any number greater than 200) you will receive as many edges as PYLON can find up to 200. If you want to produce a simpler network graph showing, for example, only the 50 most significant edges, set the threshold to 50.

Visualizing network graphs

I produced the visualizations for this blog using Google's Fusion Tables which is part of Google Labs. To add it to Google Drive you simply need to install the App. Fusion allows you to import CSV files. I stored the JSON object returned by my call to /pylon/analyze in a variable called data and then ran a simple loop to convert them to something close to a comma-separated form.

for x in data['analysis']['results']:
    print (x['key'],",", x['interactions'],",", x['unique_authors'])

Here are the first few lines of the raw output:

Athlete|Kobe Bryant|Sports League|NBA , 3000 , 2800
Athlete|Kobe Bryant|Professional Sports Team|LA Lakers , 2800 , 2700
Professional Sports Team|LA Lakers|Sports League|NBA , 1300 , 1200
Athlete|Kobe Bryant|Professional Sports Team|Utah Jazz , 1100 , 1000
Athlete|Kobe Bryant|Sport|Basketball , 800 , 800
Professional Sports Team|Utah Jazz|Sports League|NBA , 700 , 700
Athlete|Kobe Bryant|Professional Sports Team|Golden State Warriors , 600 , 600
Professional Sports Team|LA Lakers|Professional Sports Team|Utah Jazz , 500 , 500
Athlete|Stephen Curry|Professional Sports Team|Golden State Warriors , 500 , 500
Athlete|Kobe Bryant|Interest|Black mamba , 400 , 400
City|Los Angeles, California|Professional Sports Team|Utah Jazz , 100 , 100
Athlete|Kobe Bryant|Musician\/Band|The Game , 100 , 100
...

Data science never comes without a cleanup phase:

  • I added a header, which makes life easier after the import to Fusion.
  • "Los Angeles, California" will be interpreted in CSV form as two values rather than one so I changed it to "Los Angeles/California:
  • The /pylon/analyze endpoint returns four pipe-separated values but we want to separate the first category/topic pair from the second category/topic pair and have them appear in Fusion in separate columns. To do this, I replaced the second pipe in each row with a comma.
  • The data contains some escaped forward slashes. I removed the backlash in each case because it won't be necessary in Fusion and the data looks just that little bit tidier without the backslashes.

The end result, ready for import to Fusion, looks like this:

Node1, Node2, Interactions, Uniques
Athlete|Kobe Bryant, Sports League|NBA, 3000, 2800
Athlete|Kobe Bryant, Professional Sports Team|LA Lakers, 2800, 2700
Professional Sports Team|LA Lakers, Sports League|NBA, 1300, 1200
Athlete|Kobe Bryant, Professional Sports Team|Utah Jazz, 1100, 1000
Athlete|Kobe Bryant, Sport|Basketball, 800, 800
Professional Sports Team|Utah Jazz, Sports League|NBA, 700, 700
Athlete|Kobe Bryant, Professional Sports Team|Golden State Warriors, 600, 600
Professional Sports Team|LA Lakers, Professional Sports Team|Utah Jazz, 500, 500
Athlete|Stephen Curry, Professional Sports Team|Golden State Warriors, 500, 500
Athlete|Kobe Bryant, Interest|Black mamba, 400, 400
City|Los Angeles/California, Professional Sports Team|Utah Jazz, 100, 100
Athlete|Kobe Bryant, Musician/Band|The Game, 100, 100
...

Importing the data to Fusion is very simple. To generate a visualization, just select Add chart:

screen1

The result looked like this:

chart2

The edges appear bolder when there are more connections found, and Fusion allows you to select either the number of interactions or the number of unique authors in your visualization.

It also allows you to control how many nodes appear in your visualization. My tip is to set the threshold to 200 in your call to /pylon/analyze to collect as many nodes and edges as you can and then restrict the number of nodes in your visualization tool later if you want to see a simpler topic graph.

Removing noise

Our Data Science team recently recorded an index filtering on "diesel". The goal was to study the Diesel brand but you might be surprised at the deversity of the results. Facebook posts about "diesel" might be related to:

  • the clothing brand
  • the automotive sector
  • the energy sector and Wall Street results
  • legal action related to the emissions scandal
  • a game engine running on the Xbox 360, PlayStation, Windows and Linux
  • dogs
  • Vin Diesel

If you're focus is on the Diesel brand you might not initially stop to think that many people name their pets Diesel. The dataset that you'd record here wil, inevitably, be noisy. While you will pick up the signal you want you would also pick up interference. In this type of situation a topic graph is likely to show clusters to topics. It might even feature clusters that are connected to other clusters only by their link to the "diesel" node itself.

A topic graph helps you to see where the noise is and, to a great extent, understand how much noise there is compared to the signal you are looking for. It gives you a chance to rewrite your interaction filter and start a new recording if you want to, or to create a query filter to exclude as much of the noise as possible.

Summary

Generating a topic graph to accompany a study you're making can add a great deal of value exercise. It shows you where you are in Facebook's growing topic graph. With the two new topic graph targets it's now simple and inexpensive to collect the raw data for a topic graph and tools such as Fusion Tables and Gephi automate the process of visualization.

References

Google Fusion Tables

Download Gephi download

Guide to Discovering Topics in Facebook


Previous post: Studying large audiences with PYLON for Facebook topic data

Next post: Improvements to the Developer Site