Classifying Data

Through classification you can create richer, unique analysis results. In this guide you'll learn how to tag and score data to fit your use case.

What is data classification in PYLON?

If you think about an example post a user might create:

I love my new phone, it's got an amazing screen!

From this small amount of text as humans we can understand:

  • Emotion - they are excited
  • Position in purchase cycle - they are a new owner
  • Topics - they are talking about a phone and its screen

There is huge value in the message, but it's not easily accessible to you in your analysis as it is buried inside unstructured text, written in informal language.

Classification is the process of extracting this value into a structured form so it is available for analysis.

Using classification you can add extra value and meaning to data as it is recorded into your index. By adding classification to your data you will give yourself far more options when it comes to analyzing the data.

How does classification work?

Classification in PYLON works as follows:

  1. You create an interaction filter that records data you want to analyze.
  2. You add classification rules to your interaction filter.
  3. You start a recording.
  4. Any interactions that match your filter are run through your classification rules before being stored in your index. (Every rule is run against each interaction.)
  5. Your classified interactions are recorded into your index.
  6. You can make use of your classifications within your analysis queries.

There are two types of classification rules you can use:

  • Tagging rules add a tag (or label) to an interaction if it meets the criteria you specify. If an interaction matches multiple tag rules all matching tags will be assigned.
  • Scoring rules give a score to an interaction. Each rule has a value. The final score given to an interaction is the sum of all the matching rule values.

Tagging interactions

To understand tagging we'll add classification rules to the example filter from the Recording Data guide.

The interaction filter is currently as follows (to keep things simple we'll just record stories for now):

fb.author.country == "United States"
AND ( fb.content contains_any "BMW, Honda, Ford"
    OR fb.topics.name in "Cars" )

To add tagging rules to the filter we first wrap the filter conditions in a return statement, then declare tagging rules above. In this example we've added rules that simplify identify the brand being discussed.

tag "BMW" { fb.parent.content contains_any "BMW" }
tag "Honda" { fb.parent.content contains_any "Honda" }
tag "Ford" { fb.parent.content contains_any "Ford" }
return {
    fb.author.country == "United States"
    AND ( fb.content contains_any "BMW, Honda, Ford"
        OR fb.topics.name in "Cars" ))
}

Notice each tag rule has two parts:

  • tag "[label]" - The tag to assign to the interaction if there is a match.
  • { CSDL conditions } - The tag is assigned if the interaction matches the CSDL conditions.

Based on these rules if someone posted the following content in their story..

Loving the power of my new BMW!

The interaction would be tagged with "BMW".

Note that if someone posted the following content in their story..

Loving the power of my new BMW! Much better than my old Ford Focus.

The interaction would be tagged with both "BMW" and "Ford". If multiple rules are matched then all appropriate tags are applied.

These example rules look for keywords in the story content, but you can make your CSDL as complex as you like combining any number of conditions and using the full range of targets and operators available.

Using the same steps as set out in the Recording Data guide you can compile this interaction filter and start a recording.

To check that data is being tagged you can submit an analysis query to your index specifying the interaction.tags target:

datasift.pylon.analyze([id for your recording], {
    "analysis_type":"freqDist",
    "parameters":
    {
        "target":"interaction.tags"
        "threshold": 3
    })

This query will give you a frequency distribution breakdown of your recorded interactions with the count of the number of tags assigned.

classification-freqdist-brands

Using namespaces

It's not uncommon to have hundreds of tagging rules. You can use namespaces to help you give structure to your tags. For example:

tag.automotive.brand "BMW" { fb.parent.content contains_any "BMW" }
tag.automotive.brand "Honda" { fb.parent.content contains_any "Honda" }
tag.automotive.brand "Ford" { fb.parent.content contains_any "Ford" }
tag.automotive.feature "Style" { fb.content any "paint,matte,trim,alloys,alloy wheels,alloy wheel,cool,sexy,beautiful,stunning" }
tag.automotive.feature "Performance" { fb.content any "power,handles,handling,fast,fastest,quick,quickest,engine size,bhp,top speed,mph,acceleration,performance" }

Notice the syntax has changed to add the namespace. The syntax here is:

tag.[top-level namespace].[2nd-level namespace] "[tag name]" { [CSDL] }

In fact you can specify as many levels of namespaces as you'd like.

Based on these rules if someone posted the following content in their story..

Loving the power of my new BMW!

The interaction would be tagged with "automotive.brand.BMW" and "automotive.feature.Performance".

To check that data is being tagged you can submit an analysis query to your index specifying the interaction.tag_tree target:

datasift.pylon.analyze([id for your recording], {
    "analysis_type":"freqDist",
    "parameters":
    {
        "target":"interaction.tag_tree.automotive.brand"
        "threshold": 3
    })

Notice here that in the query we append the namespace we want to analyze. You can only analyze one namespace per analysis query.

Valid characters for tags and namespaces

When you define tags and namespaces you can use the following valid characters:

  • latin letters (a-z) excluding accents hyphens
  • digits (not valid for namespaces)
  • spaces (not valid for namespaces)
  • underscores

You cannot use the period character '.' in your tag names, this is a reserved character used for structuring namespaces.

warning icon

If you use characters which are not covered by the rules above your CSDL definition may still compile. However, the tags you have added to your data will not be accessible in your analysis queries.

Scoring interactions

With tags you are assigning distinct labels to interactions, whereas with scoring you are giving interactions a score on a continuous scale.

The syntax for a scoring rule is:

tag.[top-level namespace].[2nd-level namespace].field [score] { [CSDL] }

Scores can be positive or negative. Every score rules is evaluated against an interaction and the sum of all scores that match are summed and assigned to the interaction.

Scoring is the mechanism that allows you apply machine learning. Talk to your account manager to learn more.

note icon


It's important to note that an interaction filter can include up to 10,000 tag or scoring rules, including from those you have included using the stream keyword.

Using classified data in analysis

With classification rules added to your interaction filter, your index will contain interactions with classification metadata added. You can make use of this extra data when you submit analysis queries.

Usage as query filters

When you submit an analysis query you can use classification metadata to segment your data set in new ways.

interaction.tags

For example if you've applied the simple automotive brand tags, you could filter to discussions around certain brands:

from datasift import Client
datasift = Client("your username", "identity API key")

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'target': 'fb.author.gender',
        'threshold': 3
    },
    'filter': 'interaction.tags IN "BMW,Ford"'
}

print(datasift.pylon.analyze('recording id, analyze_parameters))

Adding this filter to your analysis query first filters data in your index to only interactions with these chosen tags applied, then performs the analysis on this subset of data.

interaction.tag_tree

Or if you've applied the example with namespaced tags:

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'target': 'fb.author.gender',
        'threshold': 3
    },
    'filter': 'interaction.tag_tree.automotive.brand IN "BMW,Ford"'
}

print(datasift.pylon.analyze('recording id', analyze_parameters))

Notice in this example the syntax includes the namespace of the tags we are filtering on, in this case 'automotive.brand'.

interaction.ml.categories

If you've applied machine-learned scoring to your data you can segment by the classes you have assigned.

When you create a machine learned classifier, you will give each interaction a score for each class you are trying to identify. For each interaction the field with the highest score is selected and this is the class the interaction is considered to be.

For instance if you apply the Marketing SPAM classifier from the library you could filter to :

interaction.ml.categories == “other"

In this case the analysis query will filter to only interactions where 'other' is the highest scoring class before performing the analysis.

Frequency distribution analysis

When you request a frequency distribution you can use classification targets for the analysis.

interaction.tags

Use this target to analyze non-namespaced tags. Analyzing this target gives you counts against each tag you've declared.

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'target': 'interaction.tags',
        'threshold': 5
    }
}

print(datasift.pylon.analyze('recording id', analyze_parameters))

interaction.tag_tree

Use this target to analyze namespaced tags. Analyzing this target gives you counts against each tag within the namespace you declare.

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'target': 'interaction.tag_tree.automotive.brand',
        'threshold': 3
    }
}

print(datasift.pylon.analyze('recording id', analyze_parameters))

Note that for namespaced tags your query needs to address a level in your tagging taxonomy that contains leaf tags which can be counted.

interaction.ml.categories

Use this target to analyze scores given to interactions when applying machine learning.

When you create a machine learned classifier, you will give each interaction a score for each class you are trying to identify. When you perform analysis with this target, for each interaction the field with the highest score is selected and this is the class the interaction is considered to be.

For instance if you apply the Marketing SPAM classifier from the library you could submit the following query to analyze the classes assigned to your data:

analyze_parameters = {
    'analysis_type': 'freqDist',
    'parameters':
    {
        'target': 'interaction.ml.categories',
        'threshold': 5
    }
}

print(datasift.pylon.analyze('recording id', analyze_parameters))

Reusing classifiers

As you create more classification rules you'll no doubt create CSDL you will want to reuse across your projects and customers. With CSDL you can save a classifier definition then import it into another definition, using the tags keyword.

Returning to our first example here is the CSDL we created for our tags:

tag "BMW" { fb.parent.content contains_any "BMW" }
tag "Honda" { fb.parent.content contains_any "Honda" }
tag "Ford" { fb.parent.content contains_any "Ford" }

You can compile this interaction filter using the pylon/compile endpoint. This will give you a hash for the definition that you can use in an interaction filter. For we could refactor our example:

tags "[hash for compiled tags]"
return {
    fb.author.country == "United States"
    AND ( fb.content contains_any "BMW, Honda, Ford"
        OR fb.topics.name in "Cars" ))
}

Next steps...

To learn more about classifying data take a look at these resources: