FAQ

Getting Started

Screen6 deduplicates audiences by discovering patterns in the event logs that clients provide to us. This FAQ summarizes what is required to get started with Screen6’s Cross-Device ID Solutions.

What data to select in order to start deduplication with Screen6


As the amount of deduplication depends on the data that is provided to us, it’s important to select the right data, keeping in mind the following: density, time range, device mix, user ID types and sampling.

  • Density in data
    The data you send over should be dense. The more frequent we encounter User Identifiers (UIDs) the sooner we can deduplicate them.
  • Time range
    The data should cover a long enough time range. Our pattern matching algorithm works by spotting patterns over time. Therefore we need at least two weeks of data, but depending on density three or four weeks may be required.
  • Device mix
    Including mobile traffic will increase the level of deduplication. We will spot more patterns and there will be more devices that we can deduplicate. Of course, we are also able to connect intra device .
  • User ID types
    The data should contain all available UID types. If you have access to Device IDs, IDFA (ID for Advertiser), Android IDs or any other type of UID then these should all be included in the data. These non-cookie UIDs provide a higher degree of persistency, which helps the deduplication.
  • Sampling
    Data size may be reduced by providing the data for a certain country or region. Any other type of data sampling will affect the reach of deduplication and is therefore not recommended.

Screen6 requires a few parameters to be present in the provided logs.

Required parameters


Please note that we can accept multiple UID columns and multiple corresponding UID Type columns
ParameterDescription
TimeDate+Time or Timestamp in sec or msec of the event.

Examples:
1453133038
2016-01-06 20:34:25 UTC
IPFull IP address of the user. A hash of the IP will also work.

IPs should not be truncated.
UID[]User ID (UID). If multiple types may appear in the same event, then give each UID its own column. Common types of UIDs are:
  • 3rd party cookie
  • 1st party cookie (see note at UID type)
  • didmd5
  • didsha1
  • dpidmd5
  • dpidsha1
  • macsha1
  • macmd5
  • ifa (sometimes called idfa)
UIDType[]If one UID column contains multiple types of UIDs then the type should be specified in a separate column.
Note that we prefer to have a separate column for each UID type.
If the UID is a 1st party cookie then the domain (not URL) on which the cookie was set should be listed here.
UserAgentThe full User Agent as received in the event. Note that if the file is comma separated, then user agents that contain commas should be wrapped in quotes (")

Example:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36

Requirements and formatting specifications for log files


Format


Screen6 supports tab separated files (TSV), comma separated files (CSV) as well as other custom types of value separation. Tab separation is preferred over commas because it doesn’t require wrapping values that contain commas (e.g. User Agents) in quotes.

File should be encoded in UTF-8 or ISO-8859-1.

Empty values


If you have empty values in your data (for example events where a DeviceID is missing) then Screen6 needs to know how these empty values are indicated. Common ways of specifying empty values are:

  • empty string (the empty string should still be separated by the separation character!).
  • null
  • \N
  • 0

Size, compression and naming


The log files you provide to us should comply with following requirements:

  • File size: File size should not exceed 2GB. Also avoid many small files (below 100mb) as this will slow down ingestion.
  • Compression: gzip for compression or no compression
  • Naming: filenames should include a date stamp in YYYYMMDD format. For example: client_20150122_part123.log.gz

Examples of supported log files


Sample 1


Each UID type has its own column

TimestampIPCookieIDIDFAUserAgent
14128459931.2.3.4123456789987654321Mozilla/5/...

Sample 2 (using 1st party cookies)


If your data contains 1st party cookies (cookies set on the domain of the website), then the domain on which the cookie was set should be listed in a UID Type column. You can mix 1st and 3rd party cookies by setting a default domain in case of 3rd party cookies (e.g. client.com).

TimestampIPCookieIDIDFAUserAgent
2014-10-03T06:58:45+00:001.2.3.412345678cnn.comMozilla/5/...
2014-10-03T06:59:12+00:005.6.7.834567890weather.comMozilla/5/...
2014-10-03T06:59:52+00:003.4.5.623456789client.comMozilla/5/...

Sample 3 (supported but not preferred)


In the next example there are separate columns for CookieIDs and DeviceIDs. The type of DeviceIID is further specified in the DeviceID_type column.

Preferred way of logging multiple UID types
It’s always preferred to give each UID Type its own column. However if this is complicated due to technical reasons then UID type columns may be used.
TimestampIPCookieIDDeviceIDDeviceIDtypeUserAgent
2014-10-03T06:58:45+00:001.2.3.4123456789abcdefdidmd5Mozilla/5/...
2014-10-03T06:59:22+00:005.6.7.8987654321ghijklmifaMozilla/5/...
  1. Did I include all parameters:
    • Date+time or Timestamp
    • IP or hashed IP (not truncated)
    • All available types of User IDs, either in separate columns or with a column denoting the User ID Type
    • User Agent
  2. Are the IP addresses complete, not truncated? Truncation is sometimes done by replacing an octet with a zero.
  3. Are the User Agents complete, not truncated?
  4. Did I wrap the User Agents in quotes where they contain a separation character?
  5. Are my files sizes always between 100MB and 2GB?
  6. Are my files either gzip compressed or not compressed at all?
  7. Did I include the YYYYMMDD date stamp in the file name, similar to client_20150122_part123.log.gz?
  8. Are the empty values clearly indicated?

There are various ways in which the data can be transferred to us. You can choose whichever method is most convenient for you

Amazon S3


If your logs are stored on Amazon S3 you can give us access to your bucket. We will provide you with our AWS ID (IAM) that you can use to set up the access policy. Alternatively you can provide us with an S3 Access Key and Secret Key.

Here’s a sample of an S3 Bucket Policy that will give us full access.

Sample of an S3 Bucket Policy

{
  "Version": "2008-10-17",
  "Id": "Policy1412080228241",
  "Statement": [
    {
      "Sid": "Stmt1412080223234",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::516727802551:root"
      },
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::bucketname",
        "arn:aws:s3:::bucketname/*"
      ]
    }
  ]
}

Google Cloud Storage


If your logs are stored on Google Cloud Storage you can give us (read-only) access to your bucket. We will provide you with our a Project ID that you can add to the Buckets Permissions and Object Default Permissions.

Pull from SFTP


Alternatively you can also give us access to an SFTP endpoint on your end from which we will pull the data.

Result data is available in two types of mapping tables: horizontal and vertical. They contain the same data, but the structure is different. You can indicate which format you prefer.

The mapping tables use Match IDs. These are IDs that tie together clusters of User IDs that belong to the same person. Read more about MatchIDs here.

Optionally, mapping table may also include Household IDs. These are clusters of User IDs that belong to the same household. For more on Household IDs and how they are implemented in the result data please go here

Incremental (daily) updates
If Screen6 processes your data on a daily basis, you can ask us to only send updates to the mapping table in comparison to yesterdays results.

Horizontal


This is a file where every line starts with a Match ID followed by its related User IDs. Note that the User IDs may be separated by comma even if the file format uses tab separation. Both separation characters can be customised on request.

Example of a horizontal MatchID mapping table

12345678  34302050,93436308,15420117
87654321  53410606,59794190

Vertical


This is a file where every line starts with a MatchID followed by one of its related User IDs

Example of a vertical MatchID mapping table

12345678  34302050
12345678  93436308
12345678  15420117
87654321  53410606
87654321  59794190

User ID Types


If your source data contains multiple types of User IDs (for example: cookieIDs and DeviceIDs) then you can ask us to include just one or all types of IDs in the result data. Optionally the IDs may be prefix with the type name. For example:

Example of a prefixed User IDs

12345678  cookie:34302050,cookie:93436308,ifa:15420117
87654321  cookie:53410606,ifa:59794190

Value separation


By default Screen6 will separate values with commas and prefix (optionally) User ID types with semicolons. You can ask us to use any other separation and prefixing characters.

A Match ID is a cross-device identifier that sticks with a person over time.

Once Screen6 has identified a cluster of User IDs that belong to the same person, such cluster will get assigned a Match ID. Over time, the User IDs within the cluster may change; new ones come in, others disappear. However, the Match ID for this cluster will remain the same.

The User IDs that get linked to a Match ID can come from a mix of any type: cookies, IDFA’s, MAC addresses, Smart TV IDs, etc…

Persistency: Match ID vs User ID decay

User IDs, cookies in particular, tend to disappear over time; sometimes slowly and sometimes immediately. These User IDs will get linked to Match IDs no matter their own level of persistency. This makes Match IDs much more persistent than User IDs.

Persons and Households

Devices that are shared amongst family members and that are not predominantly used by one of the members will not be attributed to a Match ID, however they will be attributed to a Household ID. Match IDs relate to persons, not households.

In an ideal case a person has only one Match ID that links all of the devices and IDs of this person.
However there are cases where our algorithms have linked for example the IDFA to the cookies of someones phone, and tablet IDFA to a desktop cookie of the same person in separate clusters. In such a case a person will have two Match IDs. These may still get linked at a later point in time.

Effects of probabilistic matching

It doesn’t happen a lot, but in some cases our algorithms, which are probabilistic by nature, may decide that two Match IDs belong to the same person or that a User ID connected to a Match ID actually belongs to another Match ID. This results in movements in the graph.

When it’s decided that two Match IDs belong to one person, the Match ID that has the fewest User IDs linked will be disposed and its User IDs are moved to the other Match ID.

When it’s decided that a User ID actually belongs to another Match ID it is simply moved over to that Match ID.

A Household ID is a cross-device identifier that sticks with a household over time.

Once Screen6 has identified a cluster of User IDs that belong to the same household, such cluster will get assigned a Household ID. Over time, the User IDs within the cluster may change; new ones come in, others disappear. However, the Household ID for this cluster will remain the same.

The User IDs that get linked to a Household ID can come from a mix of any type: cookies, IDFA’s, MAC addresses, Smart TV IDs, etc…

A Household ID is not an IP address


Households are often simply identified by an IP address. However, the IP that a certain household got assigned may change. Typically depending on the ISP provider this may be once a year or daily. Household IDs remain the same for a household even if the IP of that household changes.

A Household ID is not an IP address


A Screen6 device graph may contain both Household IDs as well as Match IDs. Match IDs, and its corresponding User IDs, may get linked to a Household ID. Also User IDs that have not been assigned to a Match ID may get linked to a Household ID. These IDs will mostly come from family shared devices such as TVs, Game consoles and Tablets.

Here’s an example of a single household with two Match IDs and 5 User IDs, one of which has not been linked to a Match ID. In that case the Match ID is the same as the Household ID.

Household IDMatch IDUser ID
123ABCID1
123ABCID2
123DEFID3
123DEFID4
123123ID5

Many, but not all of your User IDs will get linked in a Screen6 device graph for two reasons:

  1. Sometimes a User ID is the only ID of a person that is available in the source event level data, so there is simply no other User ID available to link it to.
  2. Our algorithms are not always able to find the behavioral patterns to confirm that two User IDs belong to the same person.

These factors define the coverage or match-rate of the device graph. The exact coverage figure may be defined in various ways:

User ID or Event based coverage


User ID based coverage is the percentage of all User IDs in the source data (more on that later) that has been matched to another User ID. Or in other words, the percentage of User IDs that has been assigned to a MatchID.

Event based coverage is the percentage of events in the source data (impressions, bids, etc.) that carries a User ID that is linked to at least one other User ID

As some User IDs may have been seen only once over a week and some others many times, these two percentages can be quite different.

If multiple types of User IDs are in the data (such as cookies and IDFAs), then the coverage percentages can be computed for each type separately.

Look-back or real-time coverage


The User ID or event based stats are computed as a percentage of User ID/events in the source data. The source data is the raw data that clients send to us every day. For statistics we look at two ‘windows’ within this source data:

  1. Look-back means the last day of raw data on which the device graph of that day was computed.
  2. Real-time means the next day of raw data, after the day on which the graph was computed.

Look-back coverage should be used for cross-device analytics scenarios such as conversion attribution and campaign reach/frequency.

Real-time coverage should be used for cross-device targeting. The real-time coverage percentages are usually slightly lower because new User IDs will be encountered that have not been seen before and as such, have not yet made it into the device graph.

Precision


Screen6 uses probabilistic algorithms to match User IDs that belong to the same person. These matches should be as accurate as possible. When deterministic verification data, such as login IDs or email addresses, are available then the client and Screen6 are able to compute the accuracy of the verifiable part of the graph. This figure is called: precision.

How we compute precision


We compute precision and recall using deterministic data that clients provide to us. As the device graphs that we produce are private to the client, we are not able to use deterministic data collected elsewhere for verification.

PAIRS AND CLUSTER BASED VERIFICATION

We use two methods for computing precision: pairs based and cluster based.

The pairs based method looks at individual pairs of user IDs. A pair is labelled correct when both user IDs have the same deterministic ID; a pair is incorrect when the two user IDs have a different deterministic ID.

Precision = correctPairs / (correctPairs + incorrectPairs)

The cluster based method looks at all the user IDs belonging to a single cluster or Match ID. A cluster has been matched correctly when there’s only one Deterministic ID found for the cluster, and if that Deterministic ID is linked to at least two of its User IDs.

Precision = correctClusters / (correctClusters + incorrectClusters)

CLEANING VERIFICATION DATA

Before calculating precision, the verification data is cleaned:

  • User IDs, and their corresponding Deterministic IDs, that are not present in the source event data that was sent to Screen6 are discarded during verification
  • User IDs should not get linked to more than one Deterministic ID. In such case the related Deterministic IDs are removed from the verification data
  • A Deterministic ID should link at least two User IDs. Deterministic IDs that link to only one User ID may only point to an incorrect match but will never be able to confirm a correct match. As such they put an aggressive bias in the precision calculation

Inaccuracies in deterministic data


When computing precision, the provided deterministic data is considered to be a truth set. However it is important to point out that most of the time these data sets are not flawless themselves. For example someone may login on site 1 as johndoe@gmail,com and on site 2 as johndoe@mycompany.com. When Screen6 matches the corresponding User IDs the match is done correctly, but when computing precision the match will be deemed incorrect.

Recall and coverage


Recall indicates what percentage of matches in a deterministic verification data set has also been matched correctly in the device graph. Note that the same cleaning of the verification data should be applied as mentioned above.

Coverage doesn’t relate to verification data. Coverage is the amount of User IDs or impressions that can be linked to a Match ID. Read more about Coverage here.