BUSINESS

How Criteo Manages The Traceability Of Its Data

Criteo has set up a data lineage system around its Hadoop cluster. What techniques does it rely on? What do you do with a data traceability management system ( data lineage )? For example, automate the grouping of data quality problems or the repair of datasets impacted by incidents.

Criteo is exploring these two avenues now that it has implemented its system. This works at different levels: tables (which, in practice, also includes assets not backed by Power BI tables and reports), partitions, and columns.

This traceability supports, among other use cases, impact analysis, root cause research, compliance (audit, PII tracking), and metadata enrichment. It is exposed, on the one hand, through a series of datasets. On the other, via a Datadoc web app including catalog and observability functionalities. The process is part of a more global pipeline for collecting and analyzing usage data across the Criteo data platform. Datadoc also exposes elements relating to queries, tasks, and applications that result in transformations. To support its data lineage approach, Criteo uses several techniques, including:

  • Manually fill in the source-destination relationships (owners or stewards generally do this)
  • Search for specific patterns in the asset
  • metadata – Use the execution logs of the data platform services ( logs-as-source )
  • Exploit the source code specifying the transformations ( source-as-code )
  • Integrate traceability capabilities into specific systems

Multi-layer monitoring within the Hadoop cluster

At Criteo, most offline processing takes place on the data lake, a Hadoop cluster of 3000 machines storing 180 PB. Spark jobs and Hive queries are mainly executed there. All are orchestrated by two in-house tools (Cuttle and BigDataflow).

Most of the raw data ingested comes from Kafka – a centralized system that transmits it in batches. Once the transformations are completed, users consume the data with Presto/Trino. Or they are exported to Vertica. Some clusters power a Tableau deployment. Most of the cluster’s inputs and outputs rely on centralized systems. This allows traceability information to be exposed on their API. In some cases, we can review the code and integrate it with the CI for the validation step.

The most complex part is tracking the transformations taking place in the cluster. Several layers of data lineage are used for this purpose. On SQL engines like Hive and Presto/Trino, the parser allows you to expose the information. Criteo has configured hooks that store the query execution context in a Kafka topic and then transmit it to the global pipeline. We also use Kafka for the transformations orchestrated by Bigdataflow. For the rest, we use Garmadon, an event collection service that tracks interactions with the underlying file system.

Enrichment And Deduplication

The data Garmadon produces only shows relationships between applications and raw paths. They, therefore, require semantic enrichment. To do this, we perform two tasks:

  • Merge the applications involved in the same logical transformation
  • This step is essentially based on pattern detection techniques. Garmadon also allows you to inject tags into applications intended to link executions to logical units declared in the orchestrators.
  • Associate raw paths with semantics already available in the Hive metadata store

When the traceability sources are assembled, deduplication takes place. We then keep the most qualitative source, for example, data coming from Hook rather than Garmadon. Therefore, we can carry out other transformations to expose the data in forms more suited to specific use cases.

Data lineage is used, in particular in an internal search engine, to influence the ranking of results: the more transitive dependencies an asset has, the more important it is probably.

Datadoc can also report performance alerts based on SLO information extracted from dataset definitions. And provide users with information on the root cause.

Also Read: What Is Intelligent Data Processing, Definition And Main Activities

Techno News Feed

Technonewsfeed is an innovative and inventive tech platform that provides users with vivid and well-researched tech content.

Recent Posts

Xiaomi SU7: A Serious Competitor To The Tesla Model S?

Recognized for its plethora of high-tech accessories, the Chinese giant Xiaomi has just launched its…

2 weeks ago

How Do You Choose The Correct RFID Tag For Each Application?

One of the main elements of an identification system based on RFID technology is undoubtedly…

3 months ago

Marketing Strategies In The World Of Fandoms: Co-Creation, Authenticity And High Engagement Rate

Its origin, although rooted in traditions, finds new expressions today. The most famous examples demonstrate…

5 months ago

Cloud Management: What Tools To Industrialize Cloud Management

Cloud management has established itself in many companies that must continue to manage their on-site…

6 months ago

The Vital Role of Software Engineers in App Development

There is no question that app development is a booming business. “There’s an app for…

6 months ago

The Art Of Protecting Secrets: 8 Essential Concepts For Security Engineers

Security experts are supposed to deal with this constantly always-developing rundown of keys, authentications, and…

6 months ago