Powerful Data Quality Framework for Big Data

Checkita is an open-source framework built in Scala that leverages Apache Spark for distributed computing, enabling comprehensive quality checks on large datasets.

Data Quality at Scale

Connect to multiple data sources, calculate metrics, perform quality checks, and distribute results through multiple channels.

Spark Powered
Metrics Library
Multiple Sources
Key Features

Everything You Need for Data Quality

Checkita formalizes and simplifies the process of connecting to data sources, calculating metrics, performing quality checks, and distributing results.

Distributed Computation

Leverages Apache Spark as the core engine for processing large datasets efficiently.

Multiple Data Sources

Support for HDFS, S3, Hive, JDBC, Kafka and various formats (text, ORC, Parquet, Avro).

Built-in Metrics & Checks

Extensive library of pre-built metrics and quality checks for immediate use.

SQL Query Support

Create derived "virtual sources" using SQL queries for flexible data manipulation.

Results Storage & Notifications

Store results in a dedicated database with multiple notification channels.

Batch & Streaming Support

Process both batch and streaming data with the same framework.

How It Works

Simple Configuration, Powerful Results

Checkita uses simple HOCON configuration files to define your data quality checks.

1

Connect

Define your data sources in a simple configuration file.

2

Configure

Specify metrics and quality checks to be performed.

3

Execute

Run Checkita to process data and generate quality reports.

Example Configuration

checkita {
  sources {
    my_source {
      type = "parquet"
      path = "/data/my_dataset"
    }
  }

  metrics {
    row_count {
      source = "my_source"
      type = "RowCount"
    }

    null_count {
      source = "my_source"
      type = "NullCount"
      column = "user_id"
    }
  }

  checks {
    check_row_count {
      metric = "row_count"
      type = "GreaterThan"
      value = 1000
    }
  }
}
Use Cases

Who Benefits from Checkita?

Checkita is designed for organizations that need to ensure data quality at scale.

Data Engineers

Automate data quality checks in ETL pipelines and data processing workflows.

Data Scientists

Ensure the quality of input data for machine learning models and analytics.

Data Analysts

Validate data before creating reports and dashboards for business users.

Data Governance Teams

Implement data quality standards and monitor compliance across the organization.

Enterprise Architects

Build robust data architectures with built-in quality controls.

DevOps Teams

Integrate data quality checks into CI/CD pipelines for data applications.

Ready to Ensure Data Quality at Scale?

Start using Checkita today and transform how you monitor and maintain data quality across your organization.

Open Source

Checkita is completely open source. Contribute to the project, report issues, or request features on GitHub.