Last Updated: August 1, 2024

What is Pandera in Python with Examples

Pandera is an open-source application programming interface (API) in python. It is a flexible and expressive API for falsification so that a coherent and robust data pipeline could be built.

Share

by Sourabh Mehta

Data wrangling is an important process in the cycle of a data science project in which the data is validated based on the requirement of the project. Data validation is a process of falsification of the data gathered for analysis or predictions. Data validation is done using various statistical and logical techniques. Using statistical and logical techniques for every particular kind of information gathered could be a tedious task, here Pandera could be used to validate the data. Pandera is an open-source application programming interface (API) in python. It is a flexible and expressive API for falsification so that a coherent and robust data pipeline could be built. In this article, we will discuss the following topics.

Imagine you are doing a survey related to sneakers worldwide and after receiving the data you find out 25% of the contact number and email IDs are fake or not formatted correctly which is a major drawback for the sales team. So, this is the reason one can’t blindly rely on the data gathered. Let’s start with having a glance at data validation and its importance.

Need for Data Validation

Data validation is part of a larger workflow that involves processing raw data to produce statistical products like a model, visualization, or report. The validation of data is crucial to prevent the silent passing of insidious classes of data integrity errors, which are otherwise difficult to detect without explicit assertions at runtime. These errors result in misleading visualizations, incorrect statistical inferences, and unexpected outputs in machine learning models.

The need for explicit data validation increases when the end products serve as a basis for business decisions, support scientific findings or generate predictions about real people or things. Data validation could be a tedious job for large datasets as writing bug-free codes for every kind of feature present in the dataset is a challenging job. So to make this challenging job a bit smooth Pandera could be used. Let’s try to learn to validate data with the help of Pandera.

Are you looking for for a complete repository of Python libraries used in data science, check out here.

The Pandera API

Pandera is a python based API for data engineering. The central objects in pandera are the DataFrameSchema, Column, and Check. Using these objects together, users can construct schema contracts by configuring logically grouped sets of validation rules that run on pandas DataFrames in advance. Following are the operations performed on API:

Defining schemas and validating those in different dataframe types including pandas, dask, modin, and koalas.
Check the data types and properties of columns/features in a pd.DataFrame or values in a pd.Series.
Perform complex statistical validation like hypothesis testing.
Using function decorator to integrate the validations with existing data analysis/processing pipelines
Define schema models with the class-based API with pydantic-style syntax and validate data frames using the typing syntax.
Synthesize data from schema objects for property-based testing with pandas data structures.
Lazily Validate data frames so that all validation rules are executed before raising an error.

A high-level architecture of pandera is picturized as data flow below. In the simplest case, raw data passes through a data processor which carries out operations on data to retrieve, transform, or classify information. The data is then checked by a schema validator and flows through to the next stage of the analysis pipeline if the validation checks pass, otherwise an error is raised.

Hands-on implementations with Pandera

Let’s start implementing data validation with Pandera API on a dataframe.

Installing Pandera:

!pip install pandera

Creating a dataframe:

df = pd.DataFrame({
"product_id": [114, 256, 385, 407,150,285,364,319],
"brand": ["Nike", "Puma", "Adidas", "Converse","Nike","Converse","Nike","Adidas"],
"date_of_release": pd.to_datetime(["1982", "2021", "2017-08-17", "2020","2020-06-09","2015-07-28","2022-03-03","2022-03-05"]),
"product_name": ["Airforce 1 Low", "Ralph Sampson 70", "Yeezy 700 ", "Chuck 70","Nike ISPA Flow","Chuck Talyor all star","Supreme x Nike SB Dunk High","adidas Yeezy Boost 700 V2 "],
"price":[10000,8387.17,18299.28,9149.64,15249.40,4574.82,38123.50,22874.10]
})
df

This is a data of sneakers with name, product id, the brand that manufactures them and their prices in INR.

Creating multiple validations for DataFrame and columns using the check function.

import pandera as pa
from pandera import Column, Check
from pandera.errors import SchemaError
 
schema = pa.DataFrameSchema(
    {
        "brand": Column(str, Check.isin(brand_name)),
        "product_name": Column(str, Check.isin(product_name)),
        "price": Column(int, Check.greater_than(10000.00)),
    }
)
try:
  schema(df)
except SchemaError as e:
  print("Failed check:", e.check)
  print("\n dataframe:\n", e.data)
  print("\nFailure cases:\n", e.failure_cases)

Here,

pa.DataFrameSchema() is used for applying multiple validations on the entire DataFrame,
“Column name”: Column(datatype,Check.isin(list)) is used for implementing validation on a particular column, and
schema.validate(DataFrame) is used for passing the data set.

The test fails since Puma and Adidas are not included in the list that is used to validate the data. This is a simple validation but suppose there is a need for a complicated validation than using this type of validation format can be messy. So for dealing with these kinds of situations, Pandera has a schema model in which classes can be built and can be called at multiple levels. Let’s see schema model creation.

from pandera.typing import Series
 
class Schema(pa.SchemaModel):
    brand: Series[str] = pa.Field(isin=brand_name)
    product_name: Series[str] = pa.Field(isin=product_name)
    price: Series[int] = pa.Field(le=5)
 
    @pa.check("price")
    def price_sum_lt_20(cls, price: Series[int]) -> Series[bool]:
        return sum(price) < 10000.00
 
try:
  schema(df)
except SchemaError as e:
  print("Failed check:", e.check)
  print("\n dataframe:\n", e.data)
  print("\nFailure cases:\n", e.failure_cases)

The output would be the same because it is the same thing but without creating a DataFrameSchema function. So, learned to create checks for the column as well as the dataframe. Let’s see how to use this to validate inputs.

Validating Input:

Applying the schema to the raw data to ensure that the raw is fulfilling all the conditions that are required by the user.

from pandera import check_input
 
name=['Nike', 'Puma', 'Adidas', 'Converse']
input_schema = pa.DataFrameSchema(
    {
        "brand": Column(str, Check.isin(name)),
        "price": Column(float, Check.greater_than(4000.00)),
    }
)
 
@check_input(input_schema)
def get_total_price(df: pd.DataFrame):
    return df[['price']]
 
get_total_price(df)

In this, a new schema is created which is used to check the brand name and price greater than 4000.00.

All the rows are returned because every value is greater than 4000.00 and all brands are matches with the schema.

Validating output:

Validating the data received from the function created by applying the schema.

from pandera import check_output
 
name=['Nike', 'Puma', 'Adidas', 'Converse']
output_schema = pa.DataFrameSchema(
    {
        "brand": Column(str, Check.isin(name)),
        "price": Column(float, Check.greater_than(5000.00)),
    }
)
 
@check_output(output_schema)
def get_total_price(df: pd.DataFrame):
    return df[['brand','product_name','price']]
 
try:
  get_total_price(df)
except SchemaError as e:
  print("Failed check:", e.check)
  print("\n dataframe:\n", e.data)
  print("\nFailure cases:\n", e.failure_cases)

Applying schema on the output getting from the get_total_price function and validating the price is greater than 5000.00.

There is a failure case where the value in the price column is 4574.82 as it is lower than 5000.00.

Check Both Inputs and Outputs:

Both the input and output could be applied at once by using the check_io decorator/module

from pandera import check_io
input_schema = pa.DataFrameSchema(
    {
        "brand": Column(str, Check.isin(name)),
        "price": Column(float, Check.greater_than(4000.00)),
    }
)
output_schema = pa.DataFrameSchema(
    {
        "brand": Column(str, Check.isin(name)),
        "price": Column(float, Check.greater_than(5000.00)),
    }
)
 
@check_io(df=input_schema,out=output_schema)
def get_total_price(df: pd.DataFrame):
    return df[['brand','product_name','price']]
 
try:
  get_total_price(df)
except SchemaError as e:
  print("Failed check:", e.check)
  print("\n dataframe:\n", e.data)
  print("\nFailure cases:\n", e.failure_cases)

Using the check_io decorator both the input and output schemas are assigned for the validation.

As expected, their failure case is shown above because one value is less than 5000.00

In pandera by default pandera will show an error if there are missing values in the data frame but if there is a need to accept null values then use the nullable parameter in the check decorator.

input_schema = pa.DataFrameSchema(
    {
        "brand": Column(str, Check.isin(name),nullable=False),
    }
)

Similarly, if there is a need for check duplicates in the data use the allow_duplicates attribute of the check decorator.

input_schema = pa.DataFrameSchema(
    {
        "brand": Column(str, Check.isin(name),allow_duplicates=False),
    }
)

If there is need for to convert datatype of any feature use:

schema = pa.DataFrameSchema({"price": Column(int, coerce=True)})
validated = schema.validate(df)
validated.dtypes

The data type of the price column changed from float64 to int64.

Final words

The pandera package is a way of expressing assumptions about data and falsifying those assumptions at run time. This tool has made work easy for the data wrangling process and can blindly rely on the data for further statistics artefacts.