Pydata-visualizer User Guide

This guide provides a step-by-step introduction to using the Pydata-visualizer library for data analysis and profiling.

Table of Contents

  1. Installation

  2. Basic Usage

  3. Understanding the Report

  4. Advanced Configuration

  5. Working with Large Datasets

  6. Common Use Cases

  7. Customization Options

1. Installation

Using pip

pip install pydata-visualizer

Verifying Installation

You can verify the installation was successful by importing the library:

from data_visualizer.profiler import AnalysisReport
print("Pydata-visualizer is installed successfully!")

2. Basic Usage

Step 1: Import the library

import pandas as pd
from data_visualizer.profiler import AnalysisReport

Step 2: Load your data

# Load data from a CSV file
df = pd.read_csv("your_dataset.csv")

# Or from any other source supported by pandas
# df = pd.read_excel("your_data.xlsx")
# df = pd.read_sql_query("SELECT * FROM your_table", connection)

Step 3: Create an analysis report

# Initialize the report
report = AnalysisReport(df)

Step 4: Generate the HTML report

# Create HTML report
report.to_html("output_report.html")

Step 5: Open the HTML report

Open the generated HTML file in your web browser to view the complete data profile.

3. Understanding the Report

The HTML report contains several sections:

Overview Section

  • Dataset Overview: Shows basic dataset information

    • Number of rows

    • Number of columns

    • Number of duplicate rows

    • Missing values count and percentage

Variables Section

For each column in your dataset:

  • Data type: Detected data type

  • Missing values: Count and percentage

  • Type-specific statistics:

    • For numeric: min, max, mean, median, std, quartiles, skewness, kurtosis, outlier detection

    • For categorical: unique values, most frequent values, cardinality, top N value counts

    • For string/text: unique values, cardinality, top N value counts, word frequency analysis

    • For boolean: value counts and proportions

  • Visualizations:

    • Distribution plots for numeric data with outlier highlighting

    • Bar charts for categorical data

    • Word clouds for text data (when text_analysis is enabled)

  • Alerts: Warnings about potential data issues (missing values, outliers, skewness)

Sample Data

  • Shows the first and last 10 rows of your dataset

Correlations

  • Pearson correlation: For linear relationships between numerical variables

  • Spearman correlation: For monotonic relationships

  • Cramér’s V: For relationships between categorical variables

  • Heatmaps: Visual representation of all correlation matrices

4. Advanced Configuration

You can customize the analysis with the Settings class:

from data_visualizer.profiler import AnalysisReport, Settings

# Create custom settings
settings = Settings(
    minimal=False,              # Full analysis with all features
    top_n_values=5,             # Show top 5 values in categorical columns
    skewness_threshold=2.0,     # Alert threshold for skewness
    outlier_method='iqr',       # Outlier detection method: 'iqr' or 'zscore'
    outlier_threshold=1.5,      # IQR multiplier for outlier detection
    duplicate_threshold=5.0,    # Alert if duplicates exceed 5% of dataset
    text_analysis=True          # Enable word frequency and word cloud for text
)

# Apply settings to report
report = AnalysisReport(df, settings=settings)

# Generate the report
report.to_html("custom_report.html")

Settings Options

  • minimal (bool): If True, performs minimal analysis (faster, skips visualizations and type-specific analysis)

  • top_n_values (int): Number of top values to show for categorical variables (default: 10)

  • skewness_threshold (float): Threshold for flagging skewed distributions (default: 1.0)

  • outlier_method (str): Method for outlier detection - ‘iqr’ (Interquartile Range) or ‘zscore’ (default: ‘iqr’)

  • outlier_threshold (float): IQR multiplier for outlier detection (default: 1.5, use 3.0 for extreme outliers only)

  • duplicate_threshold (float): Percentage of duplicate rows to trigger an alert (default: 5.0)

  • text_analysis (bool): Enable word frequency analysis and word cloud generation for text columns (default: True)

5. Working with Large Datasets

For large datasets, consider these approaches:

Use minimal analysis

settings = Settings(minimal=True)
report = AnalysisReport(large_df, settings=settings)

Sample your data

# Sample 10,000 rows randomly
sampled_df = large_df.sample(10000, random_state=42)
report = AnalysisReport(sampled_df)

Analyze specific columns only

# Select only specific columns for analysis
subset_df = large_df[['important_column_1', 'important_column_2', 'important_column_3']]
report = AnalysisReport(subset_df)

6. Common Use Cases

Exploratory Data Analysis (EDA)

import pandas as pd
from data_visualizer.profiler import AnalysisReport

# Load your dataset
df = pd.read_csv("new_dataset.csv")

# Generate comprehensive EDA report
report = AnalysisReport(df)
report.to_html("eda_report.html")

Data Quality Assessment

import pandas as pd
from data_visualizer.profiler import AnalysisReport, Settings

# Load your dataset
df = pd.read_csv("dataset_to_check.csv")

# Set stricter thresholds for data quality
settings = Settings(skewness_threshold=1.5)

# Generate data quality report
report = AnalysisReport(df, settings=settings)
report.to_html("quality_report.html")

Correlation Discovery

import pandas as pd
from data_visualizer.profiler import AnalysisReport

# Load your dataset
df = pd.read_csv("features.csv")

# Generate report with focus on correlations
report = AnalysisReport(df)
results = report.analyse()

# Access correlation matrices programmatically
pearson_corr = results['Correlations_JSON']['pearson']
spearman_corr = results['Correlations_JSON']['spearman']

# Find strongly correlated features (absolute correlation > 0.7)
import numpy as np
strong_correlations = [(col1, col2) for col1 in pearson_corr 
                       for col2 in pearson_corr if col1 != col2 
                       and abs(pearson_corr[col1][col2]) > 0.7]

print("Strongly correlated features:", strong_correlations)

# Generate complete report
report.to_html("correlations_report.html")

7. Customization Options

Accessing Analysis Results Programmatically

You can access and manipulate the analysis results directly:

import pandas as pd
from data_visualizer.profiler import AnalysisReport

# Load your dataset
df = pd.read_csv("your_data.csv")

# Run analysis
report = AnalysisReport(df)
results = report.analyse()

# Access specific components
overview = results['overview']
column_stats = results['variables']

# Print summary information
print(f"Dataset has {overview['num_Row']} rows and {overview['num_Columns']} columns")
print(f"Missing values: {overview['missing_percentage']:.2f}%")

# Check specific column statistics
if 'age' in column_stats:
    age_stats = column_stats['age']
    print(f"Age statistics: Min={age_stats.get('min')}, Max={age_stats.get('max')}, Mean={age_stats.get('mean')}")

Customizing Report Output

You can generate the HTML report to a specific location:

# Generate report to specific path
report.to_html("/path/to/reports/analysis_report.html")

Integrating with Other Tools

import pandas as pd
from data_visualizer.profiler import AnalysisReport
import webbrowser
import os

# Load your dataset
df = pd.read_csv("your_data.csv")

# Create and generate report
report_path = "analysis_report.html"
report = AnalysisReport(df)
report.to_html(report_path)

# Automatically open the report in the default browser
webbrowser.open('file://' + os.path.abspath(report_path))

Conclusion

This guide covered the essential aspects of using Pydata-visualizer for data analysis and profiling. For more detailed information, refer to the full documentation or explore the source code.