Python Dataclasses: Classes with Data in Python

July 14, 2023

Python Dataclasses: What are they?

Python data classes are classes for data in Python. But actually, what are they? What's the point? What are they used for?

Let's look at an initial scenario. Imagine you've been assigned the task of writing an inventory management system to better track your company's physical assets. The first thing you do is model what an asset actually is. This could look something like:

class Asset:
    def __init__(self, product_id, name, category, stock_quantity, weight=None):
        self.product_id = product_id
        self.name = name
        self.category = category
        self.stock_quantity = stock_quantity
        self.weight = weight
    
    def __eq__(self, other):
        if not isinstance(other, Asset):
            return False
        return (self.product_id == other.product_id and
                self.name == other.name and
                self.category == other.category and
                self.stock_quantity == other.stock_quantity and
                self.weight == other.weight)
    
    def __repr__(self):
        return (f"Asset(product_id={self.product_id!r}, name={self.name!r}, "
                f"category={self.category!r}, stock_quantity={self.stock_quantity!r}, "
                f"weight={self.weight!r})")

You've put together a basic model of what a company asset is (probably given some specifications) and implemented two basic dunder methods, one for checking equality and the other for the string representation of an Asset.

This would work fine and is a great first step in the right direction. However, let's look at another way to achieve the exact same result, but with half the lines of code:

from dataclasses import dataclass

@dataclass
class Asset:
    product_id: str  # or appropriate type
    name: str
    category: str
    stock_quantity: int
    weight: float = None

Why Dataclasses Exist

The example above shows the core benefit of dataclasses - they eliminate boilerplate code. Introduced in Python 3.7, dataclasses were created to solve a common frustration: writing the same repetitive code for classes that primarily store data.

With that single @dataclass decorator, Python automatically generates:

  • An __init__ method that initializes all the fields
  • An __eq__ method that compares all fields
  • A __repr__ method that displays all fields

And that's just the beginning. Before we dig deeper, let's clarify when you should actually use dataclasses.

When to Use Dataclasses

1. Structured Data Containers

Our Asset example is perfect because it's essentially a container for related data. When you need to group data with a clear structure, dataclasses provide a clean solution.

Consider an inventory report that needs to consolidate multiple pieces of information. Without dataclasses, you might be tempted to use dictionaries, which can become unwieldy and error-prone as they lack explicit structure and type information. With a dataclass, the structure becomes self-documenting:

@dataclass
class InventoryReport:
    generated_at: datetime
    total_products: int
    low_stock_items: list
    out_of_stock_items: list
    highest_value_item: Asset
    total_inventory_value: float

This approach has several advantages over alternatives. The types in the annotations immediately tell you what kind of data to expect. Anyone reading the code can quickly understand what makes up an inventory report without digging through documentation. Additionally, your IDE can provide autocomplete suggestions and type checking tools like mypy can catch errors before they occur in production.

2. Configuration Objects

Configuration settings pose a particular challenge in many applications. They typically have sensible defaults that can be overridden, need to be passed between components, and should be easy to update and maintain. Traditionally, you might use dictionaries or a custom class, but both approaches have drawbacks.

Dictionaries lack type safety and don't communicate intent clearly. Custom classes require boilerplate code. Dataclasses, however, hit the sweet spot:

@dataclass
class InventorySettings:
    low_stock_threshold: int = 5
    critical_stock_threshold: int = 2
    auto_reorder: bool = False
    reorder_approval_required: bool = True
    default_supplier_id: str = "SUP-001"
    order_tax_rate: float = 0.07
    include_shipping_estimate: bool = True

With this approach, the default values are front and center, making it clear what happens if you don't specify a particular setting. The type annotations add clear documentation about what values are acceptable. When you need to create slightly different configurations, you can do so concisely:

# Default settings
default_settings = InventorySettings()

# Warehouse settings with automatic reordering
warehouse_settings = InventorySettings(
    low_stock_threshold=10,
    auto_reorder=True
)

# Retail store settings with lower thresholds
retail_settings = InventorySettings(
    low_stock_threshold=3,
    critical_stock_threshold=1
)

And if your configuration needs validation logic (for example, ensuring thresholds are positive numbers), you can add a __post_init__ method that enforces these rules.

3. Data Transfer Objects (DTOs)

As applications grow, you often need to pass structured data between different layers or components of your system. Dataclasses excel at creating these Data Transfer Objects (DTOs) that package related information together.

In our inventory system, when transferring assets between locations, we need to track several pieces of information. Using dictionaries for this purpose can lead to inconsistencies and bugs when a developer forgets to include a required field or uses different key names. A dataclass solves these problems:

@dataclass
class AssetTransferRequest:
    asset_id: str
    source_location: str
    destination_location: str
    quantity: int
    transfer_date: datetime
    requested_by: str
    approval_status: str = "pending"
    
    @property
    def is_approved(self):
        return self.approval_status.lower() == "approved"

The benefits of this approach are numerous:

  1. Self-documentation: The class definition clearly indicates what information is required for an asset transfer.
  2. Type safety: The annotations help catch errors when incorrect types are provided.
  3. Default values: For optional fields like approval_status, we can provide sensible defaults.
  4. Behavior: We can add methods and properties like is_approved that encapsulate business logic.
  5. Validation: We could add a __post_init__ method to verify the transfer makes sense (e.g., source and destination aren't the same).

When these objects are passed between components, the receiving code knows exactly what to expect. This structure becomes especially valuable when your system grows or when you're working with a team where different developers handle different layers of the application.

4. Immutable Records

Immutability—the inability to modify an object after creation—is a powerful concept in programming that can prevent entire categories of bugs. For certain types of data, especially records of events that have already occurred, immutability makes logical sense. Once a purchase has happened, the details of that purchase shouldn't change.

Dataclasses make creating immutable objects straightforward with the frozen=True parameter:

@dataclass(frozen=True)
class AssetPurchaseRecord:
    purchase_id: str
    asset_id: str
    purchase_date: datetime
    quantity: int
    price_per_unit: float
    supplier_id: str
    
    @property
    def total_cost(self):
        return self.quantity * self.price_per_unit

When you make a dataclass frozen:

  1. Immutability is enforced: Any attempt to modify an attribute after creation will raise a FrozenInstanceError. This prevents accidental modifications that could corrupt your data.

  2. Thread safety: Since the object can't change, it's safe to share between threads without locks or other synchronization mechanisms.

  3. Hashability: Frozen dataclasses are hashable by default (as long as all their components are hashable), meaning they can be used as dictionary keys or in sets.

  4. Design clarity: Using a frozen dataclass signals to other developers that this object represents a fixed record that shouldn't be altered.

In our inventory system, purchase records, shipping records, and audit logs are perfect candidates for frozen dataclasses. Even if there's an error in the record, the correct approach would be to create a new correcting record rather than modifying the original, maintaining an accurate audit trail.

When NOT to Use Dataclasses

While dataclasses solve many problems elegantly, they're not right for every situation. Let's look at when you should avoid them:

1. Classes with Complex Behavior

While dataclasses excel at storing data with minimal behavior, they're not designed for classes where behavior is the primary focus. If your class has more methods than attributes or contains complex algorithms, a dataclass may obscure your intent rather than clarify it.

Consider the core manager class for our inventory system:

class InventoryManager:
    def __init__(self, database_connection, settings):
        self.db = database_connection
        self.settings = settings
        self._cached_items = None
        
    def add_inventory(self, asset, quantity, location):
        # Complex logic here
        pass
        
    def remove_inventory(self, asset, quantity, location):
        # More complex logic
        pass
        
    def transfer_inventory(self, asset, quantity, source, destination):
        # Even more complex logic
        pass
        
    def generate_reorder_list(self):
        # Algorithm for determining what to reorder
        pass

This class has several characteristics that make it inappropriate for a dataclass:

  1. Emphasis on behavior: The primary purpose is to provide methods that manipulate inventory, not to store inventory data itself.

  2. Few data attributes: It only has a couple of attributes (db, settings, and a private cache), compared to several methods.

  3. Complex initialization logic: The __init__ method might need to do more than just assign parameters to attributes, such as setting up connections or initializing resources.

  4. Encapsulated state: The _cached_items attribute is meant to be private and managed by the class's methods, not directly accessed.

Using a dataclass here would send the wrong signal to other developers. A dataclass communicates "this is primarily about storing data," but this class is primarily about doing things with data stored elsewhere. Additionally, the auto-generated __eq__ and __repr__ methods from a dataclass would likely be inappropriate for this kind of service class.

2. Performance-Critical Code

Dataclasses introduce a small amount of overhead compared to bare-bones classes. For most applications, this difference is negligible, but in performance-critical code paths where you're creating millions of objects, it can become significant.

Consider a scenario where your inventory system needs to process a massive daily import of sales data, creating objects for each transaction:

# Performance-sensitive code handling bulk inventory imports
class MinimalAsset:
    __slots__ = ('id', 'count')
    
    def __init__(self, id, count):
        self.id = id
        self.count = count

# Instead of:
@dataclass
class AssetData:
    id: str
    count: int

The performance differences stem from several factors:

  1. Memory overhead: Dataclasses store field metadata and generate several methods that regular classes don't have by default. This increases memory usage per instance.

  2. Initialization overhead: The auto-generated __init__ method performs more work than a manually optimized one, especially if you have many fields.

  3. Method dispatch overhead: Each method call has a small cost, and dataclasses generate several methods that might not be needed in performance-critical paths.

  4. Dictionary vs. slots: Regular classes use a dictionary for attribute storage by default, which is flexible but less efficient than using __slots__. As shown in the example, combining __slots__ with a minimal class can significantly reduce memory usage.

In benchmarks, dataclasses typically show:

  • 10-20% slower instantiation than manually written classes
  • 40-50% slower instantiation than named tuples
  • Significantly higher memory usage than tuples, named tuples, or classes with __slots__

For most of your code, these differences won't matter. But for that critical path processing millions of records, using a more minimal approach can be worth the trade-off in readability.

3. Dynamic Attributes

Dataclasses are designed with the assumption that you know the structure of your data in advance. They shine when you have a fixed set of fields that are known at development time. However, sometimes you need to handle data with dynamic attributes that aren't known until runtime.

Consider a scenario where you're allowing users to define custom properties for assets in your inventory system:

class DynamicAssetProperties:
    def __init__(self, asset_id):
        self.asset_id = asset_id
        
    def add_property(self, name, value):
        setattr(self, name, value)
        
    def get_all_properties(self):
        return {k: v for k, v in self.__dict__.items() if k != 'asset_id'}

This class allows adding arbitrary attributes at runtime, which wouldn't work well with a dataclass for several reasons:

  1. Conceptual mismatch: Dataclasses are meant to represent a known structure. Adding random attributes contradicts this purpose.

  2. Missing features: Dynamically added attributes won't be included in the generated __repr__ or __eq__ methods unless you customize them.

  3. Type annotation issues: There's no way to type-annotate fields that don't exist when you write the class.

  4. Frozen clash: If you use frozen=True, you can't add attributes after initialization.

  5. Documentation issues: The class definition no longer documents all possible attributes, making the code harder to understand.

While you technically can add attributes to dataclass instances (unless they're frozen), doing so defeats many of the benefits of using dataclasses in the first place. For truly dynamic attribute sets, a regular class or dictionary makes more sense.

4. Complex Inheritance

Dataclasses do support inheritance, and simple inheritance hierarchies work fine. However, as your class hierarchy becomes deeper or more complex, the interactions between parent and child dataclass fields can become confusing and error-prone.

Consider an attempt to model different types of inventory items using inheritance:

@dataclass
class Item:
    id: str
    name: str
    category: str

@dataclass
class Asset(Item):
    stock_quantity: int
    weight: float = None
    
@dataclass
class DigitalAsset(Asset):
    file_size: float
    download_url: str
    # How do defaults and field options interact across these classes?
    # It can get confusing quickly

This seemingly straightforward hierarchy introduces several subtle issues:

  1. Field ordering: Fields from parent classes come before fields from child classes in the generated __init__ method. This means the parameter order might not match what you'd expect if you're thinking about the child class in isolation.

  2. Default value complications: If a parent class field has a default value but a child class field doesn't, you end up with a non-intuitive parameter order where required parameters come after optional ones.

  3. Redefinition confusion: If a child class redefines a field from a parent class (to change its type or default), the behavior gets complex and can be surprising.

  4. Field options inheritance: Options specified with field() in the parent class might not work as expected in derived classes, especially for options like default_factory.

  5. InitVar fields: Fields marked with InitVar (used only in initialization) have special inheritance behavior that can be confusing.

For simple one-level inheritance, these issues are manageable. But as your hierarchy grows, the complexity increases exponentially. In such cases, consider composition over inheritance, or use regular classes with more explicit control over how fields and methods are defined and inherited.

Advanced Dataclass Features

Let's explore some powerful features that make dataclasses even more useful in our inventory system.

Post-Initialization Processing

Need to validate data or calculate derived values? Use __post_init__:

@dataclass
class Asset:
    product_id: str
    name: str
    category: str
    stock_quantity: int
    weight: float = None
    is_low_stock: bool = None
    
    def __post_init__(self):
        # Enforce business rules
        if self.stock_quantity < 0:
            raise ValueError("Stock quantity cannot be negative")
        
        # Calculate derived field
        if self.is_low_stock is None:
            self.is_low_stock = self.stock_quantity <= 5
            
        # Normalize data
        self.category = self.category.lower()

The __post_init__ method runs after the auto-generated __init__ completes, giving you a chance to apply business rules.

Field Customization

The field() function gives you fine-grained control:

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class Asset:
    product_id: str
    name: str
    category: str
    stock_quantity: int
    weight: float = None
    secret_notes: str = field(repr=False)  # Hide in string representation
    last_updated: datetime = field(default_factory=datetime.now)  # Dynamic default
    created_by: str = field(compare=False)  # Ignore in equality comparisons
    metadata: dict = field(default_factory=dict)  # Default empty dict

The options available in field() let you:

  • Control which fields appear in string representations
  • Set defaults that need to be calculated at runtime
  • Determine which fields participate in equality comparisons
  • Add metadata for documentation or validation

Easy Serialization

Need to convert your dataclass to a dictionary or JSON? Built-in functions make it simple:

from dataclasses import asdict, astuple

# Create an asset instance
laptop = Asset(
    product_id="A-123",
    name="MacBook Pro",
    category="electronics",
    stock_quantity=10,
    weight=2.0
)

# Convert to dictionary
laptop_dict = asdict(laptop)
# {'product_id': 'A-123', 'name': 'MacBook Pro', ...}

# Convert to tuple
laptop_tuple = astuple(laptop)
# ('A-123', 'MacBook Pro', 'electronics', 10, 2.0)

# JSON serialization
import json
json_data = json.dumps(asdict(laptop))

This makes dataclasses perfect for data that needs to be serialized for APIs, files, or databases.

Customizing Comparison Behavior

Control how instances are compared and sorted:

@dataclass(order=True)
class Asset:
    # Fields that determine sort order
    category: str = field(compare=True)
    name: str = field(compare=True)
    
    # Fields that don't affect sorting
    product_id: str = field(compare=False)
    stock_quantity: int = field(compare=False)
    weight: float = field(compare=False, default=None)
    
    def __post_init__(self):
        # Create a sort key tuple
        self._sort_key = (self.category, self.name)

The order=True parameter generates comparison methods like __lt__ (less than) and __gt__ (greater than), enabling sorting. By controlling which fields participate in comparisons, you determine how assets are ordered.

Type Validation

While annotations are just hints by default, you can enforce them:

@dataclass
class Asset:
    product_id: str
    name: str
    category: str
    stock_quantity: int
    weight: float = None
    
    def __post_init__(self):
        type_checks = {
            'product_id': str,
            'name': str,
            'category': str,
            'stock_quantity': int
        }
        
        for field_name, expected_type in type_checks.items():
            value = getattr(self, field_name)
            if not isinstance(value, expected_type):
                actual_type = type(value).__name__
                raise TypeError(f"{field_name} must be {expected_type.__name__}, got {actual_type}")
                
        if self.weight is not None and not isinstance(self.weight, float):
            raise TypeError(f"weight must be float or None, got {type(self.weight).__name__}")

For more comprehensive validation, you might consider libraries like Pydantic that build on top of dataclasses.

Real-World Example: Building an Inventory System

Let's see how dataclasses can form the backbone of our inventory system:

from dataclasses import dataclass, field, asdict
from datetime import datetime
from typing import List, Dict, Optional

@dataclass
class Supplier:
    id: str
    name: str
    contact_email: str
    contact_phone: str
    preferred: bool = False
    notes: str = ""

@dataclass
class Location:
    id: str
    name: str
    address: str
    is_warehouse: bool = False

@dataclass
class Asset:
    product_id: str
    name: str
    category: str
    stock_quantity: int
    supplier_id: str
    reorder_threshold: int = 5
    weight: Optional[float] = None
    locations: Dict[str, int] = field(default_factory=dict)
    last_updated: datetime = field(default_factory=datetime.now)
    
    @property
    def total_quantity(self):
        return sum(self.locations.values())
    
    @property
    def needs_reorder(self):
        return self.stock_quantity <= self.reorder_threshold
        
    def __post_init__(self):
        if self.stock_quantity < 0:
            raise ValueError("Stock quantity cannot be negative")
        
        # Ensure location quantities match total
        if self.locations and sum(self.locations.values()) != self.stock_quantity:
            raise ValueError("Location quantities must sum to total stock quantity")

@dataclass
class InventoryTransaction:
    transaction_id: str
    asset_id: str
    quantity: int
    transaction_type: str  # "receive", "ship", "transfer", "adjust"
    location_id: str
    destination_id: Optional[str] = None  # For transfers
    timestamp: datetime = field(default_factory=datetime.now)
    performed_by: str = "system"
    
    def __post_init__(self):
        valid_types = ["receive", "ship", "transfer", "adjust"]
        if self.transaction_type not in valid_types:
            raise ValueError(f"Transaction type must be one of: {valid_types}")
            
        if self.transaction_type == "transfer" and not self.destination_id:
            raise ValueError("Transfers require a destination_id")

These dataclasses form a clean, type-hinted representation of our inventory domain. The annotations make it clear what types we expect, and the defaults and validation ensure data integrity.

Wrapping it up

Dataclasses aren't magic - they simply automate code you'd otherwise write manually. Their real power lies in making your code more readable and maintainable by focusing on the essential: the data structure itself.

Our inventory management example demonstrates how dataclasses shine when modeling domain objects. They provide:

  1. Clear structure with type hints
  2. Built-in methods for equality and representation
  3. Validation via __post_init__
  4. Easy serialization with asdict()
  5. Properties for derived values

Next time you find yourself writing another class filled with boring __init__, __eq__, and __repr__ methods, remember our Asset class. Could a dataclass express your intent more clearly with half the code? For data-centric classes, the answer is usually yes.