Dataclasses and When to Use Them

Jul 14, 2024

If you've written Python for any meaningful amount of time, you've probably typed self.x = x enough times to develop muscle memory for it. You need a class to hold some data, so you write an __init__, then you realize you need __eq__ for comparisons, and then __repr__ so your debugging output isn't just <Instance object at 0x7f...>. Before you know it, you've written 20 lines of code that say absolutely nothing about the actual data you're trying to represent.

When I was at Intel working on a cloud cost optimization tool, we had to model EC2 instances--their type, vCPU count, memory, state, all the attributes you'd get back from describe_instances. We were building a system that could analyze a customer's AWS fleet and recommend right-sizing actions: shutting down idle instances, swapping an m5.xlarge for a t3.small if utilization didn't justify the cost, that sort of thing. Naturally, we needed a clean way to represent an instance in Python. Here's what the "normal" approach looks like:

class Instance:
    def __init__(self, instance_id, instance_type, state, vcpus, memory_gb, region):
        self.instance_id = instance_id
        self.instance_type = instance_type
        self.state = state
        self.vcpus = vcpus
        self.memory_gb = memory_gb
        self.region = region

    def __eq__(self, other):
        if not isinstance(other, Instance):
            return False
        return (self.instance_id == other.instance_id and
                self.instance_type == other.instance_type and
                self.state == other.state and
                self.vcpus == other.vcpus and
                self.memory_gb == other.memory_gb and
                self.region == other.region)

    def __repr__(self):
        return (f"Instance(instance_id={self.instance_id!r}, "
                f"instance_type={self.instance_type!r}, "
                f"state={self.state!r}, vcpus={self.vcpus}, "
                f"memory_gb={self.memory_gb}, region={self.region!r})")

There's nothing wrong with this, but look at how much of it is just ceremony. The actual information--what fields exist and what types they are--is buried under boilerplate. When you're dealing with a dozen different models like this (instances, volumes, snapshots, cost reports), the boilerplate adds up fast. Dataclasses fix this:

from dataclasses import dataclass

@dataclass
class Instance:
    instance_id: str
    instance_type: str
    state: str
    vcpus: int
    memory_gb: float
    region: str

That's it. The @dataclass decorator generates __init__, __eq__ and __repr__ for you based on the field annotations. You get the same behavior as the verbose version above, but now the class definition actually reads like a description of the data rather than a wall of assignment statements.

How It Actually Works

Under the hood, a dataclass is still just a regular Python class--the decorator is essentially a code generator that runs at class definition time. It reads your type annotations, figures out what fields you have and generates the dunder methods accordingly. The annotations serve double duty: they tell the dataclass machinery what to generate, and they give your IDE and other developers useful type information.

One thing worth calling out is that dataclasses don't do runtime type checking. If you annotate a field as str and pass in an int, Python won't complain. The annotations are just hints. If you actually need runtime validation, you're looking at something like Pydantic, which is a whole different beast and honestly probably what you want if you're dealing with external input like API payloads.

Where They Shine

The Instance example above is the straightforward case--you have some data, you want a clean container for it. But there are a couple other patterns where I've found dataclasses to be particularly useful.

DTOs Between Layers

When you're passing data between layers of your application--say, from a service layer to a serializer--dataclasses make great data transfer objects. Instead of passing around dicts (and inevitably misspelling a key somewhere), you get a typed, documented structure.

In our case, we had a service that would analyze an instance's utilization and produce a right-sizing recommendation. That recommendation needed to flow from the analysis engine to the API layer to be served to the user. A dataclass is a natural fit:

from dataclasses import dataclass

@dataclass
class ResizeRecommendation:
    instance_id: str
    current_type: str
    recommended_type: str
    current_monthly_cost: float
    projected_monthly_cost: float
    avg_cpu_utilization: float
    reason: str
    approved: bool = False

    @property
    def monthly_savings(self):
        return self.current_monthly_cost - self.projected_monthly_cost

You can still add methods and properties, just like any class. The dataclass decorator doesn't take away any functionality--it just handles the tedious parts.

Immutable Records with frozen=True

For data that shouldn't change after creation--think audit logs, completed actions, things like that--you can use frozen=True. We used this for recording actions that had already been executed against a customer's fleet, since once you've stopped an instance or swapped its type, that record shouldn't be mutable:

from dataclasses import dataclass
from datetime import datetime

@dataclass(frozen=True)
class ExecutedAction:
    action_id: str
    instance_id: str
    action_type: str  # "stop", "resize", "terminate"
    previous_type: str
    new_type: str | None
    executed_at: datetime
    executed_by: str

    @property
    def is_resize(self):
        return self.action_type == "resize"

Try to modify a field on a frozen instance and you'll get a FrozenInstanceError. As a bonus, frozen dataclasses are hashable, so you can use them in sets or as dictionary keys, which is handy.

When Not to Use Them

I don't think dataclasses should be your default for every class. If a class is more about behavior than data--something like a FleetManager with methods like execute_resize, rollback_action or generate_cost_report--then forcing it into a dataclass feels weird and unnatural. The decorator is designed for data containers, and trying to stretch it beyond that just makes your code harder to read.

There's also the performance angle. If you're creating millions of objects in a tight loop, the overhead of dataclasses (which is small, but nonzero) could add up. In those cases, __slots__ or NamedTuple might be better choices. But honestly, for the vast majority of applications, this isn't something you'll ever notice.

Getting More Control with field()

The basic decorator covers most cases, but sometimes you need to customize individual fields. That's where field() comes in:

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

@dataclass
class Instance:
    instance_id: str
    instance_type: str
    state: str
    vcpus: int
    memory_gb: float
    region: str
    access_key: str = field(repr=False)
    last_checked: datetime = field(default_factory=datetime.now)
    tags: dict = field(default_factory=dict)

repr=False keeps access_key out of your __repr__ output--you don't want credentials showing up in your logs. default_factory calls the provided function to generate a fresh default value for each instance.

That default_factory bit is actually important--if you try to do tags: dict = {} directly, every instance of your class will share the same dictionary object. I've seen this cause some genuinely confusing bugs where tagging one instance would mysteriously tag others. default_factory=dict avoids this entirely by creating a new dict for each instance.

__post_init__ for Validation and Derived Fields

If you need to run logic right after construction--validation, computing derived values, that sort of thing--there's __post_init__:

from dataclasses import dataclass, field

@dataclass
class Instance:
    instance_id: str
    instance_type: str
    state: str
    vcpus: int
    memory_gb: float
    is_oversized: bool = field(init=False)

    def __post_init__(self):
        if self.vcpus < 0 or self.memory_gb < 0:
            raise ValueError("vCPUs and memory must be non-negative")
        self.is_oversized = self.vcpus >= 8 and self.memory_gb >= 32

field(init=False) excludes is_oversized from the constructor parameters entirely--it gets computed in __post_init__ instead. This is a clean pattern for fields that are derived from other fields and shouldn't be set directly by the caller.

At the end of the day, dataclasses are one of those standard library features that, once you start using, you wonder how you went so long without. They don't do anything revolutionary--they just remove the friction between "I need a class to hold this data" and actually having that class ready to go.