Python Dataclasses

Introduction

Python dataclasses are great. Available from Python 3.7, not only are they fast but also incredibly powerful. While they do have their shortcomings, they're an invaluable part of my Python toolkit.

Basics

Let's start by modeling some data.

from dataclasses import dataclass

@dataclass
class Car:
  make: str
  model: str
  year: int
  price: float
  num_wheels: int = 4
  metadata: dict | None = None

Dataclasses automatically generate an __init__ method for you, with arguments in the same order of the class attributes you define. You can set defalults for these attirbutes as shown for num_wheels above.

c = Car('Honda', 'CRV', 2014, 32_000.00)

Note that the types of the arguments are not enforced out of the box, meaning we can do something like this without getting any error from Python.

c = Car('Honda', 'CRV', '32000', 2014)

Dataclasses come with a special method hook called __post_init__, which is run after the __init__ method. You can use it to enforce validations on your data, modify attributes, or add new attributes based on existing ones.

import json
from datetime import datetime
from dataclasses import field, fields


@dataclass
class Car:
  ...
  
  age: int = field(init=False)
  metadata: dict = None
  
  def __post_init__(self):
    # calculate new attribute based on existing attributes
    self.age = datetime.now().year - self.year
    
    # enforce types 
    self.validate_types()
    
    # serialize some fields to JSON
    self.metadata = json.dumps(self.metadata)

You can set the default value for each field of a dataclass, or otherwise customize it's behavior using field. Be sure to check out all of the options.

Similarly, you can use the fields method to inspect the fields of a dataclass.

Additionally, the __dict__ method of a dataclass instance gives you a dictionary representation of the data.

  def validate_types(self):
    for f in fields(self):
      val = self.__dict__[f.name]
      if not isinstance(val, f.type):
        raise TypeError(f"field '{f.name}' should be of type {f.type} instead of {type(val)}")

Sample Usage: A Minimal ORM

Let's use dataclasses to make a minimal ORM.

When starting a new project, typically we need some way to model data, and store and retrieve it from a database. Instead of reaching for a heavy-handed library and bloating your project with dependencies early on, try starting with some simple functions that cover most of the functionality that you need. Your needs may outgrow it in the future, sure, but along the way, you might learn about what you actually need, helping you make a better decision later.

Intializing Tables

To start, let's generate some SQL to create a table for the class above.

CREATE_TABLE_SQL = '''CREATE TABLE IF NOT EXISTS {} (\n  {}\n)'''

TYPE_SQL_MAP = {
    'int': 'INTEGER',
    'float': 'REAL',
    'str': 'TEXT',
    }

def create_table_sql(cls) -> str:
  fields_data: list[str] = []

  for f in fields(cls):
    type_name = f.type.__name__

    # default type to text, for example if a field is an anum
    sql_type = TYPE_SQL_MAP.get(type_name, 'TEXT')
    
    fields_data.append(f'{f.name} {sql_type}')
    
  return CREATE_TABLE_SQL.format(
      cls.__name__.lower(),
      ',\n  '.join(fields_data))

We can include a dictionary of metadata for each field in the field method, such as information that we want to include in the table defintion. By default, the generated __init__ method has all positional arguments. We can set kw_only in a field to make it a keyword argument, or apply it to all fields by setting it in the dataclass decorator.

from dataclasses import field
...

@dataclass(kw_only=True)
class Car:
  id: int = field(metadata={'PRIMARY KEY': True}, init=False, default=None)
  ...
  
  date: str = field(metadata={'DEFAULT': 'CURRENT_DATE'}, init=False, default=None)

Here we are processing the metadata dictionary to process the extra metadata and generate the proper SQL for each column.

def create_table_sql(cls) -> str:
  ...
  for f in fields(cls):
    ...
    # by default set NOT NULL for all fields
    # merge the two dicts
    metadata: dict[str, bool | str] = dict({'NOT NULL': True}, **f.metadata)
    extra_args = [
        f'{k}{"" if v == True else " "+str(v)}'
        for k, v in metadata.items()
        if v]

    fields_data.append(f'{f.name} {sql_type}{" ".join([""] + extra_args)}')
  ...

This would generate the following SQL.

CREATE TABLE IF NOT EXISTS car (
  id INTEGER NOT NULL PRIMARY KEY,
  make TEXT NOT NULL,
  model TEXT NOT NULL,
  year INTEGER NOT NULL,
  price REAL NOT NULL,
  num_wheels INTEGER NOT NULL,
  metadata TEXT NOT NULL,
  date TEXT NOT NULL DEFAULT CURRENT_DATE
)

Inserting Data

We can similarly generate SQL to insert a dataclass instance into the database.

INSERT_SQL = '''INSERT INTO {} ({}) VALUES ({})'''

def insert_sql(ins) -> tuple[str, list]:
  # look up values in the dict representation
  column_names = [f.name for f in fields(ins) if f.init]
  values = [ins.__dict__[name] for name in column_names]
  # automatically generate number of '?'
  sql = INSERT_SQL.format(
      ins.__class__.__name__.lower(),
      ', '.join(column_names),
      ', '.join(['?']*len(values)))

  return sql, values

Retrieving Data

TODO