SQLAlchemy Enums - Careful what goes into the database

The Situation

SQLAlchemy is an obvious choice when you need to throw together anything dealing with databases in Python. There might be other options, there might be faster options, but if you need it done then SQLAlchemy will do it for you pretty well and very ergonomically.

The problem I ran into recently is dealing with Python enums recently. Or more specifically: I had a user input problem which obviously turned into an enum application side - I had a limited set of inputs I wanted to allow, because those were what we supported - and I didn't want strings all through my code testing for values.

So on the client side it's obvious: check if the string matches an enum value, and use that. The enum would look something like below:

In [1]:
from enum import Enum

class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"

Now from this, we have our second problem: storing this in the database. We want to not do work here - that's we're using SQLAlchemy, so we can have our commmon problems handled. And so, SQLAlchemy helps us - here's automatic enum type handling for us.

Easy - so our model using the declarative syntax, and typehints can be written as follows:

In [2]:
import sqlalchemy
from sqlalchemy.orm import Mapped, DeclarativeBase, Session, mapped_column
from sqlalchemy import create_engine, select, text

class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color]

This is essentially identical to the documentation we see above. And, if we run this in a sample program - it works!

In [3]:
engine = create_engine("sqlite://")

Base.metadata.create_all(engine)

with Session(engine) as session:
    # Create normal values
    for enum_item in Color:
        session.add(TestTable(value=enum_item))
    session.commit()

# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    for record in records:
        print(record.value)
Color.RED
Color.GREEN
Color.BLUE

Right? We stored some enum's to the database and retreived them in simple elegant code. This is exactly what we want...right?

But the question is...what did we actually store? Let's extend the program to do a raw query to read back that table...

In [5]:
from sqlalchemy import text

with engine.connect() as conn:
    print(conn.execute(text("SELECT * FROM test_table;")).all())
[(1, 'RED'), (2, 'GREEN'), (3, 'BLUE')]

Notice the tuples: the second column, we see "RED", "GREEN" and "BLUE"...but our enum defines our colors as RED is "red". What's going on? And is something wrong here?

Depending how you view the situation, yes, but also no - but it's likely this isn't what you wanted either.

The primary reason to use SQLAlchemy enum types is to take advantage of something like PostgreSQL supporting native enum types in the database. Everywhere else in SQLAlchemy, when we define a python class - like we do with TestTable above - we're not defining a Python object, we're defining a Python object which is describing the database objects we want and how they'll behave.

And so long as we're using things that come from SQLAlchemy - and under-the-hood SQLAlchemy is converting that enum.Enum to sqlalchemy.Enum - then this makes complete sense. The enum we declare is declaring what values we store, and what data value they map too...in the sense that we might use the data elsewhere, in our application. Basically our database will hold the symbolic value RED and we interpret that as meaning "red" - but we reserve the right to change that interpretation.

But if we're coming at this from a Python application perspective - i.e. the reason we made an enum - we likely have a different view of the problem. We're thinking "we want the data to look a particular way, and then to refer to it symbolically in code which we might change" - i.e. the immutable element is the data, the value, of the enum - because that's what we'll present to the user, but not what we want to have all over the application.

In isolation these are separate problems, but automatic enum handling makes the boundary here fuzzy: because while the database is defined in our code, from one perspective, it's also external to it - i.e. we may be writing code which is meant to simply interface with and understand a database not under our control. Basically, the enum.Enum object feels like it's us saying "this is how we'll interpret the external world" and not us saying "this is what the database looks like".

And in that case then, our view of what the enum is is probably more like "the enum is the internal symbolic representation of how we plan to consume database values" - i.e. we expect to map "red" to Color.RED from the database. Rather then reading the database and interpreting RED as "red".

Nobodies wrong - but you probably have your assumptions going into this (I know I did...but it compiled, it worked, and I never questioned it - and so long as I'm the sole owner, who cares right?)

The Problem

There are a few problems though with this interpretation. One is obvious: we're a simple, apparently safe refactor away from ruining our database schema and we might be aware of it. In the above, naive interpretation, changing Color.RED to Color.LEGACY_RED for example, is implying that RED is no longer a valid value in the database - which if we think of the enum as an application mapping to an external interface is something which might make sense.

This is the sort of change which crops up all the time. We know the string "red" is out there, hardcoded and compiled into a bunch of old systems so we can't just go and change a color name in the database. Or we're doing rolling deployments and we need consistency of values - or share the database or any number of other complex environment concerns. Either way: we want to avoid needlessly updating the database value - changing our code, but not an apparent variable constant - should be safe.

However we're not storing the data we think we are. We expected "red", "green" and "blue" and got "RED", "GREEN" and "BLUE". It's worth noting that the SQLAlchemy documentation leads you astray like this, since the second example showing using typing.Literal for the mapping uses the string assignments from the first (and neither shows a sample table result which makes it obvious on a quick read).

If we change a name in this enum, then the result is actually bad if we've used it anywhere - we stop being able to read models out of this table at all. So if we do the following:

In [6]:
class Color(Enum):
    LEGACY_RED = "red"
    GREEN = "green"
    BLUE = "blue"

Then try to read the models we've created, it won't work - in fact we can't read any part of that table anymore (this post is written as a Jupyter notebook so the redefinition below is needed to setup the SQLAlchemy model again)

In [8]:
class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color]

with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    for record in records:
        print(record.value)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1608         try:
-> 1609             return self._object_lookup[elem]
   1610         except KeyError as err:

KeyError: 'RED'

The above exception was the direct cause of the following exception:

LookupError                               Traceback (most recent call last)
/tmp/ipykernel_69447/1820198460.py in <module>
      8 
      9 with Session(engine) as session:
---> 10     records = session.scalars(select(TestTable)).all()
     11     for record in records:
     12         print(record.value)

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in all(self)
   1767 
   1768         """
-> 1769         return self._allrows()
   1770 
   1771     def __iter__(self) -> Iterator[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _allrows(self)
    546         make_row = self._row_getter
    547 
--> 548         rows = self._fetchall_impl()
    549         made_rows: List[_InterimRowType[_R]]
    550         if make_row:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   1674 
   1675     def _fetchall_impl(self) -> List[_InterimRowType[Row[Any]]]:
-> 1676         return self._real_result._fetchall_impl()
   1677 
   1678     def _fetchmany_impl(

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   2268             self._raise_hard_closed()
   2269         try:
-> 2270             return list(self.iterator)
   2271         finally:
   2272             self._soft_close()

~/.local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py in chunks(size)
    217                     break
    218             else:
--> 219                 fetch = cursor._raw_all_rows()
    220 
    221             if single_entity:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _raw_all_rows(self)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in <listcomp>(.0)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy.BaseRow.__init__()

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy._apply_processors()

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in process(value)
   1727                 value = parent_processor(value)
   1728 
-> 1729             value = self._object_value_for_elem(value)
   1730             return value
   1731 

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1609             return self._object_lookup[elem]
   1610         except KeyError as err:
-> 1611             raise LookupError(
   1612                 "'%s' is not among the defined enum values. "
   1613                 "Enum name: %s. Possible values: %s"

LookupError: 'RED' is not among the defined enum values. Enum name: color. Possible values: LEGACY_RED, GREEN, BLUE

Even though we did a proper refactor, we can no longer read this table - in fact we can't even read part of it without using raw SQL and giving up on our models entirely. Obviously if we were writing an application, we've just broken all our queries - but not because we messed anything up, but because we thought we were making a code change when in reality we were making a data change.

This behavior also makes it pretty much impossible to handle externally managed schemas or existing schemas - we don't really want our enum to have to follow someone else's data scheme, even if they're well behaved.

Finally it also hightlights another danger we've walked into: what if we try to read this column, and there are values there we don't recognize? We would also get the same error - in this case, RED is unknown because we removed it. But if a new version of our application comes along and has inserted ORANGE then we'd also have the same problem - we've lost backwards and forwards compatibility, in a way which doesn't necessarily show up easily. There's just no easy way to deal with these LookupError validation problems when we're loading large chunks of models - they happen at the wrong part of the stack

The Solution

Doing the obvious thing here got us a working applications with a bunch of technical footguns - which is unfortunate, but it does work. There are plenty of situations where we'd never encounter these though - although many more where we might. So what should we do instead?

To get the behavior we expected when we used an enum we can do the following in our model definition:

In [11]:
class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color] = mapped_column(sqlalchemy.Enum(Color, values_callable=lambda t: [ str(item.value) for item in t ]))

Notice the values_callable parameter. The order returned here should be the order our enum returns (and it is - it's simply passed our Enum object) - and returns the list of values which should be assigned in the database for it. In this case we simply do a Python string conversion of the enum value (which will just return the literal string - but if you were doing something ill-advised like mixing in numbers, then this makes it sensible for the DB).

When we run this with a new database, we now see that we get what we expected in the underlying table:

In [13]:
engine = create_engine("sqlite://")

Base.metadata.create_all(engine)

with Session(engine) as session:
    # Create normal values
    for enum_item in Color:
        session.add(TestTable(value=enum_item))
    session.commit()

# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    print("We restored the following values in code...")
    for record in records:
        print(record.value)

print("But the underlying table contains...")
with engine.connect() as conn:
    print(conn.execute(text("SELECT * FROM test_table;")).all())
We restored the following values in code...
Color.LEGACY_RED
Color.GREEN
Color.BLUE
But the underlying table contains...
[(1, 'red'), (2, 'green'), (3, 'blue')]

Perfect. Now if we're connecting to an external database, or a schema we don't control, everything works great. But what about when we have unknown values? What happens then? Well we haven't fixed that, but we're much less likely to encounter it by accident now. Of course it's worth noting, SQLAlchemy also doesn't validate the inputs we put into this model against the enum before we write it either. So if we do this, then we're back to it not working:

In [15]:
with Session(engine) as session:
    session.add(TestTable(value="reed"))
    session.commit()
In [16]:
# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    print("We restored the following values in code...")
    for record in records:
        print(record.value)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1608         try:
-> 1609             return self._object_lookup[elem]
   1610         except KeyError as err:

KeyError: 'reed'

The above exception was the direct cause of the following exception:

LookupError                               Traceback (most recent call last)
/tmp/ipykernel_69447/3460624042.py in <module>
      1 # Now try and read the values back
      2 with Session(engine) as session:
----> 3     records = session.scalars(select(TestTable)).all()
      4     print("We restored the following values in code...")
      5     for record in records:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in all(self)
   1767 
   1768         """
-> 1769         return self._allrows()
   1770 
   1771     def __iter__(self) -> Iterator[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _allrows(self)
    546         make_row = self._row_getter
    547 
--> 548         rows = self._fetchall_impl()
    549         made_rows: List[_InterimRowType[_R]]
    550         if make_row:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   1674 
   1675     def _fetchall_impl(self) -> List[_InterimRowType[Row[Any]]]:
-> 1676         return self._real_result._fetchall_impl()
   1677 
   1678     def _fetchmany_impl(

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   2268             self._raise_hard_closed()
   2269         try:
-> 2270             return list(self.iterator)
   2271         finally:
   2272             self._soft_close()

~/.local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py in chunks(size)
    217                     break
    218             else:
--> 219                 fetch = cursor._raw_all_rows()
    220 
    221             if single_entity:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _raw_all_rows(self)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in <listcomp>(.0)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy.BaseRow.__init__()

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy._apply_processors()

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in process(value)
   1727                 value = parent_processor(value)
   1728 
-> 1729             value = self._object_value_for_elem(value)
   1730             return value
   1731 

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1609             return self._object_lookup[elem]
   1610         except KeyError as err:
-> 1611             raise LookupError(
   1612                 "'%s' is not among the defined enum values. "
   1613                 "Enum name: %s. Possible values: %s"

LookupError: 'reed' is not among the defined enum values. Enum name: color. Possible values: red, green, blue

Broken again.

So how do we fix this?

Handling Unknown Values

All the cases we've seen of LookupErrors are essentially a problem that we have no unknown value handler - ultimately in all applications where the value could change - which I would argue should always be considered to be all of them - we in fact should have had an option which specified handling an unknown one.

At this point we need to subclass the SQLAlchemy Enum type, and specify that directly - which do like so:

In [25]:
import typing as t

class EnumWithUnknown(sqlalchemy.Enum):
    def __init__(self, *enums, **kw: t.Any):
        super().__init__(*enums, **kw)
        # SQLAlchemy sets the _adapted_from keyword argument sometimes, which contains a reference to the original type - but won't include
        # original keyword arguments, so we need to handle that here.
        self._unknown_value = kw["_adapted_from"]._unknown_value if "_adapted_from" in kw else kw.get("unknown_value",None)
        if self._unknown_value is None:
            raise ValueError("unknown_value should be a member of the enum")
    
    # This is the function which resolves the object for the DB value
    def _object_value_for_elem(self, elem):
        try:
            return self._object_lookup[elem]
        except LookupError:
            return self._unknown_value

And then we can use this type like follows:

In [26]:
class Color(Enum):
    UNKNOWN = "unknown"
    LEGACY_RED = "red"
    GREEN = "green"
    BLUE = "blue"

class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color] = mapped_column(EnumWithUnknown(Color, values_callable=lambda t: [ str(item.value) for item in t ], 
                                                         unknown_value=Color.UNKNOWN))
    

Let's run that against the database we just inserted reed into:

In [27]:
# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    print("We restored the following values in code...")
    for record in records:
        print(record.value)
We restored the following values in code...
Color.LEGACY_RED
Color.GREEN
Color.BLUE
Color.UNKNOWN

And fixed! We obviously have changed our application logic, but this is now much safer and code which will work as we expect it too in all circumstances.

From a practical perspective we've had to expand our design space to assume indeterminate colors can exist - which might be awkward, but the trade-off is robustness: our application logic can now choose how it handles "unknown" - we could crash if we wanted, but we can also choose just to ignore those records we don't understand or display them as "unknown" and prevent user interaction or whatever else we want.

Discussion

This is an interesting case where in my opinion the "default" design isn't what you would want, but the logic for it is actually sound. SQLAlchemy models define databases - they are principally built on assuming you are describing the actual state of a database, with constraints provided by a database - i.e. in a database with first-class enumeration support, some of the tripwires here just wouldn't work without a schema upgrade.

Conversely, if you did a schema upgrade, your old applications still wouldn't know how to parse new values unless you did everything perfectly in lockstep - which in my experience isn't reality.

Basically it's an interesting case where everything is justifiably right, but leaves some design footguns lying around which might be a bit of a surprise (hence this post). The kicker for me is the effect on using session.scalar calls to return models - since unless we're querying more specifically, having unknown values we can't handle in tables leads to being unable to list any elements on the table ergonomically.

Conclusions

Think carefully before using automagic enum methods in SQLAlchemy. What you want to do now is likely subtly wrong, and while there's a simple and elegant way to use enum.Enum with SQLAlchemy, the magic will give you working code quickly but with potentially nasty problems from subtle bugs or data mismatches later.

Listings

The full listing for the code samples here can also be found here.

DHCP Fixed IPs and ESPHome

DHCP Fixed IPs and ESPHome

The Problem

My Home Assistant installation runs in Docker, and ESPHome runs in a separate docker container. I use a separate Wifi SSID for my random ESP devices to give them some isolation from my main network, so mDNS doesn't work.

ESPHome however, loves mDNS - to discover and install devices.

I've just bought a bunch of the Athom Smart Plugs, and want to rename some of their outputs to get sensible labels - as well as generally just manage them.

ESPHome's Config Files

ESPHome is actually very well documented but it can be hard to figure out what it's documenting sometimes, since there's a combination of device and environment information in it's YAML config files. This is fine - it's a matter of approach - ESPHome likes to think of your environment as a dynamic thing.

For our purposes the issue is we need to make sure ESPHome knows to connect to our devices at their DHCP fixed IP addresses - and to do this we need the wifi.use_address setting - documented here.

This setting is how we solve the problem: we're not going to set a static IP on the ESPHome device itself (since we're letting DHCP handle that via a static reserved - i.e. a fixed IP in Unifi where I'm actually doing this). Instead, we're just telling ESPHome how to contact this specific device at it's static IP (or DNS name, but I'm choosing not to trust those on my local networks for IOT stuff.)

Importantly: wifi.use_address isn't a setting which gets configured on the device. It's local to the ESPHome application - all it does is says "use this IP address to communicate with the device". i.e. you can have a device which currently has a totally different IP address to the one you're configuring, and as long as you set use_address to the current value it's on, ESPHome will update it. This is very useful if you're changing IP addresses around, or only have a DNS name or something.

The other important thing to note about this solution is that when you're not using mDNS, you're going to want to set the environment variable ESPHOME_DASHBOARD_USE_PING=1 on the ESPhome dashboard process. This simply tells the dash to use ICMP ping to determine device availability, rather then mDNS, to have your devices show up properly as Online (though it doesn't much affect usability if you don't).

The Solution

User Level

To implement this solution for each of my smart devices, I have a stack of YAML files which layer up to provide the necessary functionality following some conventions.

At the top-level is the "user" level - one specific device on the network. After it's booted and been initially joined to my IOT SSID, it gets a YAML file named after it that looks like this.

# sp-attic-ventilation.yaml
packages:
  athom.smart-plug-v2: !include .common.athom-smartplug-v2.yaml

esphome:
  name: "sp-attic-ventilation"
  friendly_name: "Attic Ventilation"
  name_add_mac_suffix: false

wifi:
  use_address: 192.168.210.66

There's not much here - just the IP address which I assigned, plus a name which is the same as the hostname I assigned which follows the nominal convention of <device-type-abbreviation>-<location>-<controlled device>. So smartplug - sp, located in the attic, controlling the ventilation. You don't have to do this - but it helps. Then we include the friendly name - this will appear in Home Assistant, and disable adding the MAC suffix - this is a handy default when you're installing and configuring multiple devices initially using fallback APs.

The important part here is to note the include file: ESPHome's web interface will automatically hide a file named secrets.yaml as well as any files prefixed with . which is a convenient way to manage templates and packages.

Device Common Files

The next step up in the stack is a device-common file. Athom Technology publish these on their Github account. This sort of thing is why I love Athom and ESPHome - because we can customize this to work how we want it too. The default smart plug listing is here, but we're going to customize it though not extensively - namely we're adding this line:

packages:
  home: !include .home.yaml

I've included my full listing here (note the removed "time" section).

The Home File

The Home file is the apex of my little ESPHome config stack. In short it's the definition of things which I want to be always true about ESP devices in my home. All of the settings here can be overridden in downstream files if needed, but it's how we get a very succinct config. There's not a lot here but it does capture the important stuff:

# Home-specific features
mdns:
  disabled: false

web_server:
  port: 80

# Common security parameters for all ESPHome devices.
wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

  domain: !secret domain

  ap:
    password: !secret fallback_wifi_password

ota:
  password: !secret ota_password

time:
  - platform: sntp
    id: my_time
    timezone: Australia/Sydney
    servers:
    - !secret ntp_server1

This file extensively references into secrets.yaml, which is templated by my Ansible deployment playbook for ESPHome (which in turn uses my Keepass database for these values). It mostly sets up the critical things I always want on my smart devices: namely, the onboard HTTP server should always be available (life-saver for debugging and a fallback for control - every ESP chip I have seems to run it fine).

One of the crucial things I do is hard code the wifi parameters: the reason I do this is because for as many devices as possible I disable persistent storage to protect the ESP write flash. It's enabled for the smart plugs because they don't change state very often, but for something like a light controller it's a waste of flash cycles. But this does mean that if the wifi settings are configured via the fallback AP mode, they'll be lost if there's a power cut - and then all my devices will turn on AP mode and need to be reconfigured.

This is also the reason you definitely want to configure wifi.ap.password: because if your devices are unable to connected to your wifi (by default for 1 minute), or don't persist settings and are down, then the first thing they'll do (and out of the box Athom devices do this becaue obviously you need to configure them yourself) is open a public wifi network to let them be configured by just any random passer-by. The consequences of this range from someone having some fun toggling a button to someone implanting an advanced persistent threat.

For much the same reason, you should also configure an over-the-air password - ota.password. There's a difference between control of a device and being able to flash firmware, so this should be enforced. This value lives in my password manager, so I'll always have it around.

Beyond that there's just convenience: i.e. I force NTP to point to the Unifi router on my network so everyone has a common agreement on the definition of time.

Alternatives

Static IPs

ESPHome does have full support for static IPs via the wifi.manual_ip parameter. It would be entirely valid to take our wifi section from above and change it to look like this:

wifi:
  use_address: 192.168.210.66
  manual_ip:
    static_ip: 192.168.210.66
    subnet: 255.255.255.0
    gateway: 192.168.210.1
    dns1: 192.168.210.1

This device would work just fine on a network without DHCP - it would come up, grab an IP and be happy. The reason I don't do this is convenience of management: having the devices send DHCPDISCOVER packets is a nice way to make sure they're alive, and turns control of the isolated network segment they're on more over to my Unifi Router, which is what I want. If I want to re-ip a network, then updating static address allocations centrally is more convenient (you do have to coordinate rebooting the devices, but they will "get it").

You could obviously do all sorts of fancy scripting around this, but all of that is a lot of work for a very limited gain.

Enable mDNS

ESPHome uses mDNS extensively, and even with an isolated network you can make it work: my Home Assistant and ESPHome docker containers have IP addresses on that network segment so they can talk to these devices, and as a result they can also receive mDNS from them provided I configure it to be bridged properly.

The reason not to for me is ultimately just that keeping track of a list of IPs is simple: whereas mDNS in more complicated network arrangements like mine is not, and the complexity just isn't worth it - once configured, I never have to really think about these devices. I've lost my Unifi router config and just restored it from a backup and everything was fine. My configs are tracked in Git, my passwords in Keepass - rebuilding this environment is straightforward.

Conclusions

If you're trying to figure out how to flash an ESPDevice, you need to set wifi.use_address to the known IP of the device.

In an environment with DHCP Fixed IP addresses, this means you'll include this value in your ESPHome YAML config files, and it should match your static reservations.

A convenient way to do this is to layer your ESPHome YAML files, with your vendor/device-type files in the middle of the "stack".

Logitech G815 Review / Impressions

Logitech G815 Review / Impressions

I recently decided I wanted to upgrade my keyboard. I had two principle goals: the first was to find a production keyboard I could still buy. My former go to was the Logitech K740 (Logitech Illuminated Keyboard) which had been out of production for a very long time. The last time I tried to replace one I ended up buying about 3 keyboards off eBay before I suceeded in getting what I was actually after.

With that one now on the way out due to the key caps breaking off on frequently used keys like the backspace, and some suspected trouble with key registration it seemed like it was finally time to choose a new keyboard and adapt to it. The typing experience and it's ergonomics has become important to me, between age and profession, so it's a big decision.

Why a mechanical keyboard?

I've been curious to try a mechanical keyboard essentially due to hype, although there is some solid logic behind it. My K740s have failed due to the scissor-type plastic (nylon) mechanism failing, and once it goes there's nothing you can do. They also build up dust underneath the keys, but removing the key caps is not super-well supported - and I've lived with a very fiddly backspace for a while now, as well as some problems with key registration if I don't hit the larger keys (backspace, tab, enter) suitably dead-center.

To be clear: these are emergent problems - as new, the keyboards were solid but they failed in a predictable way.

So what I'm looking for by going with a mechanical keyboard is improved durability for key registration, and a nice typing experience. With the G815 I'm buying a gaming keyboard, but I'm buying it because I want good key registration for typing.

2AtnXir.jpg

G815: First impressions - there's an ergonomics change

The K740 is a very thin keyboard, with a built in palm rest. It is 9.3mm thick - that is incredibly slender, and no mechanical keyboard is going to beat that. The G815/915 series is the thinnest mechanical keyboard on the market at 22mm thick, but that's still more then double. Up front: It's noticeable, my typing position was substantially changed.

The G815 doesn't come with a palm rest out of the box: people have said they don't think it needs it, I would disagree. The first thing I found myself doing was raising my arm rests to get my hands flat to the keyboard. It's what I'm doing while typing this review. I'll be buying a palm rest soon and updating this post when I do.

The G Keys

The bigger issue I found, which I did not see talked about before buying in the reviews and is probably universal to this type of gaming keyboard design is the addition of the G keys to the left hand side of the keyboard.

I did not realize this before I bought the keyboard because it's a habit I do without thinking about it, but I essentially use my left hand to find the top-left of the keyboard when typing with my pinky finger. On a regular keyboard, holding the top-left of the chassis like this works fine because it's pretty well lined up with escape and the top row of number keys.

The addition of the G keys however changes the ergonomics of this in a big way - my initial attempts at typing were frustrated and difficult because all my instincts about where the keys are were wrong: I'm so used to using that pinky to control where the top of the keyboard is that it was very difficult to adapt without it. If you are considering this keyboard, or any gaming style keyboard with extra left hand macro keys, you would be well advised to really check if this is something you're doing: it was a huge surprise to me, and the change in how I type is, as of writing (so about 45 minutes after unboxing it) still feeling rough. I'm expecting to adapt, but I'm also feeling a muscle strain in my left arm due to the new typing position so it's not an easy adaptation, and as noted above may involve more peripherals to get it comfortable.

I strongly encourage not underestimating this - this is a peripheral I use for 8 hours a day for my job. It's function and whether it causes muscle strain is vital.

The Key Action

Mechnical keyboards are all about the key action of th keyboard. I can't give any advice here: YouTube will show you people using it, how it sounds and tell you how it feels but it is something which needs to be experienced for yourself. I can say that despite my complaints about the additional G keys, and the fact it's not as thin as the K740, the "Linear" type key model fo the G815 feels great to type on when you're in the zone on it. The action is smooth, comfortable and feels solid - this is consistent with some other reviews which noted that the Linear key switches tended to feel the best after a little while of typing, and this I can believe.

Some very good advice when you get into reviewing keyboards and other "things you never think about" is that almost all of them can be criticized - perfect doesn't exist, and the criticisms always feel louder then the good points. The most I can add here is, if you can use one in person, then that's the best way to explore the space (this is an expensive keyboard, so just buying a whole lot of them - as I suspect gets most YouTubers into making YouTube videos about keyboards - is a danger).

Conclusions - we'll see

It's no fun getting a fairly expensive new thing and feeling "hmmm" about how well it works. The G keys might be the real problem here - that change in typing experience was a huge surprise to me, so if you find this review then that's my core take away: be wary of layout changes like that. There is a numpad-less variant of the G815 which can be had, but I like my media keys and numpad so that's why I bought the larger one. If you don't need or want a numpad, then I'd recommend that one at the present time - no G keys means no problems.

I'm hoping at the moment I'll adapt to the G keys: their potential utility is high (though you can't program them on Linux), but if I could buy a full-size variant without them tomorrow I'd do it and not bother with the adaptation.

But the keys feel great to use, so hence the conclusion: we'll see.

Conclusions Update (same day) - went back to the K740

This is probably a good gaming keyboard.

I say that because I'm sure the G keys are effective for gaming purposes. But for the way I type, which is not true touch typing, the presence of the G keys and the offset they introduce had two pronounced effects: (1) it was almost impossible for me to re-centre my typing of the keyboard when I moved my hands away without a pronounced and noticeable process of feeling out where the top-left edge of the keyboard is.

The problem of key-centering was replicable with my wife, who has much smaller hands, typing on the keyboard - she found the same subtle problem trying to line up, finding she inevitably ended up hitting the caps lock key when she did.

The second problem (2) was wrist strain: because the G keys are actual keys and live on the left hand side of the keyboard, my natural resting position for my left hand which is off to the side with my palm free introduced a great deal of strain to my left arm specifically. The pictures below of my hands sort of show the problem - on the top is my backup K740 and the bottom the G815:

K740 resting positionG815 resting position

This is with my hands trying to rest in a ready position on the keyboard: you can see the problem - I'm having to actively support the left hand to stop it from depressing the G keys. In my experience put a strain through the tendon running right up my arm and was quite painful after a short amount of use. It is possible a wrist rest would help fix this problem, but I'm not wild about the prospect since it's not an included feature of the keyboard unlike the K740, and I also do not experience this problem using other normal thickness keyboards - this seems to be an issue specifically with how I hold my hands to type and the existence of the extra macro row.

Wrapping Up

None of the reviews I read or watched for this keyboard before buying it mentioned this possible issue with the full-size keyboard and G keys, though I do recall that most reviewers favore using TKL (ten key-less) variants of the keyboard for endurance typing - which notably does not have the G keys.

Please keep in mind that if you're reading this, this is all based on quirks of typing which may be specific to just how I hold my hands - I am not a touch typist, just a decently fast one from long practice and most of my typing is done using two-fingers on each hand. You may have a fundamentally different experience with this keyboard then I do.

But, I have seen no reviews of gaming keyboards with these extra macro keys in this position which commented on the possible issues in use that they may introduce - it was a huge surprise when I opened this, and significantly impactful in a very direct way.

Easy Ephemeral Virtual Machines with libvirt

The Situation

At a previous job I was finally fed up with docker containers: generally speaking I was always working to setup whole systems or test whole system stuff, and docker containers - even when suitable - don't look anything like a whole system.

While Vagrant does exist, there was always something slightly "off" about the feeling of using it - it did what you want, but had a lot of opinions on it.

So the question I asked myself was, what was I actually wanting to do?

What we want to do

Since this was a job specific issue, the thing I wanted to do was boot cloud-specific environments quickly in a way which would let me deploy the codebase as it ran in the cloud. The company had since simply moved to launching cloud VM instances for this on AWS, but ultimately this left holes in the experience - i.e. try getting access to the disk of a cloud VM - on my local machine I can just mount it directly, or dive in with wxHexEditor if I really want to - on the cloud I get to spend some time trying to security manage an instance into the right environment, attaching EBS volumes and...just a lot of not the current problem.

So: the problem I wanted to solve is, given a cloud-init compatible disk image, give myself a command line parameter which would provision and boot the machine with sensible defaults, and give me an SSH login for it that would just work.

The Solution

What I ended up pulling together to do this is called kvmboot and for me at least works pretty nicely. It has also accidentally become my repository for build recipes to get various flavors of Windows VMs kicked out in a non-annoying state as quickly as possible - the result of the job I took after the original inspiration.

The environment currently works on Ubuntu (what I'm running at home) and should work on Fedora (what I was running when I developed it - hence the SELinux workarounds in the repository).

What it is is pretty simple - launch-cloud-image is a large bash script which spits out an opinionated take on a reasonable libvirt. libvirt ships with a number of tools to accomplish things like this, but no real set of instructions to produce something as useful as I've found this customization - of course that might just be me.

Usage

The basic usage I have for it today is setting up Amazon AMI provisioning scripts. Amazong provide a downloadable version of Amazon Linux 2 for KVM, and launch-cloud-image makes using it very easy: -

kvmboot $ time ./launch-cloud-image --ram 2G --video amzn2-kvm-2.0.20210813.1-x86_64.xfs.gpt.qcow2 blogtest

xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project.

Drive current: -outdev '/tmp/lci.blogtest.userdata.3dQylgsKb.iso'
Media current: stdio file, overwriteable
Media status : is blank
Media summary: 0 sessions, 0 data blocks, 0 data, 51.0g free
xorriso : NOTE : -blank as_needed: no need for action detected
xorriso : WARNING : -volid text does not comply to ISO 9660 / ECMA 119 rules
xorriso : UPDATE :      12 files added in 1 seconds
Added to ISO image: directory '/'='/tmp/lci.blogtest.userdata.kq9RDblTKJ'
ISO image produced: 41 sectors
Written to medium : 192 sectors at LBA 32
Writing to '/tmp/lci.blogtest.userdata.3dQylgsKb.iso' completed successfully.

xorriso : NOTE : Re-assessing -outdev '/tmp/lci.blogtest.userdata.3dQylgsKb.iso'
xorriso : NOTE : Loading ISO image tree from LBA 0
xorriso : UPDATE :      12 nodes read in 1 seconds
Drive current: -dev '/tmp/lci.blogtest.userdata.3dQylgsKb.iso'
Media current: stdio file, overwriteable
Media status : is written , is appendable
Media summary: 1 session, 41 data blocks, 82.0k data, 51.0g free
Volume id    : 'config-2'
User Login: will
Root disk path: /home/will/.local/share/libvirt/images/lci.blogtest.root.qcow2
ISO file path: /home/will/.local/share/libvirt/images/lci.blogtest.userdata.3dQylgsKb.iso
Virtual machine created as: blogtest
blogtest.default.libvirt : will : aedeebootahnouD7Meig

real    0m16.764s
user    0m0.326s
sys 0m0.077s

16 seconds isn't bad from nothing to what I'd get an in EC2 VM - and since I have SSH access I can jump right into using Ansible or something else to provision that machine. Or just alias it so I can kick one up quickly to try silly things.

kvmboot $ ssh will@blogtest.default.libvirt

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
19 package(s) needed for security, out of 59 available
Run "sudo yum update" to apply all updates.
[will@blogtest ~]$ # and then you try stuff here

What's nice is that this is absolutely standard libvirt. It appears in virt-manager, you can play around with it using all the standard virt-manager commands and management. It'll work with remote libvirtd's if you have them, but it's a super-convenient way to use a barebones VM environment - about as easy as doing docker run -it ubuntu bash or something similar, but with way more isolation.

But it also works for Windows!

This was the real joy of this solution: when I stumbled into a bunch of Windows provisioning, I'd never had a good solution. But it turns out launch-cloud-image (I should probably rename it kvmboot like the repo) actually works really well for this use case. By the addition of an installation mode, and some support scripting to build the automatic installation disk images, it can in fact support the whole lifecycle to go from "Windows ISO" to "cloud-initable Windows image" to "Windows workstation with all the cruft removed".

As a result the repository itself has grown a lot of my research into how to easily get usable Windows environments, but it does work and it works great - with Windows 10 we can automate the SSH installation and have it drop you straight into Powershell, ready for provisioning.

Conclusion

I use this script all the time. It's the fastest way I know to get VM environments up which look like the type of cloud instance machines you would be using in the public cloud, and the dnsmasq integration and naming makes them super easy to work with while being standard, boring libvirt - no magic.

Log OpenSSH public keys from failed logins

Problem

I setup an autossh dialback on a machine in the office and forgot to note down the public key.

While certainly not safe to do so, how hard could it really be to grab the public key from the machine with the fixed IP that's hitting my server every 3 seconds for the last 24 hours and give it a login (to be clear: a login to my reverseit tool which is only ever going to allow me to connect back to the other end if it is in fact the machine I think it is).

Solution

This StackOverflow solution looks like what I needed, only when I implemented it the keys I got back still didn't work.

The reason is because: you don't need to do it.

As of OpenSSH 8.9 in Ubuntu Jammy, debug level 2 will produce log messages that start with

debug2: userauth_pubkey: valid user will querying public key rsa-sha2-512 AAAAB3Nz....

and just give you the whole public key...almost.

The problem is OpenSSH log messages are truncated by default - if longer then 1024 characters to be precise, which modern public keys are longer than (when RSA - ECC would fit).

This is controlled by a #define in log.c:

#define MSGBUFSIZ 1024

Upping this to 8192 I recompiled and...it still didn't work.

Pasting the log lines I was getting into VS Code, I found that all of them were exactly 500 characters. That sounds like a format string to me, so some more spelunking and there it is - in log.c there's the do_log function with this line:

openlog(progname, LOG_PID, log_facility);
        syslog(pri, "%.500s", fmtbuf);
        closelog();

I'm guessing this is to work with legacy syslog limited to about 512 byte messages. We're trying to log to journald so let's just increase that to 8192 and try it out.

debug2: userauth_pubkey: valid user will querying public key rsa-sha2-512 AAAAB3NzaC1yc2EAAAADAQABAAABgQCklLxvJWTabmkVDFpOyVUhKTynHtTGfL3ngRH41sdMoiIE7j5WWcA+zvJ2ZqXzH+b5qIAMwb13H4ZkXmu6HLidlaZ0T9VBkKGjUpeHDhJ4fd1p+uw9WTRisVV+Xmw9mjbpiR8+AGXnoNwIeX5tMukglAFwEIQ8GQtM8EV4tS36RWxZjOSoT5sQlAjYsgEzQ7PHXsH3hgM7dyIK1HXrr2XcwFZPCts2EhOyh4e0hyUsvm9Nix2Y7qlqhFA+nH4buuSNpJZ2LjNb9CmWo5bjiYvrRLnU0qJMuPXp0jJeV+LwGA+W/JMbsep9xoqSA6aEQvlRUQx5jRyaJZf9GKqGBNe+v55vEbaTb+PXBU4o7nVFGCygZj2fLrW475o7vZBXJJjdgW/rZ1Eh4G2/Aukz3kfrMiJynRQOc5sFHL1ogZhHEVDqViZVLAHA2aoMCYtrsBJ9BBr/r73bzs9HbsND1wqi5ejYSiODZwX0DGmWZD21OPAj/SDMPUap6Nt/tG7oqs0= [preauth]

Oh wow - there's a lot there! in fact there's the [preauth] tag at the end which is completely cut off normally.

Full Patch

patch
diff --git a/log.c b/log.c
index bdc4b6515..09474e23a 100644
--- a/log.c
+++ b/log.c
@@ -325,7 +325,7 @@ log_redirect_stderr_to(const char *logfile)
    log_stderr_fd = fd;
 }

-#define MSGBUFSIZ 1024
+#define MSGBUFSIZ 8192

 void
 set_log_handler(log_handler_fn *handler, void *ctx)
@@ -417,7 +417,7 @@ do_log(LogLevel level, int force, const char *suffix, const char *fmt,
        closelog_r(&sdata);
 #else
        openlog(progname, LOG_PID, log_facility);
-       syslog(pri, "%.500s", fmtbuf);
+       syslog(pri, "%.8192s", fmtbuf);
        closelog();
 #endif
    }
--

Use git apply in the working tree of the OpenSSH, which I recommend editing with dgit.

Conclusions

OpenSSH does log offered public keys, at DEBUG2 level. But on any standard Ubuntu install, you will not get enough text to see them.

The giveaway for, at least these logs being truncated is whether you can see [preauth] after them. This behavior is kind of silly (and should be configurable) - ideally though we would at least get a ... or <truncated> message when this is happening because with variable length fields like public keys it is not obvious.

Jipi and the Paranoid Chip

This is a short story by Neil Stephenson which used to be hosted online here. It's outlined more in the wikipedia article here and I've been wanting to read it again due to the recent furor surrounding Google's LaMDA (Is Google’s LaMDA conscious? A philosopher’s view).

But alas! The original hosting returns a 404 now: fortunately the Google cached version is still available and I've downloaded that and made it part of my private collection.

So: to ensure this stays up I'm also including the cached copy as a part of this blog. It goes without saying that all rights to this story belong to the original author.

Click here to read Jipi and the Paranoid Chip (or any of the links above).

Install Firefox as a deb on Ubuntu 22.04

Introduction

Ubuntu 22.04 removes a native Firefox package in favor of a snap package. I'm sure this has advantages.

But the reality for me was several fold: startup times were noticeably slower, and the selenium geckodriver just plain didn't work for me (issue here), with some debate online but no canonical solution. I also couldn't get Jupyterlab to autolaunch (minor, but annoying).

Solution below reproduced from https://balintreczey.hu/blog/firefox-on-ubuntu-22-04-from-deb-not-from-snap/ with adaptations which worked for me.

Solution

You can still install Firefox as a native deb from the Mozilla team PPA. The process which worked for me was:

Step 1

Add the (Ubuntu) Mozilla team PPA to your list of software sources by running the following command in the same Terminal window:

sudo add-apt-repository ppa:mozillateam/ppa

Step 2

Pin the Firefox package

echo '
Package: *
Pin: release o=LP-PPA-mozillateam
Pin-Priority: 1001
' | sudo tee /etc/apt/preferences.d/mozilla-firefox

Step 3

Ensure upgrades will work automatically

echo 'Unattended-Upgrade::Allowed-Origins:: "LP-PPA-mozillateam:${distro_codename}";' | sudo tee /etc/apt/apt.conf.d/51unattended-upgrades-firefox

Step 4

Install firefox (this will warn of a downgade - ignore it)

sudo apt install firefox

Step 5

Remove the Firefox snap

sudo snap remove firefox

Conclusion

This worked for me - Firefox starts, my existing Selenium scripts work.

Running npm install (and other weird scripts) safely

Situation

You do this:

$ git clone https://some.site/git/some.repo.git
$ cd some.repo
$ npm install

Pretty common right? What can go wrong?

What about this:

curl -L https://our-new-thing.xyz/install | bash

This looks a little unsafe. Who would recommend it? Well it's still one of the ways to install pip in unfamiliar environments. Or Rust.

Now installing from these places is safe: why? Because they're trusted. There's huge reputational defense going on. But the reality is that for a lot of tools - npm being a big offender, pip too - there's all sorts of ways that while sudo and user permissions will protect your system from going down, your data - $HOME and the like - basically all the important things on your system - are exposed.

This is key: you are always running as "superuser" of your data. In fact your entire operating environment - systemctl --user provides a very useful and complete way to schedule tasks and persistent daemons for your entire user session. There's a lot of power and persistence there.

Problem

There's two competing demands here: it's pretty easy to build isolated environments when you feel like you're under attac, but it takes time - time you don't really want to commit to the problem. It's inconvenient - which is basically the currency we trade when it comes to security.

But the convenience<->security exchange rate is not fixed. It has a floor price, but if we can build more convenient tools, then we can protect ourselves against some threats for almost no cost.

Goals

What we want to do is find a safe way to do something like npm install and not be damaged by anything which might get run by it. For our purposes, damage is data destruction or corruption beyond a sensible scope.

We also want this to light weight: this should be a momentary "that looks unsafe" sort of intervention, not "let me plan out by secure dev environment".

Enter Bubblewrap

bubblewrap is intended to be an unprivileged containers sandboxing tool and has as its specific goal the elimination of container escape CVEs. It's also just available in the Ubuntu repositories which makes things a lot easier.

This is a fairly low level tool, so let's just cut to the wrapper script usage:

#!/bin/bash
# Wrap an executable in a container and limit writes to the current directory only.
# This system does not attempt to limit access to system files, but it does limit writes.

# See: https://stackoverflow.com/questions/59895/how-to-get-the-source-directory-of-a-bash-script-from-within-the-script-itself
# Note: you can't refactor this out: its at the top of every script so the scripts can find their includes.
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
  DIR="$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )"
  SOURCE="$(readlink "$SOURCE")"
  [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
SCRIPT_DIR="$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )"

function log() {
  echo "$*" 1>&2
}

function fatal() {
  echo "$*" 1>&2
  exit 1
}

start_dir="$(pwd)"

bwrap="$(command -v bwrap)"
if [ ! -x "$bwrap" ]; then
    fatal "bubblewrap is not installed. Try running: apt install bubblewrap"
fi

export PS_TAG="$(tput setaf 14)[safe]$(tput sgr0) "

exec "$bwrap" \
    --die-with-parent \
    --tmpfs / \
    --dev /dev \
    --proc /proc \
    --tmpfs /run \
    --mqueue /dev/mqueue \
    --dir /tmp \
    --unshare-all \
    --share-net \
    --ro-bind /bin /bin \
    --ro-bind /etc /etc \
    --ro-bind /run/resolvconf/resolv.conf /run/resolvconf/resolv.conf \
    --ro-bind /lib /lib \
    --ro-bind /lib32 /lib32 \
    --ro-bind /libx32 /libx32 \
    --ro-bind /lib64 /lib64 \
    --ro-bind /opt /opt \
    --ro-bind /sbin /sbin \
    --ro-bind /srv /srv \
    --ro-bind /sys /sys \
    --ro-bind /usr /usr \
    --ro-bind /var /var \
    --ro-bind /home /home \
    --bind "${HOME}/.npm" "${HOME}/.npm" \
    --bind "${HOME}/.cache" "${HOME}/.cache" \
    --bind "${start_dir}" "${start_dir}" \
    -- \
    "$@"

In addition to this script, I also have this in my .bashrc file to get nice shell prompts if I spawn a shell with it:

if [ ! -z "$PS_TAG" ]; then
  export PS1="${PS_TAG}${PS1}"
fi

The basic structure of this invocation is that the resultant container has networking, and my full operating environment in it...just not write access to any files beyond the current user directory.

This is a handy safety feature for reasons beyond a malicious NPM package - I've known more then one colleague to wipe out their home directory writing make clean directives.

Usage

Usage could not be simpler. With the script in my PATH under the name saferun, I can isolate any command or script I'm about to run to only be able to write to the current directory with: saferun ./some-shady-command

I can also launch a protected session with saferun bash which gives me a prompt like:

[safe] $

This is about as low overhead as I can imagine for providing basic filesystem protection.

Conclusions

This is not bullet-proof armor. And it certainly won't keep nosy code from poking around the rest of the filesystem. Are you 100% confident you never saved an important password to some file? I'm not. But I do normally work with a lot auxillary commands and functions around my home directory, and I like them being mostly available when doing risky things. This strikes a good balance - at the very least it limits the damage scope of running some random script you downloaded from causing real nuisance.

I recommend checking out bubblewrap's full set of features to figure out what it can really do, but for something I knocked up by reading for a few hours this added a handy tool to my repository for me.

Reconditioning the Gen 2 Prius HV battery

The Problem

So I've had a Generation 2 Toyota Prius since 2004. Coming up on 17 years old now in Australia, and recently I finally had what turns out to be the dreaded PA080 fault code get thrown - this is a general hybrid traction battery error.

Since the battery is relatively expensive compared to the value of the car and I don't like spending money anyway, the question becomes what can we do about this?

DIY Reconditioning

Fortunately, the car is old enough that this problem has happened before. Over at https//priuschat.com and elsewhere on the web, people have disassembled then Prius traction battery and fixd this problem themselves.

There are basically 2 issues at play: general NiMH degration, and polarity reversal - cell failure.

Cell Failure

In general, the PA080 code (at least in my experience), happens when a battery module will suddenly drop its voltage by over 1V.

This happens due to a phenomenon in NiMH cells called "polarity reversal" - characterized by a discharge curve like this one:

image.png
Source

It is what it sounds like: under extreme discharge conditions, the NiMH cell will go to 0, and if left in this state for too long (or in a battery pack where current continues to be pulled through the cell) it will then enter polarity reveral - positive becomes negative, negative becomes positive. This is disasterous in a normal application, and devastating in a battery pack as the cell now gets driven in this condition by regular charging to continue soaking up current producing heat.

At this point, the cell is dead. In a Prius battery module of 6 cells, a reduction in voltage of about 1V means you know you've had a cell drop into reverse polarity and its not coming back.

NiMH battery cells primer

It's important to understand NiMH cells to understand why "battery reconditioning" is possible and advisable.

image.png
Source

Standard NiMH battery chemistry has a nominal voltages of 1.2V. This has little bearing on the real voltages you see with the cells - a fully charged cell goes up to 1.5V, considered to be the absolute top and you're evolving hydrogen at that point - and a single, standalone cell, can be take all the way to 0V (this is not safe - miss the mark and you wind up in polarity reversal).

In a battery pack of NiMH cells, these lower limits are higher for safety: pack cells all have slightly different capacities, and once you hit 0V on one, if the others don't hit 0V at the exact same time then the empty ones will get driven into polarity reversal. At roughly 0.8V you start running into a cliff of voltage decay anyway, so that's generally the stopping point.

The graph below is an excellent primer on the voltage behaviors of NiMH at different states of charge. Note that the nominal voltage is measured right before the cell is practically empty, but for most of its duration voltage is very constant - almost linear - until the cell is almost full.

image.png
Source
$$\require{mhchem}$$

Degradation Mechanisms

The above explains the behavior of NiMH cells, but not why we can recondition them in a vehicle like the Prius. To understand this, we need to understand the common NiMH battery degradation mechanisms.

NiMH chemistry is based on the following 2 chemical reactions:

Anode: $\ce{H2O + M + e^- <=> OH^- + MH}$

Cathode: $\ce{Ni(OH)2 + OH^- <=> NiO(OH) + H2O + e^-}$

Note the M: this is an intermetallic compound, rather then any specific metal is essentially where a lot of the R&D in NiMH batteries goes.

Our target of recovery is the cathodic reaction involving the Nickel. In normal operation the Prius runs the NiMH batterys between 20-80% of their rated capacity. This is, in general, the correct answer - deep discharging batteries causes degradation of the electrode materials which is a permanent killer (over the order of 500-1000 cycles though).

Crystal Formation

The problem enters with an issue known as "crystal formation" when the batteries are operated in this way over an extended period. Search around and you'll see this referenced a lot without a lot of explanation and mostly in context of Nickel-Cadmium (NiCd) batteries.

NiMH's were meant to, and were a huge improvement on, most of the "memory effect" degradation mechanisms of NiCd batteries, however some of the fundamental mechanisms involved still apply as they are still based on the same basic active materials on the cathode - the Nickel Hydroxide and Nickel oxide hydroxide.

There are many, many mechanisms of permanent and transient change in NiMH batteries, but there are 2 identified which can be treated by the deep charge-discharge cycle recommended for reconditioning.

One is that observed by Sato et. al.: nickel oxide hydroxide has 2 primary crystal structures when used in batteries - β‐NiOOH and γ‐NiOOH.

β‐NiOOH and γ‐NiOOH are generally recognized as being two in-flux crystal states of the Nickel electrodes of any nickel based battery with a (simplified) schema looking like the following:

image.png
Source

γ‐NiOOH is the bulkier crystal form, and has more resistance to hydrogen ion diffusion - this is important because the overall ability of the battery to be recharged is entirely dependent on the accessibility of the surface to $\ce{H^+}$ ions to convert it back to $\ce{Ni(OH)2}$.

What Sato et. al. observes is that during shallow discharging and overcharging of NiCd cells, they see a voltage depression effect correllated with a rise in γ‐NiOOH peaks on XRD spectra. When they fully cycled the cells, the peaks disappeared - the γ‐NiOOH crystals over several cycles are dissolved back to $\ce{Ni(OH)2}$ during the recharge cycle.

image.png
SEM photographs captured at 10 μm of the positive plates of (a) a good battery, (b) an aged battery, and (c) a restored battery. Note: these were NiCd's, but a similar process applies to the nickel electrode of an NiMH cell.

Source |

Although the Prius works hard to avoid this sort of environment - i.e. the battery is never overcharged - it's worth remembering that the battery is not overcharged in aggregate - but it's a physical system, with a physical environment. Ions need to move around in solution, and so while in aggregate you can avoid ever overcharging a cell - on a microsopic levels through random change every now and again an overcharge-like condition can manifest. That said - it took my car 17 years to get to this point.

There's more detail to this story - a lot more - and pulling a complete picture out of the literature is tricky. For example the γ‐NiOOH phase isn't considered true γ‐NiOOH but rather γ'‐NiOOH - the product of Nickel intercalating into γ‐NiOOH, rather then potassium ions (from the potassium - $\ce{K^+}$ used as electrolyte in the cell). It's also a product of rest time on the battery - the phase grows when the battery is resting in a partly charged state.

The punchline of all of this is the reason Prius battery reconditioning works though: the Prius is exceptionally good at managing its NiMH cells, and mostly fights known memory effects while driving. However, it can't fight them all the time and with time and age you wind up with capacity degradation due to crystal formation in this ~50% state-of-charge (SOC) range. And importantly: it's experimentally shown that several normal cycles is highly effective at restoring it by dissolving away the unwanted phase.

Dehydration

There's a secondary degradation mechanism that's worth noting for those who have seemingly unrecoverable cells in a Prius: dehydration.

Looking again at the NiMH battery chemistry -

Anode: $\ce{H2O + M + e^- <=> OH^- + MH}$

Cathode: $\ce{Ni(OH)2 + OH^- <=> NiO(OH) + H2O + e^-}$

you can see that water - $\ce{H2O}$ - is involved but not consumed in the reactions. This is also kind of transparently obvious: you need an electrolyte for ion exchange. What is not obvious though is that the situation under battery charging is technically a competitive with a straight electrolytic water-splitting reaction:

$\ce{2H2O <=> 2H^2 + O^2}$

This is a known problem - though largely resolved from normal recombinative processes in the battery (having a shared gas headspace allows the H2 and O2 to recombine back into water) and can be assisted by adding specific recombination chemistry and normally just resembles a loss function on charging the cells, simply producing heat.

This is a tradeoff in battery design: a sealed cell doesn't leak gas, which ensures it can eventually recombine. But a sealed cell can overpressure and rupture, at which point the cell is destroyed. The Prius cells are not sealed - a one-way overpressure blow off valve is present which vents at 80-120 psi - 550-828 kPa (this is substantial) - and the cells themselves depend on being clamped to prevent gas pressure from damaging them during charging.

But the result is the same: failed seals or overheated cells over a long duration may have lost water through either electrolysis processes.

There are ways to fix this sort of failure - and the results are spectacular - but this is definitely into "last resort for experimentalists" sort of intervention. Typical NiMH design uses a 20-40% w/v KOH solution in water. LiOH is added to improve low temperature performance, and NaOH is substituted partially or fully for reduced corrosion in high temperature applications.

Per this link 30% w/v KOH and 1.5 g/L LiOH is suggested. For the purposes of cell rehydration, an exact match is probably not important as a "dried out cell" will still contain all its salt components (though depending on redissolving them may not be the best option). A starting point for other mixes might be this paper which concludes a 6M KOH solution is optimal.

The big results reported over by this PriusChat member for anyone considering this are here - where he notes he used 20% KOH. Of note: getting deionized water, and a suitably un-metal contaminated salt, is probably key to success here (as well as sealing up the cells properly - the trickiest part by all accounts). That said - various metal dopants are used in NiMH cells to contribute all sorts of properties, so this may be a small effect. It is worth worrying about polymeric impurities in salts - you can eliminate these by "roasting" the salt to turn the into carbon ash.

It is noted in the literature that 6-8M KOH is the sweet spot for discharge capacity - however the use of a 1M solution for total cycle life has also been noted here.

One key parameter for anyone considering this is a rule of thumb figure for electrolyte volume of 1.5 - 2.5 mL A/h. For Prius cells this corresponds to 9.75 - 16.25 mL per cell, or 58.5 - 97.5 mL per module (each module has 6 cells).

Doing the Work

You'll need to dismantle your battery out of your car to do this. This can be done quickly once you know what you're doing, but follow a YouTube tutorial and take a lot of photos while you do it. Also read the following section and understand what we're dealing with.

Safety

This is part in the story where we include the big high voltages can kill warning, but let me add some explanatory detail here: the Prius HV battery is 201.6V nominal - in Australia this is lower then the voltage you use at an electrical outlet every day. But it is a battery - it has no shutoff, and it's DC power (so being shocked will trigger muscle contraction that will prevent you letting go).

Before you do anything to get the battery out of the car, make sure you pull the high voltage service plug, and then take a multimeter and always verify anything you're about to touch is showing 0V between the battery and car chassis.

Now the tempering factor to this is, handled properly, this battery is quite safe to work with once disassembled. High voltage is only present between the end terminals when the bus bars are connected - broken down into the individual modules the highest voltage is 9V from the individual NiMH modules.

Specific Advice

What does the High Voltage disconnector do?

The big orange plug you pull out of the battery does two things: it breaks the the circuit between positive and negative inside the battery, which makes the voltage at the battery terminals in the car go to 0V. This makes the battery safe to handle with the cover on.

It does this specifically by sitting between the 2 battery modules in block 10, and breaking the connection there. Because the battery output is wired from the last module positive, to the first module negative, this breaks the circuit.

There's a secondary benefit to this once the battery is open: breaking the battery wire here limits the total possible voltage inside the battery to ~130V (from block 1 to block 10). This is still a lethal voltage though.