SQLAlchemy Enums - Careful what goes into the database

The Situation

SQLAlchemy is an obvious choice when you need to throw together anything dealing with databases in Python. There might be other options, there might be faster options, but if you need it done then SQLAlchemy will do it for you pretty well and very ergonomically.

The problem I ran into recently is dealing with Python enums recently. Or more specifically: I had a user input problem which obviously turned into an enum application side - I had a limited set of inputs I wanted to allow, because those were what we supported - and I didn't want strings all through my code testing for values.

So on the client side it's obvious: check if the string matches an enum value, and use that. The enum would look something like below:

In [1]:
from enum import Enum

class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"

Now from this, we have our second problem: storing this in the database. We want to not do work here - that's we're using SQLAlchemy, so we can have our commmon problems handled. And so, SQLAlchemy helps us - here's automatic enum type handling for us.

Easy - so our model using the declarative syntax, and typehints can be written as follows:

In [2]:
import sqlalchemy
from sqlalchemy.orm import Mapped, DeclarativeBase, Session, mapped_column
from sqlalchemy import create_engine, select, text

class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color]

This is essentially identical to the documentation we see above. And, if we run this in a sample program - it works!

In [3]:
engine = create_engine("sqlite://")

Base.metadata.create_all(engine)

with Session(engine) as session:
    # Create normal values
    for enum_item in Color:
        session.add(TestTable(value=enum_item))
    session.commit()

# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    for record in records:
        print(record.value)
Color.RED
Color.GREEN
Color.BLUE

Right? We stored some enum's to the database and retreived them in simple elegant code. This is exactly what we want...right?

But the question is...what did we actually store? Let's extend the program to do a raw query to read back that table...

In [5]:
from sqlalchemy import text

with engine.connect() as conn:
    print(conn.execute(text("SELECT * FROM test_table;")).all())
[(1, 'RED'), (2, 'GREEN'), (3, 'BLUE')]

Notice the tuples: the second column, we see "RED", "GREEN" and "BLUE"...but our enum defines our colors as RED is "red". What's going on? And is something wrong here?

Depending how you view the situation, yes, but also no - but it's likely this isn't what you wanted either.

The primary reason to use SQLAlchemy enum types is to take advantage of something like PostgreSQL supporting native enum types in the database. Everywhere else in SQLAlchemy, when we define a python class - like we do with TestTable above - we're not defining a Python object, we're defining a Python object which is describing the database objects we want and how they'll behave.

And so long as we're using things that come from SQLAlchemy - and under-the-hood SQLAlchemy is converting that enum.Enum to sqlalchemy.Enum - then this makes complete sense. The enum we declare is declaring what values we store, and what data value they map too...in the sense that we might use the data elsewhere, in our application. Basically our database will hold the symbolic value RED and we interpret that as meaning "red" - but we reserve the right to change that interpretation.

But if we're coming at this from a Python application perspective - i.e. the reason we made an enum - we likely have a different view of the problem. We're thinking "we want the data to look a particular way, and then to refer to it symbolically in code which we might change" - i.e. the immutable element is the data, the value, of the enum - because that's what we'll present to the user, but not what we want to have all over the application.

In isolation these are separate problems, but automatic enum handling makes the boundary here fuzzy: because while the database is defined in our code, from one perspective, it's also external to it - i.e. we may be writing code which is meant to simply interface with and understand a database not under our control. Basically, the enum.Enum object feels like it's us saying "this is how we'll interpret the external world" and not us saying "this is what the database looks like".

And in that case then, our view of what the enum is is probably more like "the enum is the internal symbolic representation of how we plan to consume database values" - i.e. we expect to map "red" to Color.RED from the database. Rather then reading the database and interpreting RED as "red".

Nobodies wrong - but you probably have your assumptions going into this (I know I did...but it compiled, it worked, and I never questioned it - and so long as I'm the sole owner, who cares right?)

The Problem

There are a few problems though with this interpretation. One is obvious: we're a simple, apparently safe refactor away from ruining our database schema and we might be aware of it. In the above, naive interpretation, changing Color.RED to Color.LEGACY_RED for example, is implying that RED is no longer a valid value in the database - which if we think of the enum as an application mapping to an external interface is something which might make sense.

This is the sort of change which crops up all the time. We know the string "red" is out there, hardcoded and compiled into a bunch of old systems so we can't just go and change a color name in the database. Or we're doing rolling deployments and we need consistency of values - or share the database or any number of other complex environment concerns. Either way: we want to avoid needlessly updating the database value - changing our code, but not an apparent variable constant - should be safe.

However we're not storing the data we think we are. We expected "red", "green" and "blue" and got "RED", "GREEN" and "BLUE". It's worth noting that the SQLAlchemy documentation leads you astray like this, since the second example showing using typing.Literal for the mapping uses the string assignments from the first (and neither shows a sample table result which makes it obvious on a quick read).

If we change a name in this enum, then the result is actually bad if we've used it anywhere - we stop being able to read models out of this table at all. So if we do the following:

In [6]:
class Color(Enum):
    LEGACY_RED = "red"
    GREEN = "green"
    BLUE = "blue"

Then try to read the models we've created, it won't work - in fact we can't read any part of that table anymore (this post is written as a Jupyter notebook so the redefinition below is needed to setup the SQLAlchemy model again)

In [8]:
class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color]

with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    for record in records:
        print(record.value)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1608         try:
-> 1609             return self._object_lookup[elem]
   1610         except KeyError as err:

KeyError: 'RED'

The above exception was the direct cause of the following exception:

LookupError                               Traceback (most recent call last)
/tmp/ipykernel_69447/1820198460.py in <module>
      8 
      9 with Session(engine) as session:
---> 10     records = session.scalars(select(TestTable)).all()
     11     for record in records:
     12         print(record.value)

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in all(self)
   1767 
   1768         """
-> 1769         return self._allrows()
   1770 
   1771     def __iter__(self) -> Iterator[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _allrows(self)
    546         make_row = self._row_getter
    547 
--> 548         rows = self._fetchall_impl()
    549         made_rows: List[_InterimRowType[_R]]
    550         if make_row:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   1674 
   1675     def _fetchall_impl(self) -> List[_InterimRowType[Row[Any]]]:
-> 1676         return self._real_result._fetchall_impl()
   1677 
   1678     def _fetchmany_impl(

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   2268             self._raise_hard_closed()
   2269         try:
-> 2270             return list(self.iterator)
   2271         finally:
   2272             self._soft_close()

~/.local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py in chunks(size)
    217                     break
    218             else:
--> 219                 fetch = cursor._raw_all_rows()
    220 
    221             if single_entity:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _raw_all_rows(self)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in <listcomp>(.0)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy.BaseRow.__init__()

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy._apply_processors()

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in process(value)
   1727                 value = parent_processor(value)
   1728 
-> 1729             value = self._object_value_for_elem(value)
   1730             return value
   1731 

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1609             return self._object_lookup[elem]
   1610         except KeyError as err:
-> 1611             raise LookupError(
   1612                 "'%s' is not among the defined enum values. "
   1613                 "Enum name: %s. Possible values: %s"

LookupError: 'RED' is not among the defined enum values. Enum name: color. Possible values: LEGACY_RED, GREEN, BLUE

Even though we did a proper refactor, we can no longer read this table - in fact we can't even read part of it without using raw SQL and giving up on our models entirely. Obviously if we were writing an application, we've just broken all our queries - but not because we messed anything up, but because we thought we were making a code change when in reality we were making a data change.

This behavior also makes it pretty much impossible to handle externally managed schemas or existing schemas - we don't really want our enum to have to follow someone else's data scheme, even if they're well behaved.

Finally it also hightlights another danger we've walked into: what if we try to read this column, and there are values there we don't recognize? We would also get the same error - in this case, RED is unknown because we removed it. But if a new version of our application comes along and has inserted ORANGE then we'd also have the same problem - we've lost backwards and forwards compatibility, in a way which doesn't necessarily show up easily. There's just no easy way to deal with these LookupError validation problems when we're loading large chunks of models - they happen at the wrong part of the stack

The Solution

Doing the obvious thing here got us a working applications with a bunch of technical footguns - which is unfortunate, but it does work. There are plenty of situations where we'd never encounter these though - although many more where we might. So what should we do instead?

To get the behavior we expected when we used an enum we can do the following in our model definition:

In [11]:
class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color] = mapped_column(sqlalchemy.Enum(Color, values_callable=lambda t: [ str(item.value) for item in t ]))

Notice the values_callable parameter. The order returned here should be the order our enum returns (and it is - it's simply passed our Enum object) - and returns the list of values which should be assigned in the database for it. In this case we simply do a Python string conversion of the enum value (which will just return the literal string - but if you were doing something ill-advised like mixing in numbers, then this makes it sensible for the DB).

When we run this with a new database, we now see that we get what we expected in the underlying table:

In [13]:
engine = create_engine("sqlite://")

Base.metadata.create_all(engine)

with Session(engine) as session:
    # Create normal values
    for enum_item in Color:
        session.add(TestTable(value=enum_item))
    session.commit()

# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    print("We restored the following values in code...")
    for record in records:
        print(record.value)

print("But the underlying table contains...")
with engine.connect() as conn:
    print(conn.execute(text("SELECT * FROM test_table;")).all())
We restored the following values in code...
Color.LEGACY_RED
Color.GREEN
Color.BLUE
But the underlying table contains...
[(1, 'red'), (2, 'green'), (3, 'blue')]

Perfect. Now if we're connecting to an external database, or a schema we don't control, everything works great. But what about when we have unknown values? What happens then? Well we haven't fixed that, but we're much less likely to encounter it by accident now. Of course it's worth noting, SQLAlchemy also doesn't validate the inputs we put into this model against the enum before we write it either. So if we do this, then we're back to it not working:

In [15]:
with Session(engine) as session:
    session.add(TestTable(value="reed"))
    session.commit()
In [16]:
# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    print("We restored the following values in code...")
    for record in records:
        print(record.value)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1608         try:
-> 1609             return self._object_lookup[elem]
   1610         except KeyError as err:

KeyError: 'reed'

The above exception was the direct cause of the following exception:

LookupError                               Traceback (most recent call last)
/tmp/ipykernel_69447/3460624042.py in <module>
      1 # Now try and read the values back
      2 with Session(engine) as session:
----> 3     records = session.scalars(select(TestTable)).all()
      4     print("We restored the following values in code...")
      5     for record in records:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in all(self)
   1767 
   1768         """
-> 1769         return self._allrows()
   1770 
   1771     def __iter__(self) -> Iterator[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _allrows(self)
    546         make_row = self._row_getter
    547 
--> 548         rows = self._fetchall_impl()
    549         made_rows: List[_InterimRowType[_R]]
    550         if make_row:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   1674 
   1675     def _fetchall_impl(self) -> List[_InterimRowType[Row[Any]]]:
-> 1676         return self._real_result._fetchall_impl()
   1677 
   1678     def _fetchmany_impl(

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _fetchall_impl(self)
   2268             self._raise_hard_closed()
   2269         try:
-> 2270             return list(self.iterator)
   2271         finally:
   2272             self._soft_close()

~/.local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py in chunks(size)
    217                     break
    218             else:
--> 219                 fetch = cursor._raw_all_rows()
    220 
    221             if single_entity:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in _raw_all_rows(self)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

~/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py in <listcomp>(.0)
    539         assert make_row is not None
    540         rows = self._fetchall_impl()
--> 541         return [make_row(row) for row in rows]
    542 
    543     def _allrows(self) -> List[_R]:

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy.BaseRow.__init__()

lib/sqlalchemy/cyextension/resultproxy.pyx in sqlalchemy.cyextension.resultproxy._apply_processors()

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in process(value)
   1727                 value = parent_processor(value)
   1728 
-> 1729             value = self._object_value_for_elem(value)
   1730             return value
   1731 

~/.local/lib/python3.10/site-packages/sqlalchemy/sql/sqltypes.py in _object_value_for_elem(self, elem)
   1609             return self._object_lookup[elem]
   1610         except KeyError as err:
-> 1611             raise LookupError(
   1612                 "'%s' is not among the defined enum values. "
   1613                 "Enum name: %s. Possible values: %s"

LookupError: 'reed' is not among the defined enum values. Enum name: color. Possible values: red, green, blue

Broken again.

So how do we fix this?

Handling Unknown Values

All the cases we've seen of LookupErrors are essentially a problem that we have no unknown value handler - ultimately in all applications where the value could change - which I would argue should always be considered to be all of them - we in fact should have had an option which specified handling an unknown one.

At this point we need to subclass the SQLAlchemy Enum type, and specify that directly - which do like so:

In [25]:
import typing as t

class EnumWithUnknown(sqlalchemy.Enum):
    def __init__(self, *enums, **kw: t.Any):
        super().__init__(*enums, **kw)
        # SQLAlchemy sets the _adapted_from keyword argument sometimes, which contains a reference to the original type - but won't include
        # original keyword arguments, so we need to handle that here.
        self._unknown_value = kw["_adapted_from"]._unknown_value if "_adapted_from" in kw else kw.get("unknown_value",None)
        if self._unknown_value is None:
            raise ValueError("unknown_value should be a member of the enum")
    
    # This is the function which resolves the object for the DB value
    def _object_value_for_elem(self, elem):
        try:
            return self._object_lookup[elem]
        except LookupError:
            return self._unknown_value

And then we can use this type like follows:

In [26]:
class Color(Enum):
    UNKNOWN = "unknown"
    LEGACY_RED = "red"
    GREEN = "green"
    BLUE = "blue"

class Base(DeclarativeBase):
    pass

class TestTable(Base):
    __tablename__ = "test_table"
    id: Mapped[int] = mapped_column(primary_key=True)
    value: Mapped[Color] = mapped_column(EnumWithUnknown(Color, values_callable=lambda t: [ str(item.value) for item in t ], 
                                                         unknown_value=Color.UNKNOWN))
    

Let's run that against the database we just inserted reed into:

In [27]:
# Now try and read the values back
with Session(engine) as session:
    records = session.scalars(select(TestTable)).all()
    print("We restored the following values in code...")
    for record in records:
        print(record.value)
We restored the following values in code...
Color.LEGACY_RED
Color.GREEN
Color.BLUE
Color.UNKNOWN

And fixed! We obviously have changed our application logic, but this is now much safer and code which will work as we expect it too in all circumstances.

From a practical perspective we've had to expand our design space to assume indeterminate colors can exist - which might be awkward, but the trade-off is robustness: our application logic can now choose how it handles "unknown" - we could crash if we wanted, but we can also choose just to ignore those records we don't understand or display them as "unknown" and prevent user interaction or whatever else we want.

Discussion

This is an interesting case where in my opinion the "default" design isn't what you would want, but the logic for it is actually sound. SQLAlchemy models define databases - they are principally built on assuming you are describing the actual state of a database, with constraints provided by a database - i.e. in a database with first-class enumeration support, some of the tripwires here just wouldn't work without a schema upgrade.

Conversely, if you did a schema upgrade, your old applications still wouldn't know how to parse new values unless you did everything perfectly in lockstep - which in my experience isn't reality.

Basically it's an interesting case where everything is justifiably right, but leaves some design footguns lying around which might be a bit of a surprise (hence this post). The kicker for me is the effect on using session.scalar calls to return models - since unless we're querying more specifically, having unknown values we can't handle in tables leads to being unable to list any elements on the table ergonomically.

Conclusions

Think carefully before using automagic enum methods in SQLAlchemy. What you want to do now is likely subtly wrong, and while there's a simple and elegant way to use enum.Enum with SQLAlchemy, the magic will give you working code quickly but with potentially nasty problems from subtle bugs or data mismatches later.

Listings

The full listing for the code samples here can also be found here.

DHCP Fixed IPs and ESPHome

DHCP Fixed IPs and ESPHome

The Problem

My Home Assistant installation runs in Docker, and ESPHome runs in a separate docker container. I use a separate Wifi SSID for my random ESP devices to give them some isolation from my main network, so mDNS doesn't work.

ESPHome however, loves mDNS - to discover and install devices.

I've just bought a bunch of the Athom Smart Plugs, and want to rename some of their outputs to get sensible labels - as well as generally just manage them.

ESPHome's Config Files

ESPHome is actually very well documented but it can be hard to figure out what it's documenting sometimes, since there's a combination of device and environment information in it's YAML config files. This is fine - it's a matter of approach - ESPHome likes to think of your environment as a dynamic thing.

For our purposes the issue is we need to make sure ESPHome knows to connect to our devices at their DHCP fixed IP addresses - and to do this we need the wifi.use_address setting - documented here.

This setting is how we solve the problem: we're not going to set a static IP on the ESPHome device itself (since we're letting DHCP handle that via a static reserved - i.e. a fixed IP in Unifi where I'm actually doing this). Instead, we're just telling ESPHome how to contact this specific device at it's static IP (or DNS name, but I'm choosing not to trust those on my local networks for IOT stuff.)

Importantly: wifi.use_address isn't a setting which gets configured on the device. It's local to the ESPHome application - all it does is says "use this IP address to communicate with the device". i.e. you can have a device which currently has a totally different IP address to the one you're configuring, and as long as you set use_address to the current value it's on, ESPHome will update it. This is very useful if you're changing IP addresses around, or only have a DNS name or something.

The other important thing to note about this solution is that when you're not using mDNS, you're going to want to set the environment variable ESPHOME_DASHBOARD_USE_PING=1 on the ESPhome dashboard process. This simply tells the dash to use ICMP ping to determine device availability, rather then mDNS, to have your devices show up properly as Online (though it doesn't much affect usability if you don't).

The Solution

User Level

To implement this solution for each of my smart devices, I have a stack of YAML files which layer up to provide the necessary functionality following some conventions.

At the top-level is the "user" level - one specific device on the network. After it's booted and been initially joined to my IOT SSID, it gets a YAML file named after it that looks like this.

# sp-attic-ventilation.yaml
packages:
  athom.smart-plug-v2: !include .common.athom-smartplug-v2.yaml

esphome:
  name: "sp-attic-ventilation"
  friendly_name: "Attic Ventilation"
  name_add_mac_suffix: false

wifi:
  use_address: 192.168.210.66

There's not much here - just the IP address which I assigned, plus a name which is the same as the hostname I assigned which follows the nominal convention of <device-type-abbreviation>-<location>-<controlled device>. So smartplug - sp, located in the attic, controlling the ventilation. You don't have to do this - but it helps. Then we include the friendly name - this will appear in Home Assistant, and disable adding the MAC suffix - this is a handy default when you're installing and configuring multiple devices initially using fallback APs.

The important part here is to note the include file: ESPHome's web interface will automatically hide a file named secrets.yaml as well as any files prefixed with . which is a convenient way to manage templates and packages.

Device Common Files

The next step up in the stack is a device-common file. Athom Technology publish these on their Github account. This sort of thing is why I love Athom and ESPHome - because we can customize this to work how we want it too. The default smart plug listing is here, but we're going to customize it though not extensively - namely we're adding this line:

packages:
  home: !include .home.yaml

I've included my full listing here (note the removed "time" section).

The Home File

The Home file is the apex of my little ESPHome config stack. In short it's the definition of things which I want to be always true about ESP devices in my home. All of the settings here can be overridden in downstream files if needed, but it's how we get a very succinct config. There's not a lot here but it does capture the important stuff:

# Home-specific features
mdns:
  disabled: false

web_server:
  port: 80

# Common security parameters for all ESPHome devices.
wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

  domain: !secret domain

  ap:
    password: !secret fallback_wifi_password

ota:
  password: !secret ota_password

time:
  - platform: sntp
    id: my_time
    timezone: Australia/Sydney
    servers:
    - !secret ntp_server1

This file extensively references into secrets.yaml, which is templated by my Ansible deployment playbook for ESPHome (which in turn uses my Keepass database for these values). It mostly sets up the critical things I always want on my smart devices: namely, the onboard HTTP server should always be available (life-saver for debugging and a fallback for control - every ESP chip I have seems to run it fine).

One of the crucial things I do is hard code the wifi parameters: the reason I do this is because for as many devices as possible I disable persistent storage to protect the ESP write flash. It's enabled for the smart plugs because they don't change state very often, but for something like a light controller it's a waste of flash cycles. But this does mean that if the wifi settings are configured via the fallback AP mode, they'll be lost if there's a power cut - and then all my devices will turn on AP mode and need to be reconfigured.

This is also the reason you definitely want to configure wifi.ap.password: because if your devices are unable to connected to your wifi (by default for 1 minute), or don't persist settings and are down, then the first thing they'll do (and out of the box Athom devices do this becaue obviously you need to configure them yourself) is open a public wifi network to let them be configured by just any random passer-by. The consequences of this range from someone having some fun toggling a button to someone implanting an advanced persistent threat.

For much the same reason, you should also configure an over-the-air password - ota.password. There's a difference between control of a device and being able to flash firmware, so this should be enforced. This value lives in my password manager, so I'll always have it around.

Beyond that there's just convenience: i.e. I force NTP to point to the Unifi router on my network so everyone has a common agreement on the definition of time.

Alternatives

Static IPs

ESPHome does have full support for static IPs via the wifi.manual_ip parameter. It would be entirely valid to take our wifi section from above and change it to look like this:

wifi:
  use_address: 192.168.210.66
  manual_ip:
    static_ip: 192.168.210.66
    subnet: 255.255.255.0
    gateway: 192.168.210.1
    dns1: 192.168.210.1

This device would work just fine on a network without DHCP - it would come up, grab an IP and be happy. The reason I don't do this is convenience of management: having the devices send DHCPDISCOVER packets is a nice way to make sure they're alive, and turns control of the isolated network segment they're on more over to my Unifi Router, which is what I want. If I want to re-ip a network, then updating static address allocations centrally is more convenient (you do have to coordinate rebooting the devices, but they will "get it").

You could obviously do all sorts of fancy scripting around this, but all of that is a lot of work for a very limited gain.

Enable mDNS

ESPHome uses mDNS extensively, and even with an isolated network you can make it work: my Home Assistant and ESPHome docker containers have IP addresses on that network segment so they can talk to these devices, and as a result they can also receive mDNS from them provided I configure it to be bridged properly.

The reason not to for me is ultimately just that keeping track of a list of IPs is simple: whereas mDNS in more complicated network arrangements like mine is not, and the complexity just isn't worth it - once configured, I never have to really think about these devices. I've lost my Unifi router config and just restored it from a backup and everything was fine. My configs are tracked in Git, my passwords in Keepass - rebuilding this environment is straightforward.

Conclusions

If you're trying to figure out how to flash an ESPDevice, you need to set wifi.use_address to the known IP of the device.

In an environment with DHCP Fixed IP addresses, this means you'll include this value in your ESPHome YAML config files, and it should match your static reservations.

A convenient way to do this is to layer your ESPHome YAML files, with your vendor/device-type files in the middle of the "stack".

Logitech G815 Review / Impressions

Logitech G815 Review / Impressions

I recently decided I wanted to upgrade my keyboard. I had two principle goals: the first was to find a production keyboard I could still buy. My former go to was the Logitech K740 (Logitech Illuminated Keyboard) which had been out of production for a very long time. The last time I tried to replace one I ended up buying about 3 keyboards off eBay before I suceeded in getting what I was actually after.

With that one now on the way out due to the key caps breaking off on frequently used keys like the backspace, and some suspected trouble with key registration it seemed like it was finally time to choose a new keyboard and adapt to it. The typing experience and it's ergonomics has become important to me, between age and profession, so it's a big decision.

Why a mechanical keyboard?

I've been curious to try a mechanical keyboard essentially due to hype, although there is some solid logic behind it. My K740s have failed due to the scissor-type plastic (nylon) mechanism failing, and once it goes there's nothing you can do. They also build up dust underneath the keys, but removing the key caps is not super-well supported - and I've lived with a very fiddly backspace for a while now, as well as some problems with key registration if I don't hit the larger keys (backspace, tab, enter) suitably dead-center.

To be clear: these are emergent problems - as new, the keyboards were solid but they failed in a predictable way.

So what I'm looking for by going with a mechanical keyboard is improved durability for key registration, and a nice typing experience. With the G815 I'm buying a gaming keyboard, but I'm buying it because I want good key registration for typing.

2AtnXir.jpg

G815: First impressions - there's an ergonomics change

The K740 is a very thin keyboard, with a built in palm rest. It is 9.3mm thick - that is incredibly slender, and no mechanical keyboard is going to beat that. The G815/915 series is the thinnest mechanical keyboard on the market at 22mm thick, but that's still more then double. Up front: It's noticeable, my typing position was substantially changed.

The G815 doesn't come with a palm rest out of the box: people have said they don't think it needs it, I would disagree. The first thing I found myself doing was raising my arm rests to get my hands flat to the keyboard. It's what I'm doing while typing this review. I'll be buying a palm rest soon and updating this post when I do.

The G Keys

The bigger issue I found, which I did not see talked about before buying in the reviews and is probably universal to this type of gaming keyboard design is the addition of the G keys to the left hand side of the keyboard.

I did not realize this before I bought the keyboard because it's a habit I do without thinking about it, but I essentially use my left hand to find the top-left of the keyboard when typing with my pinky finger. On a regular keyboard, holding the top-left of the chassis like this works fine because it's pretty well lined up with escape and the top row of number keys.

The addition of the G keys however changes the ergonomics of this in a big way - my initial attempts at typing were frustrated and difficult because all my instincts about where the keys are were wrong: I'm so used to using that pinky to control where the top of the keyboard is that it was very difficult to adapt without it. If you are considering this keyboard, or any gaming style keyboard with extra left hand macro keys, you would be well advised to really check if this is something you're doing: it was a huge surprise to me, and the change in how I type is, as of writing (so about 45 minutes after unboxing it) still feeling rough. I'm expecting to adapt, but I'm also feeling a muscle strain in my left arm due to the new typing position so it's not an easy adaptation, and as noted above may involve more peripherals to get it comfortable.

I strongly encourage not underestimating this - this is a peripheral I use for 8 hours a day for my job. It's function and whether it causes muscle strain is vital.

The Key Action

Mechnical keyboards are all about the key action of th keyboard. I can't give any advice here: YouTube will show you people using it, how it sounds and tell you how it feels but it is something which needs to be experienced for yourself. I can say that despite my complaints about the additional G keys, and the fact it's not as thin as the K740, the "Linear" type key model fo the G815 feels great to type on when you're in the zone on it. The action is smooth, comfortable and feels solid - this is consistent with some other reviews which noted that the Linear key switches tended to feel the best after a little while of typing, and this I can believe.

Some very good advice when you get into reviewing keyboards and other "things you never think about" is that almost all of them can be criticized - perfect doesn't exist, and the criticisms always feel louder then the good points. The most I can add here is, if you can use one in person, then that's the best way to explore the space (this is an expensive keyboard, so just buying a whole lot of them - as I suspect gets most YouTubers into making YouTube videos about keyboards - is a danger).

Conclusions - we'll see

It's no fun getting a fairly expensive new thing and feeling "hmmm" about how well it works. The G keys might be the real problem here - that change in typing experience was a huge surprise to me, so if you find this review then that's my core take away: be wary of layout changes like that. There is a numpad-less variant of the G815 which can be had, but I like my media keys and numpad so that's why I bought the larger one. If you don't need or want a numpad, then I'd recommend that one at the present time - no G keys means no problems.

I'm hoping at the moment I'll adapt to the G keys: their potential utility is high (though you can't program them on Linux), but if I could buy a full-size variant without them tomorrow I'd do it and not bother with the adaptation.

But the keys feel great to use, so hence the conclusion: we'll see.

Conclusions Update (same day) - went back to the K740

This is probably a good gaming keyboard.

I say that because I'm sure the G keys are effective for gaming purposes. But for the way I type, which is not true touch typing, the presence of the G keys and the offset they introduce had two pronounced effects: (1) it was almost impossible for me to re-centre my typing of the keyboard when I moved my hands away without a pronounced and noticeable process of feeling out where the top-left edge of the keyboard is.

The problem of key-centering was replicable with my wife, who has much smaller hands, typing on the keyboard - she found the same subtle problem trying to line up, finding she inevitably ended up hitting the caps lock key when she did.

The second problem (2) was wrist strain: because the G keys are actual keys and live on the left hand side of the keyboard, my natural resting position for my left hand which is off to the side with my palm free introduced a great deal of strain to my left arm specifically. The pictures below of my hands sort of show the problem - on the top is my backup K740 and the bottom the G815:

K740 resting positionG815 resting position

This is with my hands trying to rest in a ready position on the keyboard: you can see the problem - I'm having to actively support the left hand to stop it from depressing the G keys. In my experience put a strain through the tendon running right up my arm and was quite painful after a short amount of use. It is possible a wrist rest would help fix this problem, but I'm not wild about the prospect since it's not an included feature of the keyboard unlike the K740, and I also do not experience this problem using other normal thickness keyboards - this seems to be an issue specifically with how I hold my hands to type and the existence of the extra macro row.

Wrapping Up

None of the reviews I read or watched for this keyboard before buying it mentioned this possible issue with the full-size keyboard and G keys, though I do recall that most reviewers favore using TKL (ten key-less) variants of the keyboard for endurance typing - which notably does not have the G keys.

Please keep in mind that if you're reading this, this is all based on quirks of typing which may be specific to just how I hold my hands - I am not a touch typist, just a decently fast one from long practice and most of my typing is done using two-fingers on each hand. You may have a fundamentally different experience with this keyboard then I do.

But, I have seen no reviews of gaming keyboards with these extra macro keys in this position which commented on the possible issues in use that they may introduce - it was a huge surprise when I opened this, and significantly impactful in a very direct way.

Easy Ephemeral Virtual Machines with libvirt

The Situation

At a previous job I was finally fed up with docker containers: generally speaking I was always working to setup whole systems or test whole system stuff, and docker containers - even when suitable - don't look anything like a whole system.

While Vagrant does exist, there was always something slightly "off" about the feeling of using it - it did what you want, but had a lot of opinions on it.

So the question I asked myself was, what was I actually wanting to do?

What we want to do

Since this was a job specific issue, the thing I wanted to do was boot cloud-specific environments quickly in a way which would let me deploy the codebase as it ran in the cloud. The company had since simply moved to launching cloud VM instances for this on AWS, but ultimately this left holes in the experience - i.e. try getting access to the disk of a cloud VM - on my local machine I can just mount it directly, or dive in with wxHexEditor if I really want to - on the cloud I get to spend some time trying to security manage an instance into the right environment, attaching EBS volumes and...just a lot of not the current problem.

So: the problem I wanted to solve is, given a cloud-init compatible disk image, give myself a command line parameter which would provision and boot the machine with sensible defaults, and give me an SSH login for it that would just work.

The Solution

What I ended up pulling together to do this is called kvmboot and for me at least works pretty nicely. It has also accidentally become my repository for build recipes to get various flavors of Windows VMs kicked out in a non-annoying state as quickly as possible - the result of the job I took after the original inspiration.

The environment currently works on Ubuntu (what I'm running at home) and should work on Fedora (what I was running when I developed it - hence the SELinux workarounds in the repository).

What it is is pretty simple - launch-cloud-image is a large bash script which spits out an opinionated take on a reasonable libvirt. libvirt ships with a number of tools to accomplish things like this, but no real set of instructions to produce something as useful as I've found this customization - of course that might just be me.

Usage

The basic usage I have for it today is setting up Amazon AMI provisioning scripts. Amazong provide a downloadable version of Amazon Linux 2 for KVM, and launch-cloud-image makes using it very easy: -

kvmboot $ time ./launch-cloud-image --ram 2G --video amzn2-kvm-2.0.20210813.1-x86_64.xfs.gpt.qcow2 blogtest

xorriso 1.5.2 : RockRidge filesystem manipulator, libburnia project.

Drive current: -outdev '/tmp/lci.blogtest.userdata.3dQylgsKb.iso'
Media current: stdio file, overwriteable
Media status : is blank
Media summary: 0 sessions, 0 data blocks, 0 data, 51.0g free
xorriso : NOTE : -blank as_needed: no need for action detected
xorriso : WARNING : -volid text does not comply to ISO 9660 / ECMA 119 rules
xorriso : UPDATE :      12 files added in 1 seconds
Added to ISO image: directory '/'='/tmp/lci.blogtest.userdata.kq9RDblTKJ'
ISO image produced: 41 sectors
Written to medium : 192 sectors at LBA 32
Writing to '/tmp/lci.blogtest.userdata.3dQylgsKb.iso' completed successfully.

xorriso : NOTE : Re-assessing -outdev '/tmp/lci.blogtest.userdata.3dQylgsKb.iso'
xorriso : NOTE : Loading ISO image tree from LBA 0
xorriso : UPDATE :      12 nodes read in 1 seconds
Drive current: -dev '/tmp/lci.blogtest.userdata.3dQylgsKb.iso'
Media current: stdio file, overwriteable
Media status : is written , is appendable
Media summary: 1 session, 41 data blocks, 82.0k data, 51.0g free
Volume id    : 'config-2'
User Login: will
Root disk path: /home/will/.local/share/libvirt/images/lci.blogtest.root.qcow2
ISO file path: /home/will/.local/share/libvirt/images/lci.blogtest.userdata.3dQylgsKb.iso
Virtual machine created as: blogtest
blogtest.default.libvirt : will : aedeebootahnouD7Meig

real    0m16.764s
user    0m0.326s
sys 0m0.077s

16 seconds isn't bad from nothing to what I'd get an in EC2 VM - and since I have SSH access I can jump right into using Ansible or something else to provision that machine. Or just alias it so I can kick one up quickly to try silly things.

kvmboot $ ssh will@blogtest.default.libvirt

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
19 package(s) needed for security, out of 59 available
Run "sudo yum update" to apply all updates.
[will@blogtest ~]$ # and then you try stuff here

What's nice is that this is absolutely standard libvirt. It appears in virt-manager, you can play around with it using all the standard virt-manager commands and management. It'll work with remote libvirtd's if you have them, but it's a super-convenient way to use a barebones VM environment - about as easy as doing docker run -it ubuntu bash or something similar, but with way more isolation.

But it also works for Windows!

This was the real joy of this solution: when I stumbled into a bunch of Windows provisioning, I'd never had a good solution. But it turns out launch-cloud-image (I should probably rename it kvmboot like the repo) actually works really well for this use case. By the addition of an installation mode, and some support scripting to build the automatic installation disk images, it can in fact support the whole lifecycle to go from "Windows ISO" to "cloud-initable Windows image" to "Windows workstation with all the cruft removed".

As a result the repository itself has grown a lot of my research into how to easily get usable Windows environments, but it does work and it works great - with Windows 10 we can automate the SSH installation and have it drop you straight into Powershell, ready for provisioning.

Conclusion

I use this script all the time. It's the fastest way I know to get VM environments up which look like the type of cloud instance machines you would be using in the public cloud, and the dnsmasq integration and naming makes them super easy to work with while being standard, boring libvirt - no magic.

Log OpenSSH public keys from failed logins

Problem

I setup an autossh dialback on a machine in the office and forgot to note down the public key.

While certainly not safe to do so, how hard could it really be to grab the public key from the machine with the fixed IP that's hitting my server every 3 seconds for the last 24 hours and give it a login (to be clear: a login to my reverseit tool which is only ever going to allow me to connect back to the other end if it is in fact the machine I think it is).

Solution

This StackOverflow solution looks like what I needed, only when I implemented it the keys I got back still didn't work.

The reason is because: you don't need to do it.

As of OpenSSH 8.9 in Ubuntu Jammy, debug level 2 will produce log messages that start with

debug2: userauth_pubkey: valid user will querying public key rsa-sha2-512 AAAAB3Nz....

and just give you the whole public key...almost.

The problem is OpenSSH log messages are truncated by default - if longer then 1024 characters to be precise, which modern public keys are longer than (when RSA - ECC would fit).

This is controlled by a #define in log.c:

#define MSGBUFSIZ 1024

Upping this to 8192 I recompiled and...it still didn't work.

Pasting the log lines I was getting into VS Code, I found that all of them were exactly 500 characters. That sounds like a format string to me, so some more spelunking and there it is - in log.c there's the do_log function with this line:

openlog(progname, LOG_PID, log_facility);
        syslog(pri, "%.500s", fmtbuf);
        closelog();

I'm guessing this is to work with legacy syslog limited to about 512 byte messages. We're trying to log to journald so let's just increase that to 8192 and try it out.

debug2: userauth_pubkey: valid user will querying public key rsa-sha2-512 AAAAB3NzaC1yc2EAAAADAQABAAABgQCklLxvJWTabmkVDFpOyVUhKTynHtTGfL3ngRH41sdMoiIE7j5WWcA+zvJ2ZqXzH+b5qIAMwb13H4ZkXmu6HLidlaZ0T9VBkKGjUpeHDhJ4fd1p+uw9WTRisVV+Xmw9mjbpiR8+AGXnoNwIeX5tMukglAFwEIQ8GQtM8EV4tS36RWxZjOSoT5sQlAjYsgEzQ7PHXsH3hgM7dyIK1HXrr2XcwFZPCts2EhOyh4e0hyUsvm9Nix2Y7qlqhFA+nH4buuSNpJZ2LjNb9CmWo5bjiYvrRLnU0qJMuPXp0jJeV+LwGA+W/JMbsep9xoqSA6aEQvlRUQx5jRyaJZf9GKqGBNe+v55vEbaTb+PXBU4o7nVFGCygZj2fLrW475o7vZBXJJjdgW/rZ1Eh4G2/Aukz3kfrMiJynRQOc5sFHL1ogZhHEVDqViZVLAHA2aoMCYtrsBJ9BBr/r73bzs9HbsND1wqi5ejYSiODZwX0DGmWZD21OPAj/SDMPUap6Nt/tG7oqs0= [preauth]

Oh wow - there's a lot there! in fact there's the [preauth] tag at the end which is completely cut off normally.

Full Patch

patch
diff --git a/log.c b/log.c
index bdc4b6515..09474e23a 100644
--- a/log.c
+++ b/log.c
@@ -325,7 +325,7 @@ log_redirect_stderr_to(const char *logfile)
    log_stderr_fd = fd;
 }

-#define MSGBUFSIZ 1024
+#define MSGBUFSIZ 8192

 void
 set_log_handler(log_handler_fn *handler, void *ctx)
@@ -417,7 +417,7 @@ do_log(LogLevel level, int force, const char *suffix, const char *fmt,
        closelog_r(&sdata);
 #else
        openlog(progname, LOG_PID, log_facility);
-       syslog(pri, "%.500s", fmtbuf);
+       syslog(pri, "%.8192s", fmtbuf);
        closelog();
 #endif
    }
--

Use git apply in the working tree of the OpenSSH, which I recommend editing with dgit.

Conclusions

OpenSSH does log offered public keys, at DEBUG2 level. But on any standard Ubuntu install, you will not get enough text to see them.

The giveaway for, at least these logs being truncated is whether you can see [preauth] after them. This behavior is kind of silly (and should be configurable) - ideally though we would at least get a ... or <truncated> message when this is happening because with variable length fields like public keys it is not obvious.

Jipi and the Paranoid Chip

This is a short story by Neil Stephenson which used to be hosted online here. It's outlined more in the wikipedia article here and I've been wanting to read it again due to the recent furor surrounding Google's LaMDA (Is Google’s LaMDA conscious? A philosopher’s view).

But alas! The original hosting returns a 404 now: fortunately the Google cached version is still available and I've downloaded that and made it part of my private collection.

So: to ensure this stays up I'm also including the cached copy as a part of this blog. It goes without saying that all rights to this story belong to the original author.

Click here to read Jipi and the Paranoid Chip (or any of the links above).

Install Firefox as a deb on Ubuntu 22.04

Introduction

Ubuntu 22.04 removes a native Firefox package in favor of a snap package. I'm sure this has advantages.

But the reality for me was several fold: startup times were noticeably slower, and the selenium geckodriver just plain didn't work for me (issue here), with some debate online but no canonical solution. I also couldn't get Jupyterlab to autolaunch (minor, but annoying).

Solution below reproduced from https://balintreczey.hu/blog/firefox-on-ubuntu-22-04-from-deb-not-from-snap/ with adaptations which worked for me.

Solution

You can still install Firefox as a native deb from the Mozilla team PPA. The process which worked for me was:

Step 1

Add the (Ubuntu) Mozilla team PPA to your list of software sources by running the following command in the same Terminal window:

sudo add-apt-repository ppa:mozillateam/ppa

Step 2

Pin the Firefox package

echo '
Package: *
Pin: release o=LP-PPA-mozillateam
Pin-Priority: 1001
' | sudo tee /etc/apt/preferences.d/mozilla-firefox

Step 3

Ensure upgrades will work automatically

echo 'Unattended-Upgrade::Allowed-Origins:: "LP-PPA-mozillateam:${distro_codename}";' | sudo tee /etc/apt/apt.conf.d/51unattended-upgrades-firefox

Step 4

Install firefox (this will warn of a downgade - ignore it)

sudo apt install firefox

Step 5

Remove the Firefox snap

sudo snap remove firefox

Conclusion

This worked for me - Firefox starts, my existing Selenium scripts work.

Running npm install (and other weird scripts) safely

Situation

You do this:

$ git clone https://some.site/git/some.repo.git
$ cd some.repo
$ npm install

Pretty common right? What can go wrong?

What about this:

curl -L https://our-new-thing.xyz/install | bash

This looks a little unsafe. Who would recommend it? Well it's still one of the ways to install pip in unfamiliar environments. Or Rust.

Now installing from these places is safe: why? Because they're trusted. There's huge reputational defense going on. But the reality is that for a lot of tools - npm being a big offender, pip too - there's all sorts of ways that while sudo and user permissions will protect your system from going down, your data - $HOME and the like - basically all the important things on your system - are exposed.

This is key: you are always running as "superuser" of your data. In fact your entire operating environment - systemctl --user provides a very useful and complete way to schedule tasks and persistent daemons for your entire user session. There's a lot of power and persistence there.

Problem

There's two competing demands here: it's pretty easy to build isolated environments when you feel like you're under attac, but it takes time - time you don't really want to commit to the problem. It's inconvenient - which is basically the currency we trade when it comes to security.

But the convenience<->security exchange rate is not fixed. It has a floor price, but if we can build more convenient tools, then we can protect ourselves against some threats for almost no cost.

Goals

What we want to do is find a safe way to do something like npm install and not be damaged by anything which might get run by it. For our purposes, damage is data destruction or corruption beyond a sensible scope.

We also want this to light weight: this should be a momentary "that looks unsafe" sort of intervention, not "let me plan out by secure dev environment".

Enter Bubblewrap

bubblewrap is intended to be an unprivileged containers sandboxing tool and has as its specific goal the elimination of container escape CVEs. It's also just available in the Ubuntu repositories which makes things a lot easier.

This is a fairly low level tool, so let's just cut to the wrapper script usage:

#!/bin/bash
# Wrap an executable in a container and limit writes to the current directory only.
# This system does not attempt to limit access to system files, but it does limit writes.

# See: https://stackoverflow.com/questions/59895/how-to-get-the-source-directory-of-a-bash-script-from-within-the-script-itself
# Note: you can't refactor this out: its at the top of every script so the scripts can find their includes.
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
  DIR="$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )"
  SOURCE="$(readlink "$SOURCE")"
  [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
SCRIPT_DIR="$( cd -P "$( dirname "$SOURCE" )" >/dev/null 2>&1 && pwd )"

function log() {
  echo "$*" 1>&2
}

function fatal() {
  echo "$*" 1>&2
  exit 1
}

start_dir="$(pwd)"

bwrap="$(command -v bwrap)"
if [ ! -x "$bwrap" ]; then
    fatal "bubblewrap is not installed. Try running: apt install bubblewrap"
fi

export PS_TAG="$(tput setaf 14)[safe]$(tput sgr0) "

exec "$bwrap" \
    --die-with-parent \
    --tmpfs / \
    --dev /dev \
    --proc /proc \
    --tmpfs /run \
    --mqueue /dev/mqueue \
    --dir /tmp \
    --unshare-all \
    --share-net \
    --ro-bind /bin /bin \
    --ro-bind /etc /etc \
    --ro-bind /run/resolvconf/resolv.conf /run/resolvconf/resolv.conf \
    --ro-bind /lib /lib \
    --ro-bind /lib32 /lib32 \
    --ro-bind /libx32 /libx32 \
    --ro-bind /lib64 /lib64 \
    --ro-bind /opt /opt \
    --ro-bind /sbin /sbin \
    --ro-bind /srv /srv \
    --ro-bind /sys /sys \
    --ro-bind /usr /usr \
    --ro-bind /var /var \
    --ro-bind /home /home \
    --bind "${HOME}/.npm" "${HOME}/.npm" \
    --bind "${HOME}/.cache" "${HOME}/.cache" \
    --bind "${start_dir}" "${start_dir}" \
    -- \
    "$@"

In addition to this script, I also have this in my .bashrc file to get nice shell prompts if I spawn a shell with it:

if [ ! -z "$PS_TAG" ]; then
  export PS1="${PS_TAG}${PS1}"
fi

The basic structure of this invocation is that the resultant container has networking, and my full operating environment in it...just not write access to any files beyond the current user directory.

This is a handy safety feature for reasons beyond a malicious NPM package - I've known more then one colleague to wipe out their home directory writing make clean directives.

Usage

Usage could not be simpler. With the script in my PATH under the name saferun, I can isolate any command or script I'm about to run to only be able to write to the current directory with: saferun ./some-shady-command

I can also launch a protected session with saferun bash which gives me a prompt like:

[safe] $

This is about as low overhead as I can imagine for providing basic filesystem protection.

Conclusions

This is not bullet-proof armor. And it certainly won't keep nosy code from poking around the rest of the filesystem. Are you 100% confident you never saved an important password to some file? I'm not. But I do normally work with a lot auxillary commands and functions around my home directory, and I like them being mostly available when doing risky things. This strikes a good balance - at the very least it limits the damage scope of running some random script you downloaded from causing real nuisance.

I recommend checking out bubblewrap's full set of features to figure out what it can really do, but for something I knocked up by reading for a few hours this added a handy tool to my repository for me.

Reconditioning the Gen 2 Prius HV battery

The Problem

So I've had a Generation 2 Toyota Prius since 2004. Coming up on 17 years old now in Australia, and recently I finally had what turns out to be the dreaded PA080 fault code get thrown - this is a general hybrid traction battery error.

Since the battery is relatively expensive compared to the value of the car and I don't like spending money anyway, the question becomes what can we do about this?

DIY Reconditioning

Fortunately, the car is old enough that this problem has happened before. Over at https//priuschat.com and elsewhere on the web, people have disassembled then Prius traction battery and fixd this problem themselves.

There are basically 2 issues at play: general NiMH degration, and polarity reversal - cell failure.

Cell Failure

In general, the PA080 code (at least in my experience), happens when a battery module will suddenly drop its voltage by over 1V.

This happens due to a phenomenon in NiMH cells called "polarity reversal" - characterized by a discharge curve like this one:

image.png
Source

It is what it sounds like: under extreme discharge conditions, the NiMH cell will go to 0, and if left in this state for too long (or in a battery pack where current continues to be pulled through the cell) it will then enter polarity reveral - positive becomes negative, negative becomes positive. This is disasterous in a normal application, and devastating in a battery pack as the cell now gets driven in this condition by regular charging to continue soaking up current producing heat.

At this point, the cell is dead. In a Prius battery module of 6 cells, a reduction in voltage of about 1V means you know you've had a cell drop into reverse polarity and its not coming back.

NiMH battery cells primer

It's important to understand NiMH cells to understand why "battery reconditioning" is possible and advisable.

image.png
Source

Standard NiMH battery chemistry has a nominal voltages of 1.2V. This has little bearing on the real voltages you see with the cells - a fully charged cell goes up to 1.5V, considered to be the absolute top and you're evolving hydrogen at that point - and a single, standalone cell, can be take all the way to 0V (this is not safe - miss the mark and you wind up in polarity reversal).

In a battery pack of NiMH cells, these lower limits are higher for safety: pack cells all have slightly different capacities, and once you hit 0V on one, if the others don't hit 0V at the exact same time then the empty ones will get driven into polarity reversal. At roughly 0.8V you start running into a cliff of voltage decay anyway, so that's generally the stopping point.

The graph below is an excellent primer on the voltage behaviors of NiMH at different states of charge. Note that the nominal voltage is measured right before the cell is practically empty, but for most of its duration voltage is very constant - almost linear - until the cell is almost full.

image.png
Source
$$\require{mhchem}$$

Degradation Mechanisms

The above explains the behavior of NiMH cells, but not why we can recondition them in a vehicle like the Prius. To understand this, we need to understand the common NiMH battery degradation mechanisms.

NiMH chemistry is based on the following 2 chemical reactions:

Anode: $\ce{H2O + M + e^- <=> OH^- + MH}$

Cathode: $\ce{Ni(OH)2 + OH^- <=> NiO(OH) + H2O + e^-}$

Note the M: this is an intermetallic compound, rather then any specific metal is essentially where a lot of the R&D in NiMH batteries goes.

Our target of recovery is the cathodic reaction involving the Nickel. In normal operation the Prius runs the NiMH batterys between 20-80% of their rated capacity. This is, in general, the correct answer - deep discharging batteries causes degradation of the electrode materials which is a permanent killer (over the order of 500-1000 cycles though).

Crystal Formation

The problem enters with an issue known as "crystal formation" when the batteries are operated in this way over an extended period. Search around and you'll see this referenced a lot without a lot of explanation and mostly in context of Nickel-Cadmium (NiCd) batteries.

NiMH's were meant to, and were a huge improvement on, most of the "memory effect" degradation mechanisms of NiCd batteries, however some of the fundamental mechanisms involved still apply as they are still based on the same basic active materials on the cathode - the Nickel Hydroxide and Nickel oxide hydroxide.

There are many, many mechanisms of permanent and transient change in NiMH batteries, but there are 2 identified which can be treated by the deep charge-discharge cycle recommended for reconditioning.

One is that observed by Sato et. al.: nickel oxide hydroxide has 2 primary crystal structures when used in batteries - β‐NiOOH and γ‐NiOOH.

β‐NiOOH and γ‐NiOOH are generally recognized as being two in-flux crystal states of the Nickel electrodes of any nickel based battery with a (simplified) schema looking like the following:

image.png
Source

γ‐NiOOH is the bulkier crystal form, and has more resistance to hydrogen ion diffusion - this is important because the overall ability of the battery to be recharged is entirely dependent on the accessibility of the surface to $\ce{H^+}$ ions to convert it back to $\ce{Ni(OH)2}$.

What Sato et. al. observes is that during shallow discharging and overcharging of NiCd cells, they see a voltage depression effect correllated with a rise in γ‐NiOOH peaks on XRD spectra. When they fully cycled the cells, the peaks disappeared - the γ‐NiOOH crystals over several cycles are dissolved back to $\ce{Ni(OH)2}$ during the recharge cycle.

image.png
SEM photographs captured at 10 μm of the positive plates of (a) a good battery, (b) an aged battery, and (c) a restored battery. Note: these were NiCd's, but a similar process applies to the nickel electrode of an NiMH cell.

Source |

Although the Prius works hard to avoid this sort of environment - i.e. the battery is never overcharged - it's worth remembering that the battery is not overcharged in aggregate - but it's a physical system, with a physical environment. Ions need to move around in solution, and so while in aggregate you can avoid ever overcharging a cell - on a microsopic levels through random change every now and again an overcharge-like condition can manifest. That said - it took my car 17 years to get to this point.

There's more detail to this story - a lot more - and pulling a complete picture out of the literature is tricky. For example the γ‐NiOOH phase isn't considered true γ‐NiOOH but rather γ'‐NiOOH - the product of Nickel intercalating into γ‐NiOOH, rather then potassium ions (from the potassium - $\ce{K^+}$ used as electrolyte in the cell). It's also a product of rest time on the battery - the phase grows when the battery is resting in a partly charged state.

The punchline of all of this is the reason Prius battery reconditioning works though: the Prius is exceptionally good at managing its NiMH cells, and mostly fights known memory effects while driving. However, it can't fight them all the time and with time and age you wind up with capacity degradation due to crystal formation in this ~50% state-of-charge (SOC) range. And importantly: it's experimentally shown that several normal cycles is highly effective at restoring it by dissolving away the unwanted phase.

Dehydration

There's a secondary degradation mechanism that's worth noting for those who have seemingly unrecoverable cells in a Prius: dehydration.

Looking again at the NiMH battery chemistry -

Anode: $\ce{H2O + M + e^- <=> OH^- + MH}$

Cathode: $\ce{Ni(OH)2 + OH^- <=> NiO(OH) + H2O + e^-}$

you can see that water - $\ce{H2O}$ - is involved but not consumed in the reactions. This is also kind of transparently obvious: you need an electrolyte for ion exchange. What is not obvious though is that the situation under battery charging is technically a competitive with a straight electrolytic water-splitting reaction:

$\ce{2H2O <=> 2H^2 + O^2}$

This is a known problem - though largely resolved from normal recombinative processes in the battery (having a shared gas headspace allows the H2 and O2 to recombine back into water) and can be assisted by adding specific recombination chemistry and normally just resembles a loss function on charging the cells, simply producing heat.

This is a tradeoff in battery design: a sealed cell doesn't leak gas, which ensures it can eventually recombine. But a sealed cell can overpressure and rupture, at which point the cell is destroyed. The Prius cells are not sealed - a one-way overpressure blow off valve is present which vents at 80-120 psi - 550-828 kPa (this is substantial) - and the cells themselves depend on being clamped to prevent gas pressure from damaging them during charging.

But the result is the same: failed seals or overheated cells over a long duration may have lost water through either electrolysis processes.

There are ways to fix this sort of failure - and the results are spectacular - but this is definitely into "last resort for experimentalists" sort of intervention. Typical NiMH design uses a 20-40% w/v KOH solution in water. LiOH is added to improve low temperature performance, and NaOH is substituted partially or fully for reduced corrosion in high temperature applications.

Per this link 30% w/v KOH and 1.5 g/L LiOH is suggested. For the purposes of cell rehydration, an exact match is probably not important as a "dried out cell" will still contain all its salt components (though depending on redissolving them may not be the best option). A starting point for other mixes might be this paper which concludes a 6M KOH solution is optimal.

The big results reported over by this PriusChat member for anyone considering this are here - where he notes he used 20% KOH. Of note: getting deionized water, and a suitably un-metal contaminated salt, is probably key to success here (as well as sealing up the cells properly - the trickiest part by all accounts). That said - various metal dopants are used in NiMH cells to contribute all sorts of properties, so this may be a small effect. It is worth worrying about polymeric impurities in salts - you can eliminate these by "roasting" the salt to turn the into carbon ash.

It is noted in the literature that 6-8M KOH is the sweet spot for discharge capacity - however the use of a 1M solution for total cycle life has also been noted here.

One key parameter for anyone considering this is a rule of thumb figure for electrolyte volume of 1.5 - 2.5 mL A/h. For Prius cells this corresponds to 9.75 - 16.25 mL per cell, or 58.5 - 97.5 mL per module (each module has 6 cells).

Doing the Work

You'll need to dismantle your battery out of your car to do this. This can be done quickly once you know what you're doing, but follow a YouTube tutorial and take a lot of photos while you do it. Also read the following section and understand what we're dealing with.

Safety

This is part in the story where we include the big high voltages can kill warning, but let me add some explanatory detail here: the Prius HV battery is 201.6V nominal - in Australia this is lower then the voltage you use at an electrical outlet every day. But it is a battery - it has no shutoff, and it's DC power (so being shocked will trigger muscle contraction that will prevent you letting go).

Before you do anything to get the battery out of the car, make sure you pull the high voltage service plug, and then take a multimeter and always verify anything you're about to touch is showing 0V between the battery and car chassis.

Now the tempering factor to this is, handled properly, this battery is quite safe to work with once disassembled. High voltage is only present between the end terminals when the bus bars are connected - broken down into the individual modules the highest voltage is 9V from the individual NiMH modules.

Specific Advice

What does the High Voltage disconnector do?

The big orange plug you pull out of the battery does two things: it breaks the the circuit between positive and negative inside the battery, which makes the voltage at the battery terminals in the car go to 0V. This makes the battery safe to handle with the cover on.

It does this specifically by sitting between the 2 battery modules in block 10, and breaking the connection there. Because the battery output is wired from the last module positive, to the first module negative, this breaks the circuit.

There's a secondary benefit to this once the battery is open: breaking the battery wire here limits the total possible voltage inside the battery to ~130V (from block 1 to block 10). This is still a lethal voltage though.

image.png
My Gen 2/2004 Prius HV battery on the bench. Note modules 19-20 don't have a busbar, and are instead disconnected when the service plug is pulled.

The service plug breaks the connection where it does because it has the most benefit for knocking the pack voltages down to more reasonable values. It also disconnects the high voltage at the main battery terminals when this bus bar is broken, because the path between ground and high voltage is broken on the other side of the battery (the terminals connect on the side opposite us).

Getting it down to a safe voltage

Until the bus bars are removed completely, the voltage the battery presents remains potentially high. The bus bars make direct contact to the batteries even without the fastening nuts, so until you have removed one of the orange busbars completely, the battery may still have lethal voltages present.

This is a point to stress the issue: unlike your house, there are no circuit breakers or GFCI interrupters here. If you get shocked, it will just keep shocking you and nothing will stop it. Wear lines man gloves, don't touch metal if you're not 100% sure its , when in doubt use a multimeter to check what you're going to ground with tools.

The way to bring the battery down to a safe state is to start the side of the battery with the disconnector switch and remove the bus bars on that side (see image above). They're not connected to wires or other areas so they come off more easily.

Wearing insulated gloves, insulated boots, and using an insulated tool (I used a battery powered drill driver with a hex bit on a low torque settings), knock off all the bus bar nuts working right to left - then, pull the orange busbar off in one go (again, moving right to left, but you are wearing insulated gloves right?)

Important: So long as you only touch a bus bar terminal inside an orange plastic carrier, then there is no high voltage. The greater the distance between any two bus bars though, and the more voltage there is from the batteries stacking up.

After you've removed the bus bars on this side, the battery is now safe - the maximum voltage present is 7.2V.

Important: the battery is still live: you will see sparks (as I did) if you accidentally bridge any two adjacent batteries since you'll be short-circuiting a 14.4V battery (a block). Bus bar nuts are just wide enough to do this.

Be aware that when you re-assemble the battery, past this step (in reverse), the battery very rapidly goes from "safe" to "potentially lethal".

Safety End Notes

It's really hard to stress how weird the "safe"/"unsafe" combination of this task actually is. Once the circuits are isolated, this is a safe task. Handled properly, the HV can be quite safely disconnected, many have done it just fine. But ground yourself against that unfused, constantly live 200+ V source, and you are at extremely danger of dying very suddenly. Batteries are not your house electrical: you could probably get away with grabbing live and neutral in a house today, and GFCI would save you. None of that exists in the battery.

Cycling the Batteries

Despite the complexity of the actual processes, what needs to be done to the batteries is pretty simple. For this role I used four (4) Tenergy T-180 hobby chargers. These are "okay" as in I was successful with them, though the software implementation leaves a lot to be desired - the hardware seems solid though. They worked for me - and they can discharge (the slowest part) faster then other models on the market. There's 28 cells to do, so optimizing time is important.

Charging Settings

As noted the T-180 software isn't great, so your charging settings need to be done in a specific order.

First: goto NiMH mode, then CHG, and scroll down to delta V: you want to this value to 25 mV/C. Notionally this value is in millivolts per cell, but based on my experience I could find no way to set the number of cells, and it didn't seem to be inferred - setting it to a lower value resulted in the charger not properly cycling the cells. Setting the value here changes the value used when using CYCLE mode.

Note: the capacity setting does nothing but try and set the rate of charge. For NiMH the rate of charge is considered to be 1C for a full charge, so about 6.5 amps for the Prius cells, though it has been reported 5 amps might be the better option. I set it 7250mA, then reduced the rate of charge to 6.5A, just in case.

Under CYCLE, you'll want to set the discharge rate to maximum (5 amps) and the minimum discharge voltage to 6V which takes the modules down to 1V per cell, empty. I didn't go lower - if you reverse polarize a cell, that pack is dead and not coming back.

A setting I found important after I started was the wait time setting between cycles. I got much better cycle over cycle capacity gains setting this to 30 minutes then 5 or 10 minutes, and in particular since you are charging to full the modules do get hot. There is also some mechanics of allowing the cell chemistry more time to equilibrate.

Note: When using the CYCLE option, always press START from the second page of settings - otherwise you will not be abe to review and record individual cycle data.

Just the Settings
Charge Rate: 6.5A
Discharge Rate: 5A
Discharge End Volts: 6.0V
Delta V Peak: 25mV
Standby Time: 30 mins

Charging Setup

IMPORTANT: Never charge cells without compressing them. Prius batteries are prismatic and must have mechanical counter-pressure to prevent swelling. If they swell, the plates will short and warp, and the module will be destroyed. The total side-wall loading (across the whole area) is about 1300 kgs, but basically they mustn't be able to flex what they're in contact with. Keep them in the battery carrier and tightened down or you are very likely to ruin them.

The hardware configuration of a charging setup is important. The T-180s come with many connectors but not simple ring terminals, which is what I wased by adding some crimp adapters. This setup was not ideal: what is under-specified in other literature on the topic is that 6.5A at 9V is quite a lot of current to be throwing through a wire. I used a setup made up of 1.5mm$^2$ house electrical cable (what I had on hand) - this is rated for about 10A at 240VAC, but what this rating misses is at 240VAC you don't really care about 0.3V of voltage drop.

At a much lower voltage, you really do.

Throughout the charging process the chargers would take the batteries up to over 9V from the perspective of the charger, but the battery terminals themselves would be receiving exactly 9 at the end of the process, or only reach 8.8 or so. Which is within spec for NiMH, but means the reported voltages on the chargers can't be trusted.

This also in practice means my 6V discharge setting was probably closer to 6.3V-6.2V which is much gentler but also going to be less effective at restoring the batteries.

If I was doing this again, I would use a very heavy gauge wire (4mm$^2$ or so) and keep the run to 50cm rather then meter or so I used - and use the power supply mode options on the chargers to check voltage drop. I suspect people who have reported very deeply discharging cells may have had it improve their results by this method too - though again, NiMH is unique in that individual cells can safely go to 0V and will come back, but in a series pack you just can't get them all their safely (and if they reverse polarize, you're done).

IMG20211127124810.jpg
My charging setup with four Tenergy T-180 chargers. They're angled so their fans don't blow on the adjacent unit.

Two other important things to note:

  1. Never charge modules side-by-side. I interleaved modules as I was charging to allow the module under control to use the ones next to it as a heatsink. The one time I charged 3 in a row, they definitely got way too hot.

  2. You really need to consider having multiple chargers for this. I had (4) and it took about 2 weeks, though that's including learning time. But generally speaking once you have your baseline, underperforming batteries benefit from more cycles - I ran some modules upto 15 times and they recovered enormously from a position of looking like underperformers, but 5 cycles takes 14+ hours - and there's 28 modules total. If you're going to do this, get multiple chargers - at least 4, and discharge capacity matters - you spend most of your time discharging. Charging takes about 80-90 minutes at the rates I used, whereas discharge takes 5 hours at about 17W average. This is why I went with the T-180s.

Number of Cycles and How to Cycle

For most of the work I used Discharge -> Charge cycles. For your first cycle though you want to start out by charging the module with the CHG mode. There's a reason for this: if you have a weak cell or an out of balance cell in your module, the only way to bring it up is to top-balance it by slightly overcharging the pack, which happens when you do a full charge (the Prius doesn't, because overall this reduces cell longevity). If you just use Discharge -> Charge first, then there's a risk that weak cell might get pulled into reverse polarity and just die on you - when it otherwise might have been able to be saved. So do 1 cycle of CHG first.

I did not do this - but I should have. It worked out okay, but the difference between a weak cell and a reverse polarity cell is big. One might restore - one regardless of how good it was is gone, along with it's module.

The other benefit is it makes your data collection easier - you can assume your cells start out with full charge on them for the initial discharge.

Discharge -> Charge cycles have the benefit of ending the module charged, so if you're done with it the last step is to discharge it down to 7.5V to be ready for balancing and reinstallation in the Prius (remember the Prius ECU expects to the cells to be about in this range, and when I removed them this was about the range).

Due to the voltage drop issue, I ended up reinstalling the cells at 7.8V - this worked out fine, as when I pulled them out initially they were charged to about 7.85V from the car.

Step-by-Step

What you want to do is get a sense of your battery pack's status quickly:

To start out: Number your cells. You'll be tracking them in a spreadsheet by these numbers. You also have two other fields: Cycles - for individual cycles of each run, and Campaign - the word I used to describe everytime I started a charger cycling a cell. Together I used these to keep the cells in order. The other variables you'll be tracking come from the charger - Charge Vp (Voltage Peak), Charge mAh, Discharge Va (voltage average), Discharge mAh.

Note: While I say number your cells, I actually gave them letters. Since blocks are numbered, I wanted to make modules clearly defined. The schemes goes from A-Z plus alpha and beta. Replacement cells are referred to as the letter plus number - i.e. K1 is K's replacement.

I used LibreOffice Calc and saved my data as tab-separated values (TSV) for ease of use with numpy and matplotlib for graphing and data analysis - which I did in Jupyter Lab (VSCode's Jupyter plugin was excellent for this).

image.png
Sample of my data loaded with pandas. You can omit "type" if you're not discharging partially charged cells, which in retrospect I don't recommend. "order" is just my note that it was a discharge/charge cycle.

To start out, get a voltmeter and measure the voltage of all your cells. They should all agree with each other, and any that are about 1.5V out are probably dead from a reverse polarity cell. I had no luck recovering these - so you can save yourself some time by just replacing them right away.

image.png
The initial sweep of cell voltages. K is obviously in polarity reversal, and triggered the initial warning light in the Prius.

Experience of Capacity Change While Cycling

When I started out I spent a bunch of time worrying I was killing the cells I was working with because the chargers weren't max current delivery limited. This is a real concern, but I was reacting to observations of discharge behavior - and I did not have the delta V cut off set to the 25mV I ended up using. In addition, the voltage drop between charger and battery tended to make the finishing voltages look higher then they were - in my experience the T-180s delta-v detection seemed to work just fine.

cell-cycle-tracking.png
Final plots of the data for all cells. Charge/Discharge cycles are corrected so each discharge data point corresponds to the charge datapoint that supplied it's current. There were potentially days between subsequent cycles so self-discharge is probably present.

The cells in my pack overall are in very good shape: I suspect I could've got most of them up to 6000+ mAh, but ultimately I wanted my car drivable again and was content with the results. The big news in this data is that initially under-performing cells showed massive improvements on being repeatedly cycled and this is consistent with other Priuschat members experience.

From this data I am content to say that cycling the modules definitely improved capacity: while the settings are not totally consistent across all cycles, the ones which ended up best started out looking like serious under-performers of the pack. I am actually now 4 modules in surplus since they were looking so poor I thought it would be better just to replace them and save the trouble - instead they improved quicker then my replacements arrived, and I kept them in.

One crucial pattern which stands out is the stair-step discharge performance on the cells which were improving: I saw this a lot, and it was the source of my initial concerns that I was just damaging them. I have no explanation for this from the scientific literature, although my guess would be that since there is an electrochemical potential associated with the gamma->alpha/beta nickel oxy-hydroxide conversion, that once the crystals get small enough they become much more reactive for that cycle of charge.

Final Data

The figure below was plotted simply slicing the last set of data I got from the modules, and was used as a benchmark (the red line) for which modules to target for additional cycles).

final-battery-data.png
Final cycle plots. Note K vs K1 which was the replaced module.

Finishing up and Reinstalling

Balancing

Once I had my cells at a fairly consistent capacity I was happy with, it was time to finish up. This involves balancing the cells - specifically we want to take them all to a similar voltage so the Prius can manage them. The recommendation is to take them 7.5V - with NiMH this is a very weird process though.

From the outline of NiMH chemistry, it can be seen that the voltage of the cells is "stiff" - they don't vary much at all for a very long flat part of their discharge curve, which in turn means assessing state of charge from voltage is tricky.

One suggestion I did see was to simply take them to 7.5V + a known quantity of mAh to get them to a consistent state of charge. The Tenergy T-180s can't do this - they count charge, but can't trigger actions off it. The only control system you have is discharge end voltage.

My solution was to start with my fully charged cells and bring them down - I was aiming for 7.5V (this was the setting on the charger) but wound up with 7.88V or so - close enough to my Prius's charge state when I pulled the battery that I figured it should be okay and the ECU would figure it out.

The other thing I tried to do was balance the cells - this is where you connect them all together positive to postive, negative to negative, in order to let them equilibrate on a common voltage. This process is questionable with NiMH for equalizing charge due to the fairly uniform voltage at multiple possible states of charge.

I made one big mistake here which I realized and undid before anything bad happened: don't do this -

IMG20211214164438.jpg
Don't do this to balance cells. It's a bomb.

You can see the problem: I've wired the almost 1.3kWh of battery capacity in parallel, but the common bus bars are haphazardly held maybe 1cm away. It was tricky to wire, it was even worse to unwire since at no state is it not at risk of negative and positive touching and then 1.3kWh of low voltage is going to drop 100 amps into the terminal, or the wire or something. After I did this, I realized I didn't feel comfortable with it, and very carefully unwired it.

Don't do this. It's basically a bomb.

For the time this was connected, I setup one of my Tenergy T-180s to pull 1 amp off the common bus with a target discharge voltage of 7.5V. The idea was that between the current coming off and the batteries, the draw would give the cells a chance to equalize their voltages as the high voltage cells would feed the draw and the pack as a whole would be gently brought towards a common voltage. I only ran it for about an hour before coming to my senses and unhooking it, and after the pack voltage was a flat 7.8V which basically seemed good enough.

There is a correct way to do this: when you open up the pack, flip all the modules around so positive and negative are side by side. No risk of shorts, and you can easily busbar the entire pack with bare bit of copper wire. To implement the constant draw I would crimp a fuse onto the wire and hook it up to the charger.

My conclusion here is that I don't really have enough data. I think my approach to balancing is probably sensible in abstract: by pulling a draw on the pack overall, you can center the cells on a target voltage more quickly then just waiting for quiescient current between them to catch up. But others would say not to even bother once you have matching voltages - which I pretty much did with the chargers in the inital discharge to a consistent value.

Nickel Plated Bus Bars

When I pulled my pack out and opened it up, this was the state of the bus bars (after 17 years):

IMG20210925180957.jpg

There are a number of Chinese sellers (and American importers) selling replacement busbars which are nickel plated for corrosion resistance. You can just clean up the copper in some vinegar, but since I'm going to the effort and it was about AUD$50 to buy a set of replacements, I ordered some nickel plated bus bars.

IMG20211213144503.jpg
Bus bars as they were
IMG20211213144646.jpg
Removing the old bars
IMG20211213145049.jpg
Installing the new ones

Thoroughly optional, but probably worth it if you're going to this trouble - especially since ensuring nice clean contact areas on your battery reassembly will make things easier.

Make sure you re-install the service plug properly

I got the battery back in the car, disconnected the 12V battery for a few minutes to reset the various ECUs - I'm lead to believe this means the car drops its memory of the expected state of charge and will relearn it.

Taking the car out for a drive, firstly it was working properly - the dead cell removed meant it trusted the battery again. It felt snappy and responsive. But, I still had a red triangle on the dash.

Using an OBD dongle and the Dr Prius Android app, when I ran "read battery codes" I found I was getting P3140. This code is not well explained and lots of owners have had it for various hard to tell reasons. In short what it means is "open circuit on safety interlock".

What that means is that there's a sense line somewhere in the high voltage system which is not closed from when the car was started. What sense line? It starts in the high voltage service connector, and runs all the way to the inverter and can be triggered by any problems with that wire loop.

If it breaks while the car is running (say if you crash) then the system will immediately disconnect the high voltage as a safety feature.

But: if it is not connected when the car starts, you just the fault code - and it's a weird code, because most OBD scanners won't see it (FIXD didn't), but Dr Prius can.

Here's the important part: part of this circuit is in the service plug for obvious reasons, but it is only engaged when the service plug is fully installed. The service plug has 3 actions to install it: insert, lift up to engage, and then pull down to lock. The final step - down to lock - engages the pins on the sense line. When I initially put the service plug back in, I didn't do the last step - the car would start, and run, but I would get the problem light and it would not be cleared.

image.pngimage.png
Service plug installation procedure

tl;dr: If you've pulled the service plug for any reason, and have a warning light that won't clear - check you have fully engaged the service plug.

Conclusions

My car is working again. I'm at least $1500 ahead after parts and chargers, probably more accounting for labor, displacement etc. and frankly this is just the sort of thing I like doing.

I've replaced two modules which went into reverse polarity on a cell in the pack. I am wondering if an earlier recondition might have saved them.

The car is only just back together, so we'll see how the situation evolves.

Switching to nikola for blogging

Why?

The key to me for blogging is to keep it simple. This at first meant static sites that rendered nicely, and this was a function fulfilled by Wintersmith.

The Problems

The problem with Wintersmith was that it was a pure markdown solution. At the end of the day, it turned out that most of what I wanted to talk about probably had some type of interactive or graphical component to it, or I just wanted to be able to add photos and images easily.

None of this is easy with pure markdown.

The Solution

The new solution for me here is Nikola which seems to hit a new sweetspot of codeability versus low friction. Specifcially it's support for posts being rendered out of jupter-lab notebooks, something that I've had just grow in usefulness.

Perceived Benefits

Getting the transition from "jupyter notebook" to "blog" post down to as frictionless as possible feels important. When you want to share neat stuff about code, it helps to do it right next to the neat code you just wrote.

When you want to share data analysis - well that's a jupyter specialty. And when you want to show off something you're doing in your workshop, then at the very least pulling in some images with jupyter is a bit more practical and probably a lot more likely to be low enough effort that I'll do it - this last part is the bit that's definitely up in the air here.

Workarounds

Keeping media and posts together

The first problem I ran into with Nikola is that it and I disagree on how to structure posts.

Namely: out of the box Nikola has the notion of separate /images /files and /posts (or /pages) directories for determining content.

I don't like this, and it has a practical downside: when working on a Jupyter notebook for a post, suppose it requires data or I'm using some tool which will do drag and drop images for me? What I'd like is to open that notebook, and just that notebook in Jupyter or VSCode and work on just that file - including how I reference data.

Although this issue suggests a workaround, it has a of drawbacks. But more importantly - one of the reasons we use Nikola is it's written in Python, and better - it's main configuration file conf.py is itself, runnable Python.

This gives us a much better solution: we can just generate our post finding logic at config time.

To do this we need to find this stanza:

POSTS = (
    ("posts/*.ipynb", "posts", "post.tmpl"),
    ("posts/*.rst", "posts", "post.tmpl"),
    ("posts/*.md", "posts", "post.tmpl"),
    ("posts/*.txt", "posts", "post.tmpl"),
    ("posts/*.html", "posts", "post.tmpl"),
)

This is a pretty standard stanza which configures how posts are located. Importantly: the wildcard here isn't a regular glob, and all these paths act recursive searches, with the directory names winding up in our paths (i.e. posts/some path with spaces/mypost.ipynb winds up as https://yoursite/posts/some path with spaces/mypost-name-from-metadata)

So what do we want to have happen?

Ideally we want something like this to work:

|-my-post/- my-post.ipynb
         |- images/some-image.jpg
         |- files/data_for_the_folder.tsv

and then on the output it should end up in a sensible location.

We can do this by calculating all these paths at compile time in the config file, to workaround the default behavior.

So for our POSTS element we use this dynamic bit of Python code:

# Calculate POSTS so they must follow the convention <post-name>/<post-name>.<supported extension>
_post_exts = (
    "ipynb",
    "rst",
    "md",
    "txt",
    "html",
)
_posts = []
_root_dir = os.path.join(os.path.dirname(__file__),"posts")
for e in os.listdir(_root_dir):
    fpath = os.path.join(_root_dir,e)
    if not os.path.isdir(fpath):
        continue
    _postmatchers = [ ".".join((os.path.join("posts",e,e),ext)) for ext in _post_exts ]
    _posts.extend([ (p, "posts", "post.tmpl") for p in _postmatchers ])

POSTS = tuple(_posts)

PAGES = (
    ("pages/*.ipynb", "posts", "page.tmpl"),
    ("pages/*.rst", "pages", "page.tmpl"),
    ("pages/*.md", "pages", "page.tmpl"),
    ("pages/*.txt", "pages", "page.tmpl"),
    ("pages/*.html", "pages", "page.tmpl"),
)

Testing this - it works. It means we can keep our actual post bodies, and any supporting files, nicely organized.

Now there's two additional problems: images and files. We'd like to handle images specially because Nikola will do automatic thumbnailing and resizing for us in our posts. They're handled lossily. Whereas files are not touched at all in the final output.

The solution I settled on is just to move these paths to under images and files adjacent to the posts respectfully. This means is the Jupyter notebook I'm using is referencing data, it's reasonably well behaved.

For files we use this config stanza:

FILES_FOLDERS = {'files': 'files'}

for e in os.listdir(_root_dir):
    fpath = os.path.join(_root_dir,e)
    if not os.path.isdir(fpath):
        continue
    FILES_FOLDERS[os.path.join(fpath,"files")] = os.path.join("posts",e,"files")

and for images we use this:

IMAGE_FOLDERS = {
    "images": "images",
}

for e in os.listdir(_root_dir):
    fpath = os.path.join(_root_dir,e)
    if not os.path.isdir(fpath):
        continue
    IMAGE_FOLDERS[os.path.join(fpath,"images")] = os.path.join("posts",e,"images")

Setting up the publish workflow

Nikola comes out of the box with a publishing workflow for Github pages, which is where I host this blog.

Since I've switched over to running decentralized with my git repos stored in syncthing, I wanted to ensure I only pushed the content of this blog and kept the regular repo on my local systems since it leads to an easier drafting experience.

I configure the github publish workflow like so in conf.py:

GITHUB_SOURCE_BRANCH = "src"
GITHUB_DEPLOY_BRANCH = "master"

# The name of the remote where you wish to push to, using github_deploy.
GITHUB_REMOTE_NAME = "publish"

# Whether or not github_deploy should commit to the source branch automatically
# before deploying.
GITHUB_COMMIT_SOURCE = False

and then add my Github repo as the remote named publish as

git remote add publish https://github.com/wrouesnel/wrouesnel.github.io.git

and then synchronize my old blog so nikola can take it over:

git fetch publish
git checkout master
find -depth -delete
git commit -m "Transition to Nikola"
git checkout main

and then finally just do the deploy:

nikola github_deploy

Next Steps

This isn't perfect, but it's a static site and it looks okay and that's good enough.

I've got a few things I want to fix:

  • presentation of jupyter notebooks - I'd like it to look seamless to "writing things in Markdown"
  • a tag line under the blog title - the old site had it, the new one should have it.
  • using nikola new_post with this system probably doesn't put the new file anywhere sensible - it would be nice if it did
  • figure out how I want to use galleries