Reinforcement learning for plants | <!-- -->Steven Tang

Reinforcement learning for plants

January 25, 2025

This blog describes our attempt to use reinforcement learning to grow plants by learning a lighting schedule from data.

Markov Decision Process (MDP)

We first frame the problem as a Markov Decision Process (MDP).

Reward function

Difference in log area

Rt+1=ln(Areat+1)ln(Areat)R_{t+1} = \ln(\text{Area}_{t+1})-\ln(\text{Area}_{t})

The return is invariant to the initial area.
The rewards are invariant to the size of the plant, this is desirable because when training the neural network the error should not be dominated by the size of the plant. The return is the log-ratio of the final and initial area. We can interpret the return as the number of doublings of the plant area. The exponential of the return is the ratio of the final and initial area. We actually use ln(x+1)ln(x+1) to avoid issues with zero area.

G=ln(AreaT)ln(Area0)=t=0T1Rt+1G = \ln(\text{Area}_{T})-\ln(\text{Area}_{0}) = \sum_{t=0}^{T-1} R_{t+1}

Alternative Reward functions

Other alternative reward functions were considered.

Difference in area

Areat+1Areat\text{Area}_{t+1} - \text{Area}_{t}

The return is not invariant to the initial area. The rewards grow exponentially with the size of the plant.

Percentage change in area

Areat+1Areat1\frac{\text{Area}_{t+1}}{\text{Area}_{t}} - 1

While this reward function induces a return that is invariant to the initial area, it can favour volatility in area sequences.

For example, a plant that is 100 mm2100 \text{ mm}^2 that grows to 110 mm2110 \text{ mm}^2 then shrinks back to 100 mm2100 \text{ mm}^2 gets a return of G=110100100+100110110=0.1+(0.091)=0.009G = \frac{110 - 100}{100} + \frac{100 - 110}{110} = 0.1 + (-0.091) = 0.009. The plant did not grow, so the return should be zero.

Difference in area divided by initial area

NAreaCAreaIArea\frac{N_{Area}-C_{Area}}{I_{Area}}

While this reward function induces a return that is invariant to the initial area, and the return is equal to the total growth factor, it has rewards that are exponential with the plant size.

Methods

  • Offline RL

Deployment

How do we do inference? mean plant embedding seems to be a reasonable choice (maybe only mean of those that are “alive”)