StockTradingEnv_gym_anytrading
Action Space
The action space is discrete, with two actions:
0 : Sell
1 : Buy
State Space
The state space consists of the following components:
Account balance (1 element): current account balance
Stock price (1 element): current price for the stock
Holding position (1 element): number of shares held for the stock
Price difference (1 element): the change in price since the previous timestep
Position (1 element): 0 for short, 1 for long
Observation Space
The agent can observe the following components:
Stock price: 1 element
Price difference: 1 element
Each observation contains a rolling history of these elements over the most recent window_size time steps.
Thus, the total observation space dimension is 2*window_size.
Transition Dynamics
At each timestep, the environment performs the following updates:
Decode Actions (Discrete)
The action for each asset is a discrete flag:
0= SELL/close long,1= BUY/open long. Current position flags are kept inself.position(0= short,1= long).acts = np.asarray(actions, dtype=int).reshape(-1) pos = np.asarray(self.position, dtype=int).reshape(-1)
Execute Trades
We resolve the intended transitions into two disjoint sets:
sell_hits = np.where((acts == 0) & (pos == 1))[0] buy_hits = np.where((acts == 1) & (pos == 0))[0]
BUY (open long) — from short:
Calls
_buy_stock(i, qty)withqty = self.hmax(the function ignores the requested quantity and buys the maximum affordable number of shares):\[\text{unit_cost}^{(i)} = \text{price}^{(i)} \bigl(1 + c^{\text{buy}}_i\bigr),\qquad \text{actual}^{(i)} = \left\lfloor \frac{\text{cash}}{\text{unit_cost}^{(i)}} \right\rfloor.\]Internal accounting:
Cash decreases by
price * actual + feewherefee = price * actual * buy_cost_pct[i].Holdings increase by
actualshares.self.last_trade_price[i]is set to the entry price for later reward.Trade counter and cumulative cost are updated.
Position flag becomes
1(long).
SELL (close long) — from long:
Calls
_sell_stock(i, qty)withqty = min(hmax, holding)but the function ignores the request and sells all current shares:Cash increases by
price * actual - feewherefee = price * actual * sell_cost_pct[i].Holdings drop to zero.
Trade counter and cumulative cost are updated.
Position flag becomes
0(flat).
Advance Time & Update State / Observation
Advance the market clock:
self.day += 1and load the new row:self.data = self.df.loc[self.day, :].Recompute turbulence if applicable.
Rebuild the internal state (cash, price, holdings, price_diff, etc.) via
_update_state().Append the observation frame [price, price_diff] to the rolling buffer:
self._frames.append(np.array([self.state[1], self.state[3]], dtype=np.float32))
_get_obs()concatenates the lastwindow_sizeframes to produce the agent’s observation.
Constraints
Costs: proportional buy/sell costs are applied via
buy_cost_pct[i]andsell_cost_pct[i].
Reward Design
Reward is given only when a SELL closes an existing LONG. For each such asset i:
The step reward is the sum over assets that closed a long this step:
price_diff[i] = current_sell_price - self.last_trade_price[i]
reward = float(sum(price_diff))
All other actions (opening a long, staying flat, or forced no-trade) yield 0 reward.
Initial State
At the start of each episode:
Account balance is initialized to
1,000,000.Holding position is initialized to
0.Stock price is set to the price from the first trading day.
Price difference is set to the change in price since the last day in the observation window.
Ending Condition
The episode terminates when the trading day would exceed the available data range.