How I used AI to Simulate a Company Environment

Using RL to simulate a company environment

Berkshire Hathaway made net 81 billion dollars, Apple made net 55 billion dollars, 1 in 12 businesses went out of business this year and about 98,000 businesses were permanently closed due to covid-19.

So what separates big companies like Berkshire Hathaway and Apple from the other 1 in 12 businesses that close per year? How can you make companies make as much money as possible to prevent them from going out of business?

Why not use AI? [Insert joke about AI start-ups getting funded like crazy by VCs]

So that’s what I did. using reinforcement learning, I simulated a company environment to optimize money made for the company.


What is reinforcement learning?

Reinforcement learning is similar to how humans or dogs learn. Reinforcement learning works by rewarding the agent when he does something correctly and punishing him when he does something incorrectly.

Reinforcement learning works with 5 main things: an agent, reward, state, action and environment.

Visualization of an RL system

The actions are what the agent does. These could be things like: move left, move right, go forward, etc. These usually compound to bigger actions, like running a maze or driving up a hill.

The agent is the AI and is what predicts what actions to take and which not to take. The agent learns from what he has done and what reward he got from them.

The state is the state of the agent in the environment. For example, if your agent is playing a video game with a display, it would be a frame in the game.

The reward is what causes the agent to do what it does. It shows your agent what actions to take through rewarding desired actions and punishing those which are undesirable.

The environment defines what happens. When an action is taken the environment executes that action. The environment calculates if the agent should get a reward and gives them. In short, the environment is the game the agent is playing.

Using these 5principles, you can make a reinforcement learning system.

Now that we know what reinforcement learning is I can explain:

How I used reinforcement learning to simulate a company environment

The goal of my environment was to optimize for money made by a company to prevent them from going out of business due to money. So I started by making one boss and four workers

The workers

The workers were all different to represent variety in employees as you’d normally have as a company. All the workers have two actions they can take: make money or do nothing. If they do nothing, nothing happens if they make money the company makes money. The amount of money they make is model by this equation:

self.company_money += worker_1_skill + worker_1_pay_motivation - worker_1_endurance

This says to add the worker's skill level + worker pay motivation level - the worker’s endurance value to total value of company money and set that to the new value of company money. The workers take the actions randomly and are not agents because of lack of computer power(which I found out the hard way).

The variables listed above are affected by the three variables that change between the workers: skill level, endurance level and money motivation level.

The skill level of the workers affects how much money they make for the company directly and is a value from 1–5. This means the higher the skill level, the more money they make for the company in the same amount of time.

The worker's pay motivation level indirectly affects the amount of money the company makes. The worker’s pay motivation level is a value from 0 to 1. That is multiplied by the pay of the worker to make the pay motivation that is added to the make money action equation. This means the higher the money motivation level the more motivated by the money they are.

self.worker_1_pay_motivation = (self.worker_1_pay * worker_1_money_motivation_level)

The endurance level of the worker affects the make money action equation indirectly as well. The endurance level is a value from 0 to 5 that affects this equation to calculate endurance. The values in the equation below were derived from a very sample size because not many people want to answer how long of a break is needed from how many days. But if you want to help, here is a form to help with data. Using those variables and putting it to the power of days work multiplied by endurance level I calculated the endurance. This means the higher the endurance level the easier they get tired.

self.worker_0_endurance = ((0.4641588837 * (1.165914401) ** self.worker_0_day_on) * (self.worker_0_endurance_level))

The values of the worker are as shown below:

The Boss

The boss is the agent for this reinforcement learning system. The boss has 22 actions he can take; but they can be summed up with four main things: pay change, make workers take a day off, do nothing and make money.

The boss has 22 actions because he has each action for each worker and pays changes of different magnitudes.

All his actions are worker_0 day off, worker_1 day off, worker_2 day off, worker_2 day off, do nothing, make money (the boss makes 20 dollars for each action), worker_0 1 dollar pay raise, worker_0 2 dollar pay raise, worker_0 -1 dollar pay change, worker_0 -2 dollar pay change, worker_1 1 dollar pay raise, worker_1 2 dollar pay raise, worker_1 -1 dollar pay change, worker_1 -2 dollar pay change, worker_2 1 dollar pay raise, worker_2 2 dollar pay raise, worker_2 -1 dollar pay change, worker_2 -2 dollar pay change, worker_3 1 dollar pay raise, worker_3 2 dollar pay raise, worker_3 -1 dollar pay change and finally worker_3 -2 dollar pay change. If you read all that- mad respect!

We will talk about the AI and how it works later on, but the action is decided by the AI.

Using those actions, variables, and a bunch of other code, the environment is pieced together.

The Agent

Later is now! So let’s talk about how the AI works. For this environment, I used a DQN to predict the values. In short, a DQN is a Deep Q network that combines deep learning and reinforcement learning together. The Q is for the Q value which is the highest value it can find. So a DQN tries to find the highest value possible. A more in depth explanation here.

I used a sequential model with 4 dense layers model as shown here.

model = Sequential()model.add(Dense(1, input_dim = (1), activation=relu))model.add(Dense(150, activation=relu))model.add(Dense(120, activation=relu))model.add(Dense(self.action_space, activation=linear))model.compile(loss="mse", optimizer=Adam(

Since with RL the dataset is constantly changing and needs to be made as you go, To predict I used replay memory to Iterate through the dataset after every action the DQN takes. Training it on its predicted state and the real state shown here:

def replay(self):   if len(self.memory) < self.batch_size:   return   sample = random.sample(self.memory, self.batch_size)   states_boss = np.array([i[0] for i in sample])   actions_boss = np.array([i[1] for i in sample])   rewards_boss = np.array([i[2] for i in sample])   next_states_boss = np.array([i[3] for i in sample])   dones_boss = np.array([i[4] for i in sample])   Qtargets = rewards_boss + self.gamma*(np.amax(self.model.predict_on_batch(next_states_boss), axis=1))*(1-dones_boss)   Qtarget = self.model.predict_on_batch(states_boss)   batch_size_array = np.array([i for i in range(self.batch_size)]), Qtarget, epochs=1, verbose=0)


Through all this, the final results after running the simulation many times are as follows.

In general, the highest profiting times were when they paid the person with the most skill much more than the median and paid people with lower skill much less than high skill worker but still triple than necessary pay. As well lowest skill people had the most days off and the highest skill person had barely any days off.

One outlier though that happened to be the highest profiting one was one that paid all the workers as little as possible and lots of days off for the low-skill workers and a fair amount of days off for the high-skilled worker.

Future Potential

I know that right now this is a very fundamental model of a company and would probably have to be specialized to a specific company to provide more value to the company. As well with more time and computational power, this program could be expanded even further with features like:

  • Multiple companies (poaching, loyalty level)
  • Commissions vs Salary
  • Multiple bosses
  • Workers are Agents (lots more computational power)
  • More Jobs and Specialization of jobs

If you interested I linked my code here

If you like this article, you will probably like my other ones, so consider following me on Medium, and while you’re doing that, follow me on Twitter, Linkedin and sign up for my newsletter.

I’m a curious 16 year old. I’m interested in space, AI and many other things.