Table of Contents
Abstract
Artificial Intelligence is increasingly being used to automate complex gaming environments, enabling agents to learn, adapt, and make decisions autonomously. Automating a game like Flappy Bird demonstrates how AI can perceive challenges, make real-time judgments, and optimize its behavior over time. This paper presents an approach to automating gameplay through the integration of three ML techniques: Curriculum Learning, Imitation Learning, and Proximal Policy Optimization (PPO), a reinforcement learning algorithm. The goal is not to determine which method is superior, but how these methods can be integrated and structured for better learning outcomes. Through problem decomposition and gradual difficulty increase, we cut training time from multiple hours to just around an hour, achieving stable and accurate gameplay automation.
Key Findings:
- Computational time and accuracy can be enhanced by leveraging proper training design decisions.
- There is no single superior method; the optimal choice depends on computational time, accuracy, and design considerations.
Note: This research was conducted by GenITeam Solutions as part of an internal exploration into AI-driven gameplay automation. The experiment builds upon an open-source Unity project by Code Monkey (See link for full setup), with adaptations and training methodologies developed in-house. The source code used for this study is available here.
Why AI Training is Important for Game Development
AI training in games serves as a controlled setting to test autonomous decision-making, adaptability, and procedural reasoning.
Key benefits include:
- Automated testing: Continuous gameplay evaluation to detect flaws or balance issues.
- Procedural level generation: Validating levels generated by AI and assigning them complexity.
- Personalized level generation: Dynamically adjusting difficulty based on player performance for an engaging yet achievable experience.
Experiment Design
All experiments were conducted on a system equipped with 32 GB DDR4 RAM, an NVIDIA GTX 1050 Ti (4 GB) GPU, and an Intel Core i5-10400 CPU.
Progressive Training Setup: We broke down the problem into several approaches; each approach beginning with a no-pipe (only ground and ceiling) baseline where the agent learned basic physics of gravity, timing, and boundary awareness, before progressing to obstacle environments with the end goal of sustaining 10 minutes of gameplay in a randomized environment at the smallest gap size of 40.
Approach 1: Fixed Pipe Heights with Decreasing Gaps
Following the baseline phase, the agent was first introduced to an environment with a gap size of 80 and fixed pipe heights to be trained via reinforcement learning. Once the agent achieved consistent performance, the gap size was gradually reduced to 70, 60, 50, and finally 40. After mastering the smallest gap environment, the agent was transitioned into a randomized environment featuring both varying gap sizes and pipe heights.
| Approach 1 | |
|---|---|
| Gap Size | Training Time with Same Pipe Heights (min) |
| 80.00 | 16.7 |
| 70.00 | 8.6 |
| 60.00 | 5.86 |
| 50.00 | 20.36 |
| 40.00 | 16.1 |
| Randomization | 54.83 |
| Total Time | 122.61 |
Approach 2: Gradual Gap Reduction with Randomzied Pipe Heights
Here, the environment was randomized from the start. The agent began training with a gap size of 80, but unlike the previous approach, the pipe heights were randomized throughout the process. The gap size was then progressively reduced from 80 to 40. This constant exposure to environmental randomness improved the agent’s ability to generalize its learning without requiring a separate randomized environment phase at the end. This approach achieved slightly better performance than Approach 1.
| Approach 2 | |
|---|---|
| Gap Size | Training Time with Same Pipe Heights (min) |
| 80.00 | 32.4 |
| 70.00 | 12.85 |
| 60.00 | 14.56 |
| 50.00 | 28.36 |
| 40.00 | 23.46 |
| Total Time | 122.61 |
Approach 3: Sharply Reducing Gaps with Fixed Pipe Heights
This approach followed the same setup as Approach 1, with training beginning at a gap size of 80 and fixed pipe heights. However, instead of gradual decrements, the gap size was reduced sharply, from 80 to 60, then to 40, before transitioning to the randomized environment phase. This approach produced significantly better results compared to the two earlier approaches.
| Approach 3 | |
|---|---|
| Gap Size | Bigger Gaps – Same Heights (min) |
| 80.00 | 16.7 |
| 60.00 | 15.65 |
| 40.00 | 12.41 |
| Randomization | 26.1 |
| Total Time | 71.1 |
Approach 4: Sharply Reducing Gaps with Randomized Pipe Heights
This approach utilized the same setup as Approach 2, maintaining randomized pipe heights throughout, but with sharper gap reduction again, moving from gap sizes of 80 to 60 to 40. This approach outperformed Approaches 1 and 2 and achieved results comparable to Approach 3.
| Approach 4 | |
|---|---|
| Gap Size | Bigger Gaps – Different Heights (min) |
| 80.00 | 32.4 |
| 60.00 | 16.25 |
| 40.00 | 27.22 |
| Total Time | 75.37 |
Approach 5: Alternating Fixed and Randomzied Heights with Reducing Gaps
Here, the agent was first trained at a fixed pipe height for a single gap size (starting at 80), then with randomized pipe heights while maintaining the same gap size. This process was repeated for each gap size down to 40. This improved training time compared to Approaches 1 and 2, though remained significantly longer than Approaches 3 and 4.
| Approach 5 | |
|---|---|
| Gap Size | Training Time (min) |
| 80.00 | 16.7 |
| 80 + Randomized | 18.68 | 70.00 | 3.65 |
| 70 + Randomized | 7.73 |
| 60.00 | 3.95 |
| 60 + Randomized | 5.8 | 50.00 | 9.41 |
| 50 + Randomized | 15.91 |
| 40.00 | 2.683 |
| 40 + Randomized | 14.3 |
| Total Time | 98.9 |
Approach 6: Hybrid Imitation-Reinforcement Learning (5% Imitation, 95% Reinforcement)
Here, expert inputs from recorded human gameplay guided the agent, with 5% imitation training to provide a quick learning kickstart. The remaining 95% involved reinforcement learning, where the agent refined its performance directly through interaction with the environment. For each gap size from 80 to 40, training began with fixed pipe heights and then progressed to randomized ones. This method resulted in the longest overall training time among all approaches.
| Approach 6 | |
|---|---|
| Gap Size | Training Time (min) |
| 80 – Imitation + RL | 7.2 |
| 80 Randomized – Pure RL | 17.5 | 70 – Imitation + RL | 26.9 |
| 70 Randomized – Pure RL | 20.3 |
| 60 – Imitation + RL | 6.5 |
| 60 Randomized – Pure RL | 44.9 | 50 – Imitation + RL | 13.1 |
| 50 Randomized – Pure RL | 27.42 |
| 40 – Imitation + RL | 13.1 |
| 40 Randomized – Pure RL | 25.58 |
| Total Time | 196.15 |
Approach 7: Randomized Imitation-Reinforcement Learning
This approach trained the agent in a randomized environment with varying gaps from 80 to 40, starting with imitation learning based on recorded human gameplay. Once the agent handled demonstrated situations sufficiently well, training shifted fully to reinforcement learning, enabling independent exploration and better generalization. This combined strategy yielded the most efficient and stable performance, achieving the best results among all approaches.
| Approach 7 | Training Time (min) |
|---|---|
| Train agent on randomized environments (gaps 40-80, varying heights) using Imitation Learning | 12.5 |
| Resume training with Reinforcement Learning | 41 |
| Total Time | 54 |
Results and observations
Training performance varied notably across approaches, highlighting how different training configurations influenced convergence speed and adaptability.
- Approach 7 achieved the fastest convergence time (54.1 minutes), demonstrating the effectiveness of a sequential imitation–reinforcement strategy. The agent first received a kickstart through imitation learning, then refined its adaptability through reinforcement learning.
- Following this, Approach 3 (71.10 min) and Approach 4 (75.37 min) also showed efficient learning due to sharply reduced gap intervals, which accelerated precision acquisition. The slight slowdown in Approach 4 reflected the added variability of pipe heights, requiring greater adaptive control.
- Approach 5 (98.90 min), which alternated between fixed and randomized conditions at each gap level, balanced stability and adaptability but extended overall training time due to frequent environmental shifts.
- In contrast, Approaches 1 and 2 (122.61 min and 111.65 min, respectively), converged more slowly as the agent had to gradually develop precision, control, and spatial awareness through incremental difficulty.
- Finally, Approach 6 recorded the longest training time (196.15 min), as applying imitation and reinforcement learning simultaneously introduced noise in policy stabilization. The static imitation ratio prevented optimal adjustment during reinforcement phases, slowing overall convergence.
Conclusion
This study developed an adaptive training framework to teach an AI agent to navigate increasingly complex game environments. By combining structured learning, environmental variability, and hybrid training strategies, the framework achieved a balance of stability, adaptability, and efficiency. The results show that no single method is best; effective performance arises from integrating curriculum design, controlled randomness, and reinforcement learning.

