Despite recent advances in Large Language Models (LLMs) and their proficiency in various language tasks, these models’ capability of evaluating complex linguistic concepts lacks benchmarks, such as frame blending generation—a basic mental operation for cognitively modern human beings. Frame Semantics enables a deeper understanding of words by analyzing conceptual structures, making it a valuable tool in fields such as linguistics, economics, and politics. This study investigates the ability of generating frame blending cases with open-source LLMs, specifically Llama 2, through fine-tuning and prompt engineering. Using FrameNet data, we build a robust data pipeline that trains and fine-tunes models to generate sentences with blended frames. Our system allows users to select specific frames for blending and to customize the level of training exposure (e.g., zero-shot, one-shot, few-shot learning). Once the model generates results, human annotators evaluate them based on metrics such as clarity, relevance, and depth of understanding, enabling a comprehensive assessment of the model's performance. This data pipeline facilitates large-scale, Human-in-the-Loop AI experiments, comparing the frame-blending capabilities of different open-source LLMs. Our research not only benchmarks LLMs’ proficiency in frame blending but also offers strategies for enhancing such ability. Our findings aim to set a new benchmark for evaluating LLMs' ability to understand complex language, improving their use in fields that need nuanced language skills. This study helps develop more advanced language models that better simulate complex cognitive operations.