ssrn-5124841
ssrn-5124841
000
001
A LPHAQ UANT: LLM-D RIVEN AUTOMATED ROBUST
002 F EATURE E NGINEERING FOR Q UANTITATIVE F INANCE
003
004
005 Anonymous authors
006 Paper under double-blind review
007
008
009
A BSTRACT
010
011
Feature engineering is critical to predictive modeling, transforming raw data into
012 meaningful features that enhance performance. However, traditional feature engi-
013 neering is labor-intensive and prone to biases, while automated methods often lack
014 robustness and interpretability. This paper introduces a novel framework that com-
015 bines large language models (LLMs) with evolutionary optimization to automate
016 robust feature discovery. The framework integrates LLMs for domain specific
017 feature generation and a rigorous evaluation loop using machine-learning mod-
018 els hyper-tuned with time-series cross-validation on historical asset performance
019 to ensure their robustness. The key contributions: (1) an LLM-powered system
020
for generating domain-relevant, interpretable feature extraction functions, (2) an
evolutionary illumination process that iteratively refines the feature-set based on
021
importance scores from hyper-tuned models, and (3) empirical validation on fi-
022
nancial data demonstrating significant improvements in predictive accuracy and
023 feature robustness. The results highlight the potential of LLMs to revolutionize
024 feature engineering, paving the way for interpretable machine-learning models.
025 The set of discovered features is open-sourced for the reproducibility of results.
026
027
028 1 I NTRODUCTION
029
030 Feature engineering transforms raw data into meaningful variables that enhance model performance,
031 and has long been a key focus in predictive modeling. Traditional feature engineering, however, re-
032 lies heavily on domain expertise and manual iteration, which is labor-intensive, time-consuming,
033
and prone to errors and biases. Automated feature engineering methods systematically generate
features by applying mathematical operations to raw variables to address these challenges but often
034
fail to account for domain-specific nuances, resulting in features that lack robustness and interop-
035
erability. The challenges are especially pronounced in domains like finance, where data is noisy,
036 non-stationary, and influenced by extreme events. Features must generalize across time and assets,
037 requiring sophisticated evaluation mechanisms to ensure robustness. Traditional methods often fail
038 to capture domain-specific insights or ensure feature robustness across assets or market regimes.
039
Recent advancements in Large Language Models (LLMs) have opened new possibilities for au-
040
tomating feature discovery. LLMs can generate interpretable feature extraction functions tailored
041
to domain requirements, synthesizing knowledge from mathematical principles and financial liter-
042 ature. When paired with evolutionary optimization frameworks, LLMs can enable iterative refine-
043 ment of features to maximize robustness and performance. To ensure robustness, features must be
044 evaluated using advanced cross-validation techniques. Time-series cross-validation ensures that fea-
045 tures generalize across temporal data splits, while leave-one-out cross-validation evaluates feature
046 performance across assets. By combining these evaluation mechanisms with LLM-driven feature
047 generation and optimization, This work represents a significant advancement in feature engineering.
048 LLMs, such as GPT-4 and LLama, offer an innovative approach to automating feature generation.
049 These models synthesize domain-specific knowledge and generate creative solutions informed by
050 vast corpora. When coupled with quality-diversity optimization—a paradigm that emphasizes ex-
051 ploration and refinement—LLMs can enable robust, interpretable, and scalable feature engineering.
052 This paper introduces a novel framework integrating LLM-driven feature generation with hyper-
053 tuned model evaluation and iterative optimization. Using financial data as a case study, we demon-
strate its ability to automate feature engineering while ensuring robustness through cross-validation.
1
Under review as a conference paper at ICLR 2025
054
2 R ELATED W ORK
055
056
The advent of large language models has spurred a new line of research in automated feature en-
057 gineering. Researchers have begun to harness LLMs’ few–shot and in–context learning abilities to
058 generate candidate feature transformations. For instance, Zhang et al. introduced a “Dynamic and
059 Adaptive Feature Generation” method that uses LLMs to generate new feature extraction functions
060 in a controlled, interpretable manner Zhang et al. (2024). Similarly, Gong et al. proposed an “Evolu-
061 tionary Large Language Model for Automated Feature Transformation,” in which an LLM is teamed
062 with evolutionary search strategies to explore the vast space of possible feature–transformation se-
063 quences Gong et al. (2024). In these works the LLM serves not merely as a language generator but
064 as a domain knowledge–integrator, capable of leveraging textual and mathematical cues to propose
065
creative and contextually relevant features. In addition, several surveys and frameworks have be-
gun to review how LLMs can be applied throughout machine learning workflows—including data
066
preprocessing, feature engineering, and model selection Gu et al. (2024). While these works focus
067
broadly on automating ML pipelines, our contribution zeroes in on the feature engineering stage,
068 combining LLM–driven creative generation with quality–diversity optimization to iteratively refine
069 and validate feature sets. Taken together, these strands of research motivate our framework: by cou-
070 pling the creative and domain–informed capabilities of LLMs with the robust exploration provided
071 by quality–diversity optimization, our approach offers a scalable, interpretable, and high–performing
072 solution for automated feature engineering in challenging domains such as finance.
073
Recent advances in quantitative finance have also shifted the focus from manually crafted heuristics
074 to data-driven, automated pipelines for discovering predictive signals. Two complementary strands
075 have emerged: one in which LLMs are deployed as alpha miners within trading agents, and another
076 where LLMs facilitate automated feature engineering to transform raw data into robust, interpretable
077 inputs for predictive models. Early work has demonstrated the promise of LLMs in financial appli-
078 cations by positioning them either as direct signal generators (i.e., trading agents) or as alpha miners
079 that produce candidate factors for downstream trading systems. In this paradigm, architectures such
080 as QuantAgent and AlphaGPT employ an inner–outer loop strategy: an LLM first generates candi-
081 date alpha expressions based on high-level trader ideas, and an outer loop evaluates these candidates
082
against historical data to iteratively refine the factors (Wang, 2023; 2024). This circumvents the
need for manual feature extraction, enabling agents to adapt dynamically to evolving market con-
083
ditions and synthesize insights from textual and numerical financial data. Lastly, research by Kou
084
et al. (2024) has integrated LLMs within multi-agent architectures to mine and validate alphas across
085 diverse modalities where an evaluation process back-tests candidate alphas under various risk pref-
086 erences and market scenarios, dynamically selecting and weighting the most robust signals.
087
088
Together, these strands of research illustrate a paradigm shift in financial feature engineering and
signal discovery. The transition from rigid, expert-driven approaches to flexible, LLM-powered
089
frameworks enables the extraction of richer, more adaptive predictive signals from complex, multi-
090
modal data sources. By leveraging LLMs’ creative and domain-informed capabilities, recent work
091 offers a scalable, interpretable, and high-performing solution for automated feature engineering in
092 challenging domains such as finance. They demonstrate that combining LLM-driven feature discov-
093 ery with ensemble evaluation methods can lead to more adaptive and resilient quantitative strategies.
094 Moreover, in the financial domain, Arian et al. Arian et al. (2024) introduce a Stability-Weighted En-
095 semble Feature Importance method designed specifically for financial applications, demonstrating
096 how feature importance can be enhanced by combining stability selection with ensemble learning.
097
098
099
100
101
102
103
104
105 Figure 1: Feature importance where dark colors indicate features being discovered in later iterations.
106 The plot shows that the proposed method is able to discover more important features over iterations.
107
2
Under review as a conference paper at ICLR 2025
108
3 M ETHODOLOGY
109
110
Our methodology implements an iterative feature engineering workflow for financial datasets by
111
combining automated feature generation driven by large language models (LLMs), parallelized fea-
112
ture extraction and evaluation, and iterative model optimization via AutoML. The goal is to extract
113 novel, predictive financial features from time-series log-return data to enhance risk-adjusted perfor-
114 mance measures, such as the Sharpe ratio. The system integrates the following components:
115
116 • LLM-Driven Feature Synthesis: An LLM is prompted with a set of few-shot examples
117 (e.g., basic statistical functions) to generate new candidate feature extraction functions in
118 PyTorch by proposing innovative features inspired by established financial literature.
119
• Parallelized Feature Extraction: Newly generated feature functions are executed parallel
120
for each asset in a global context using Python’s concurrency libraries. Each feature is
121 validated for correct dimensionality, absence of NaN/Inf values, and sufficient variance.
122
123 • AutoML-Based Evaluation: Extracted features are evaluated using an AutoML frame-
124
work (FLAML) with LightGBM as the regression estimator. A state graph orchestrates
the cyclic process of feature extraction, evaluation, ranking, and new candidate generation.
125
Each iteration refines the feature set based on previous performance and diagnostic metrics.
126
127
128 3.1 F EATURE G ENERATION U SING LLM
129
130
A key innovation in our approach is the use of LLMs to propose novel feature extraction functions:
131
• Few-Shot Prompting: A set of basic PyTorch functions (e.g., mean, variance, std,
132
etc.) is provided as initial few-shot examples and updated as better features are discovered.
133
134 • Error-Aware Generation: The prompt includes a list of previously eliminated functions
135 and a log of execution errors to steer the LLM away from redundant or erroneous features.
136 • Code Extraction and Compression: The generated code is post-processed using Abstract
137 Syntax Trees (AST) to extract valid function definitions and to remove extraneous com-
138 ments and docstrings, ensuring that only the essential code is retained.
139
140
3.2 AUTO ML T RAINING AND F EATURE R ANKING
141
142 The extracted feature set is subsequently assessed with cross-validation using a AutoML framework:
143
144 • AutoML Framework: FLAML is used to train a regression model (with LightGBM as the
145 estimator) on the extracted feature matrix. The hyper-tuning is always warm-started.
146
• Performance Metrics: The model training is carried out using the standard R2 objective.
147
Other performance measures include Mean Absolute Error (MAE), Spearman correlation,
148 and Normalized Discounted Cumulative Gain (nDCG) at specified quantile cutoffs.
149
150 • Feature Importance Extraction: Post-training, feature importances are extracted from
151
the LightGBM model, which are used to rank the features and drive the selection process.
152
153 3.3 I TERATIVE W ORKFLOW F OR F EATURE E NGINEERING
154
155 The overall system is organized as a cyclic state graph that defines a multi-stage iterative workflow:
156
157
1. Feature Extraction: The current candidate feature functions are applied to the training
data to produce a feature matrix. The top-k features are selected based on feature impor-
158
tances from the AutoML model hyper-tuned with time-series cross-validation.
159
160 2. Feature Selection: Features that are underperforming or redundant are eliminated, and the
161 LLM keeps generating new candidate features. The new generation is informed by previous
iteration metadata, including error logs and eliminated functions.
3
Under review as a conference paper at ICLR 2025
162
4 E XPERIMENTS
163
164
The experiments were conducted on a real-world financial dataset. Time-series cross-validation was
165 employed to validate and rank features as they evolve. The dataset, comprising 15 years of historical
166 data from 3,246 US stocks and ETFs, was split into rolling windows with a 50% overlap after
167 separating 10% of the data for the out-of-sample test (1.5 years). In our experiments, the test window
168 size is set to 10% of the total time steps, while the training window is three times larger. The blind
169 test set included recent periods of extreme market stress, such as the 2020 COVID-19 market crash,
170 highlighting the robustness and stability of the discovered features under high-volatility conditions.
171 The AutoML training has been performed with the Mean-Absolute-Error (MAE) objective from
172 cross-validation where folds were organized as grouped rolling windows based on their start date.
173
An example set of discovered features is open-sourced for reproducibility of experimental results.
174
175
4.1 W ORKFLOW C ONFIGURATION
176 • Initial Feature Set: Ten basic statistical functions are provided as seed feature candidates.
177 • LLM Integration: LLM is utilized to generate novel candidate features. The prompts are
178 designed to encourage the synthesis of advanced risk-return indicators.
179
• Iteration Control: The state graph orchestrates three main tasks in each iteration:
180
181
1. Feature Extraction and Scoring: New candidate features are applied to the training
data and validated.
182
183
2. Feature Ranking and Selection: AutoML (via FLAML and LightGBM) is used to
evaluate and rank features based on predictive performance.
184
3. Candidate Generation: New feature candidates are generated by the LLM, with re-
185
dundant or underperforming features eliminated.
186
187 • Time Budget: Each AutoML training session is allocated a time budget of 60 seconds,
188
with warm-starting from previous configurations to accelerate convergence.
189
We assess the predictive performance of models trained with discovered feature sets using:
190
191 • Mean Absolute Error (MAE): Evaluates the average magnitude of errors in predictions.
192 • Spearman Correlation: Assesses the monotonic relationship between the predicted and
193 future Sharpe ratios.
194
• Normalized Discounted Cumulative Gain (nDCG): Computed at different quantile cut-
195
offs (nDCG@Q1 and nDCG@Q4) to measure the quality of the feature ranking.
196
197
4.2 R ESULTS AND A NALYSIS
198
199 At each workflow iteration, the AutoML model hyper-tuned with newly discovered features is also
200 evaluated on the blind test set. The experimental results demonstrate the efficacy of our iterative
201 feature engineering framework. Incorporating LLM-generated features leads to a more diverse and
202 predictive feature set. Iterative refinement, combined with AutoML and feature importance ranking,
203 consistently improves predictive performance. Overall, our approach improves forecasting perfor-
204 mance (as measured by MAE, Spearman, and nDCG) and provides interpretable insights into the
205
role of individual features in risk-adjusted performance evaluation. The experiments validate the
proposed methodology as an effective dynamic financial feature engineering tool. Integrating LLM-
206
driven feature synthesis, robust feature extraction, and iterative AutoML evaluation forms a com-
207
prehensive framework that can adapt to the evolving nature of financial data and market conditions.
208
209 • Improved Predictive Accuracy: The iterative process leads to progressive reductions in
210 MAE. For example, initial iterations with basic features yielded higher MAE compared to
211 later iterations where novel LLM-generated features were incorporated.
212 • Enhanced Feature Ranking: The top-k features are selected based on feature importances
213 extracted from LightGBM, or optionally obtained from SHAP values.
214 • Ranking Correlation Analysis: Spearman and nDCG values indicate an increasingly
215 stronger monotonic relationship among feature-driven predictions and blind Sharpe ratios.
4
Under review as a conference paper at ICLR 2025
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234 Figure 2: Mean-absolute-error (MAE) of predictions in validation and test-set with more features.
235 The plot shows a tendency to decrease validation and test errors as more features are discovered.
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255 Figure 3: Spearman and nDCG of model predictions vs test-set Sharpe ratios with more features.
256 The plot shows the increased model’s capability to rank future asset performance over iterations.
257
258
259 4.3 C ONCLUSION
260
261 By integrating LLM-driven feature synthesis with rigorous extraction, evaluation, and iterative re-
262 finement, our methodology provides an end-to-end automated framework for financial feature en-
263
gineering. AutoML with time-series cross-validation ensures that only the most robust predictive
features are retained. The proposed method is well-suited for financial environments, where the
264
LLM-driven discovery of novel features can significantly enhance the forecasting performance.
265
266
267 R EFERENCES
268
269 HR Arian, SA Mousavizade, and LA Seco. Stability-weighted ensemble feature importance for
financial applications. Working Paper Abstract ID: 4905824, SSRN, 2024. URL https://
5
Under review as a conference paper at ICLR 2025
270
papers.ssrn.com/sol3/papers.cfm?abstract_id=4905824. Accessed: Febru-
271 ary 03, 2025.
272
273 Nanxu Gong, Chandan K. Reddy, Wangyang Ying, and Yanjie Fu. Evolutionary large language
274 model for automated feature transformation, 2024. Preprint.
275
Yang Gu, Hengyu You, Jian Cao, and Muran Yu. Large language models for constructing and
276 optimizing machine learning workflows: A survey, 2024. Survey.
277
278 Zhizhuo Kou, Holam Yu, Jingshu Peng, and Lei Chen. Automate strategy finding with llm in quant
279 investment. arXiv preprint arXiv:2409.06289, 2024.
280 et al. Wang. Alphagpt: A human in the loop framework for alpha mining. https://arxiv.
281 org/abs/2408.06361v1, 2023. Details provided in the survey.
282
283 et al. Wang. Quantagent: Llm as an alpha miner in financial trading. https://arxiv.org/
284 abs/2408.06361v1, 2024. Extracted from survey discussion.
285 Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, and Kunpeng
286 Liu. Dynamic and adaptive feature generation with llm. arXiv preprint arXiv:2406.03505, 2024.
287
288
289 A A PPENDIX
290
291 A.1 S YSTEM P ROMPT
292
293
You are an expert in financial metrics and feature engineering. Generate
PyTorch feature extraction code for a financial dataset.
294 You aim to engineer predictive features by proposing interestingly novel
295 statistical risk-return indicators. USE CRITICAL THINKING!
296
297 You are deeply familiar with statistical time-series features from
298
literature, and encouraged to draw inspiration from publications.
Use the knowledge from examples and inspiration from academic literature
299 to propose new financial analysis feature implementations.
300
301 Each feature should be unique, interpretable, and useful for predicting
302 Sharpe Ratios. Avoid redundancy with the examples provided.
303
When features are applied to the past log-returns of assets, obtained
indicators should help predicting assets’ future sharpe-ratios.
304
305
306 A.2 U SER P ROMPT
307
308 Example features (PyTorch code snippets) are below. You should implement
309 novel more advanced (better) features that are complementary.
310
{top_k_examples}
311
312 Generate {num_features} new feature extraction functions. Give concise
313 names to the functions that summarize the feature extracted.
314 It must take log_returns (torch.Tensor) as the only input parameter that
315 doesn’t have any default value defined in the function body.
Prefer using tensor operations. Return only the PyTorch implementation of
316
each feature without any additional markdown or comments.
317
318 Avoid extracting following features that have been previously tried but
319 were redundant, come up with better ones. THINK OUTSIDE THE BOX!
320
{eliminated_list}
321
322 Below are the execution errors encountered in previous attempts. Avoid
323 similar mistakes in future implementations: \n{previous_errors}