0% found this document useful (0 votes)

34 views

TableLlama Towards Open Large Generalist Models For Tables

Uploaded by

rohnsonfrost

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

TableLlama Towards Open Large Generalist Models For Tables

Uploaded by

rohnsonfrost

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

TableLlama: Towards Open Large Generalist Models for Tables

Tianshu Zhang Xiang Yue Yifei Li Huan Sun

The Ohio State University
{zhang.11535,yue.149,li.14042,sun.397}@osu.edu

Abstract have spurred significant research interest (Deng

et al., 2020; Yin et al., 2020; Wang et al., 2021;
Semi-structured tables are ubiquitous. There Iida et al., 2021) in recent years.
has been a variety of tasks that aim to auto-
Most existing methods for table-based tasks have
arXiv:2311.09206v3 [cs.CL] 4 Apr 2024

matically interpret, augment, and query tables.

Current methods often require pretraining on at least one of the following limitations: (1) Re-
tables or special model architecture design, are quire table pretraining (Liu et al., 2022; Yin et al.,
restricted to specific table types, or have sim- 2020; Deng et al., 2020; Iida et al., 2021) and/or
plifying assumptions about tables and tasks. special model architecture design for tables (Deng
This paper makes the first step towards de- et al., 2020; Wang et al., 2021; Iida et al., 2021),
veloping open-source large language models (2) only support limited, specific types of tables
(LLMs) as generalists for a diversity of table-
and tasks (Chen et al., 2020a; Nan et al., 2022),
based tasks. Towards that end, we construct
TableInstruct, a new dataset with a variety
(3) make strong simplifying assumptions (See the
of realistic tables and tasks, for instruction tun- “in-domain” part of Section 2.1) about tables and
ing and evaluating LLMs. We further develop tasks (Li et al., 2023b).
the first open-source generalist model for tables, On the other hand, language models like T5
TableLlama, by fine-tuning Llama 2 (7B) with (Raffel et al., 2020) have been shown to excel in
LongLoRA to address the long context chal- grounding language to structured knowledge (Xie
lenge. We experiment under both in-domain et al., 2022). In addition, instruction tuning (Chung
setting and out-of-domain setting. On 7 out of
et al., 2022; Wang et al., 2022; Mishra et al., 2022)
8 in-domain tasks, TableLlama achieves com-
parable or better performance than the SOTA appears as an important technique that can guide
for each task, despite the latter often has task- LLMs to follow instructions to complete a variety
specific design. On 6 out-of-domain datasets, of tasks.
it achieves 5-44 absolute point gains compared Under this background, we seek to answer the
with the base model, showing that training on following question: Can we build a generalist
TableInstruct enhances the model’s gener- model to handle a variety of table-based tasks us-
alizability. We open source our dataset and
ing LLMs and instruction tuning? Some exemplar
trained model to boost future work on develop-
ing open generalist models for tables.1
tasks are shown in Figure 1. Such a generalist
model shall meet the following desiderata: First,
1 Introduction it should not only work well on diverse table-
based tasks, but also generalize to unseen tasks.
Semi-structured tables are prevalent data structures Since new table data and tasks can be constructed
to store and present information in almost every dynamically as new information arrives, it is hard
domain, ranging from scientific research, business to collect training data that covers all tasks and all
reports, and healthcare records to financial state- tables, which requires a model to be inherently gen-
ments. A variety of table-based tasks have been eralizable to tasks and datasets it has never seen
proposed, such as entity linking (Ritze et al., 2015), before. Second, it should work on real-world
schema augmentation (Zhang and Balog, 2017), tables and realistic tasks. The model should not
and table-based question answering (Cheng et al., make strong assumptions to only handle simplified
2022b; Nan et al., 2022; Chen et al., 2020b), which synthetic tables and tasks, but must embrace practi-
1
Code, model and data are available at: https:// cal challenges such as handling complex numerical
osu-nlp-group.github.io/TableLlama/. reasoning on large hierarchical spreadsheets as well
In-Domain Training Tasks In-Domain Evaluation Tasks

Column Type Relation Entity Row Out-of-Domain Evaluation Tasks

Annotation Extraction Linking Population Table Grounded
Dialogue Generation
Schema Highlighted Hierarchical Table Fact
Augmentation Cells QA Table QA Verification Highlighted Cells
Fine-Tuning Evaluate Description
Hybrid Table
Table Types Passage QA

Wikipedia Tables Spreadsheets TableLlama Table Fact

Verification

Figure 1: An overview of TableInstruct and TableLlama. TableInstruct includes a wide variety of realistic
tables and tasks with instructions. We make the first step towards developing open-source generalist models for
tables with TableInstruct and TableLlama.

as a large number of candidates for classification pared with existing work (Li et al., 2023b). We
and ranking tasks. incorporate a large number of Wikipedia tables and
In pursuing this goal, we realize there lacks a spreadsheets from statistical scientific reports with
comprehensive collection of realistic tables and varied length of contents, realistic and complex
tasks that can support the development and eval- semantic types from Freebase (Google.2015) for
uation of generalist models. Therefore, we con- column type annotation and relation extraction, and
struct TableInstruct, by meticulously selecting a large referent entity corpus with rich metadata
representative table-based tasks from widely used from Wikidata (Vrandečić and Krötzsch, 2014) for
datasets, unifying the format for all tasks and entity linking. In addition, we include complicated
manually annotating instructions. TableInstruct numerical reasoning tasks with hierarchical table
shown in Table 1 offers the following unique fea- structure and existing manually annotated table QA
tures: (1) Diverse coverage of tables and tasks. and fact verification tasks. By doing so, we aim
TableInstruct boasts a collection of 14 datasets to equip models with the capability to cope with
of 11 tasks in total, with both in-domain and out- realistic and complex table-based tasks.
of-domain evaluation settings. Our training data TableInstruct requires models to accommo-
includes 8 tasks, which are curated from 1.24M date long inputs (Table 1). We adopt LongLoRA
tables containing 2.6M instances spanning from ta- (Chen et al., 2023b) based on Llama 2 (7B) (Tou-
ble interpretation, table augmentation, table-based vron et al., 2023) as our backbone model, which
QA, and table-based fact verification. We choose 8 has been shown efficient and effective to handle
datasets for these 8 tasks for in-domain evaluation long contexts. We fine-tune it on TableInstruct
and leave the other 6 datasets for 4 tasks for out-of- and name our model TableLlama. We conducted
domain evaluation. The in-domain training tasks extensive experiments under both in-domain and
can enable the model to learn more fundamental out-of-domain settings. Our experiments show
table understanding abilities such as table interpre- TableLlama has strong capabilities for various
tation and table augmentation, while we choose in-domain table understanding and augmentation
tasks that require more high-level reasoning abili- tasks, and also achieves promising performance in
ties such as table QA and cell description to test the generalizing to unseen tasks and datasets.
model’s generalization ability. This extensive range In summary, our main contributions are:
of tables and diverse tasks not only provide valu-
able resources for table modeling, but also foster a • We construct TableInstruct, a large-scale
more comprehensive evaluation of generalist mod- instruction tuning dataset with diverse, realis-
els. (2) The use of real-world tables and realistic tic tasks based on real-world tables. We unify
tasks. TableInstruct uses authentic real-world their format and manually annotate instruc-
instead of overly simplified synthetic task data com- tions to guarantee quality.
(a) Column Type Annotation ### Instruction:
This is a column type annotation task. The goal for this task is to choose the correct types for one selected column of the table from
1958 Nippon Professional Baseball season the given candidates. The Wikipedia page, ... provide important information for choosing the correct column types.
Central League ### Input:
[TLE] The Wikipedia page is about 1958 Nippon Professional Baseball season. The Wikipedia section is about Central League. The table
Stat Player Team Total
Wins Masaichi Kaneda Kokutetsu Swallows 31
caption is Pitching leaders. [TAB] col: | stat | player | ... [SEP] row 1: | Wins | Masaichi Kaneda | ... [SEP] row 2: | Losses | ...
Losses Noboru Akiyama Taiyo Whales 23 ### Question:
Earned run
Masaichi Kaneda Kokutetsu Swallows 1.3 The column 'player' contains the following entities: <Masaichi Kaneda>, <Noboru Akiyama>, ... The column type candidates are:
average
tv.tv_producer, astronomy.star_system_body, ... What are the correct column types for this column (column name: player; entities:
Strikeouts Masaichi Kaneda Kokutetsu Swallows 311
<Masaichi Kaneda>, ... , etc)?
Motoshi Fujita Yomiuri Giants
Innings pitched 359
Noboru Akiyama Taiyo Whales ### Response: sports.pro_athlete, baseball.baseball_player, people.person.

(b) Row Population ### Instruction:

This is a table row population task. The goal of this task is to populate the possible entities of the selected column for a table, given the
Wikipedia page title, ... You will be given a list of entity candidates. Please rank them so that the most likely entities come first.
NBA Conference Finals
### Input:
Eastern Conference Finals [TLE] The Wikipedia page is about NBA conference finals. The Wikipedia section is about eastern conference finals. The table headers
Year Champion Coach Result Runner-up are: | year | champion | ... You need to populate the column: year. [SEED] The seed entity is <1971_NBA_playoffs>.
1971 Baltimore Bullets Gene Shue 4–3 New York Knicks
### Question:
The entity candidates are: <2003_NBA_playoffs>, <1982-83_Washington_Bullets_season>, <2004_NBA_playoffs>, <Philadelphia_76ers>,
<1983-84_Washington_Bullets_season>, <1952_NBA_playoffs>, …
### Response: <1972_NBA_playoffs>, <1973_NBA_playoffs>, <1974_NBA_playoffs>, <1975_NBA_playoffs>, <1976_NBA_playoffs>, ...

(c) Hierarchical Table QA ### Instruction:

This is a hierarchical table question answering task. The goal for this task is to answer the given question based on the given table. The
Table: Department of defense obligations for research, development, table might be hierarchical.
test, and evaluation, by agency: 2015-18
agency 2015 2016 2017 2018 ### Input:
department of defense [TLE] The table caption is department of defense obligations for research, development, test, and evaluation, by agency: 2015-18. [TAB]
rdt&e 61513.5 69306.1 70866.1 83725
total research 6691.5 7152 7178 7652.7
| agency | 2015 | 2016 | ... [SEP] | department of defense | department of defense | ... [SEP] | rdt&e | 61513.5 | ... [SEP] | total research
basic research 2133.4 2238.7 2110.1 2389.9 | 6691.5 | ... [SEP] | basic research | 2133.4 | ... [SEP] | defense advanced research projects agency | ...
### Question:
defense advanced research projects agency
rdt&e 2815.6 2933.4 2894.5 3018.2
total research 1485 1535.9 1509.4 1680 How many dollars are the difference for basic research of defense advanced research projects agency increase between 2016 and 2018?
basic research 359.8 378.1 391.2 458.4
### Response: 80.3.

Figure 2: Illustration of three exemplary tasks: (a) Column type annotation. This task is to annotate the selected
column with the correct semantic types. (b) Row population. This task is to populate rows given table metadata and
partial row entities. (c) Hierarchical table QA. For subfigures (a) and (b), we mark candidates with red color in the
“task instruction” part. The candidate set size can be hundreds to thousands in TableInstruct.

• We develop TableLlama, an open-source that necessitate different abilities of models, such

LLM-based generalist model fine-tuned on as table interpretation, table augmentation, table
TableInstruct. Experiments show that com- QA and table fact verification from Wikipedia ta-
pared with the SOTA on each task that of- bles and spreadsheets in statistical scientific reports.
ten has special pre-training or model archi- Second, we select realistic tasks and construct high-
tecture design for tables, TableLlama can quality instruction data in a unified fashion without
achieve similar or even better performance on simplifying assumptions (see “in-domain” part of
almost all of the in-domain tasks. For out-of- 2.1). TableInstruct will support powerful mod-
domain tasks, compared with the base model, eling and realistic evaluation approaches, ensuring
TableLlama can achieve 5-44 absolute point a valuable and practical dataset for research.
gains on 6 datasets, and compared with GPT-4,
TableLlama has less gap or even better zero- 2.1 Data Collection
shot performance on 4 out of 6 datasets, which
demonstrate that TableInstruct can substan- TableInstruct incorporates samples from 14
tially enhance model generalizability. table-based datasets of 11 distinctive tasks (Table
1). We separate them and select 8 datasets of 8
2 TableInstruct Benchmark tasks for training and in-domain evaluation. We
leave the other 6 datasets of 4 tasks as held-out
Unlike existing datasets predominantly designed unseen datasets for out-of-domain evaluation.
for training task-specific table models, our objec- Task category: Tasks in TableInstruct can be
tive is to bridge the gap between multiple com- categorized into several groups: table interpreta-
plex task-specific models and one simple generalist tion, table augmentation, question answering, fact
model that can deal with all the table-based tasks verification, dialogue generation, and data-to-text.
without extra model-design efforts. To achieve this, Table interpretation aims to uncover the seman-
our approach for constructing TableInstruct ad- tic attributes of the data contained in relational
heres to the following principles. First, instead of tables, and transform this information into ma-
collecting multiple datasets from highly homoge- chine understandable knowledge. Table augmenta-
neous tasks, we try to diversify the tasks and table tion is to expand the partial tables with additional
types. We pick representative table-based tasks data. Question answering aims to obtain the an-
In- #Train #Test Input Token Length
Task Category Task Name Dataset
domain (Table/Sample) (Table/Sample) min max median
Col Type Annot. Yes 397K/628K 1K/2K 106 8192 2613
Table
Relation Extract. TURL (Deng et al., 2020) Yes 53K/63K 1K/2K 2602 8192 3219
Interpretation
Entity Linking Yes 193K/1264K 1K/2K 299 8192 4667
Table Schema Aug. Yes 288K/288K 4K/4K 160 1188 215
TURL (Deng et al., 2020)
Augmentation Row Pop. Yes 286K/286K 0.3K/0.3K 264 8192 1508
Hierarchical Table QA HiTab (Cheng et al., 2022b) Yes 3K/7K 1K/1K 206 5616 978
Highlighted Cells QA FeTaQA (Nan et al., 2022) Yes 7K/7K 2K/2K 261 5923 740
Question
Hybrid Table QA HybridQA (Chen et al., 2020b) No – 3K/3K 248 2497 675
Answering
Table QA WikiSQL (Zhong et al., 2017) No – 5K/16K 198 2091 575
Table QA WikiTQ (Pasupat and Liang, 2015) No – 0.4K/4K 263 2688 709
Fact TabFact (Chen et al., 2020a) Yes 16K/92K 2K/12K 253 4975 630
Fact Verification
Verification FEVEROUS (Aly et al., 2021) No – 4K/7K 247 8192 648
Dialogue Table Grounded
KVRET (Eric et al., 2017) No – 0.3K/0.8K 187 1103 527
Generation Dialogue Generation
Highlighted
Data-to-Text ToTTo (Parikh et al., 2020) No – 7K/8K 152 8192 246
Cells Description

Table 1: Statistics of train/test tasks and datasets in our TableInstruct. For each task, we explain its definition and
show an example in Appendix E.

swer with tables and optional highlighted cells or dečić and Krötzsch, 2014), which contains hun-
passages as evidence. Fact verification is to dis- dreds of complex metadata, such as “<2011-12
criminate whether the tables can support or refute Melbourne Victory season [DESCRIPTION] Asso-
the claims. Dialogue generation is to generate a re- ciation football club 2011/12 season for Melbourne
sponse grounded on the table and dialogue history. Victory [TYPE] SoccerClubSeason>” as shown in
Data-to-text is to generate a description based on Figure 5 in Appendix E. For schema augmentation
the highlighted cells. By choosing the tasks that and row population, there are a huge number of can-
require models to learn more fundamental table didates that LLMs need to rank. For hierarchical
understanding abilities such as table interpretation table QA, all the tables are engaged with intricate
and table augmentation for training, we hope the structures with multi-level column names and row
model can demonstrate generalization ability on names. In addition, it is intensive in numerical rea-
out-of-domain datasets such as high-level table QA soning which requires LLMs to understand table
and table cell description tasks. structure, identify related cells and do calculations.
In-domain: The tasks for training the general- By doing so, we hope to enable LLMs to become
ist table model include column type annotation, truly powerful generalist models that can handle so-
relation extraction, entity linking, row popula- phisticated table tasks and TableInstruct can be
tion, schema augmentation, hierarchical table QA, a realistic benchmark to evaluate LLMs’ abilities
highlighted cells QA, and table fact verification. compared with specially designed table models.
These tasks require the model to understand the Out-of-domain: A powerful generalist table
semantics of table columns, the relation between model is expected to not only demonstrate strong
table column pairs, the semantics of table cells performance on in-domain tasks, but also general-
and require the model to gain reasoning ability ize well to unseen tasks or unseen datasets of the
to answer table-related questions and verify the same tasks. We choose tasks such as table QA and
facts. For the dataset of each task, we intention- cell description that require the model’s high-level
ally pick up those that enjoy realistic task com- table understanding and reasoning ability as out-
plexity without simplifying assumptions. For ex- of-domain datasets. We involve HybridQA (Chen
ample, for column type annotation and relation et al., 2020b), KVRET (Eric et al., 2017), FEVER-
extraction, these two tasks are multi-choice classifi- OUS (Aly et al., 2021), ToTTo (Parikh et al., 2020),
cation tasks in essence. We use real-world column WikiSQL (Zhong et al., 2017) and WikiTQ (Pasu-
semantic types and relation types from Freebase pat and Liang, 2015) as 6 out-of-domain datasets
(Google.2015), which contains hundreds of com- to test our model’s generalization ability.
plex choices such as “government.politician.party-
government.political_party_tenure.party” shown in 2.2 Task Formulation and Challenges
Figure 4 in Appendix E. For entity linking, the ref- The primary objective of TableInstruct is to de-
erent entities are from real-world Wikidata (Vran- sign one generalist model for all table-based tasks.
As Figure 2 (a)-(c) shows, each instance in our leads to less computation cost with similar perfor-
dataset maps three components: <instruction, table mance compared to fine-tuning with vanilla atten-
input, question> to an output. The instruction is tion. We fine-tune LongLoRA on TableInstruct
manually designed to point out the task and give to get our generalist model TableLlama.
a detailed task description. We concatenate table Existing SOTA Models. In our evaluation settings,
metadata such as the Wikipedia page title, section we have 9 out of 14 SOTA models utilize table
title and table caption with the serialized table as pretraining and/or have special model architecture
table input. In the question, we put all the infor- design for tables. The detailed description for each
mation the model needed to complete the task and SOTA model is in Appendix A.
prompt the model to generate an answer. For exam- Evaluation Metrics. We follow the above base-
ple, for the column type annotation task, as Figure lines to use their evaluation metrics. For column
2 (a) shows, the column named “Player” needs to type annotation, relation extraction and KVRET,
be annotated with its semantic types. In the for- we use Micro F1. For entity linking, TabFact,
mat, the “instruction” gives the description of the FEVEROUS, HybridQA, WikiSQL and WikiTQ,
task. The “input” contains the table-related infor- we use accuracy. For row population and schema
mation. Then we provide the entire candidate pool augmentation, we use MAP. For Hitab, we use exe-
in the “question” and ask the model to choose one cution accuracy (Zhong et al., 2017). For FeTaQA
or multiple correct semantic types for this column. and ToTTo, we use BLEU (Papineni et al., 2002).
Challenges. Since we select realistic tasks and Training and Inference Details. We choose Lon-
tables, the table length can vary from several to gLoRA 7B (Chen et al., 2023b), fully fine-tuning
thousands of rows. Besides, for some tasks that version with 8K context length limit as our base
are essentially multi-choice classification or rank- model. The fully fine-tuning version replaces the
ing, the entire candidate pool can be very large vanilla attention in Llama 2 with shift short atten-
up to thousands. Furthermore, as the candidates tion. We fine-tune the model with Huggingface
are from real-world Freebase and Wikidata, each transformers library (Wolf et al., 2020). We merge
candidate is long, such as “<2011-12 Melbourne all eight datasets and repeat three smaller datasets
Victory season [DESCRIPTION] Association foot- (i.e., FeTaQA, HiTab and TabFact) for six times
ball club 2011/12 season for Melbourne Victory and randomly shuffle them as our final training data.
[TYPE] SoccerClubSeason>” is one candidate for We use a learning rate of 2e-5 and set the batch size
entity linking. These characteristics can not only at 3. We streamingly train the model on 48 A100
make it difficult for the model to learn, but also 80GB GPUs and use a cosine scheduler with a 3%
introduce the challenge of handling long contexts. warm-up period for 2 epochs. To efficiently train
the model, we employ DeepSpeed training with
3 Experimental Setup ZeRO-2 stage (Rajbhandari et al., 2020). For both
training and inference, we set the input length as
Model Construction. Although a few existing 8192. For inference on TableLlama, as different
LLMs (Chen et al., 2023a; Tworkowski et al., 2023) tasks have different lengths of the ground truth, we
can handle longer than 4K contexts, their training use 64 as the output length for column type anno-
time is quadratically increasing with context length, tation, relation extraction, entity linking, HiTab,
which becomes very costly for us to further fine- TabFact, FEVEROUS, HybridQA, WikiSQL and
tune them on TableInstruct due to our large data WikiTQ, 128 for schema augmentation, FeTaQA,
scale. As LongLoRA (Chen et al., 2023b) has been KVRET and ToTTo, and 512 for row population.
shown as an effective and efficient technique to For column type annotation and entity linking, we
train long-context LLMs with shift short attention, uniformly sample a subset from the original test
we adopt it as our backbone model. Shift short at- data as our test set due to the large test size. For
tention splits context length into several groups and row population, we filter out the examples with
conducts attention in each group individually. The more than 500 candidate entities from the original
tokens are shifted by half group size in half atten- test set and randomly sample a subset as our test
tion heads to ensure the information flow between set. For all the downsampled test set, we reproduce
neighboring groups. For example, LongLoRA can the SOTA results using the SOTA model.
use shift short attention with group size 2048 to ap- For closed-source LLMs, we use the gpt-4-1106-
proximate total 8196 context length training, which preview version for GPT-4, which is the latest ver-
In-domain Evaluation
Datasets Metric Base TableLlama SOTA GPT-3.5 GPT-4§
Column Type Annotation F1 3.01 94.39 94.54*† (Deng et al., 2020) 30.88 31.75
Relation Extraction F1 0.96 91.95 94.91*† (Deng et al., 2020) 27.42 52.95
Entity Linking Accuracy 31.80 93.65 84.90*† (Deng et al., 2020) 72.15 90.80
Schema Augmentation MAP 36.75 80.50 77.55*† (Deng et al., 2020) 49.11 58.19
Row Population MAP 4.53 58.44 73.31*† (Deng et al., 2020) 22.36 53.40
HiTab Exec Acc 14.96 64.71 47.00*† (Cheng et al., 2022a) 43.62 48.40
FeTaQA BLEU 8.54 39.05 33.44 (Xie et al., 2022) 26.49 21.70
TabFact Accuracy 41.65 82.55 84.87* (Zhao and Yang, 2022) 67.41 74.40

Table 2: In-domain evaluation results. “Base”: LongLoRA model w/o fine-tuning on TableInstruct; “*”: w/
special model architecture design for tables/tasks; “†”: w/ table pretraining; “§": for GPT-4, we uniformly sample
500 examples from test set for each task due to limited budget.

sion that supports 128K context and reports the may be beneficial for such table QA tasks despite
best performance. For GPT-3.5, we use the gpt-3.5- with semi-structured tables.
turbo-1106 version, which supports 16K context. For entity linking which requires the model
to link the mention in a table cell to the cor-
4 Result Analysis rect referent entity in Wikidata, TableLlama also
presents superior performance with 8 points gain
4.1 Main Results over SOTA. Since the candidates are composed of
In-domain Results. As Table 2 shows, we train referent entity name and description, we hypothe-
TableLlama on eight table-based tasks and eval- size LLMs have certain abilities to understand the
uate it on their test sets as the in-domain results. description which help identify the correct entities.
Due to the special semi-structured nature of tables, Row population is the only task that TableLlama
for most table-based tasks, existing work achieves has a large performance gap compared to the SOTA.
SOTA results by using pretraining on large-scale Here we provide a large number of candidates for
tables and/or special model architecture design tai- the model to rank given table metadata and the seed
lored for tables. Nonetheless, we observe that: row entity. By analyzing the errors, we found that
By simply fine-tuning a large language model on the model can easily identify the entities contain-
TableInstruct, TableLlama can achieve compa- ing similar numbers in sequence, such as the first
rable or even better performance on almost all the example shown in Table 6 in Appendix D. How-
tasks without any table pretraining or special table ever, for entities that share high similarities, such
model architecture design. For most of the tasks, as the second example in Table 6 shows, the tar-
the performance gap is within 3 absolute points, ex- get row entities are the competitions which “Oleg
cept for row population. For entity linking, schema Veretelnikov” got achievements in. To correctly
augmentation, HiTab and FeTaQA, TableLlama populate the entities from the given plenty of can-
can exceed the SOTA performance by up to 17.71 didates highly related to “competitions”, it requires
absolute points. This demonstrates that empower- the model to understand the inherent relation be-
ing open-source LLMs with more powerful table tween the athlete and each given candidate, which
understanding abilities via instruction tuning can be is still challenging for the current model.
a promising research direction to further explore. Out-of-domain results. We evaluate TableLlama
TableLlama displays advantanges in table QA on six out-of-domain datasets. We observe that:
tasks. HiTab and FeTaQA are two table question By comparing with the base model, TableLlama
answering tasks we include for training. By com- can achieve 5-44 points gain on 6 out-of-domain
paring the results, we found that TableLlama can datasets, which demonstrates TableInstruct can
surpass the SOTA by 5.61 points for FeTaQA and enhance the model’s generalization ability. By
17.71 points for HiTab, which is full of numerical learning from the table-based training tasks, the
reasoning on tables. As LLMs have been shown model has acquired essential underlying table un-
superior in interacting with humans and answering derstanding ability, which can be transferred to
questions, this indicates that the existing underly- other table-based tasks/datasets and facilitate their
ing strong language understanding ability of LLMs performance. Among these 6 datasets, we found
Out-of-domain Evaluation
Datasets Metric Base TableLlama SOTA ∆Base GPT-3.5 GPT-4§
FEVEROUS Accuracy 29.68 73.77 85.60 (Tay et al., 2022) +44.09 60.79 71.60
HybridQA Accuracy 23.46 39.38 65.40* (Lee et al., 2023) +15.92 40.22 58.60
KVRET Micro F1 38.90 48.73 67.80 (Xie et al., 2022) +9.83 54.56 56.46
ToTTo BLEU 10.39 20.77 48.95 (Xie et al., 2022) +10.38 16.81 12.21
WikiSQL Accuracy 15.56 50.48 92.70 (Xu et al., 2023b) +34.92 41.91 47.60
WikiTQ Accuracy 29.26 35.01 57.50† (Liu et al., 2022) +5.75 53.13 68.40

Table 3: Out-of-domain evaluation results. “Base”: LongLoRA model w/o fine-tuning on TableInstruct; “*”: w/
special model architecture design for tables/tasks; “†”: w/ table pretraining; “§": for GPT-4, we uniformly sample
500 examples from test set for each task due to limited budget. We put the SOTA performances here in grey for
reference and note that they were achieved under full-dataset training for each task while TableLlama is zero-shot.

that FEVEROUS, a table fact verification dataset 4.2 Ablation Study

exhibits the largest gain over the other 5 datasets.
This is likely because the fact verification task is an To better understand how TableInstruct helps
in-domain training task, despite the dataset unseen enhance the model’s generalizability, we conduct
during training. Compared with cross-task general- an ablation study to show the transfer between in-
ization, it may be easier to generalize to different dividual datasets.
datasets belonging to the same tasks. The model trained on table-based QA tasks gen-
eralizes better than that trained on other tasks. As
Although there is still some gap between our
Table 4 shows, the model trained on HiTab scores
performance and the previously reported SOTA for
more than 20 points on 7 out of 13 unseen datasets,
each dataset, we note those SOTAs were achieved
and that trained on FeTaQA scores more than 10
under full-dataset training while TableLlama is
points on 7 out of 13 unseen datasets, which can
zero-shot, hence it is reasonable to see such a gap.
surpass models trained on the other 6 datasets in-
Nevertheless, we hope our work can inspire future
dividually by a large gain. We hypothesize that
work to further improve the zero-shot performance.
the general forms of table-based QA tasks can en-
Open-source vs. closed-source. We compare courage models to gain general QA ability, which
TableLlama and closed-source LLMs (i.e., GPT- is beneficial when transferring to other tasks or
3.5 and GPT-4) and observe that: datasets, since instruction tuning requires models
TableLlama achieves better performance on in- to answer the question in essence. However, the
domain tasks compared with closed-source LLMs. models that are individually trained on other tasks
It shows that even if closed-source LLMs have may have learned strong superficial regularities as
demonstrated strong performance in general, fine- their formats have unique characteristics specially
tuning open-source LLMs on task-specific table- designed for themselves. Therefore, when evaluat-
based data still has better performance. ing on other unseen datasets or tasks, the models
are too obfuscated to generate the correct answer.
TableLlama shows less gap or even better zero-
Incorporating other tasks helps enhance the
shot performance than closed-source LLMs on 4
model’s underlying generalization ability within
out of 6 out-of-domain datasets (i.e., FEVEROUS,
the same task category. Comparing the model
KVRET, ToTTo and WikiSQL), which shows TableL-
trained on TabFact and TableInstruct, when
lama has gained generalization ability. But closed-
evaluating on FEVEROUS, which is the same
source LLMs are still stronger at table-based QA
task transfer for TabFact, we found TableLlama
tasks that require more complex reasoning.
achieves 73.77 accuracy while the model trained
GPT-4 has better results than GPT-3.5 on all the on TabFact only achieves 56.15 accuracy. This
in-domain and out-of-domain datasets except for indicates that other tasks in the training set also
FeTaQA and ToTTo. This is because GPT-4 gen- play an important role in engaging the model to
erates longer output than GPT-3.5, so for FeTaQA obtain stronger table fact verification ability. Be-
and ToTTo which are evaluated using BLEU to sides, if we compare the performance on three
compare the generated sentence the ground truth out-of-domain table QA datasets (i.e., HybridQA,
sentence, GPT-3.5 performs better. WikiSQL and WikiTQ) among TableLlama and
In-domain Out-of-domain
Training
Data ColType RelExtra EntLink ScheAug RowPop HiTab FeTaQA TabFact FEVER. HybridQA KVRET ToTTo WikiSQL WikiTQ
F1 F1 Acc MAP MAP Acc BLEU Acc Acc Acc Micro F1 BLEU Acc Acc
Base 3.01 0.96 31.80 36.75 4.53 14.96 8.54 41.65 29.68 23.46 38.90 10.39 15.56 29.26
ColType 94.32 0 0 0 0 0.13 0.52 0 0 0 0 1.11 0.35 0.21
RelExtra 45.69 93.96 0.45 8.72 0.99 7.26 1.44 0 2.38 8.17 5.90 5.60 7.02 9.58
EntLink 0.86 0.03 88.45 2.31 0.94 5.37 4.79 0 39.04 3.06 0 1.76 3.42 7.07
ScheAug - - - 80.00 - - - - - - - - - -
RowPop - - - - 53.86 - - - - - - - - -
HiTab 0.20 0.14 7.15 40.81 5.45 63.19 2.07 49.46 46.81 24.70 38.70 2.45 32.86 27.97
FeTaQA 0 0.40 0 30.23 0.15 19.57 38.69 1.20 1.21 33.79 50.69 23.57 13.79 27.12
TabFact 0 0 0 0 0 0 0 74.87 56.15 0 0 0 0 0
TableInstruct 94.39 91.95 93.65 80.50 58.44 64.71 39.05 82.55 73.77 39.38 48.73 20.77 50.48 35.01

Table 4: Transfer between different datasets. Bold numbers are the best results for each evaluation dataset. For
models trained on schema augmentation (ScheAug) and row population (RowPop), their predictions on other
datasets tend to repeat the candidates in the training data, which means they cannot generalize to other datasets, and
hence we use “-” to represent their performances.

models individually trained on two table-based 2020; Wang et al., 2021), and numerical encoding
QA datasets (i.e., HiTab and FeTaQA), we can (Wang et al., 2021) to better encode the table struc-
see TableLlama achieves better zero-shot perfor- ture and infuse more information to the neural ar-
mance. This indicates that including the other tasks chitecture. In addition, some work focuses on table
(i.e., TableInstruct) to train the model can fur- pretraining (Liu et al., 2022; Yin et al., 2020; Deng
ther enhance the model’s underlying table question et al., 2020; Iida et al., 2021) to encode knowledge
answering ability. in large-scale tables. However, although such ex-
Individually fine-tuning models on tasks that are isting works have shown promising progress, they
highly different from others tends to make models are still data-specific and downstream task-specific,
overfit and hardly generalize to others. As Table which requires special design tailored for tables
4 shows, the model individually fine-tuned on 4 and table-based tasks.
tasks: column type annotation, relation extraction, Our work proposes TableInstruct to unify dif-
entity linking and TabFact tends to have weaker ferent table-based tasks and develops a one-for-all
performance when evaluated on other tasks. We LLM TableLlama to reduce those extra efforts dur-
hypothesize that these four tasks are highly differ- ing modeling. This high-level insight is similar
ent from others, so the model individually trained to UnifiedSKG (Xie et al., 2022), which unifies
on such tasks will overfit to the task itself, thus a diverse set of structured knowledge grounding
becoming hard to generalize to other unseen tasks. tasks into a text-to-text format. However, Unified-
SKG deals with different knowledge sources such
5 Related Work as databases, knowledge graphs and web tables
Table Representation Learning. Given the vast and does not explore instruction tuning, while we
amount of knowledge stored in tables, various focus on a wide range of realistic tasks based on
table-based tasks have been proposed (Pujara et al., real-world tables via instruction tuning. In addi-
2021), such as column type annotation (Hulse- tion, a concurrent work (Li et al., 2023b) synthe-
bos et al., 2019), row population (Zhang and Ba- sizes diverse table-related tasks and finetunes close-
log, 2017), table QA (Sun et al., 2016; Pasupat source LLMs such as GPT-3.5 via instruction tun-
and Liang, 2015; Cheng et al., 2022b; Nan et al., ing. Compared to theirs, we collect more realistic
2022), etc. In order to handle the semi-structured and complex task data such as HiTab as well as clas-
tables, existing work puts their efforts into design- sification and ranking tasks with candidates from
ing special model architectures, such as TURL Freebase and Wikidata and develop open-source
with structure-aware attention (Deng et al., 2020), LLMs for table-based tasks. We believe both our
TUTA with tree-based attention (Wang et al., 2021) constructed high-quality table instruction tuning
and TaBERT with vertical self-attention mecha- dataset and the trained model can be valuable re-
nism (Yin et al., 2020); or designing special encod- sources for facilitating this line of research.
ings such as table position encoding (Herzig et al., Instruction Tuning. Instruction tuning that trains
LLMs using <instruction, output> pairs in a super- which are not included in TableInstruct. There-
vised fashion is a crucial technique to enhance the fore, even if TableLlama has demonstrated the
capabilities and controllability of LLMs (Chung generalization ability on different out-of-domain
et al., 2022; Wang et al., 2022; Mishra et al., 2022). datasets and tasks, the model’s performance may
The instructions serve to constrain the model’s out- vary based on the complexity and specifics of the
puts to align with the desired response character- new unseen table tasks and datasets. As we have
istics or domain knowledge and can help LLMs made the first step towards building an open large
rapidly adapt to a specific domain without ex- generalist model for tables, we encourage future
tensive retraining or architecture designs (Zhang work to further explore this line of research and to
et al., 2023). Therefore, different instruction tun- further enhance the model’s generalization ability
ing datasets have been proposed to guide LLMs’ for tables.
behaviors (Wang et al., 2022; Honovich et al.,
2022; Longpre et al., 2023; Xu et al., 2023a; Yue Acknowledgements
et al., 2024). Different instruction tuning mod-
The authors would thank all members of the
els such as InstructGPT (Ouyang et al., 2022), Vi-
OSU NLP group for providing feedback about
cuna (Zheng et al., 2023) and Claude2 emerge and
the project. This research was sponsored in part
demonstrate boosted performance compared with
by NSF IIS-1815674, NSF CAREER #1942980,
the pre-trained models. In addition, instruction tun-
and NSF OAC-2112606. The views and conclu-
ing has been applied to different modalities such as
sions contained herein are those of the authors and
images, videos and audio (Li et al., 2023a) and has
should not be interpreted as representing the offi-
shown promising results. This signals that instruc-
cial policies, either expressed or implied, of the
tion tuning can be a promising technique to enable
U.S. government. The U.S. Government is autho-
large pre-trained models to handle various tasks.
rized to reproduce and distribute reprints for Gov-
However, how to utilize instruction tuning to guide
ernment purposes notwithstanding any copyright
LLMs to complete tables-based tasks is still under-
notice herein.
explored. Our work fills this gap by construct-
ing a high-quality table instruction tuning dataset:
TableInstruct, which covers large-scale diverse References
and realistic tables and tasks to enable both mod-
Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull,
eling and evaluation. We also release TableLlama, James Thorne, Andreas Vlachos, Christos
an open-source LLM-based generalist model fine- Christodoulopoulos, Oana Cocarascu, and Arpit
tuned on TableInstruct to promote this avenue Mittal. 2021. The fact extraction and VERification
of research. over unstructured and structured information
(FEVEROUS) shared task. In Proceedings of the
6 Conclusion Fourth Workshop on Fact Extraction and VERifica-
tion (FEVER), pages 1–13, Dominican Republic.
This paper makes the first step towards developing Association for Computational Linguistics.
open-source large generalist models for a diversity Shouyuan Chen, Sherman Wong, Liangjian Chen, and
of table-based tasks. Towards that end, we con- Yuandong Tian. 2023a. Extending context window
struct TableInstruct and develop the first open- of large language models via positional interpolation.
source generalist model for tables, TableLlama.
Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai
We evaluate both in-domain and out-of-domain set- Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and
tings and the experiments show that TableLlama William Yang Wang. 2020a. Tabfact: A large-scale
has gained strong table understanding ability and dataset for table-based fact verification. In Interna-
generalization ability. tional Conference on Learning Representations.

Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong,

7 Limitations Hong Wang, and William Yang Wang. 2020b. Hy-
bridqa: A dataset of multi-hop question answering
Although we strive to increase the diversity of our
over tabular and textual data. In Findings of the Asso-
dataset and have collected 14 datasets of 11 tasks ciation for Computational Linguistics: EMNLP 2020,
for tables, there are still some table-based tasks pages 1026–1036.
such as data imputation and table classification
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai,
2
https://www.anthropic.com/index/introducing-claude Zhijian Liu, Song Han, and Jiaya Jia. 2023b. Lon-
glora: Efficient fine-tuning of long-context large lan- Sung-Min Lee, Eunhwan Park, Daeryong Seo,
guage models. arXiv:2309.12307. Donghyeon Jeon, Inho Kang, and Seung-Hoon Na.
2023. MAFiD: Moving average equipped fusion-in-
Zhoujun Cheng, Haoyu Dong, Ran Jia, Pengfei Wu, Shi decoder for question answering over tabular and tex-
Han, Fan Cheng, and Dongmei Zhang. 2022a. For- tual data. In Findings of the Association for Compu-
tap: Using formulas for numerical-reasoning-aware tational Linguistics: EACL 2023, pages 2337–2344,
table pretraining. Association for Computational Lin- Dubrovnik, Croatia. Association for Computational
guistics. Linguistics.
Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and 2023a. Blip-2: Bootstrapping language-image pre-
Dongmei Zhang. 2022b. HiTab: A hierarchical table training with frozen image encoders and large lan-
dataset for question answering and natural language guage models.
generation. In Proceedings of the 60th Annual Meet-
ing of the Association for Computational Linguistics Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge,
(Volume 1: Long Papers), pages 1094–1110, Dublin, Haidong Zhang, Danielle Rifinski Fainman, Dong-
Ireland. Association for Computational Linguistics. mei Zhang, and Surajit Chaudhuri. 2023b. Table-gpt:
Table-tuned gpt for diverse table tasks. arXiv preprint
Hyung Won Chung, Le Hou, Shayne Longpre, Barret arXiv:2310.09263.
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi
2022. Scaling instruction-finetuned language models. Lin, Weizhu Chen, and Jian-Guang Lou. 2022.
arXiv preprint arXiv:2210.11416. TAPEX: Table pre-training via learning a neural SQL
executor. In International Conference on Learning
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Representations.
Cong Yu. 2020. Turl: table understanding through
representation learning. Proceedings of the VLDB Shayne Longpre, Le Hou, Tu Vu, Albert Webson,
Endowment, 14(3):307–319. Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le,
Barret Zoph, Jason Wei, and Adam Roberts. 2023.
Mihail Eric, Lakshmi Krishnan, Francois Charette, and
The flan collection: Designing data and methods for
Christopher D. Manning. 2017. Key-value retrieval
effective instruction tuning.
networks for task-oriented dialogue. In Proceedings
of the 18th Annual SIGdial Meeting on Discourse Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
and Dialogue, pages 37–49, Saarbrücken, Germany. Hannaneh Hajishirzi. 2022. Cross-task generaliza-
Association for Computational Linguistics. tion via natural language crowdsourcing instructions.
Google.2015. Freebase data dumps. In Proceedings of the 60th Annual Meeting of the
https://developers.google.com/freebase/data. Association for Computational Linguistics (Volume
1: Long Papers), pages 3470–3487, Dublin, Ireland.
Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Association for Computational Linguistics.
Müller, Francesco Piccinno, and Julian Eisenschlos.
2020. TaPas: Weakly supervised table parsing via Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria
pre-training. In Proceedings of the 58th Annual Meet- Lin, Neha Verma, Rui Zhang, Wojciech Kryściński,
ing of the Association for Computational Linguistics, Hailey Schoelkopf, Riley Kong, Xiangru Tang,
pages 4320–4333, Online. Association for Computa- Mutethia Mutuma, Ben Rosand, Isabel Trindade,
tional Linguistics. Renusree Bandaru, Jacob Cunningham, Caiming
Xiong, and Dragomir Radev. 2022. Fetaqa: Free-
Or Honovich, Thomas Scialom, Omer Levy, and Timo form table question answering. Transactions of the
Schick. 2022. Unnatural instructions: Tuning lan- Association for Computational Linguistics, 10:35–49.
guage models with (almost) no human labor.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Zgraggen, Arvind Satyanarayan, Tim Kraska, Ça- Sandhini Agarwal, Katarina Slama, Alex Ray, John
gatay Demiralp, and César Hidalgo. 2019. Sherlock: Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
A deep learning approach to semantic data type de- Maddie Simens, Amanda Askell, Peter Welinder,
tection. In Proceedings of the 25th ACM SIGKDD Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
International Conference on Knowledge Discovery Training language models to follow instructions with
& Data Mining, pages 1500–1508. human feedback.

Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Iyyer. 2021. TABBIE: Pretrained representations of Jing Zhu. 2002. Bleu: a method for automatic evalu-
tabular data. In Proceedings of the 2021 Conference ation of machine translation. In Proceedings of the
of the North American Chapter of the Association 40th Annual Meeting of the Association for Compu-
for Computational Linguistics: Human Language tational Linguistics, pages 311–318, Philadelphia,
Technologies, pages 3446–3456, Online. Association Pennsylvania, USA. Association for Computational
for Computational Linguistics. Linguistics.
Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Man- Ruan Silva, Eric Michael Smith, Ranjan Subrama-
aal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipan- nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
jan Das. 2020. ToTTo: A controlled table-to-text lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
generation dataset. In Proceedings of the 2020 Con- Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
ference on Empirical Methods in Natural Language Melanie Kambadur, Sharan Narang, Aurelien Ro-
Processing (EMNLP), pages 1173–1186, Online. As- driguez, Robert Stojnic, Sergey Edunov, and Thomas
sociation for Computational Linguistics. Scialom. 2023. Llama 2: Open foundation and fine-
tuned chat models.
Panupong Pasupat and Percy Liang. 2015. Composi-
tional semantic parsing on semi-structured tables. In Szymon Tworkowski, Konrad Staniszewski, Mikołaj
Proceedings of the 53rd Annual Meeting of the As- Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr
sociation for Computational Linguistics and the 7th Miłoś. 2023. Focused transformer: Contrastive train-
International Joint Conference on Natural Language ing for context scaling.
Processing (Volume 1: Long Papers), pages 1470–
1480, Beijing, China. Association for Computational Denny Vrandečić and Markus Krötzsch. 2014. Wiki-
Linguistics. data: a free collaborative knowledgebase. Communi-
cations of the ACM, 57(10):78–85.
Jay Pujara, Pedro Szekely, Huan Sun, and Muhao Chen.
Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
2021. From tables to knowledge: Recent advances
labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
in table understanding. In Proceedings of the 27th
Naik, Arjun Ashok, Arut Selvan Dhanasekaran,
ACM SIGKDD Conference on Knowledge Discovery
Anjana Arunkumar, David Stap, Eshaan Pathak,
& Data Mining, pages 4060–4061.
Giannis Karamanolakis, Haizhi Lai, Ishan Puro-
hit, Ishani Mondal, Jacob Anderson, Kirby Kuznia,
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Krima Doshi, Kuntal Kumar Pal, Maitreya Patel,
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Mehrad Moradshahi, Mihir Parmar, Mirali Purohit,
Wei Li, Peter J Liu, et al. 2020. Exploring the limits
Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma,
of transfer learning with a unified text-to-text trans-
Ravsehaj Singh Puri, Rushang Karia, Savan Doshi,
former. J. Mach. Learn. Res., 21(140):1–67.
Shailaja Keyur Sampat, Siddhartha Mishra, Sujan
Reddy A, Sumanta Patro, Tanay Dixit, and Xudong
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
Shen. 2022. Super-NaturalInstructions: Generaliza-
and Yuxiong He. 2020. Zero: Memory optimizations
tion via declarative instructions on 1600+ NLP tasks.
toward training trillion parameter models.
In Proceedings of the 2022 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Dominique Ritze, Oliver Lehmberg, and Christian Bizer.
5085–5109, Abu Dhabi, United Arab Emirates. As-
2015. Matching html tables to dbpedia. In Proceed-
sociation for Computational Linguistics.
ings of the 5th international conference on web intel-
ligence, mining and semantics, pages 1–6. Zhiruo Wang, Haoyu Dong, Ran Jia, Jia Li, Zhiyi Fu,
Shi Han, and Dongmei Zhang. 2021. Tuta: Tree-
Huan Sun, Hao Ma, Xiaodong He, Wen-tau Yih, Yu Su, based transformers for generally structured table pre-
and Xifeng Yan. 2016. Table cell search for question training. In Proceedings of the 27th ACM SIGKDD
answering. In Proceedings of the 25th International Conference on Knowledge Discovery & Data Mining,
Conference on World Wide Web, pages 771–782. pages 1780–1790.
Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Jason Wei, Xuezhi Wang, Hyung Won Chung, Sia- Chaumond, Clement Delangue, Anthony Moi, Pier-
mak Shakeri, Dara Bahri, Tal Schuster, et al. 2022. ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
Ul2: Unifying language learning paradigms. arXiv icz, Joe Davison, Sam Shleifer, Patrick von Platen,
preprint arXiv:2205.05131. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- Quentin Lhoest, and Alexander Rush. 2020. Trans-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay formers: State-of-the-art natural language processing.
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti In Proceedings of the 2020 Conference on Empirical
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Methods in Natural Language Processing: System
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Demonstrations, pages 38–45, Online. Association
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, for Computational Linguistics.
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong,
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Wu, Ming Zhong, Pengcheng Yin, Sida I Wang,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- et al. 2022. Unifiedskg: Unifying and multi-tasking
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- structured knowledge grounding with text-to-text lan-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- guage models. In Proceedings of the 2022 Confer-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- ence on Empirical Methods in Natural Language
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Processing, pages 602–631.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, A Existing SOTA Models
Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin
Jiang. 2023a. Wizardlm: Empowering large lan- TURL (Deng et al., 2020) is an encoder-based
guage models to follow complex instructions. BERT-like model pre-trained on 570K tables.
Kuan Xu, Yongbo Wang, Yongliang Wang, Zujie Wen,
Though TURL has shown SOTA performance on
and Yang Dong. 2023b. Sead: End-to-end text-to-sql various table tasks such as column type annotation,
generation with schema-aware denoising. relation extraction, entity linking, row population
and schema augmentation, it requires fine-tuning
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Se-
bastian Riedel. 2020. TaBERT: Pretraining for joint task-specific modules on labeled data. The SOTA
understanding of textual and tabular data. In Proceed- method for HiTab builds on 1) TUTA (Wang et al.,
ings of the 58th Annual Meeting of the Association 2021), which uses tree attention as the encoder to
for Computational Linguistics, pages 8413–8426, On- capture table structures and 2) FORTAP (Cheng
line. Association for Computational Linguistics.
et al., 2022a), which leverages spreadsheet formu-
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wen- las for table pre-training to better handle numerical
hao Huang, Huan Sun, Yu Su, and Wenhu Chen. reasoning. The SOTA method for TabFact designs
2024. MAmmoTH: Building math generalist models
a self-labeled keypoint alignment (Zhao and Yang,
through hybrid instruction tuning. In The Twelfth In-
ternational Conference on Learning Representations. 2022) to align salient evidence and aggregate essen-
tial information between the statement and table.
Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, For HybridQA, the SOTA method MAFiD (Lee
Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tian-
wei Zhang, Fei Wu, and Guoyin Wang. 2023. Instruc- et al., 2023) deploys special fusion in decoder and
tion tuning for large language models: A survey. uses a gated cross-attention layer to enhance the
reasoning ability on tables. The SOTA method for
Shuo Zhang and Krisztian Balog. 2017. Entitables:
WikiTQ is TAPEX (Liu et al., 2022), which fuses
Smart assistance for entity-focused tables. In Pro-
ceedings of the 40th international ACM SIGIR con- table pre-training by learning a neural SQL execu-
ference on research and development in information tor over a synthetic corpus. The SOTA method
retrieval, pages 255–264. for WikiSQL uses two denoising objectives and a
Guangzhen Zhao and Peng Yang. 2022. Table-based clause-sensitive execution guided (EG) decoding
fact verification with self-labeled keypoint alignment. strategy to generate better SQL and then get the an-
In Proceedings of the 29th International Conference swer (Xu et al., 2023b). For FeTaQA, KVRET and
on Computational Linguistics, pages 1401–1411, ToTTo, the SOTA results come from T5-3B fine-
Gyeongju, Republic of Korea. International Com-
mittee on Computational Linguistics. tuned on their own individual training data (Xie
et al., 2022). For FEVEROUS, the SOTA is from a
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan 20B large language model: FLAN UL2 (Tay et al.,
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, 2022).
Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
llm-as-a-judge with mt-bench and chatbot arena. B More details about TableInstruct

Victor Zhong, Caiming Xiong, and Richard Socher. B.1 Data Selection
2017. Seq2sql: Generating structured queries We choose the datasets and tasks based on three
from natural language using reinforcement learning. criteria: diversity, realisticness and reliability.
CoRR, abs/1709.00103.
• Diversity: we hope to cover table-based tasks
as comprehensively as possible both in the
NLP community and database community.
That’s why we include 14 datasets of 11 tasks.
• Realisticness: we include the table
sources from Wikipedia tables and Na-
tional Science Foundation reports (eg,
https://www.nsf.gov/statistics/2019/nsf19319/),
which make sure the table types are real-
istic and include both simple tables and
hierarchical tables with complex table
structures.
• Reliability: we compile existing datasets that sampled 30 instances for each task to double check
are widely used in the NLP community and the data and make sure there are no errors. We also
database community. have two annotators to do the cross-checking.
We split TableInstruct into in-domain (for training C More detailed statistics of
and evaluation) and out-of-domain (for evaluation) TableInstruct.
sets based on three constraints:
Table 5 shows more detailed statistics of
• to make the tasks in the training and out-of- TableInstruct in terms of the average word count
domain evaluation set as disjoint as possible; of different parts of the datasets (i.e., instruction,
• if there are two datasets for the same task, we input, question and response), table size (average
will divide them into training set and out-of- column size and row size per table), table type
domain evaluation set; (Wikipedia tables or NSF reports), task type (rank-
ing or classification) and whether the tables are
• since tables have special two-dimensional hierarchical or not.
structures, we need the model to gain fun-
damental table understanding abilities, which
the model can recognize the relation for cells
within and among different columns and rows,
and also correlate the headers and row names
with corresponding columns and rows. So
we mainly select different table interpretation
and table augmentation tasks to encourage the
model to understand table structures. In addi-
tion, we try to engage the model with strong
numerical reasoning ability, open-ended table
QA and fact verification ability, so we include
HiTab, FeTaQA and TabFact for training as
well. For out-of-domain tasks, we mainly test
the more high-level ability to see the model’s
generalization. For example, the table ques-
tion answering datasets in the training set are
two types: one is full of numerical reasoning
on hierarchical tables and the other is to gener-
ate open-ended answer based on highlighted
table cells. We hope the learned table QA abil-
ity can transfer to different kinds of unseen
table QA tasks such as adding extra compo-
nents (passages or dialogues, etc) as evidence
and letting the model infer the answer from
both tables and added components.
B.2 Data Annotation
The raw tables in our collected datasets are stored
in JSON, CSV or text files. We mainly annotate
instructions and questions based on the metadata
of each task, serialize the table format and put the
ground truth as response (more details and example
cases are in Appendix E).

B.3 Quality Control

These collected datasets are cleaned by previous
authors. After we annotated the data, we randomly
Avg Avg Avg Hierarchical Hierarchical
Avg Avg Instruction Avg Input Question Response Table Col Row
Rows/Table Cols/Table Len(Word) Len(Word) Len(Word) Len(Word) Type Ranking? Classification? Headers? Headers?
In-domain
ColType 15 7 46 374 333 2 Wiki. N Y N N
RelExtra 18 7 45 433 245 1 Wiki. N Y N N
EntLink 60 6 82 1308 2070 9 Wiki. N Y N N
ScheAug - - 51 17 24 12 Wiki. Y N N N
RowPop - - 60 52 74 62 Wiki. Y N N N
HiTab 22 9 29 491 17 1 Stat. reports & Wiki. N N Y Y
FeTaQA 15 6 28 325 39 19 Wiki. N N Y Y
TabFact 14 6 27 315 27 1 Wiki. N Y N N
Out-of-domain
FEVER. 13 4 27 362 63 1 Wiki. N Y Y Y
HybridQA 15 4 21 315 19 2 Wiki. N N N N
KVRET 7 6 55 171 46 9 Wiki. N N N N
ToTTo 32 7 21 54 13 15 Wiki. N N Y Y
WikiSQL 15 6 19 285 12 2 Wiki. N N N N
WikiTQ 19 6 19 348 10 2 Wiki. N N N N

Table 5: More detailed statistics of TableInstruct in terms of the average word count of different parts of the
datasets (i.e., instruction, input, question and response), table size (average column size and row size per table), table
type (Wikipedia tables or NSF reports), task type (ranking or classification) and whether the tables are hierarchical
or not. ’Y’ indicates ’Yes’ and ’N’ indicates ’No’.

D Case Study

Query Caption Seed Candidates Target AP Predicted

2003_Amsterdam_Admirals_season 2003_NECBL_season 2003_NECBL_season
The_Young_Punx 2004_NECBL_season 2004_NECBL_season
concord quarry dogs 2002_NECBL_season 1.0
2011_FCBL_season 2005_NECBL_season 2005_NECBL_season
... 2006_NECBL_season 2006_NECBL_season
1997_World_Championships_in
New_York_City_Marathon 1994_Asian_Games
_Athletics-2013_Men’s_decathlon
oleg veretelnikov 1993_Asian_Athletics Friendship_Games 1995_Asian_Athletics_Championships
1994_Asian_Games 0.2
achievements _Championships 1998_Asian_Games Athletics_at_the_1995_All-Africa_Games
1999_World_Championships_in_Athletics
... ...
1998_Asian_Games

Table 6: Case study for row population task. “Query Caption" refers to the table metadata such as Wikipedia page
title and table caption. “AP" means average precision.

E Example Prompts
Column Type Annotation

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a column type annotation task. The goal for this task is to choose the correct types for one selected column of the
table from the given candidates. The Wikipedia page, section and table caption (if any) provide important information
for choosing the correct column types.

### Question:
The column ’player’ contains the following entities: <Masaichi Kaneda>, <Noboru Akiyama>, etc. The column type
candidates are: tv.tv_producer, astronomy.star_system_body, location.citytown, sports.pro_athlete, biology.organism,
medicine.muscle, baseball.baseball_team, baseball.baseball_player, aviation.aircraft_owner, people.person, ... What are
the correct column types for this column (column name: player; entities: <Masaichi Kaneda>, <Noboru Akiyama>, etc)?

### Response:
sports.pro_athlete, baseball.baseball_player, people.person.

Figure 3: Column type annotation task. This task is to annotate the selected column with the correct semantic
types. We mark candidates with red color in the "task instruction" part. The candidate size can be up to hundreds to
thousands in TableInstruct.

Relation Extraction

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a relation extraction task. The goal for this task is to choose the correct relations between two selected columns
of the table from the given candidates. The Wikipedia page, section and table caption (if any) provide important
information for choosing the correct relation types.

### Question:
The two selected column names are: <(name),(party)>. The entity pairs for these two columns are:
<(Kevin Barr),(New Democratic Party)>, <(Brad Cathers),(Yukon Party)>, <(Currie Dixon),(Yukon Party)>,
<(Darius Elias),(Yukon Party)>, ... The relation type candidates are: location.location.contains, avia-
tion.airline.hubs, film.film.written_by, time.event.instance_of_recurring_even , people.person.place_of_birth, mu-
sic.composer.compositions, sports.sports_team.roster- sports.sports_team_roster.player, location.location.containedby,
soccer.football_player.statistics- soccer.football_player_stats.team... What are the correct relation types for the two
selected columns (column names: <(name),(party)>. entity pairs: <(Kevin Barr),(New Democratic Party)>, <(Brad
Cathers),(Yukon Party)>, <(Currie Dixon), (Yukon Party)>, <(Darius Elias),(Yukon Party)>, etc)?

### Response:
government.politician.party-government.political_party_tenure.party.

Figure 4: Relation extraction task. This task is to annotate the selected column pairs with the correct relations. We
mark candidates with red color in the "task instruction" part. The candidate size can be up to hundreds to thousands
in TableInstruct.
Entity Linking

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction: This is an entity linking task. The goal for this task is to link the selected entity mention in the table
cells to the entity in the knowledge base. You will be given a list of referent entities, with each one composed of an
entity name, its description and its type. Please choose the correct one from the referent entity candidates. Note that the
Wikipedia page, Wikipedia section and table caption (if any) provide important information for choosing the correct
referent entity.

### Input: [TLE] The Wikipedia page is about A-League all-time records. The Wikipedia section is about Average
season attendances. [TAB] col: | season | league average | total gate receipts | highest club | average | lowest club |
average | row 1: | 2005-06 | 10,955 | 920,219 | Sydney FC | 16,669 | New Zealand Knights | 3,909 | [SEP] row 2: |
2006-07 | 12,927 | ...

### Question: The selected entity mention in the table cell is: Melbourne Victory. The column name for ’Melbourne
Victory’ is highest club. The referent entity candidates are: <Melbourne Victory FC W-League [DESCRIPTION] None
[TYPE] SoccerClub>, <2016-17 Melbourne Victory FC season [DESCRIPTION] None [TYPE] SoccerClubSeason>,
<2011-12 Melbourne Victory season [DESCRIPTION] Association football club 2011/12 season for Melbourne Victory
[TYPE] SoccerClubSeason>, ... What is the correct referent entity for the entity mention ’Melbourne Victory’ ?

### Response: <Melbourne Victory [DESCRIPTION] association football team from Australia [TYPE] SoccerClub>.

Figure 5: Entity linking task. This task is to link the selected entity mention in the table cells to the entity in the
knowledge base. We mark candidates with red color in the "task instruction" part. The candidate size can be up to
hundreds to thousands in TableInstruct.

Row Population

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction: This is a table row population task. The goal of this task is to populate the possible entities
of the selected column for a table, given the Wikipedia page title, Wikipedia section title, table caption (if any)
and table headers. You will be given a list of entity candidates. Please rank them so that the most likely entities come first.

### Question: The entity candidates are: <2003_NBA_playoffs>, <1982-83_Washington_Bullets_season>,

<2004_NBA_playoffs>, <Philadelphia_76ers>, <1983-84_Washington_Bullets_season>, <1952_NBA_playoffs>,
<1972_NBA_playoffs>, <1999-2000_Dallas_Mavericks_season>, <1985- 86_Sacramento_Kings_season>, <2000-
01_Vancouver_Grizzlies_season>, <Toronto_Raptors>, <Vancouver_Grizzlies>, <1976_NBA_playoffs>, ...

### Response: <1972_NBA_playoffs>, <1973_NBA_playoffs>, <1974_NBA_playoffs>, <1975_NBA_playoffs>,

<1976_NBA_playoffs>, <1977_NBA_playoffs>, ...

Figure 6: Row population task. This task is to populate the possible entities of the selected column for a table given
partial table and table metadata. We mark candidates with red color in the "task instruction" part. The candidate size
can be up to hundreds to thousands in TableInstruct.
Schema Augmentation

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a table schema augmentation task. The goal of this task is to populate the possible headers for a table, given the
table caption and the seed table header. You will be given a list of table header candidates. Please rank them so that the
most likely headers come first.

### Input:
[TLE] The table caption is 2010-11 rangers f.c. season. [SEED] The seed table header is <competition>.

### Question:
The header candidates are: <from>, <fee (\u00a3)>, <opponents>, <final position / round>, <started round>, <player>,
<fee>, <scorers>, <position (s)>, <name>, <venue>. Please rank the headers in the header candidates.

### Response:
<town/city>, <summary>, <suburb>, <county>, <region>, <district>, <stadium>, <city>, <home team>, <remarks>,
<city name>, <film name>

Figure 7: Schema augmentation task. This task is to populate the possible headers for a table, given the table
caption and the seed table header. The targets in this case are: <started round>, <final position/round>.

Hierarchical Table QA

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction: This is a hierarchical table question answering task. The goal for this task is to answer the given
question based on the given table. The table might be hierarchical.

### Input: [TLE] The table caption is department of defense obligations for research, development, test, and evaluation,
by agency: 2015-18. [TAB] | agency | 2015 | 2016 | 2017 | 2018 | [SEP] | department of defense | department of defense |
department of defense | department of defense | department of defense | [SEP] | rdt&e | 61513.5 | 69306.1| 70866.1 |
83725 | [SEP] | total research | 6691.5 | 7152 | 7178 | 7652.7 | [SEP] | basic research | 2133.4 | 2238.7 | 2110.1 | 2389.9 |
[SEP] | defense advanced research projects agency | defense advanced research projects agency | defense advanced
research projects agency | ...

### Question: How many dollars are the difference for total research of department of the air force increase between
2016 and 2018?

### Response:
142.3.

Figure 8: Hierarchical table QA task. This task is to answer the question based on the tables with complex
hierarchical structures.
Highlighted Cells QA

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a free-form table question answering task. The goal for this task is to answer the given question based on the
given table and the highlighted cells.

### Question:
The highlighted cells of the table are: [HIGHLIGHTED_BEGIN] [1988], [Across the Rio Grande in 1988 included the
singles \"That’s What Your Love Does to Me\" and \"(It’s Always Gonna Be) Someday\".], [\"That’s What Your Love
Does to Me\"], [Across the Rio Grande], [1988], [\"(It’s Always Gonna Be) Someday\"], [Across the Rio Grande]
[HIGHLIGHTED_END] What singles were Included in Across the Rio Grande in 1988?

### Response:
Across the Rio Grande in 1988 included the singles \"That’s What Your Love Does to Meänd \"(It’s Always Gonna Be)
Someday\".

Figure 9: Highlighted cells QA task. This task is to answer the question based on the tables with highlighted cells.

Table Fact Verification (TabFact)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a table fact verification task. The goal of this task is to distinguish whether the given statement is entailed or
refuted by the given table.

### Input:
[TLE] The table caption is about tony lema. [TAB] | tournament | wins | top - 5 | top - 10 | top - 25 |
events | cuts made [SEP] | masters tournament | 0 | 1 | 2 | 4 | 4 | 4 | [SEP] | us open | 0 | 2 | 3 | 4 | 6 | 5 | [SEP] |
the open championship | 1 | 2 | 2 | 2 | 3 | 3 | [SEP] | pga championship | 0 | 0 | 1 | 2 | 5 | 4 | [SEP] | totals | 1 | 5 | 8 | 12 | 18 | 16 |.

### Question:
The statement is: <tony lema be in the top 5 for the master tournament, the us open, and the open championship>. Is it
entailed or refuted by the table above?

### Response:
Entailed.

Figure 10: Table fact verification task. This task is to discriminate whether the claim can be entailed or refuted by
the given table.
Hybrid Question Answering

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a hybrid question answering task. The goal of this task is to answer the question given tables and passages.

### Input:
[TAB] col: | rank | player | team (s) by season | carries | yards | average | [SEP] | 1 | emmitt smith | dallas cowboys ( 1990
- 2002 ) arizona cardinals ( | 4,409 | 18,355 | 4.2 | [SEP] | 3 | frank gore | san francisco 49ers ( 2005 - 2014 ) indianapolis
colts | 3,548 | 15,347 | 4.3 | [SEP] | ...

### Question:
The passage may also provide related context. You can refer to both the passages and the table when you answer the
question. Passages: emmitt smith: smith led the league in rushing and won the super bowl in the same year three times
( 1992 , 1993 , and 1995 ) when to that point it had never been done . | walter payton: walter jerry payton ( july 25 ,
1954 - november 1 , 1999 ) was an american professional football player who was a running back for the chicago bears
of the national football league ( nfl ) for thirteen seasons . | ... The question: what is the middle name of the player with
the second most national football league career rushing yards?

### Response:
Jerry.

Figure 11: HybridQA task. This task is to answer the question based on the table and passages.

Table Grounded Dialogue Generation

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a dialogue response generation task grounded on tables. The goal of this task is to generate response based
on the given dialogue history and the given table. The dialogues are grounded through underlying tables and span
three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and
point-of-interest navigation.

### Question:
The dialogue history is: <what is the address ? || taking you to chevron | that s good ! please pick the quickest route to
get there and avoid all heavy_traffic ! | there is a chevron | what gas_station are here ?>. Please generate the response
based on the given table and the given dialogue history.

### Response:
783_arcadia_pl is the address for chevron gas_station.

Figure 12: Table grounded dialogue generation task. This task is to generate the response based on the given
table and the dialogue history.
Highlighted Cells Description

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a highlighted cells description task. The goal of this task is to generate the language description given table cells.

### Input:
<page_title> List of Governors of South Carolina </page_title> <section_title> Governors under the Constitution of
1868 </section_title> <table> <cell> 76 <col_header> # </col_header> <col_header> 74 </col_header> <col_header>
75 </col_header> </cell> <cell> Daniel Henry Chamberlain <col_header> Governor </col_header> <row_header>
76 </row_header> </cell> <cell> December 1, 1874 <col_header> Took Office </col_header> <row_header> 76
</row_header> </cell> </table>.

### Question:
Please generate one natural language description to describe the given highlighted table cells.

### Response:
Daniel Henry Chamberlain was the 76th Governor of South Carolina from 1874.

Figure 13: Highlighted cells description task. This task is to generate the language description for the highlighted
table cells.

Table Fact Verification (FEVEROUS)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a table fact verification task. The goal of this task is to distinguish whether the given statement is entailed or
refuted by the given table.

### Question:
The statement is: <algebraic logic has five logical system and lindenbaum2̆013tarski algebra which includes physics
algebra and nodal algebra (provide models of propositional modal logics).>. Is it entailed or refuted by the table above?
If you think the current information can not provide enough evidence for determining it, please choose ’not enough
info’, otherwise please choose the answer from ’supports’ or ’refutes’.

### Response:
Refutes.

Figure 14: Table fact verification task. This task is to discriminate whether the claim can be entailed or refuted by
the given table.
Table QA (WikiSQL)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a table QA task. The goal of this task is to answer the question given the table.

### Question:
What is terrence ross’ nationality?

### Response:
United states.

Figure 15: Table QA task. This task is to answer the question based on the given table.

Table QA (WikiTQ)

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that
appropriately completes the request.

### Instruction:
This is a table QA task. The goal of this task is to answer the question given the table.

### Question:
Alfie’s birthday party aired on january 19. What was the airdate of the next episode?

### Response:
January 26, 1995.

Figure 16: Table QA task. This task is to answer the question based on the given table.

Customer Service Request Form 24may2021
No ratings yet
Customer Service Request Form 24may2021
2 pages
4th Quarter Test in Mapeh
100% (7)
4th Quarter Test in Mapeh
5 pages
tablegpt
No ratings yet
tablegpt
13 pages
25 Tabular Representation Noisy o
No ratings yet
25 Tabular Representation Noisy o
14 pages
Investigating Table-To-Text Generation Capabilities of Llms in Real-World Information Seeking Scenarios
No ratings yet
Investigating Table-To-Text Generation Capabilities of Llms in Real-World Information Seeking Scenarios
16 pages
Tacl A 00544
No ratings yet
Tacl A 00544
23 pages
Few-shot Classification of Tabular Data with Large Language Models
No ratings yet
Few-shot Classification of Tabular Data with Large Language Models
33 pages
1-s2.0-S0957417424027192-main
No ratings yet
1-s2.0-S0957417424027192-main
13 pages
Trompt Towards a Better Deep Neural Network for Tabular Data
No ratings yet
Trompt Towards a Better Deep Neural Network for Tabular Data
43 pages
Ddcde
No ratings yet
Ddcde
14 pages
TableGPT2- A Large Multimodal Model with Tabular Data Integration
No ratings yet
TableGPT2- A Large Multimodal Model with Tabular Data Integration
32 pages
2309.08963
No ratings yet
2309.08963
15 pages
edi6_paper4
No ratings yet
edi6_paper4
6 pages
Publi-6721 2
No ratings yet
Publi-6721 2
17 pages
TUTA: Tree-Based Transformers For Generally Structured Table Pre-Training
No ratings yet
TUTA: Tree-Based Transformers For Generally Structured Table Pre-Training
11 pages
2626 Pre Training Language Mod
No ratings yet
2626 Pre Training Language Mod
10 pages
Multimodal Table Understanding
No ratings yet
Multimodal Table Understanding
23 pages
Research Paper On NLP Accepted in Naaclregreat Presentation
No ratings yet
Research Paper On NLP Accepted in Naaclregreat Presentation
11 pages
2404.12404v4 (1)
No ratings yet
2404.12404v4 (1)
39 pages
ExcelFormer A Neural Network Surpassing GBDTs on Tabular Data
No ratings yet
ExcelFormer A Neural Network Surpassing GBDTs on Tabular Data
13 pages
Table-to-Text Describing Table Region With Natural
No ratings yet
Table-to-Text Describing Table Region With Natural
10 pages
Resnet
No ratings yet
Resnet
25 pages
Large Language Model Routing With Benchmark Datasets
No ratings yet
Large Language Model Routing With Benchmark Datasets
18 pages
Revisiting Deep Learning Models for Tabular Data
No ratings yet
Revisiting Deep Learning Models for Tabular Data
12 pages
2205.00148
No ratings yet
2205.00148
16 pages
Research Paper
No ratings yet
Research Paper
14 pages
Platypus
No ratings yet
Platypus
17 pages
2412.04185v1
No ratings yet
2412.04185v1
20 pages
Incorporating LLM Priors Into Tabular Learners: Table Representation Learning Workshop at Neurips 2023
No ratings yet
Incorporating LLM Priors Into Tabular Learners: Table Representation Learning Workshop at Neurips 2023
10 pages
In Context Reinforcement Learning Based Retrieval Augmented Generation for Text to SQL
No ratings yet
In Context Reinforcement Learning Based Retrieval Augmented Generation for Text to SQL
8 pages
CAAFE
No ratings yet
CAAFE
23 pages
Browse Comp
No ratings yet
Browse Comp
11 pages
"According To - . - ": Prompting Language Models Improves Quoting From Pre-Training Data
No ratings yet
"According To - . - ": Prompting Language Models Improves Quoting From Pre-Training Data
14 pages
Tabular NeurIPS2022
No ratings yet
Tabular NeurIPS2022
49 pages
Sigir2019 Table
No ratings yet
Sigir2019 Table
5 pages
2486+_Phuttaamart_et_al.
No ratings yet
2486+_Phuttaamart_et_al.
7 pages
Table To Text Generation With Accurate Content Copying: Yang Yang, Juan Cao, Yujun Wen & Pengzhou Zhang
No ratings yet
Table To Text Generation With Accurate Content Copying: Yang Yang, Juan Cao, Yujun Wen & Pengzhou Zhang
12 pages
Ai Data Extraction
No ratings yet
Ai Data Extraction
1 page
2408.08003v1
No ratings yet
2408.08003v1
16 pages
2024.acl-long.778
No ratings yet
2024.acl-long.778
15 pages
2410.04739v1
No ratings yet
2410.04739v1
17 pages
Adapting Large Language Models Via Reading Comprehension
No ratings yet
Adapting Large Language Models Via Reading Comprehension
30 pages
Efficient Large Language Models- A Survey
No ratings yet
Efficient Large Language Models- A Survey
67 pages
Deductive Reasoning vs Inductive Reasoning in LLMs
No ratings yet
Deductive Reasoning vs Inductive Reasoning in LLMs
18 pages
A Survey on Data Synthesis and Augmentation for Large Language Models
No ratings yet
A Survey on Data Synthesis and Augmentation for Large Language Models
28 pages
Improving BERT Model Using Contrastive Learning For Biomedical Relation Extraction
No ratings yet
Improving BERT Model Using Contrastive Learning For Biomedical Relation Extraction
10 pages
A T: E G A A LLM: Gent Uning Nabling Eneralized Gent Bilities For S
No ratings yet
A T: E G A A LLM: Gent Uning Nabling Eneralized Gent Bilities For S
31 pages
LLM4OG
No ratings yet
LLM4OG
27 pages
Bao, J., Et Al. (2018) - Table-To-Text - Describing Table Region With Natural Language. AAAI
No ratings yet
Bao, J., Et Al. (2018) - Table-To-Text - Describing Table Region With Natural Language. AAAI
8 pages
2305.09612v1
No ratings yet
2305.09612v1
11 pages
Large Language Models For Data Annotation - A Survey
No ratings yet
Large Language Models For Data Annotation - A Survey
22 pages
GrowOVER - How Can LLMs Adapt To Growing Real-World Knowledge
No ratings yet
GrowOVER - How Can LLMs Adapt To Growing Real-World Knowledge
27 pages
MCS-024: Object Oriented Technologies and Java Programming
From Everand
MCS-024: Object Oriented Technologies and Java Programming
Dr. DK Sukhani
No ratings yet
Tabular Data - Deep Learning Is Not All You Need
No ratings yet
Tabular Data - Deep Learning Is Not All You Need
13 pages
Full Download Modern Deep Learning for Tabular Data: Novel Approaches to Common Modeling Problems 1st Edition Andre Ye PDF DOCX
100% (2)
Full Download Modern Deep Learning for Tabular Data: Novel Approaches to Common Modeling Problems 1st Edition Andre Ye PDF DOCX
41 pages
s41586-024-08328-6
No ratings yet
s41586-024-08328-6
23 pages
Evaluating Logical Generalization in Graph Neural Networks: R R R R R R R R
No ratings yet
Evaluating Logical Generalization in Graph Neural Networks: R R R R R R R R
18 pages
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
From Everand
Java/J2EE Design Patterns Interview Questions You'll Most Likely Be Asked: Second Edition
Vibrant Publishers
No ratings yet
(Ebook) Modern Deep Learning for Tabular Data: Novel Approaches to Common Modeling Problems by Andre Ye, Zian Wang ISBN 9781484286920, 9781484286913, 1484286928, 148428691X, 3998949136 - Download the ebook now and own the full detailed content
100% (2)
(Ebook) Modern Deep Learning for Tabular Data: Novel Approaches to Common Modeling Problems by Andre Ye, Zian Wang ISBN 9781484286920, 9781484286913, 1484286928, 148428691X, 3998949136 - Download the ebook now and own the full detailed content
59 pages
Tables to LaTeX- structure and content extraction from scientific tables
No ratings yet
Tables to LaTeX- structure and content extraction from scientific tables
10 pages
Adapting Large Language Models Via
No ratings yet
Adapting Large Language Models Via
26 pages
TURL: Table Understanding Through Representation Learning: Xiang Deng Huan Sun Alyssa Lees
No ratings yet
TURL: Table Understanding Through Representation Learning: Xiang Deng Huan Sun Alyssa Lees
14 pages
The Effect of E-Banking On Premium and Business Customer Satisfaction
No ratings yet
The Effect of E-Banking On Premium and Business Customer Satisfaction
6 pages
Data Export Batch
No ratings yet
Data Export Batch
42 pages
Remote Manager Survival Guide 2023
100% (1)
Remote Manager Survival Guide 2023
60 pages
Assessing The Challenges of SMEs in Adopting IFRS in Uganda
100% (8)
Assessing The Challenges of SMEs in Adopting IFRS in Uganda
25 pages
Kapferer Lbrand-Identity-Prism
100% (1)
Kapferer Lbrand-Identity-Prism
26 pages
PE and Arts Cala
No ratings yet
PE and Arts Cala
6 pages
1.1. Chapter1. IntroductionToPrincipleofAccounting
No ratings yet
1.1. Chapter1. IntroductionToPrincipleofAccounting
54 pages
Aims and Objectives
No ratings yet
Aims and Objectives
6 pages
Astm - Jis
No ratings yet
Astm - Jis
7 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
4 pages
Cranial Nerve Cheat Sheet
No ratings yet
Cranial Nerve Cheat Sheet
2 pages
Phonetics: The Sounds of Language
No ratings yet
Phonetics: The Sounds of Language
13 pages
Invoice
No ratings yet
Invoice
54 pages
Anaphylaxis Wikipedia
No ratings yet
Anaphylaxis Wikipedia
6 pages
Carbon Training
No ratings yet
Carbon Training
42 pages
Wiley The Russian Review: This Content Downloaded From 86.121.93.149 On Thu, 03 Nov 2022 12:02:08 UTC
No ratings yet
Wiley The Russian Review: This Content Downloaded From 86.121.93.149 On Thu, 03 Nov 2022 12:02:08 UTC
19 pages
Module 3 Assessment NSTP 2
No ratings yet
Module 3 Assessment NSTP 2
2 pages
Vocathlon 2023 TOSYA Draft
No ratings yet
Vocathlon 2023 TOSYA Draft
6 pages
AR AFFF
No ratings yet
AR AFFF
2 pages
Black Box Testing
No ratings yet
Black Box Testing
35 pages
Steel and Cast Iron
No ratings yet
Steel and Cast Iron
20 pages
SDP MSP Admin Guide
No ratings yet
SDP MSP Admin Guide
492 pages
Welding Defects and Its Remedies
100% (1)
Welding Defects and Its Remedies
3 pages
CS8582 Ooad
No ratings yet
CS8582 Ooad
3 pages
9.1 Critical Success Factors, Benchmarking, Resource & Competence (PM Area by ST)
No ratings yet
9.1 Critical Success Factors, Benchmarking, Resource & Competence (PM Area by ST)
12 pages
SMART by GEP® Full User Guide - Final
No ratings yet
SMART by GEP® Full User Guide - Final
177 pages
Pricing
No ratings yet
Pricing
7 pages
Ecological Literacy
No ratings yet
Ecological Literacy
7 pages

TableLlama Towards Open Large Generalist Models For Tables

Uploaded by

TableLlama Towards Open Large Generalist Models For Tables

Uploaded by

TableLlama: Towards Open Large Generalist Models for Tables

Tianshu Zhang Xiang Yue Yifei Li Huan Sun

Abstract have spurred significant research interest (Deng

matically interpret, augment, and query tables.

Column Type Relation Entity Row Out-of-Domain Evaluation Tasks

Wikipedia Tables Spreadsheets TableLlama Table Fact

(b) Row Population ### Instruction:

(c) Hierarchical Table QA ### Instruction:

• We develop TableLlama, an open-source that necessitate different abilities of models, such

that FEVEROUS, a table fact verification dataset 4.2 Ablation Study

Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong,

B.3 Quality Control

Query Caption Seed Candidates Target AP Predicted

### Question: The entity candidates are: <2003_NBA_playoffs>, <1982-83_Washington_Bullets_season>,

### Response: <1972_NBA_playoffs>, <1973_NBA_playoffs>, <1974_NBA_playoffs>, <1975_NBA_playoffs>,

Table Fact Verification (TabFact)

Table Grounded Dialogue Generation

Table Fact Verification (FEVEROUS)

You might also like