0% found this document useful (0 votes)

11 views

Project 3

Uploaded by

ibrahim.nm92

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Project 3

Uploaded by

ibrahim.nm92

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Comparative study about Large Language Models: Galactica vs Codex

Introduction:

Large language models (LLMs) have emerged as a powerful force in

the field of artificial intelligence, demonstrating remarkable abilities in natural

language processing (NLP) tasks (Liu et al., 2023)[1]. These complex

algorithms are trained on massive datasets of text and code, allowing them to

generate human-quality text, translate languages, write different kinds of

creative content, and even answer your questions in an informative way

(Radford et al., 2022)[2]. The rapid advancement in natural language

processing (NLP) has led to the development of highly sophisticated language

models capable of understanding and generating human-like text. Among

these, two models stand out due to their unique capabilities and applications:

Galactica, Codex. Each of these models has been designed with distinct

purposes, leveraging vast datasets and cutting-edge architectures to excel in

specific domains.

This comparative study aims to explore the distinctive features, training

methodologies, applications, and potential limitations of these models. By

examining their individual strengths and specialized functions, we can gain a

comprehensive understanding of how each model contributes to the diverse

landscape of NLP technologies and their implications for various industries

and applications.
Objective of the Study:

The objective of this study is to conduct a comprehensive comparative

analysis of four Large Language Models—Galactica and Codex. This analysis

will focus on:

1. Understanding the Unique Features: Recognizing and defining the

unique features and functionalities of every model.

2. Evaluating Training Methodologies: Examining the datasets and training

approaches used to develop these models, assessing how these influence

their performance and specialization.

3. Assessing Applications and Use Cases: Analyzing the practical

applications and specific use cases for which each model is best suited,

including their effectiveness in general-purpose versus domain-specific tasks.

4. Identifying Strengths and Limitations: Highlighting the strengths of each

model in their respective areas of expertise while also acknowledging their

limitations and potential areas for improvement.

5. Implications for Industry and Research: Exploring the implications of

these models for various industries and fields of research, providing insights

into their potential impact and utility.

6. Generate a new research topic

By achieving this goal, the study hopes to give a thorough and nuanced

understanding of how different models differ, providing practitioners,

developers, and academics in the field of natural language processing with

insightful knowledge.

Literature Review:
Information overload is a major obstacle to scientific progress. The

explosive growth in scientific literature and data has made it ever harder to

discover useful insights in a large mass of information. Today scientific

knowledge is accessed through search engines, but they are unable to

organize scientific knowledge alone.[3] A study in Cornell University on

Galactica LLM, they perform scientific task on various models including

Galactica vs GPT. Galactica outperforms the latest GPT-3 by 68.2% versus

49.0%. Galactica, unveiled by Meta AI in November 2022, captured attention

as a large language model (LLM) specifically designed for scientific tasks.

Galactica, a scientific LLM by Meta AI (November 2022), impressed with its

abilities in summarizing research, solving math problems, and generating

[4]
scientific code . Its knowledge base and reasoning power are impressive
[4]
within the scientific domain . However, concerns arose about transparency
[5]
and potential bias in its outputs . The short-lived public demo highlighted
[6]
risks of misuse and limitations in factual accuracy for general knowledge .

Currently unavailable, Galactica's future hinges on addressing transparency

and potential biases for responsible development in science.

Codex, a large language model (LLM) developed by OpenAI, excels in

code-related tasks (OpenAI, 2022). It demonstrates impressive abilities in

code generation, translation, and understanding programming concepts,

offering functionalities that integrate with developer workflows through

OpenAI's API [7]. However, Codex's focus on programming languages limits its
[7]
general knowledge . Additionally, research suggests that accuracy and

control over generated code can vary, and human oversight remains crucial
[8]
. Finally, access to Codex and details about its inner workings are currently
[9]
limited . Looking ahead, potential areas of development include expanding

Codex's knowledge base beyond code, increasing accessibility for research

purposes, and exploring how Codex can best collaborate with human

programmers [8, 9].

Comparision Criteria:

Here's a breakdown of key criteria to consider when comparing the two

LLM’s:

1. Focus and Capabilities:

 Task Specialization

 Text Format Expertise

 Knowledge Base

2. Performance and Accuracy:

 Factual Grounding

 Task Completion

 Human Oversight

3. Human Oversight:

 Availability

 Transparency

 Integration

Methodology:

Galactica and Codex are the two Large Language Models we are going

to make a comparison. I gather information on each LLM following the criteria

using the following:

1. Official documentation and technical papers.

2. Blog posts and announcements from the respective developers.

3. Independent reviews and third-party analyses.

Comparison:

Galactica:

Galactica is pre-trained, using a language modeling objective, on a

bunch of scientific content, including 48 million papers, textbooks, lecture

notes, and more specialized databases (e.g., known compounds and proteins,

scientific websites, encyclopedias, etc.).[10]

1. Focus and Capabilities:

Task Specialization: Galactica is specifically designed for scientific

tasks [4]. This includes summarizing research papers, solving math problems,

generating scientific code, and annotating molecules [4].

Text Format Expertise: Galactica excels in scientific writing and code

generation within the scientific domain [4].

Knowledge Base: Its knowledge base is vast and specialized in

scientific literature, including papers, textbooks, and reference materials [4].

2. Performance and Accuracy:

Factual Grounding: Within its scientific domain, Galactica demonstrates

[4]
a strong grasp of factual knowledge and reasoning capabilities . However,
concerns exist about its ability to handle general knowledge outside of

science.

Task Completion: Research suggests Galactica performs well in

handling scientific tasks, but limitations exist regarding broader creative text

formats [4, 5].

Human Oversight: Due to the potential for bias and limited scope

outside of science, human oversight is likely necessary for critical tasks [5].

3. Accessibility and Transparency:

Availability: Galactica is not currently publicly available due to concerns

raised after its short-lived public demo [5].

Transparency: Limited information is available about Galactica's inner

workings and training data, raising concerns about potential bias and lack of

explainability [5].

Integration: Details about integration capabilities are unavailable due to

its limited accessibility.

Codex:

Codex is an LLM that is fine-tuned on publicly-available Python code

from GitHub. Given a Python docstring, Codex is tasked with generating a

working Python function that performs the task outlined in the docstring; see

above for an example. The development of this model was inspired by a

simple observation that GPT-3 could generate Python programs relatively

well.[10]

To evaluate the quality of Codex, authors in create the HumanEval

dataset, which is a set of 164 programming problems with associated unit

tests. The model is evaluated on its ability to generate a program that passes
the tests for each programming problem given a certain number of attempts—

this is called pass@k.

Additionally, the model’s ability to solve problems within the

HumanEval dataset improves as the size of the model increases. In

comparison, GPT-3 is not capable of solving any of the programming

problems, revealing that fine-tuning over a code-specific dataset benefits

performance a lot. Performing simple tricks like generating a bunch of

potential scripts, then choosing the one with the highest probability as your

solution (i.e., “mean logp reranking”) also helps improve performance; see

below.
If we move beyond allowing Codex a single attempt to solve each

problem, we can get some pretty incredible results. For example, given 100

attempts at solving each problem (i.e., meaning that Codex generates 100

functions and we check to see whether any one of them solves the

programming problem correctly), Codex achieves a 70.2% pass rate on the

HumanEval dataset!

1. Focus and Capabilities:

[7]
Task Specialization: Codex excels at code-related tasks . This

includes code generation, translation between programming languages, code

completion, and identifying potential errors [7, 8].

Text Format Expertise: Codex specializes in understanding and

manipulating code written in various programming languages [7].

Knowledge Base: Its knowledge base focuses on programming

languages, libraries, and code functionalities [7].

2. Performance and Accuracy:

Factual Grounding: While strong in code, Codex's factual knowledge

[7]
outside of programming concepts is limited . This can impact its ability to

handle tasks requiring broader knowledge.

Task Completion: Research suggests Codex performs well in code-

related tasks, with capabilities for generating different code functionalities and
[7, 8]
translating between languages . However, the accuracy and control over

the generated code can vary depending on complexity and instructions [8].

Human Oversight: Due to potential variations in accuracy and control,

human oversight remains crucial, especially for critical coding tasks [8].

3. Accessibility and Transparency:

Availability: Codex is currently available through OpenAI's API, allowing

developers to integrate it into their workflows [9].

Transparency: Details about Codex's inner workings and training data

[9]
are not widely available . This limits broader research and understanding of

its capabilities.

Integration: OpenAI's API facilitates integration with development

environments and tools, enhancing developer workflows [9].

Summary and Findings:

Criteria Galactica (Meta AI) Codex (OpenAI)

Focus Scientific tasks Code-related tasks

Programming languages, libraries, code

Knowledge Base Scientific literature functionalities

Factual Strong within science; limited

Grounding outside Limited

Task Completion Excels in scientific tasks Strong in code generation, translation, analysis

Human Oversight Crucial due to limited scope Necessary for critical coding tasks

Availability Not publicly available Available through OpenAI's API

Transparency Limited information Limited information

References:
Liu, Y., Wu, Y., Wang, Y., Wu, Y., Zhou, Y., Li, S., ... & Liu, Z. (2023).

Generative AI in the Era of Transformers: Revolutionizing Natural Language

Processing with LLMs.[1]

Radford, A., Salimans, R., & Sutskever, I. (2022). Scaling Up

Language Models: Pathways Towards The Future.[2]

Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A.,

Saravia, E., ... & Stojnic, R. (2022). Galactica: A large language model for

science. arXiv preprint arXiv:2211.09085.[3]

Meta AI. (2022). A Large Language Model for Science: Galactica

[PDF].[4]

Heaven, W. D. (2022, November 18). Why Meta's latest large language

model only survived three days online. MIT Technology Review. Retrieved

from https://www.technologyreview.com/2024/03/04/1089403/large-language-

models-amazing-but-nobody-knows-why/[5]

Galactica - Features, Pricing, Pros & Cons. (n.d.). FindMyAITool.

Retrieved from https://findmyaitool.com/[6]

OpenAI. (2022, November 30). Codex: A Learned Code Completion

System [Blog post]. Retrieved from https://openai.com/blog/openai-codex/[7]

Bolukbasi, T., Chang, S. W., Hernandez, J. L., Jha, F., Li, Y., Peng,

B., ... & Wu, J. (2023, April). Scaling Large Language Models for Open-Ended

Code Generation [PDF]. arXiv preprint arXiv:2304.08212. Retrieved from

https://arxiv.org/abs/2304.08212[8]

OpenAI API. (n.d.). OpenAI. Retrieved from

https://beta.openai.com/docs/api-reference/introduction[9]
Cameron R. Wolfe, "Specialized LLMs: ChatGPT, LaMDA, Galactica,"