Project 3
Project 3
Introduction:
algorithms are trained on massive datasets of text and code, allowing them to
these, two models stand out due to their unique capabilities and applications:
Galactica, Codex. Each of these models has been designed with distinct
specific domains.
and applications.
Objective of the Study:
applications and specific use cases for which each model is best suited,
these models for various industries and fields of research, providing insights
By achieving this goal, the study hopes to give a thorough and nuanced
insightful knowledge.
Literature Review:
Information overload is a major obstacle to scientific progress. The
explosive growth in scientific literature and data has made it ever harder to
OpenAI's API [7]. However, Codex's focus on programming languages limits its
[7]
general knowledge . Additionally, research suggests that accuracy and
control over generated code can vary, and human oversight remains crucial
[8]
. Finally, access to Codex and details about its inner workings are currently
[9]
limited . Looking ahead, potential areas of development include expanding
purposes, and exploring how Codex can best collaborate with human
Comparision Criteria:
LLM’s:
Task Specialization
Knowledge Base
Factual Grounding
Task Completion
Human Oversight
3. Human Oversight:
Availability
Transparency
Integration
Methodology:
Galactica and Codex are the two Large Language Models we are going
Comparison:
Galactica:
notes, and more specialized databases (e.g., known compounds and proteins,
tasks [4]. This includes summarizing research papers, solving math problems,
science.
handling scientific tasks, but limitations exist regarding broader creative text
Human Oversight: Due to the potential for bias and limited scope
outside of science, human oversight is likely necessary for critical tasks [5].
workings and training data, raising concerns about potential bias and lack of
explainability [5].
Codex:
working Python function that performs the task outlined in the docstring; see
well.[10]
tests. The model is evaluated on its ability to generate a program that passes
the tests for each programming problem given a certain number of attempts—
potential scripts, then choosing the one with the highest probability as your
solution (i.e., “mean logp reranking”) also helps improve performance; see
below.
If we move beyond allowing Codex a single attempt to solve each
problem, we can get some pretty incredible results. For example, given 100
attempts at solving each problem (i.e., meaning that Codex generates 100
functions and we check to see whether any one of them solves the
HumanEval dataset!
related tasks, with capabilities for generating different code functionalities and
[7, 8]
translating between languages . However, the accuracy and control over
the generated code can vary depending on complexity and instructions [8].
human oversight remains crucial, especially for critical coding tasks [8].
its capabilities.
Task Completion Excels in scientific tasks Strong in code generation, translation, analysis
Human Oversight Crucial due to limited scope Necessary for critical coding tasks
References:
Liu, Y., Wu, Y., Wang, Y., Wu, Y., Zhou, Y., Li, S., ... & Liu, Z. (2023).
Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A.,
Saravia, E., ... & Stojnic, R. (2022). Galactica: A large language model for
[PDF].[4]
model only survived three days online. MIT Technology Review. Retrieved
from https://www.technologyreview.com/2024/03/04/1089403/large-language-
models-amazing-but-nobody-knows-why/[5]
Bolukbasi, T., Chang, S. W., Hernandez, J. L., Jha, F., Li, Y., Peng,
B., ... & Wu, J. (2023, April). Scaling Large Language Models for Open-Ended
https://arxiv.org/abs/2304.08212[8]
https://beta.openai.com/docs/api-reference/introduction[9]
Cameron R. Wolfe, "Specialized LLMs: ChatGPT, LaMDA, Galactica,"
https://cameronrwolfe.substack.com/p/specialized-llms-chatgpt-lamda-
galactica,[10]