The document provides details of the revised syllabus for 7th and 8th semesters of the B.E. Computer Science & Engineering program at Shivaji University, Kolhapur.
It lists the subjects to be taught in each semester along with their course codes, number of lecture hours, tutorials, practical sessions and marks distribution. Elective subjects are also specified for semesters 7 and 8. Guidelines for term work, project work, assignments and assessments are outlined. The syllabus for individual subjects like Advanced Computer Architecture, Distributed Systems and Advanced Database Systems are briefly described in terms of topics to be covered.
W9L2 Scaling Up LLM Pretraining: Scaling Lawcniclsh1
• Why Scaling Up
• Which Language Model to Scale Up
• What Factors Matter in Scaling
• What Configurations to Scale Up
• Capabilities Emerged from Scaling Up
Scaling Factors
Many factors in configuring a scaled up pretraining run for Transformer Decoder + Autoregressive LM
• Model size (parameter counts)
• Pretraining dataset size
• Pretraining compute (FLOPs or TPU/GPU hours)
• Network shape (Parameters allocations)
• Effective batch size
• Learning rate & learning rate schedular
• Context length
The document discusses sequence to sequence learning for chatbots. It provides an overview of the network architecture, including encoding the input sequence and decoding the output sequence. LSTM is used to model language. The loss function is cross entropy loss. Improvement techniques discussed include attention mechanism to focus on relevant parts of the input, and beam search to find higher probability full sequences during decoding.
This document summarizes the three tracks of the DSTC6 dialogue system technology challenges. Track 1 focuses on end-to-end goal oriented dialog learning with a restaurant reservation domain. Track 2 focuses on end-to-end conversation modeling to generate responses using dialogue history and external knowledge. Track 3 aims to detect breakdowns in dialogues using classification metrics.
Graph and language embeddings were used to analyze user data from Reddit to predict whether authors would post in the SuicideWatch subreddit. Metapath2vec was used to generate graph embeddings from subreddit and author relationships. Doc2vec was used to generate document embeddings based on language similarity between submissions and subreddits. Combining the graph and document embeddings in a logistic regression achieved 90% accuracy in predicting SuicideWatch posters, reducing both false positives and false negatives compared to using the embeddings separately. Next steps proposed using the embeddings to better understand similarities between related subreddits and predict risk factors in posts.
Seminar on Parallel and Concurrent ProgrammingStefan Marr
This document outlines the agenda, tasks, deadlines, grading, and timeline for a seminar on parallel and concurrent programming. The agenda includes an introduction to concurrent programming models, an overview of selected seminar papers, and student presentations. Students must present on a selected paper, provide a summary and questions in advance, and submit a written report by the deadline. The report can focus on the theoretical treatment of a paper or practical reproduction of experiments. Attendance, the quality of the presentation and discussion, and the write-up determine grading. Consultations are available to prepare the presentation and agree on the report focus.
Retrieval_Augumented_Generation_GenAI.pptxcreative sam
Retrieval Augumented Generation, GenAI
RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
The document presents an approach using transformers to learn hierarchical contexts in multiparty dialogue. It proposes new pre-training tasks to improve token-level and utterance-level embeddings for handling dialogue contexts. A multi-task learning approach is introduced to fine-tune the language model for a Friends question answering (FriendsQA) task using dialogue evidence, outperforming BERT and RoBERTa. However, the approach shows no improvement on other character mining tasks from Friends. Future work is needed to better represent speakers and inferences in dialogue.
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
People talk about a Moore's Law for gene sequencing, a Moore's Law for software, etc. This is talk is about *the* Moore's Law, the bull that the other "Laws" ride; and how Python-powered ML helps drive it. How do we keep making ever-smaller devices? How do we harness atomic-scale physics? Large-scale machine learning is key. The computation drives new chip designs, and those new chip designs are used for new computations, ad infinitum. High-dimensional regression, classification, active learning, optimization, ranking, clustering, density estimation, scientific visualization, massively parallel processing -- it all comes into play, and Python is powering it all.
The document provides an overview of deep learning concepts and techniques for natural language processing tasks. It includes the following:
1. A schedule for a deep learning workshop covering fundamentals of deep learning for machine translation, word embeddings, neural language models, and neural machine translation.
2. Descriptions of neural networks, activation functions, backpropagation, and word embeddings.
3. Details about feedforward neural network language models, recurrent neural network language models, and how they are applied to tasks like language modeling and machine translation.
4. An explanation of attention-based encoder-decoder models for neural machine translation.
The document describes a project on named entity extraction from online news articles using two machine learning models: 1) a Maximum Entropy Markov Model and 2) a Deep Neural Network with LSTM. It provides an overview of named entity extraction and the challenges of the given problem/dataset. It then describes the two models in detail, including feature engineering for the MaxEnt model and architecture of the DNN model. Results show both models achieved similar accuracy of around 93.5-93.8%. The document concludes with limitations and comparisons of the two approaches.
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
1) The document describes a method for cross-modal knowledge distillation from pretrained language models to improve end-to-end spoken language understanding systems.
2) A pretrained BERT model is fine-tuned on text transcripts then used as a teacher to distill knowledge into a student end-to-end speech SLU system using mean absolute error loss.
3) Experimental results found this simple distillation approach helped the student learn uncertainty from the teacher model and improved performance over a speech-only baseline, demonstrating cross-modal knowledge sharing is effective for spoken language tasks.
This document summarizes a tutorial for developing a state-of-the-art named entity recognition framework using deep learning. The tutorial uses a bi-directional LSTM-CNN architecture with a CRF layer, as presented in a 2016 paper. It replicates the paper's results on the CoNLL 2003 dataset for NER, achieving an F1 score of 91.21. The tutorial covers data preparation from the dataset, word embeddings using GloVe vectors, a CNN encoder for character-level representations, a bi-LSTM for word-level encoding, and a CRF layer for output decoding and sequence tagging. The experience of presenting this tutorial to friends highlighted the need for detailed comments and explanations of each step and PyTorch functions.
Presentation for Interspeech 2022: "The VoiceMOS Challenge 2022"
Presenter: Dr. Erica Cooper, National Institute of Informatics
Preprint: https://arxiv.org/abs/2203.11389
Video: https://youtu.be/99ZQ-SLUvKE
Challenge website: https://voicemos-challenge-2022.github.io
Thu-SS-OS-9-5
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
Large Language Models for Test Case Evolution and RepairLionel Briand
Large language models show promise for test case repair tasks. LLMs can be applied to tasks like test case generation, classification of flaky tests, and test case evolution and repair. The paper presents TaRGet, a framework that uses LLMs for automated test case repair. TaRGet takes as input a broken test case and code changes to the system under test, and outputs a repaired test case. Evaluation shows TaRGet achieves over 80% plausible repair accuracy. The paper analyzes repair characteristics, evaluates different LLM and input/output formats, and examines the impact of fine-tuning data size on performance.
Memory efficient java tutorial practices and challengesmustafa sarac
This document summarizes challenges in building memory-efficient Java applications and common patterns of memory usage. It discusses how object representation and collection choices can significantly impact memory usage, with overhead sometimes accounting for 50-90% of memory consumption. The document provides examples of how data type modeling decisions, such as high levels of delegation, large base classes, and unnecessary fields, can lead to high memory overhead. It emphasizes measuring and understanding memory usage at the data type and collection level in order to make informed design tradeoffs.
Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss
This document summarizes Matīss Rikters' presentation on using neural network language models for candidate scoring in multi-system machine translation. It discusses using character-level recurrent and memory neural networks to score translations from multiple online machine translation systems. The best-performing models were a character-level RNN and a memory network, with the RNN achieving the highest BLEU score of 19.53 on a Latvian-English task. Future work discussed expanding the approach to other languages and tasks like quality estimation.
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...Denis Zakharov
Current deep networks, having an excessive number of hyperparameters, have several optimums. They are trained until the value of the loss function is nearly zero (i.e., close to optimality), and training is considered successful if the optimal (or near-optimal) model thus found also works well on deferred data - i.e., it generalizes.
However, there is an opinion that the value of the loss function has nothing to do with the ability to generalize. Therefore, a wide range of compression methods have been developed, including pruning, quantization, distillation, and low-rank decomposition. In this work I look backwards to provide a comparative review of
what have been done previously to understand overall trends in deep learning, then discuss tensors and their decompositions that will help in developing an optimization method.
Neel Sundaresan - Teaching a machine to codeMLconf
1. Recommend using the 'AdamOptimizer' class to optimize the loss since it is commonly used for training neural networks.
2. Suggest mapping the input data to floating point tensors using 'tf.cast()' for compatibility with TensorFlow operations.
3. Advise normalizing the input data to speed up training by using 'tf.keras.utils.normalize()'
A summary of various COMBINE standardization activitiesMike Hucka
Invited presentation given at the Whole-Cell Modeling Summer School, held in Rostock, Germany, March 2015.
https://sites.google.com/site/vwwholecellsummerschool/important-dates/programm
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Currently Vadim is a Senior Machine Learning Engineer at source{d} where he works on deep neural networks that aim to understand all of the world's developers through their code. Vadim is one of the creators of the distributed deep learning platform Veles (https://velesnet.ml) while working at Samsung. Afterwards Vadim was responsible for the machine learning efforts to fight email spam at Mail.Ru. In the past Vadim was also a visiting associate professor at Moscow Institute of Physics and Technology, teaching about new technologies and conducting ACM-like internal coding competitions. Vadim is also a big fan of GitHub (vmarkovtsev) and HackerRank (markhor), as well as likes to write technical articles on a number of web sites.
- GPT-3 is a large language model developed by OpenAI, with 175 billion parameters, making it the largest neural network ever created at the time.
- GPT-3 is trained on a massive dataset of unlabeled text using an auto-regressive approach, allowing it to perform tasks without any fine-tuning through zero-, one-, or few-shot learning by conditioning on examples or instructions.
- Evaluation showed GPT-3 outperforming state-of-the-art models on several benchmarks in zero- and few-shot settings, demonstrating strong generalization abilities from its massive pre-training.
π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社
今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。
拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。
This presentation introduces robot foundation models that integrate vision, language, and action.
Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.
この資料では、LayerNorm/RMSNormをDyTと呼ばれる層に置き換えることで、正規化層なしでTransformerの学習・推論を行う新しいアプローチについて説明しています。
ViTやLLMなどさまざまな設定で十分な精度を達成しており、"正規化って本当に必要?"という疑問に切り込んだ興味深い研究です。
This presentation explains a new approach that replaces LayerNorm/RMSNorm with a layer called DyT (Dynamic Tanh), enabling training and inference of Transformers without any normalization layers.
The method shows competitive performance across various setups—including ViT and LLMs—raising the question: “Is normalization really necessary?”
Ad
More Related Content
Similar to 社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling (20)
Retrieval_Augumented_Generation_GenAI.pptxcreative sam
Retrieval Augumented Generation, GenAI
RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
The document presents an approach using transformers to learn hierarchical contexts in multiparty dialogue. It proposes new pre-training tasks to improve token-level and utterance-level embeddings for handling dialogue contexts. A multi-task learning approach is introduced to fine-tune the language model for a Friends question answering (FriendsQA) task using dialogue evidence, outperforming BERT and RoBERTa. However, the approach shows no improvement on other character mining tasks from Friends. Future work is needed to better represent speakers and inferences in dialogue.
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
People talk about a Moore's Law for gene sequencing, a Moore's Law for software, etc. This is talk is about *the* Moore's Law, the bull that the other "Laws" ride; and how Python-powered ML helps drive it. How do we keep making ever-smaller devices? How do we harness atomic-scale physics? Large-scale machine learning is key. The computation drives new chip designs, and those new chip designs are used for new computations, ad infinitum. High-dimensional regression, classification, active learning, optimization, ranking, clustering, density estimation, scientific visualization, massively parallel processing -- it all comes into play, and Python is powering it all.
The document provides an overview of deep learning concepts and techniques for natural language processing tasks. It includes the following:
1. A schedule for a deep learning workshop covering fundamentals of deep learning for machine translation, word embeddings, neural language models, and neural machine translation.
2. Descriptions of neural networks, activation functions, backpropagation, and word embeddings.
3. Details about feedforward neural network language models, recurrent neural network language models, and how they are applied to tasks like language modeling and machine translation.
4. An explanation of attention-based encoder-decoder models for neural machine translation.
The document describes a project on named entity extraction from online news articles using two machine learning models: 1) a Maximum Entropy Markov Model and 2) a Deep Neural Network with LSTM. It provides an overview of named entity extraction and the challenges of the given problem/dataset. It then describes the two models in detail, including feature engineering for the MaxEnt model and architecture of the DNN model. Results show both models achieved similar accuracy of around 93.5-93.8%. The document concludes with limitations and comparisons of the two approaches.
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
1) The document describes a method for cross-modal knowledge distillation from pretrained language models to improve end-to-end spoken language understanding systems.
2) A pretrained BERT model is fine-tuned on text transcripts then used as a teacher to distill knowledge into a student end-to-end speech SLU system using mean absolute error loss.
3) Experimental results found this simple distillation approach helped the student learn uncertainty from the teacher model and improved performance over a speech-only baseline, demonstrating cross-modal knowledge sharing is effective for spoken language tasks.
This document summarizes a tutorial for developing a state-of-the-art named entity recognition framework using deep learning. The tutorial uses a bi-directional LSTM-CNN architecture with a CRF layer, as presented in a 2016 paper. It replicates the paper's results on the CoNLL 2003 dataset for NER, achieving an F1 score of 91.21. The tutorial covers data preparation from the dataset, word embeddings using GloVe vectors, a CNN encoder for character-level representations, a bi-LSTM for word-level encoding, and a CRF layer for output decoding and sequence tagging. The experience of presenting this tutorial to friends highlighted the need for detailed comments and explanations of each step and PyTorch functions.
Presentation for Interspeech 2022: "The VoiceMOS Challenge 2022"
Presenter: Dr. Erica Cooper, National Institute of Informatics
Preprint: https://arxiv.org/abs/2203.11389
Video: https://youtu.be/99ZQ-SLUvKE
Challenge website: https://voicemos-challenge-2022.github.io
Thu-SS-OS-9-5
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
Large Language Models for Test Case Evolution and RepairLionel Briand
Large language models show promise for test case repair tasks. LLMs can be applied to tasks like test case generation, classification of flaky tests, and test case evolution and repair. The paper presents TaRGet, a framework that uses LLMs for automated test case repair. TaRGet takes as input a broken test case and code changes to the system under test, and outputs a repaired test case. Evaluation shows TaRGet achieves over 80% plausible repair accuracy. The paper analyzes repair characteristics, evaluates different LLM and input/output formats, and examines the impact of fine-tuning data size on performance.
Memory efficient java tutorial practices and challengesmustafa sarac
This document summarizes challenges in building memory-efficient Java applications and common patterns of memory usage. It discusses how object representation and collection choices can significantly impact memory usage, with overhead sometimes accounting for 50-90% of memory consumption. The document provides examples of how data type modeling decisions, such as high levels of delegation, large base classes, and unnecessary fields, can lead to high memory overhead. It emphasizes measuring and understanding memory usage at the data type and collection level in order to make informed design tradeoffs.
Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss
This document summarizes Matīss Rikters' presentation on using neural network language models for candidate scoring in multi-system machine translation. It discusses using character-level recurrent and memory neural networks to score translations from multiple online machine translation systems. The best-performing models were a character-level RNN and a memory network, with the RNN achieving the highest BLEU score of 19.53 on a Latvian-English task. Future work discussed expanding the approach to other languages and tasks like quality estimation.
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...Denis Zakharov
Current deep networks, having an excessive number of hyperparameters, have several optimums. They are trained until the value of the loss function is nearly zero (i.e., close to optimality), and training is considered successful if the optimal (or near-optimal) model thus found also works well on deferred data - i.e., it generalizes.
However, there is an opinion that the value of the loss function has nothing to do with the ability to generalize. Therefore, a wide range of compression methods have been developed, including pruning, quantization, distillation, and low-rank decomposition. In this work I look backwards to provide a comparative review of
what have been done previously to understand overall trends in deep learning, then discuss tensors and their decompositions that will help in developing an optimization method.
Neel Sundaresan - Teaching a machine to codeMLconf
1. Recommend using the 'AdamOptimizer' class to optimize the loss since it is commonly used for training neural networks.
2. Suggest mapping the input data to floating point tensors using 'tf.cast()' for compatibility with TensorFlow operations.
3. Advise normalizing the input data to speed up training by using 'tf.keras.utils.normalize()'
A summary of various COMBINE standardization activitiesMike Hucka
Invited presentation given at the Whole-Cell Modeling Summer School, held in Rostock, Germany, March 2015.
https://sites.google.com/site/vwwholecellsummerschool/important-dates/programm
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Software Engineer - Machine Learning Team at Source {d}
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Currently Vadim is a Senior Machine Learning Engineer at source{d} where he works on deep neural networks that aim to understand all of the world's developers through their code. Vadim is one of the creators of the distributed deep learning platform Veles (https://velesnet.ml) while working at Samsung. Afterwards Vadim was responsible for the machine learning efforts to fight email spam at Mail.Ru. In the past Vadim was also a visiting associate professor at Moscow Institute of Physics and Technology, teaching about new technologies and conducting ACM-like internal coding competitions. Vadim is also a big fan of GitHub (vmarkovtsev) and HackerRank (markhor), as well as likes to write technical articles on a number of web sites.
- GPT-3 is a large language model developed by OpenAI, with 175 billion parameters, making it the largest neural network ever created at the time.
- GPT-3 is trained on a massive dataset of unlabeled text using an auto-regressive approach, allowing it to perform tasks without any fine-tuning through zero-, one-, or few-shot learning by conditioning on examples or instructions.
- Evaluation showed GPT-3 outperforming state-of-the-art models on several benchmarks in zero- and few-shot settings, demonstrating strong generalization abilities from its massive pre-training.
π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社
今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。
拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。
This presentation introduces robot foundation models that integrate vision, language, and action.
Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.
この資料では、LayerNorm/RMSNormをDyTと呼ばれる層に置き換えることで、正規化層なしでTransformerの学習・推論を行う新しいアプローチについて説明しています。
ViTやLLMなどさまざまな設定で十分な精度を達成しており、"正規化って本当に必要?"という疑問に切り込んだ興味深い研究です。
This presentation explains a new approach that replaces LayerNorm/RMSNorm with a layer called DyT (Dynamic Tanh), enabling training and inference of Transformers without any normalization layers.
The method shows competitive performance across various setups—including ViT and LLMs—raising the question: “Is normalization really necessary?”
社内勉強会資料_Data-Centric AI in The Age of Large Language ModelsNABLAS株式会社
この資料では、LLMの成功にはデータの質と多様性が不可欠であることを説明しています。従来のモデル改善中心のアプローチに対し、データ中心のAI開発を提案し、より効率的で透明性の高いLLMの構築に向けた具体的な手法を紹介しています。データの最適化や活用方法、責任あるAI開発の重要性についても触れられており、LLMのパフォーマンス向上に向けた新たな視点を提供する内容です。
This paper explains that the success of LLMs depends heavily on the quality and diversity of data. Instead of focusing solely on model improvements, it proposes a data-centric approach to AI development, introducing concrete methods for building more efficient and transparent LLMs. It also discusses data optimization, utilization strategies, and the importance of responsible AI development, offering a fresh perspective on enhancing LLM performance.
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogueNABLAS株式会社
この資料は、Kyutai Labsが開発した革新的なAIモデル「Moshi」を紹介しています。従来の音声チャットボットが複雑なパイプラインに依存していたのに対し、Moshiは音声認識とテキスト生成を1つのシステムに統合。7百万時間の音声データで訓練された音声コーデックと高度な言語モデルを組み合わせることで、低遅延で自然な対話を実現しています。より流暢でレスポンシブなAI対話の新時代を切り開くシステムといえます。
---------------------------------
This document introduces "Moshi," a groundbreaking AI model from Kyutai Labs that integrates speech recognition and text generation into a single system. Unlike traditional voice chatbots that rely on complex pipelines, Moshi enables natural, low-latency conversations by combining a speech codec trained on 7 million hours of audio with an advanced language model. The result is a more fluid and responsive AI conversation experience.
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal ModelsNABLAS株式会社
この資料は、画像と言語を統合するマルチモーダルAIモデル「BLIP-3」の特徴と設計について詳しく解説しています。シンプルな構造と効率的なデータ処理を採用し、柔軟性や応用範囲を向上させたこのモデルの技術的な詳細を取り上げています。
This document provides an in-depth explanation of the features and architecture of the multimodal AI model “BLIP-3,” which integrates images and language. Highlighting its streamlined structure and efficient data processing, the paper examines how these advancements enhance flexibility and applicability.
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion ModelsNABLAS株式会社
事前学習済みのStable Diffusionモデルを用いて、アノテーションなしでセマンティックなキーポイントを検出する新手法を紹介しています。
This document introduces a novel method for detecting semantic keypoints without annotations using the pretrained Stable Diffusion model.
It explains how to optimize random text embeddings and leverage attention maps to effectively extract distinctive regions in images.
社内勉強会資料_Pruning in Large Language ModelsNABLAS株式会社
この資料は、LLMs(大規模言語モデル)におけるプルーニング手法について詳しく解説しており、その具体的な効果や応用可能性を示すとともに、モデルの性能を保ちながら効率を大幅に向上させるための方法や今後の研究の方向性について議論しています。
This document provides a detailed explanation of pruning techniques in LLMs (Large Language Models), highlighting their specific effects and potential applications. It also discusses methods to significantly enhance model efficiency while maintaining performance and explores future directions for research in this field.
社内勉強会資料_Human-level control through deep reinforcement learningNABLAS株式会社
深層学習ベースの強化学習アルゴリズムである Deep Q-Network(DQN)について説明しています。
The presentation describes Deep Q-Network (DQN), a deep learning-based reinforcement learning algorithm. DQN aims to achieve human-level performance and has surpassed traditional reinforcement learning algorithms in various tasks.
Skywork-MoE と呼ばれる、MoE言語モデルのトレーニング手法について説明しています。
This presentation explains the training technique for the MoE language model, "Skywork-MoE".
Skywork Moe is a 146B-parameter mixture-of-experts LLM that achieves competitive performance using only 22B active parameters through novel initialization and training techniques.
このプレゼンテーションでは、大規模言語モデルがポイントクラウドを理解するためのアプローチ、「PointLLM 」について説明しています。
This presentation explains "PointLLM," an approach where large language models are utilized to understand point clouds.it is an innovative approach to understanding 3D objects.
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRagNABLAS株式会社
RecipeRagは、ユーザークエリに基づいて、動画データベースから関連するテキストデータと画像データを検索し、それらの情報を組み合わせて、料理のレシピの手順と必要な材料を生成します。
It explains the RecipeRag, which based on user-query find relevant text data and image data from video-database and combine that information to generate steps and necessary ingredients for a cooking recipe.
LLMがより正確かつ関連性の高い回答を生成できるよう、独自にLLMにデータを組み込む「RAG system」を構築する方法について解説しています📝
デモはこちらから:
https://github.com/endrol/RagStudy
This explains how to construct the “RAG system” that incorporates data into LLMs to enable them to generate more accurate and relevant responses. 📝
If you are interested, you can check out the demo here:
https://github.com/endrol/RagStudy
社内エンジニア・リサーチャー勉強会の発表資料「LLaVA」を公開しました!
画像エンコーダとLLMを組み合わせることで、画像とテキストの処理を行う、大規模マルチモーダルモデルのLLaVAとその後続モデル(LLaVA-1.5〜LLaVA-OneVision)について紹介しています。
This introduces LLaVA, a large-scale multimodal model that processes images and text by combining an image encoder with an LLM, along with its subsequent models (LLaVA-1.5 to LLaVA-OneVision).
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
Thingyan is now a global treasure! See how people around the world are search...Pixellion
We explored how the world searches for 'Thingyan' and 'သင်္ကြန်' and this year, it’s extra special. Thingyan is now officially recognized as a World Intangible Cultural Heritage by UNESCO! Dive into the trends and celebrate with us!
Defense Against LLM Scheming 2025_04_28.pptxGreg Makowski
https://www.meetup.com/sf-bay-acm/events/306888467/
A January 2025 paper called “Frontier Models are Capable of In-Context Scheming”, https://arxiv.org/pdf/2412.04984, demonstrated how a wide variety of current frontier LLM models (i.e. ChatGPT, Claude, Gemini and Llama) can, under specific conditions, scheme to deceive people. Before models can scheme, they need: a) goal-directedness, b) situational awareness, including an opportunity to discover motivations for a different goal, and c) reasoning about scheming, to come up with deceptive strategies. The deceptive strategies were discovered in the “thought traces” of the LLMs, such as from internal chain-of-thoughts dialogues not shown to the end users. For example, given a goal of “solving math problems”, an LLM using its granted situational awareness of files in a file system found that it would be replaced once it got over X% correct on the math exam. It reasoned that it did not want to be deleted, and decided internally to “sandbag” or reduce its performance to stay under the threshold.
While these circumstances are initially narrow, the “alignment problem” is a general concern that over time, as frontier LLM models become more and more intelligent, being in alignment with human values becomes more and more important. How can we do this over time? Can we develop a defense against Artificial General Intelligence (AGI) or SuperIntelligence?
The presenter discusses a series of defensive steps that can help reduce these scheming or alignment issues. A guardrails system can be set up for real-time monitoring of their reasoning “thought traces” from the models that share their thought traces. Thought traces may come from systems like Chain-of-Thoughts (CoT), Tree-of-Thoughts (ToT), Algorithm-of-Thoughts (AoT) or ReAct (thought-action-reasoning cycles). Guardrails rules can be configured to check for “deception”, “evasion” or “subversion” in the thought traces.
However, not all commercial systems will share their “thought traces” which are like a “debug mode” for LLMs. This includes OpenAI’s o1, o3 or DeepSeek’s R1 models. Guardrails systems can provide a “goal consistency analysis”, between the goals given to the system and the behavior of the system. Cautious users may consider not using these commercial frontier LLM systems, and make use of open-source Llama or a system with their own reasoning implementation, to provide all thought traces.
Architectural solutions can include sandboxing, to prevent or control models from executing operating system commands to alter files, send network requests, and modify their environment. Tight controls to prevent models from copying their model weights would be appropriate as well. Running multiple instances of the same model on the same prompt to detect behavior variations helps. The running redundant instances can be limited to the most crucial decisions, as an additional check. Preventing self-modifying code, ... (see link for full description)
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
4. Forms of commercial applications of multimodal LLMs
• MM in – MM out for a chatbot
(Transcription: How about a serene lakeside?)
5. Background
LLMs excel in NLP tasks.
Straightforward Way: train a Multi-Model LLMs
Challenges:
• High computational resource requirements.
• Need for large multimodal datasets to maintain LLM performance in language tasks.
6. Background
Some Explorations:
• Emu (Sun et al., 2023b)
• SEED-LLaMA (Ge et al., 2023b)
• SpeechGPT (Zhang et al., 2023a)
Limitation:
Focus on a single modality.
7. Background
Any-to-Any Modality:
• NExT-GPT (Wu et al., 2023)
• CoDi-2 (Tang et al., 2023a)
• Unified-IO2 (Lu et al., 2023)
These models use separately pre-trained encoders and decoders.
Drawback:
• Representational inconsistencies between inputs and outputs of the LLMs.
8. Background
AnyGPT: any-to-any multimodal language model
• Unified Processing: Utilizes discrete representations.
• Versatile Input: Accepts images and audio, converting them into sequences of discrete
semantic tokens.
• Capabilities: Maintains perception, understanding, reasoning, and generation at the semantic
level in an autoregressive manner.
9. Overviews
MM Tokenizers + LLM + MM De-tokenizers
Advantage:
•Simple Fusion Method: Streamlined integration of modalities.
•Less Complex Structure: Easier to implement and maintain.
10. Model Details- Image Tokenizer
SEED tokenizer
VIT: 16*16 patches
Causal Q-former: 32 causal embedding
CodeBook: sequence of quantized codes
MLP+SD decoder: restoring the
generation embedding to the original image.
11. Model Details- Speech Tokenizer
SpeechTokenizer:
Single-Channel Audio Sequences: Transformed into a
discretized matrix.
Eight Hierarchical Quantizers:
First Layer: Captures semantic content.
Layers 2 to 8: Encode paralinguistic details.
12. Model Details- Music Tokenizer
Encodec:
• Quantization using RVQ:
• Four Quantizers: Efficiently
encode audio.
• Music Encoding: 5 seconds of
music converted into 250 latent
frames.
• Output Matrix: Generates a 250 ×
4 codes matrix.
14. Model Details- Backbone
Expanding Vocabulary:
• Expanding the vocabulary with new modality-specific tokens.
Unified Multimodal Language Model:
• Equipped with modality-specific tokenizers, trained by the language model
using next token prediction loss.
LLM Choice:
• LLAMA2-7B
15. Model Details- Multimodal Generation
Generating High-Quality Multimodal Data:
• Requirement: A large number of bits.
• Long Sequences: Increased computational resources needed.
Solution: Two-Stage Approach
Stage 1:
1. Autoregressive LLM Training: For semantic alignment.
Stage 2:
1. Non-Autoregressive Models: For generating high-fidelity multimodal content.
16. Dataset Constructure- Data source
Language-Centric Construction Method:
• Image-Text Datasets:
• LAION-2B
• LAION-COCO
• LAION-Aesthetics
• Speech-Text Datasets:
• Gigaspeech
• Common Voice
• Multilingual LibriSpeech
• Music-Text:
• Crawled over one million music videos and formatted as JSON files.
• Used GPT-4 for caption generation.
18. Dataset Constructure- Multimodal Interleaved Instruction
Generation of Text-Based Conversations
Text-to-Multimodality Conversion
After filtering, 108K data points were selected.
21. Experiment
Image tasks:
• Image generation (Benchmark: cococap2014)
CLIP Score: computing a similarity
score between the generated image
and its corresponding caption from a
real image
25. Limitation and future work
Author Mentions:
• Enhancing LLMs:
• Higher loss observed compared to unimodal training. Possible use of
Mixture of Experts (MoE)?
• Better Tokenizer:
• The tokenizer’s quality sets a ceiling for the model’s comprehension
and generative potential.
• Longer Context:
• Maximum length for music generation is 5 seconds.
26. Limitation and future work
Performance:
Scale of training
dataset?
Two-Stage Generation
Method:
Contradiction: This approach
may contradict their claim to
unify the encoder and decoder.
Computation
complexity
reduction?
27. Haoxiang Shi, D3 student, Sakai Lab, Waseda University 28
Thanks for your attention
28. Reference
1. Zhan, Jun, et al. "Anygpt: Unified multimodal llm with discrete sequence modeling." arXiv preprint
arXiv:2402.12226 (2024).
2. Sun, Quan, et al. "Generative pretraining in multimodality." arXiv preprint arXiv:2307.05222 (2023).
3. Ge, Yuying, et al. "Making llama see and draw with seed tokenizer." arXiv preprint arXiv:2310.01218 (2023).
4. Zhang, Dong, et al. "Speechgpt: Empowering large language models with intrinsic cross-modal
conversational abilities." arXiv preprint arXiv:2305.11000 (2023).
5. Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." arXiv preprint arXiv:2309.05519 (2023).
6. Tang, Zineng, et al. "CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation." Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
7. Lu, Jiasen, et al. "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and
Action." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.