1. Abstract
- Goal of the Competition
- Dialogue Summarization
- Task : Summarization
- Evaluation Metric : ROUGE(Recall-Oriented Understudy for Gisting Evaluation)
- Timeline
- Start Date : May 13, 2024
- Final submission deadline : May 27, 2024 (13:00)
- Description of the work
Dataset overview
- train : 12457
- dev : 499
- test : 250
- hidden-test : 249
EDA
- Data distribution of dialogue/topic/summary length.
- Analysis on special tokens in dialogues.
- Summary ratio before and after tokenizing. (train/dev)
- train
- dev
- train
Data Processing
- train : 12457 → 12403
- dialogue with more than 3 special tokens.
- summary ratio over 0.5
- dev : 499 → 486
- dialogue with more than 2 special tokens.
- summary ratio over 0.5
Data Augmentation
-
- Data augmentation using Cohere API
- original data
- augmented data
2. Process : Competition Model
- Solar : beomi/OPEN-SOLAR-KO-10.7B + 4-bit quantization + LoRA

3. Process : Issues
- The data structure of Kobart and T5 used in the baseline is different, so restructuring is necessary to proceed with the baseline approach.
- Modify the code to be divided within the Jupyter notebook without using config.yaml.
- The configuration is set improperly for the T5 model due to the reuse of newly referenced code.
- Modify the configuration to fit the T5 model and proceed with fine-tuning.
- The issue of no output being produced is under investigation; currently on hold while switching to the Solar model.
4. Role
- Modeling & Finetuning
- T5 (Main) : eenzeenee/t5-base-korean-summarization
- Solar(sub) : shared by teammate
5. Results
- Public Score
- Final standings of the Leaderboard(Private Score)
6. Conclusion
- 스크립트 형식의 baseline을 통해 yaml 파일을 활용한 train config을 조절하는 경험은 신선했음
- 입문 단계다보니 노트북 형식이 익숙할 수는 있지만 효율성을 위해 스크립트 형식으로 전환하는 것이 필요
- 이 외에도 여러 툴을 활용해 기록함으로서 실험을 관리할 수 있어 좋았음(wandb, hugging face 업로드 등)
- cv보다 더 생소한 분야다보니 이를 경험해봤다는 것에 의의를 두려하며 추후 복기를 통해 여러가지를 시도해볼예정
'일별 학습일지' 카테고리의 다른 글
4/23 :: Fast-up report (0) | 2024.04.27 |
---|---|
4/12~4/13 :: Ideation (0) | 2024.04.14 |
4/11 :: CV data (0) | 2024.04.11 |
4/5 :: CV (0) | 2024.04.05 |
ML Competition : Fast-UP Report (0) | 2024.04.03 |