본문 바로가기

일별 학습일지

5/30 :: Fast-up report

1. Abstract

  • Goal of the Competition
    • Dialogue Summarization
    • Task : Summarization
    • Evaluation Metric : ROUGE(Recall-Oriented Understudy for Gisting Evaluation)
  • Timeline
    • Start Date : May 13, 2024
    • Final submission deadline : May 27, 2024 (13:00)
  • Description of the work

    Dataset overview

    • train : 12457
    • dev : 499
    • test : 250
    • hidden-test : 249

    EDA

    • Data distribution of dialogue/topic/summary length.
    • Analysis on special tokens in dialogues.
    • Summary ratio before and after tokenizing. (train/dev)
      • train
      • dev

    Data Processing

    • train : 12457 → 12403
      • dialogue with more than 3 special tokens.
      • summary ratio over 0.5
    • dev : 499 → 486
      • dialogue with more than 2 special tokens.
      • summary ratio over 0.5

     

    Data Augmentation

    • Data augmentation using Cohere API
    • original data
    • augmented data

 

  •  

2. Process : Competition Model

  • Solar : beomi/OPEN-SOLAR-KO-10.7B + 4-bit quantization + LoRA

Modeling Process

 

3. Process : Issues

  • The data structure of Kobart and T5 used in the baseline is different, so restructuring is necessary to proceed with the baseline approach.
    • Modify the code to be divided within the Jupyter notebook without using config.yaml.
  • The configuration is set improperly for the T5 model due to the reuse of newly referenced code.
    • Modify the configuration to fit the T5 model and proceed with fine-tuning.
  • The issue of no output being produced is under investigation; currently on hold while switching to the Solar model.

4. Role

  • Modeling & Finetuning
    • T5 (Main) : eenzeenee/t5-base-korean-summarization
    • Solar(sub) : shared by teammate

5. Results

  • Public Score
  • Final standings of the Leaderboard(Private Score)

 

6. Conclusion

  • 스크립트 형식의 baseline을 통해 yaml 파일을 활용한 train config을 조절하는 경험은 신선했음
    • 입문 단계다보니 노트북 형식이 익숙할 수는 있지만 효율성을 위해 스크립트 형식으로 전환하는 것이 필요
  • 이 외에도 여러 툴을 활용해 기록함으로서 실험을 관리할 수 있어 좋았음(wandb, hugging face 업로드 등)
  • cv보다 더 생소한 분야다보니 이를 경험해봤다는 것에 의의를 두려하며 추후 복기를 통해 여러가지를 시도해볼예정

'일별 학습일지' 카테고리의 다른 글

4/23 :: Fast-up report  (0) 2024.04.27
4/12~4/13 :: Ideation  (0) 2024.04.14
4/11 :: CV data  (0) 2024.04.11
4/5 :: CV  (0) 2024.04.05
ML Competition : Fast-UP Report  (0) 2024.04.03