데이터 전처리 - 네이버 영화리뷰

티스토리 뷰

ML&DL

데이터 전처리 - 네이버 영화리뷰

joyHong 2020. 8. 11. 22:54

네이버 영화리뷰 데이터 획득하기

https://ratsgo.github.io/embedding/downloaddata.html

데이터 다운로드

말뭉치나 임베딩을 다운로드하는 방법을 안내합니다.

ratsgo.github.io

이 블로그에서 네이버 영화리뷰 데이터를 다운 받을 수 있다.

이 외에도 한국어 위키백과와 KorQuAD 데이터도 다운 받을 수 있다.

네이버 영화리뷰 데이터를 다운로드 받아서 확인해 보면,

id, document, label 순으로 탭 구분되어 있는 데이터를 볼 수 있다.

이 파일에서 필요한 부분인 document 만을 추출하여 다른 파일로 저장하도록 한다.

(여기서는 부정, 긍정을 나타내는 label인 0과 1은 사용하지 않을 예정이기 때문에 추출하지 않는다.)

코드

import json, glob


def process_nsmc(corpus_path, output_fname, with_label=True):
    with open(corpus_path, 'r', encoding='utf-8') as f1, \
            open(output_fname, 'w', encoding='utf-8') as f2:
        next(f1)  # skip head line
        for line in f1:
            _, sentence, label = line.strip().split('\t')
            if not sentence: continue
            if with_label:
                f2.writelines(sentence + "\u241E" + label + "\n")
            else:
                f2.writelines(sentence + "\n")


corpus_path = 'D:/Data/embedding/data/raw/ratings.txt'
output_fname = 'D:/Data/embedding/data/processed/processed_ratings.txt'
process_nsmc(corpus_path, output_fname, False)

결과

이제 네이버 영화 리뷰 데이터 중 실제 리뷰 텍스트만이 추출되어 파일로 저장되었다.

다음으로는 이 텍스트를 읽어 명사만을 추출하여 다른 파일로 저장하는 내용에 대해 남겨볼 예정이다.

※ 위 내용에 대한 결과 파일은 처음에 링크한 사이트(https://ratsgo.github.io/embedding/downloaddata.html)에서 다운로드 받을 수 있다.

참고:

위에서 사용한 코드는 https://github.com/ratsgo/embedding/blob/master/preprocess/dump.py 에서 가져와 일부 수정하여 사용하였습니다.

'ML&DL' 카테고리의 다른 글

임베딩 - Word2Vec (5)	2020.08.12
명사 추출 - 네이버 영화리뷰 (0)	2020.08.11
pyeunjeon - koNLPy 스타일 mecab 래퍼 사용하기 (12)	2020.07.24
윈도우 환경에서 mecab 사용자 사전 추가하기 (0)	2020.07.24
윈도우 환경에서 mecab 설치 후 파이참(PyCharm) 에서 사용하기 (1)	2020.07.24

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

옳은 길로..

티스토리 뷰

데이터 전처리 - 네이버 영화리뷰

'ML&DL' 카테고리의 다른 글

티스토리툴바