Building ML Pipelines 따라잡기 (4)

티스토리 뷰

Study/MLOps

Building ML Pipelines 따라잡기 (4) - 데이터 검증

johanjun 2021. 11. 22. 18:57

Garbage in, Garbage Out

머신러닝을 경험한 사람이라면 지겹도록 들은 말일 것이다(~~대부분은 Garbage가 안들어가도 Garbage Out되지만~~).

데이터를 선별하고 검증하지 않으면 모델이 제대로 학습하지 못한다.

데이터 검증은 파이프라인의 데이터가 피처 엔지니어링 단계에서 기대하는 데이터인지 확인하는 작업이다. 아래와 같은 작업들이 데이터 검증이라고 할 수 있다.

여러 데이터셋을 비교
시간이 지나 업데이트되면서 데이터가 변경될 때도 표시
이상치를 확인하거나 스키마(schema)의 변경 확인
새 데이터셋과 이전 데이터셋의 통계가 일치하는지도 확인

TFDV(Tensorflow Data Validation)를 통한 데이터 검증

TFX가 제공하는 패키지인 TFDV을 통하여 데이터를 검증해보자. TFDV는 TFRecord와 csv파일 두 가지 입력 형식을 허용한다.

설치는 아래의 코드로 진행할 수 있다.

!pip3 install tensorflow-data-validation

요약 통계(피쳐 통계) 생성

stats라는 변수에 피처 통계를 생성해보자.

import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv(data_location='../data/consumer_complaints_with_narrative.csv',
                                         delimiter=',')

한 번에 이해하긴 어렵지만, 전체 데이터 레코드 개수와 누락 데이터 개수, 고유 레코드 개수, min/max값 등 다양한 통계 수치를 저장한다.

스키마(schema) 생성

스키마는 데이터 셋의 표현을 설명하는 형식이다. 데이터 타입과 범위를 정의한다.

schema = tfdv.infer_schema(stats)
tfdv.display_schema(schema)

데이터셋 비교

다음 처럼 분리된 데이터셋을 비교할 수 있다.

train_stats = tfdv.generate_statistics_from_csv(
    data_location='../data/data_validation/dataset_1.csv',
    delimiter=',')
val_stats = tfdv.generate_statistics_from_csv(
    data_location='../data/data_validation/dataset_2.csv',
    delimiter=',')

tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats,
                          lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')

이상치 탐색

다음 코드를 통해서 이상치를 탐지할 수 있다.

anomalies = tfdv.validate_statistics(statistics=stats, schema=schema)
tfdv.display_anomalies(anomalies)

데이터 스큐(skew) 비교기

TFDV에서는 통계적 정의의 skew가 아닌 두 데이터셋 간의 L-infinity Norm으로 정의된다. (Norm?) 두 데이터셋 간의 차이가 특정 피처에 대한 L-infinity Norm의 임곗값을 초과하면 이상치로 간주한다.

tfdv.get_feature(schema, 'company').skew_comparator.infinity_norm.threshold=0.01
skew_anomalies = tfdv.validate_statistics(statistics=train_stats,
                                         schema=schema,
                                         serving_statistics=stats)

데이터 슬라이싱하기

선택한 피처에서 데이터셋을 슬라이싱해서 편향을 확인할 수도 있다.

#binarization
slice_fn1 = slicing_util.get_feature_value_slicer(features={'state':[b'CA']})
slice_option = tfdv.StatsOptions(slice_functions=[slice_fn1])
slice_stats = tfdv.generate_statistics_from_csv(data_location='../data/data_validation/dataset_1.csv',
                                               stats_options=slice_option)
                                               
from tensorflow_metadata.proto.v0 import statistics_pb2

def display_slice_keys(stats):
    print(list(map(lambda x: s.name, slice_stats.datasets)))
    
def get_sliced_stats(stats, slice_key):
    for slice_stats in stats.datasets:
        if slice_stats.name == slice_key:
            result = statistics_pb2.DatasetFeatureStatisticsList()
            result.datasets.add().CopyFrom(slice_stats)
            return result
        print('Invalid Slice Key')
    
def compare_slices(stats, slice_key1, slice_key2):
    lhs_stats = get_sliced_stats(stats, slice_key1)
    rhs_stats = get_sliced_stats(stats, slice_key2)
    tfdv.visualize_statistics(lhs_stats, rhs_stats)

compare_slices(slice_stats, 'state_CA', 'All Examples')

GCP에서 데이터 검증 처리하기

(인증 작업 후에) 다음 코드를 통해서 로컬에서 GCP Dataflow에 데이터 검증 작업을 요청할 수 있다. 작업은 GCP 인스턴스에서 실행된다.

#code from https://www.tensorflow.org/tfx/data_validation/get_started#running_on_google_cloud
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions

PROJECT_ID = ''
JOB_NAME = ''
GCS_STAGING_LOCATION = ''
GCS_TMP_LOCATION = ''
GCS_DATA_LOCATION = ''
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = ''

PATH_TO_WHL_FILE = ''


# Create and set your PipelineOptions.
options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
options.view_as(StandardOptions).runner = 'DataflowRunner'

setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]

tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION,
                                       output_path=GCS_STATS_OUTPUT_PATH,
                                       pipeline_options=options)

이상 데이터 검증 프로세스를 효율적으로 수행하는 tfx 패키지인 TFDV에 대해 알아보았다.

'Study > MLOps' 카테고리의 다른 글

도커와 쿠버네티스 (1) - 개념과 실습환경 (0)	2021.11.25
Building ML Pipelines 따라잡기 (5) - 데이터 전처리 (0)	2021.11.24
Building ML Pipelines 따라잡기 (3) - 데이터 수집과 준비 (0)	2021.11.20
Building ML Pipelines 따라잡기 (2) - TFX(Tensorflow Extended) (0)	2021.11.17
Building ML Pipelines 따라잡기 (1) - intro (0)	2021.11.15

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

글 보관함

티스토리 뷰