Python 7 - Pandas 1

2022-09-23 2 분 소요

pandas

구조화된 데이터의 처리를 지원하는 python 라이브러리

panel data

numpy와 통합되어 좋아짐.

기존 데이터들을 접근하는 Indexing이 지정되어 있음

DataFrame 이용

data 용어 정리

전체 : data table, sample

column : attribute, feature, field

row : instance, tuple

->tabular 라는 데이터 형식 사용

data_url =  "./what.data"
df_data = pd.read_csv(
		data_url, sep = "\s+", header = None 
)

#처음 n줄 출력
df_data.head(n)

df_data.columns = ["1", "2", "3"]
#column header 이름 지칭
df_data.head()


df_data.values

pandas의 구성

series

: column vector를 표현하는 object

list_data = [1,2,3,4,5]
list_name = ["a","b","c","d","e"]

example = Series(data = list_data, index = list_name)

type 는 numpy.ndarray

index 기본인 pandas 특성상 dict로 넣을 수 있음

dict_data = {"a":1, "b":2, "c":3}
example = Series(dict_data, dtype = float32, name = "example")

data index에 접근하기

example["a"]
#값 할당도 가능
example["a"] = 3.2

#값 추출, 할당 가능
example.values
example.index
example.name
example.index.name

*series는 index 기준으로 생성되기 때문에 data가 수가 적으/면 NaN으로 표시되더라도 생성된다. (column도 마찬가지)

dataframe

: series가 모인, dat table 전체를 포함하는 object

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

row_data = {'key' : [value1, value2], 'key2 ': [value3, value4]}
df = pd.DataFrame(row_data, columns = ['key', 'key2'])

#column 추출법
df.key
df["key"]

dataframe indexing

# loc - index location : index 이름
df.loc[:3]
df.loc[:, ["first", "second"]]
# iloc - index position : index number
df.iloc[:3]
df["first"].iloc[:3]

dataframe handling

#column에 새로운 값 할당
df.dept = df.age >40

#column 삭제
del df["debt"]
df.drop("debt",axis = 1)

selection and drop

#한개 column
df["account"].head(3)
#1개 이상 column
df["ac", "st","fs"].head(3)

팁) .T 하면 transpose되어서 데이터 보기 좋음

# series 형태 추출
df["acount"]
# dataframe 형태 추출
df[["account"]]

#column 이름 없이 사용하는 index number는 row 기준 표시 (일관성 부족한 pandas의 모습)
df[:3]

데이터 뽑는 다양한 방법

#column 과 index number
df[["name","str"]][:2]
#column 과 index name
df.loc[[23,24],["name","str"]]
#column number와 index number
df.iloc[:2,:2]

-> 아무래도 pandas는 행, 열 의 순서가 중요하지 않은 느낌. 걍 있는걸로 돌아가는 느낌

reindex

#직접
df.index= list(range(5))
#리셋된 인덱스 추가 
#(drop 하면 기존 index 없어짐)
#(inplace 하면 dataframe자체가 변화함)
df.reset_index(drop = True, inplace= True)

data drop

삭-제

df.drop([0,1,2,3])

df.drop("city", axis =1, inplace = True)

dataframe operation

column과 index 모두 고려

겹치는 index 없는 경우 NaN 값으로 반환

이를 변환 하기 위해 fill_value=0 사용하면 됨

df.add

.mul 

등등 연산 있음

lambda, map, apply

map for series

function 대신 dict, sequence 형 자료로 대체 가능

s1 = Series(np.arange(10))
#함수
s1.map(lambda x: x**2).head()
#dict
s1.map({1:'a',2:"b"}).head()
#series
s2 = Series(np.arange(10,29))
s1.map(s2).head()


#ex)
df["sex_code"] = df.sex.map({'male' : 0, 'female' : 1})

replace function

df.sex.replace({"male":0,"female":1})

df.sex.replace(["male","female"],[0,1], inplace = True)

항 뽑기

df.sex.unique()

apply for dataframe

둘이 같은거

df.sum()
df.apply(sum)

applymap for dataframe

series 단위가 아닌 element 단위로 함수를 적용함

series 단위에 apply를 적용시킬 때와 같은 효과

그냥 dataframe내 모든 값에 함수 적용 시킬 때 쓴다

f = lambda x: -x
df_info.applymap(f)

pandas builtin functions

describe()

numeric type 데이터의 요약 정보를 보여줌
unique()

series data의 유일한 값을 list로 반환
연산

sum, sub, mean, min, max, count, median, mad, var

isnull()

sum()을 뒤에 같이 써서 얼마나 비어져 있나 확인하는 용도로 쓸 수 있음

sort_values([“age”,”earn”],ascending = True)

column 기준으로 데이터 sorting

ascending : 오름차순

corr, cov, corrwith

Twitter Facebook LinkedIn

SunTae Hwang