(파이썬 한권으로 끝내기) 상관계수, 선형 회귀분석, 다중 회귀, 다중공선성, 변수선택법(전진선택법, 후진제거법, 단계선택법)

데이터 분석/ADP 자격증 공부

(파이썬 한권으로 끝내기) 상관계수, 선형 회귀분석, 다중 회귀, 다중공선성, 변수선택법(전진선택법, 후진제거법, 단계선택법)

나르시스트 2026. 4. 5. 18:21

*상관계수

*피어슨 상관계수

연속형 변수(등간척도, 비율척도)로 측정된 변수들 사이의 선형관계를 나타냄

from scipy.stats import pearsonr

stats.pearsonr(x=data['GRE'], y=data['LOR'])  
#data[['GRE', 'TOEFL', 'LOR']].corr(method='pearson')  # (-1 ~ 1)

q45['pair'] = q45.apply(lambda x: tuple(sorted([x['v1'], x['v2']])), axis=1)  # 순서를 무시하고 정렬하여 튜플로 생성
q45 = q45.drop_duplicates('pair').drop(columns='pair')  # 'pair' 값 중복 행 제거하고, pair 컬럼 삭제

import pandas as pd
import pingouin as pg

df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [2, 4, 5, 4, 5]
})

pg.corr(x=df['x'], y=df['y'], method='pearson')

*BF10: 귀무가설 vs 대립가설 중 어느 쪽이 더 타당한지를 비교하는 베이지안 통계의 핵심 지표

*스피어만 순위상관계수

관측치가 서열척도로 되어 있을 때

from scipy.stats import spearmanr

stats.spearmanr(X, Y)

*켄달의 타우

관측치가 서열척도로 되어 있을 때, 수치형 관측치의 분포가 극단적인 분포를 보일 때

from scipy.stats import kendalltau

corr, p = kendalltau(X, Y)

※ 단조 관계: 하나의 변수의 값이 증가하거나 감소함에 따라 다른 변수도 일관되게 증가/감소하는 패턴을 보이는 관계

→ X, Y 변수 간 차이가 있으면 켄달의 타우를 사용하자!

→ p < 0.05, 통계적으로 유의하므로, 관계 없음

*크라메르의 연관계수 (p.224)

두 범주형 변수 간의 연관성을 측정
카이제곱 독립성 검정을 확장한 것으로, 두 범주형 변수가 서로 얼마나 연관되어 있는지를 0에서 1까지의 값으로 나타냅

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
import math

data = np.array([[50, 30, 20], [30, 40, 30], [20, 10, 50]])

# 교차표에 대한 카이제곱 독립성 검정 수행
chi2, p, dof, expected = chi2_contingency(data)

# 총 관측값
n = np.sum(data)

# 크라메르의 V 계산
k = min(data.shape)  # 행과 열 중 작은 값
cramers_v = math.sqrt(chi2 / (n * (k - 1)))

print(f"카이제곱 통계량: {chi2}")
print(f"p-값: {p}")
print(f"크라메르의 연관계수: {cramers_v:.4f}")

import numpy as np
from scipy.stats.contingency import association

data = np.array([[50, 30, 20], [30, 40, 30], [20, 10, 50]])

cramers_v = association(data, method='cramer')

print(f"크라메르의 연관계수: {cramers_v:.4f}")

→ 0.6 이상인 경우, 연관성이 높다고 봄

*편 상관관계 분석

import pandas as pd
import pingouin as pg

data = {
    'X': [10, 20, 30, 40, 50, 60],
    'Y': [9, 19, 29, 41, 49, 59],
    'Z': [1, 2, 3, 4, 5, 6]
}

df = pd.DataFrame(data)

partial_corr = pg.partial_corr(data=df, x='X', y='Y', covar='Z')  # covar=['Z', 'W']
print(partial_corr)

*선형 회귀분석

① 회귀 모형이 통계적으로 유의한가? – F 통계량
→ p-value가 유의수준(a) 보다 작으면 통계적으로 유의함
② 모형은 데이터를 얼마나 설명할 수 있는가? – R2
③ 회귀계수는 유의한가? – t 값

import statsmodels.api as sm
import pandas as pd

data = {
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 3, 4, 5, 6],
    'y': [1, 2, 1.3, 3.75, 2.25]
}
df = pd.DataFrame(data)

X = df[['X1', 'X2']]
y = df['y']

X = sm.add_constant(X)  # 상수 항 추가 (절편을 포함하기 위해)

model = sm.OLS(y, X).fit()
print(model.summary())

new_data = pd.DataFrame({
    'const': 1,  # 상수 항 추가
    'X1': [6, 7],
    'X2': [7, 8]
})
predictions = model.predict(new_data)
print(predictions)

*교호작용이 있는 회귀모형

# 교호작용 항 추가
df['X1_X2'] = df['X1'] * df['X2']  # X1과 X2의 상호작용 항

# 독립 변수 (교호작용 항 포함, 상수항 추가)
X_interaction = sm.add_constant(df[['X1', 'X2', 'X1_X2']])

model_interaction = sm.OLS(df['y'], X_interaction).fit()
print(model_interaction.summary())

import pandas as pd

from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

house = pd.read_csv('data/kc_house_data_gpt.csv')
house = house[['price', 'sqft_living']]

house.corr()  # 상관관계 확인

y = house['price']
x = house[['sqft_living']]

lr = ols('price ~ sqft_living', data=house).fit()
pred = lr.predict(x)

print(lr.summary())

plt.scatter(x['sqft_living'], y)
plt.plot(x['sqft_living'], pred, color='red')
plt.show()

test_data = pd.DataFrame({
    'sqft_living': [1200, 1500, 2000, 2500, 3000]
})
test_pred = lr.predict(test_data)
print("Test Predictions:")
print(test_pred)

plt.scatter(test_data['sqft_living'], test_pred, color='blue', marker='x', label='Test Predictions')
plt.plot(x['sqft_living'], pred, color='red', label='Training Data Fit')
plt.xlabel('Square Feet of Living Space')
plt.ylabel('Price')
plt.legend()
plt.show()

회귀 모델 통계적으로 유의함
설명력 97.6%로 성능 우수
회귀계수도 유의하므로, Price = sqft_living*279.7792 + 4069.0715 회귀식 도출

*영향치 판단(Cook’s distance 등)

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence
import matplotlib.pyplot as plt

data = {
    'Area': [150, 200, 250, 300, 350, 400, 450, 500, 550, 600],
    'Bedrooms': [2, 3, 3, 4, 4, 4, 5, 5, 5, 6],
    'Distance': [5, 4, 4, 3, 3, 2, 2, 1, 1, 1],
    'Price': [300, 400, 350, 500, 550, 600, 700, 650, 700, 800]
}
df = pd.DataFrame(data)

X = df[['Area', 'Bedrooms', 'Distance']]
y = df['Price']

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

influence = OLSInfluence(model)
cooks_d = influence.cooks_distance[0]

print(cooks_d)

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

X = sm.add_constant(df['x'])
model = sm.OLS(df['y'], X).fit()

influence = model.get_influence()
cooks_d, _ = influence.cooks_distance
dfbetas = influence.dfbetas
dffits = influence.dffits[0]
leverage = influence.hat_matrix_diag
#시각화: sm.graphics.influence_plot()

*다중 선형 회귀

*다중공선성

상관성 0.9 이상
VIF(Variance Inflation Factor) 값이 10 이상이면 다중공선성 존재

import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

Cars = pd.read_csv('data/Cars93.csv')
Cars.columns = Cars.columns.str.replace('.', '')

model = smf.ols(formula='Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data=Cars)

result = model.fit()
print(result.summary())
print(Cars[['EngineSize', 'RPM', 'Weight', 'Length', 'MPGcity', 'MPGhighway']].corr())

# 독립변수와 종속변수를 데이터프레임으로 나누어 저장
y, X = dmatrices('Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data=Cars, return_type='dataframe')

vif_list = []
for i in range(1, len(X.columns)):
    vif_list.append([variance_inflation_factor(X.values, i), X.columns[i]])
print(pd.DataFrame(vif_list, columns=['vif', 'variable']))

→ MPGcity, MPGhighway는 0.9 이상의 상관성을 보이므로 다중공선성이 존재함을 알 수 있음

→ VIF 값을 통해 다중공선성을 확인한 결과 MPGcity 변수를 제거해야 함을 알 수 있음

*MPGcity 변수를 제거하고 다시 돌려보자

model = smf.ols(formula='Price ~ EngineSize + RPM + Weight + Length + MPGhighway', data=Cars)
result = model.fit()
print(result.summary())

→ MPGhighway p-value가 현저히 낮아진 것을 볼 수 있음

*변수선택법 – 전진선택법/후진제거법/단계선택법 (p.241)

import time
import itertools
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from patsy import dmatrices

data = pd.read_csv('data/Cars93.csv')
y, X = dmatrices('Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data=data, return_type="dataframe")

def processSubset(X, y, feature_set):
    model = sm.OLS(y, X[list(feature_set)])
    regr = model.fit()
    AIC = regr.aic
    return {"model": regr, "AIC": AIC}

def forward(X, y, predictors):
    tic = time.time()
    remaining_predictors = [p for p in X.columns.difference(['Intercept']) if p not in predictors]

    results = []
    for p in remaining_predictors:
        results.append(processSubset(X=X, y=y, feature_set=predictors+[p]+['Intercept']))
    models = pd.DataFrame(results)
    best_model = models.loc[models['AIC'].argmin()]
    toc = time.time()

    print("Processed", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic))
    print("Selected predictors:", best_model['model'].model.exog_names, "AIC:", best_model[0])
    return best_model

def backward(X, y, predictors):
    tic = time.time()

    results = []
    for combo in itertools.combinations(predictors, len(predictors) -1):
        results.append(processSubset(X=X, y=y, feature_set=list(combo)+['Intercept']))
    models = pd.DataFrame(results)
    best_model = models.loc[models['AIC'].argmin()]
    toc = time.time()

    print("Processed", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic))
    print("Selected predictors:", best_model['model'].model.exog_names, "AIC:", best_model[0])
    return best_model

### 단계선택법
def stepwise_model(X,y):
    stepModels = pd.DataFrame(columns=["AIC","model"])
    tic = time.time()

    predictors = []
    SmodelBefore = processSubset(X, y, predictors + ['Intercept'])['AIC']

    predictors = []
    for i in range(1, len(X.columns.difference(['Intercept']))+1):
        forwardResult = forward(X,y,predictors)
        print("forward")
        stepModels.loc[i] = forwardResult
        predictors = stepModels.loc[i]["model"].model.exog_names
        predictors = [k for k in predictors if k != 'Intercept']
        backwordResult = backward(X,y,predictors)
        if backwordResult['AIC'] < forwardResult['AIC']:
            stepModels.loc[i] = backwordResult
            predictors=stepModels.loc[i]["model"].model.exog_names
            smodelBefore=stepModels.loc[i]["AIC"]
            predictors=[k for k in predictors if k != 'Intercept']
            print('backward')
        if stepModels.loc[i]["AIC"] > SmodelBefore:
            break
        else:
            smodelBefore = stepModels.loc[i]["AIC"]
    toc = time.time()

    print("Total elapsed time : ", (toc - tic), "seconds")
    return (stepModels['model'][len(stepModels['model'])])

Stepwise_best_model = stepwise_model(X=X, y=y)
print(Stepwise_best_model.summary())

### 전진선택법
def forward_model(X, y):
    fModels = pd.DataFrame(columns=["AIC", "model"])
    tic = time.time()

    predictors = []
    for i in range(1, len(X.columns.difference(['Intercept'])) + 1):
        forwardResult = forward(X, y, predictors)
        if i > 1:
            if forwardResult['AIC'] > fmodelBefore:
                break
        fModels.loc[i] = forwardResult
        predictors = fModels.loc[i]["model"].model.exog_names
        fmodelBefore = fModels.loc[i]["AIC"]
        predictors = [k for k in predictors if k != 'Intercept']
    toc = time.time()
    print("Total elapesed time : ", (toc - tic), "seconds.")
    return (fModels['model'][len(fModels['model'])])

#Forward_best_model = forward_model(X=X, y=y)
#print(Forward_best_model.summary())

### 후진제거법(후진소거법)
def backward_model(X, y):
    BModels = pd.DataFrame(columns=["AIC", "model"])
    tic = time.time()

    predictors = X.columns.difference(['Intercept'])
    BmodelBefore = processSubset(X, y, predictors)['AIC']

    while (len(predictors) > 1):
        backwardResult = backward(X, y, predictors)
        if backwardResult['AIC'] > BmodelBefore:
            break
        BModels.loc[len(predictors) - 1] = backwardResult
        predictors = BModels.loc[len(predictors) - 1]["model"].model.exog_names
        BmodelBefore = backwardResult["AIC"]
        predictors = [k for k in predictors if k != 'Intercept']
    toc = time.time()
    print("Total elapsed time :", (toc - tic), "seconds.")
    return (BModels["model"].dropna().iloc[0])

#Backward_best_model = backward_model(X=X, y=y)
#print(Backward_best_model.summary())

→ 최종적으로 Weight, RPM, EngineSize가 포함된 다중 선형 회귀 모델을 채택, but 설명력이 낮음(54.7%)
y = 0.0073*Weight + 0.0071*RPM + 4. 3054*EngineSize – 51.7933

*GPT가 만들어준 더 짧은 코드

import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices

data = pd.read_csv('data/Cars93.csv')
y, X = dmatrices('Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data=data, return_type="dataframe")

X = sm.add_constant(X)

def forward_selection(X, y):
    initial_features = []
    best_features = []
    
    while len(initial_features) < len(X.columns):
        remaining_features = [f for f in X.columns if f not in initial_features]
        pvals = pd.Series(index=remaining_features)
        for feature in remaining_features:
            model = sm.OLS(y, X[initial_features + [feature]]).fit()
            pvals[feature] = model.pvalues[feature]
        min_pval = pvals.min()
        if min_pval < 0.05:
            best_feature = pvals.idxmin()
            initial_features.append(best_feature)
            best_features.append(best_feature)
        else:
            break
    
    return best_features

selected_features = forward_selection(X, y)
print("Selected features:", selected_features)
Forward_best_model = sm.OLS(y, sm.add_constant(X[selected_features])).fit()
print(Forward_best_model.summary())

def backward_elimination(X, y):
    features = X.columns.tolist()
    
    while len(features) > 0:
        model = sm.OLS(y, X[features]).fit()
        pvals = model.pvalues[1:]  # 상수(const)를 제외한 변수들의 p-value 확인
        max_pval = pvals.max()
        if max_pval > 0.05:
            worst_feature = pvals.idxmax()
            features.remove(worst_feature)
        else:
            break
    
    return features

selected_features = backward_elimination(X, y)
print("Selected features:", selected_features)
Backward_best_model = sm.OLS(y, sm.add_constant(X[selected_features])).fit()
print(Backward_best_model.summary())

def stepwise_selection(X, y, initial_list=[], threshold_in=0.05, threshold_out=0.05):
    
    included = list(initial_list)
    while True:
        changed = False
        # forward step
        excluded = list(set(X.columns) - set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [new_column]])).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed = True

        # backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvalues = model.pvalues.iloc[1:]  # 상수 제외
        worst_pval = pvalues.max()
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
        
        if not changed:
            break
    
    return included

selected_features = stepwise_selection(X, y)
print("Selected features:", selected_features)
Stepwise_best_model = sm.OLS(y, sm.add_constant(X[selected_features])).fit()
print(Stepwise_best_model.summary())

*더더더 짧은 GPT 코드

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import pandas as pd

housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name="PRICE")

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

model = LinearRegression()

sfs_forward = SFS(model, k_features='best', forward=True, floating=False, scoring='r2', cv=5)
sfs_forward.fit(X_train, y_train)
print('전진선택법 선택 변수:', sfs_forward.k_feature_names_)

# 후진제거법
sfs_backward = SFS(model, k_features='best', forward=False, floating=False, scoring='r2', cv=5)
sfs_backward.fit(X_train, y_train)
print('후진제거법 선택 변수:', sfs_backward.k_feature_names_)

# 단계선택법 (Stepwise)   # floating=True → Stepwise
sfs_stepwise = SFS(model, k_features='best', forward=True, floating=True, scoring='r2', cv=5)
sfs_stepwise.fit(X_train, y_train)
print('단계선택법 선택 변수:', sfs_stepwise.k_feature_names_)

*가장 낮은 AIC를 가지는 모델을 선택하고 저장

→ 모든 변수 조합을 평가하는 방법 (Exhaustive Search)

import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
import itertools
import time
pd.set_option('display.max_colwidth', None)

data = pd.read_csv('data/Cars93.csv')
y, X = dmatrices('Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data=data, return_type="dataframe")

X = sm.add_constant(X)
X.rename(columns={'const': 'Intercept'}, inplace=True)

# 특정 변수 조합의 모델을 평가하는 함수
def processSubset(X, y, feature_set):
    model = sm.OLS(y, X[feature_set]).fit()
    AIC = model.aic
    return {"model": model, "AIC": AIC, "features": feature_set}

# 모든 조합에서 최적의 모델을 선택하는 함수
def getBestModel(X, y):
    best_AIC = float('inf')
    best_model = None
    best_features = None
    
    # 변수의 모든 가능한 조합을 시도
    for k in range(1, len(X.columns) + 1):
        for combo in itertools.combinations(X.columns, k):
            result = processSubset(X, y, list(combo))
            if result['AIC'] < best_AIC:
                best_AIC = result['AIC']
                best_model = result['model']
                best_features = result['features']
    
    return best_model, best_AIC, best_features

# 최적 모델 선택
best_model, best_AIC, best_features = getBestModel(X, y)

# 최적 모델 출력
print("\nSelected features:", best_features)
print("Best AIC:", best_AIC)
print("\nBest Model Summary:")
print(best_model.summary())

*Mallow’s Cp를 이용한 변수 선택

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from itertools import combinations
import statsmodels.api as sm

np.random.seed(42)
n = 100
X = np.random.randn(n, 5)  # 5개의 독립 변수
beta = np.array([2, -1, 0.5, 0, 0])  # 실제 계수
y = X @ beta + np.random.randn(n) * 0.5  # 종속 변수

df = pd.DataFrame(X, columns=['X1', 'X2', 'X3', 'X4', 'X5'])
df['y'] = y

# 2. 전체 모델에서 잔차 제곱합(RSS)과 분산 추정
full_X = sm.add_constant(df[['X1', 'X2', 'X3', 'X4', 'X5']])
full_model = sm.OLS(df['y'], full_X).fit()
sigma2 = full_model.mse_resid  # 잔차 분산 추정

# 3. Mallow's Cp 계산 함수
def mallows_cp(rss, p, sigma2, n):
    return rss / sigma2 - (n - 2 * p)

# 4. 변수 조합별 Mallow's Cp 계산
X_cols = ['X1', 'X2', 'X3', 'X4', 'X5']
n = df.shape[0]

results = []
for k in range(1, len(X_cols) + 1):
    for combo in combinations(X_cols, k):
        # 변수 조합으로 회귀 모델 적합
        X_train = df[list(combo)]
        model = LinearRegression().fit(X_train, df['y'])
        
        # 잔차 제곱합(RSS) 계산
        y_pred = model.predict(X_train)
        rss = mean_squared_error(df['y'], y_pred) * len(df['y'])
        
        # Cp 계산
        p = len(combo) + 1  # 상수항 포함한 변수 수
        cp = mallows_cp(rss, p, sigma2, n)
        results.append((combo, cp))

# 5. Cp 값에 따라 정렬하여 출력
results.sort(key=lambda x: x[1])
for combo, cp in results:
    print(f"Variables: {combo}, Cp: {cp:.2f}")

*등분산성 검사

import statsmodels.stats.api as sms

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
bp_test = sms.het_breuschpagan(model.resid, model.model.exog)

labels = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']
print(dict(zip(labels, bp_test)))

fitted = model.predict(data[['CGPA', 'GRE', 'LOR', 'Research', 'TOEFL']])
residual = data['Chance_of_Admit'] - fitted
sns.regplot(fitted, residual, lowess=True, line_kws={'color': 'red'})
plt.plot([fitted.min(), fitted.max()], [0,0], '--', color='grey')

*등분산성을 만족하지 않는 경우

→ 가중 최소제곱법(Weighted Least Squares, WLS) 혹은 Heteroscedasticity-Robust Standard Errors 계산

X = sm.add_constant(X)

ols_model = sm.OLS(y, X).fit()

# 잔차의 절대값을 새로운 종속변수로 사용하여 가중치를 추정
residuals_abs = abs(ols_model.resid)
wls_model = sm.OLS(residuals_abs, X).fit()
weights = 1 / wls_model.predict(X) ** 2

# 가중치를 적용한 WLS 모델 적합
wls_model = sm.WLS(y, X, weights=weights).fit()
print(wls_model.summary())

X = sm.add_constant(X)

ols_model = sm.OLS(y, X).fit()
robust_ols_model = ols_model.get_robustcov_results()
print(robust_ols_model.summary())

*정규성 검사

sr = stats.zscore(residual)
(x,y), _ = stats.probplot(sr)
sns.scatterplot(x,y)
plt.plot([-3,3], [-3,3], '--', color='grey')

stats.shapiro(redisual)

→ p-value가 유의 수준보다 작으면 정규분포가 아니다

*Kolmogorov-Smirnov 검정 (두 분포의 동일성 검정) – 두 개의 표본 데이터가 같은 분포를 따르는지 검정

import numpy as np
from scipy import stats

# 두 개의 데이터 샘플 생성 (하나는 정규분포, 하나는 균등분포)
np.random.seed(0)
data1 = np.random.normal(0, 1, 100)
data2 = np.random.uniform(-1, 1, 100)

statistic, p_value = stats.ks_2samp(data1, data2)

print(f"KS 통계량: {statistic}")
print(f"p-value: {p_value}")

*Kolmogorov-Smirnov 검정 (분포 적합성 검정) – 표본이 정규 분포를 따르는지 확인

import numpy as np
from scipy import stats

# 정규 분포에서 샘플 생성
np.random.seed(0)
data = np.random.normal(0, 1, 100)

# KS 검정 (정규 분포와 비교)
statistic, p_value = stats.kstest(data, 'norm')

print(f"KS 통계량: {statistic}")
print(f"p-value: {p_value}")

*정규성을 만족하지 않는 경우

→ 이상치나 정규성 위반의 영향을 덜 받는 Robust Regression 수행
→ 종속 변수가 특정 분포(예: 이항, 포아송)를 따를 때, 정규성 가정이 필요 없는 일반화 선형 모델을 사용할 수도 있음
→ 로그 변환, Box-Cox 변환 등

import pandas as pd
import statsmodels.api as sm
import statsmodels.stats.api as sms
from patsy import dmatrices

data = pd.read_csv('data/Cars93.csv')
y, X = dmatrices('Price ~ EngineSize + RPM + Weight + Length + MPGcity + MPGhighway', data=data, return_type="dataframe")

X = sm.add_constant(X)

robust_model = sm.RLM(y, X).fit()

print(robust_model.summary())

*독립성 검정

→ summary() 결과 값에 나오는 Durbin-Watson 확인
→ 2에 가까울수록 자기 상관이 없고, 0에 가까우면 양의 상관 관계, 4에 가까우면 음의 상관 관계

from statsmodels.stats.stattools import durbin_watson
durbin_watson(model.resid)

*독립성 검정을 만족하지 않는 경우

→ 다중공선성을 확인하거나, 릿지, 라쏘, 주성분, 부분 최소 제곱 회귀 등의 방법을 사용

'데이터 분석 > ADP 자격증 공부' 카테고리의 다른 글

(파이썬 한권으로 끝내기) 연관분석 (0)	2026.04.06
(파이썬 한권으로 끝내기) 군집분석 (0)	2026.04.05
[ADP 필기] 비정형 데이터마이닝 – 텍스트마이닝, 사회연결망 분석 (0)	2026.04.05
[ADP 필기] 연관분석 (0)	2026.04.05
[ADP 필기] 군집분석 (0)	2026.04.05

현재글(파이썬 한권으로 끝내기) 상관계수, 선형 회귀분석, 다중 회귀, 다중공선성, 변수선택법(전진선택법, 후진제거법, 단계선택법)

공부천재

Today :
Yesterday :

공부천재