Data Science Notes
Data Science Notes
6/24
第三周 https://www.edx.org/course/introducti...
第四周 算法 https://www.coursera.org/courses?lang...
第 2 个月
第 3 个月(深度学习)
Linear regression
Logistic regression
Random forest
Gradient boosting
PCA
k-mean clustering
k nearest neighbors
Natural language processing (2 sessions)
Exploratory data analysis
Python web APIs
Feature engineering (2 sessions)
Object-oriented programming
Forecasting
Linear regression
Logistic regression
SVM
Random forest
Gradient boosting
PCA
k-means
Collaborative filtering
kNN
ARIMA
Data gathering from vary data source (balance vs. unbalance dataset)
Whether the data is in the right format cleansing, wrangling, exploring EDA and how to handle the
missing value, to better put into ML algorithm. (Feature Engineering -> also apply some stats knowledge
to check Mean, Median, Mode)
Coding library
Python:
The Inplace parameter
dropna()
drop_duplicates()
fillna()
query()
rename()
reset_index()
sort_index()
sort_values()
import itertools
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import statsmodels.api as sm
import matplotlibmatplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
for p in p_values:
for d in d_values:
for q in q_values:
order = (p,d,q)
prediction = list()
for i in range(len(test)):
try:
model_fit = model.fit(disp=0)
pred_y = model_fit.forecast()[0]
predictions.append(pred_y)
error = mean_squared_error(test,predictions)
except:
continue