# [ Data Preprocessing with NumPy ] {CheatSheet}
Basics and Array Creation:
● Create NumPy Array: np.array([1, 2, 3])
● Array Shape: array.shape
● Array Dimensions: array.ndim
● Array Size: array.size
● Reshape Array: array.reshape((rows, cols))
● Concatenate Arrays Vertically: np.vstack((array1, array2))
● Concatenate Arrays Horizontally: np.hstack((array1, array2))
● Transpose Array: array.T
Indexing and Slicing:
● Indexing: array[0]
● Slicing: array[1:4]
● Boolean Indexing: array[array > 5]
● Fancy Indexing: array[[1, 3, 5]]
Missing Data:
● Replace NaN with Zero: np.nan_to_num(array)
● Remove NaN Values: array = array[~np.isnan(array)]
Mathematical Operations:
● Element-wise Addition: array1 + array2
● Element-wise Multiplication: array1 * array2
● Matrix Multiplication: np.dot(matrix1, matrix2)
● Element-wise Square Root: np.sqrt(array)
Statistical Operations:
● Mean: np.mean(array)
● Median: np.median(array)
● Standard Deviation: np.std(array)
By: Waleed Mousa
● Variance: np.var(array)
● Minimum Value: np.min(array)
● Maximum Value: np.max(array)
Data Cleaning:
● Remove Duplicates: np.unique(array)
● Replace Values: np.where(array == 0, 1, array)
● Clip Values: np.clip(array, min_val, max_val)
Filtering and Sorting:
● Filter by Condition: array[array > threshold]
● Sort Array: np.sort(array)
● Sort by Column/Axis: array.sort(axis=0)
Random Sampling:
● Random Permutation: np.random.permutation(array)
● Random Sampling with Replacement: np.random.choice(array, size=n,
replace=True)
● Shuffle Array: np.random.shuffle(array)
Vectorization:
● Vectorized Operations: np.vectorize(function)(array)
File I/O:
● Read CSV: np.genfromtxt('data.csv', delimiter=',')
● Write CSV: np.savetxt('output.csv', array, delimiter=',')
Linear Algebra:
● Dot Product: np.dot(array1, array2)
● Matrix Inversion: np.linalg.inv(matrix)
● Eigenvalues and Eigenvectors: eigenvalues, eigenvectors =
np.linalg.eig(matrix)
By: Waleed Mousa
Broadcasting:
● Broadcasting: array += 5
Data Transformation:
● Log Transformation: np.log(array)
● Exponential Transformation: np.exp(array)
● Box-Cox Transformation: scipy.stats.boxcox(array)
Scaling and Normalization:
● Min-Max Scaling: (array - array.min()) / (array.max() -
array.min())
● Standardization: (array - np.mean(array)) / np.std(array)
● Z-Score Transformation: scipy.stats.zscore(array)
Handling Categorical Data:
● One-Hot Encoding: np.eye(num_classes)[array]
Reshaping and Flattening:
● Flatten Array: array.flatten()
● Ravel Array: np.ravel(array)
Interpolation:
● Linear Interpolation: np.interp(x, xp, yp)
Polynomial Fitting:
● Polynomial Fitting: np.polyfit(x, y, degree)
Time Series Operations:
● Time Lag Transformation: np.roll(array, shift=n)
● Moving Average: np.convolve(array, np.ones(window)/window,
mode='valid')
By: Waleed Mousa
Image Processing:
● Image Resizing: scipy.ndimage.zoom(image, zoom=(2, 2, 1))
● Image Rotation: scipy.ndimage.rotate(image, angle=45,
reshape=False)
Handling Strings:
● String Operations on Array: np.char.add(array1, array2)
Set Operations:
● Set Union: np.union1d(array1, array2)
● Set Intersection: np.intersect1d(array1, array2)
● Set Difference: np.setdiff1d(array1, array2)
Handling Dates:
● Convert to DateTime: np.datetime64('2022-01-01')
● Date Arithmetic: np.datetime64('2022-01-01') + np.timedelta64(5,
'D')
Handling Complex Numbers:
● Create Complex Numbers: np.complex(real, imag)
● Complex Conjugate: np.conjugate(complex_array)
Handling Inf and NaN:
● Replace Inf with Max Value: array[np.isinf(array)] = np.nan
● Replace NaN with Mean: array[np.isnan(array)] = np.nanmean(array)
Distance Metrics:
● Euclidean Distance: np.linalg.norm(vector1 - vector2)
● Cosine Similarity: cosine_similarity(array1, array2)
By: Waleed Mousa
Statistical Testing:
● T-Test for Independent Samples: t_stat, p_value =
scipy.stats.ttest_ind(sample1, sample2)
● ANOVA Test: f_stat, p_value = scipy.stats.f_oneway(group1, group2,
group3)
Outlier Detection:
● Z-Score Outliers: z_scores = scipy.stats.zscore(array)
Handling Logarithmic Data:
● Log Transformation for Skewed Data: log_array =
np.log1p(skewed_array)
Handling Exponential Data:
● Exponential Transformation for Highly Skewed Data: exp_array =
np.exp(original_array)
Handling Power Law Data:
● Power Transformation: power_transformed_array, lambda_value =
scipy.stats.boxcox(array)
● Yeo-Johnson Transformation: yeo_johnson_transformed_array,
lambda_value = scipy.stats.yeojohnson(array)
Principal Component Analysis (PCA):
● PCA: pca = PCA(n_components=2); transformed_data =
pca.fit_transform(data)
Singular Value Decomposition (SVD):
● SVD: U, S, Vt = np.linalg.svd(matrix)
By: Waleed Mousa
Handling Outliers:
● Winsorizing Outliers: winsorized_array =
scipy.stats.mstats.winsorize(original_array, limits=[0.05, 0.05])
Time Window Operations:
● Rolling Window Mean: rolling_mean =
pd.Series(array).rolling(window=3).mean()
Interpolation:
● Linear Interpolation: interpolated_values = np.interp(x, xp, yp)
Handling JSON Data:
● Convert NumPy Array to JSON: json_data = json.dumps(array.tolist())
● Convert JSON to NumPy Array: numpy_array = np.array(json_data)
Handling CSV Data:
● Read CSV into NumPy Array: data = np.genfromtxt('data.csv',
delimiter=',')
● CSV File Reading with Pandas: data = pd.read_csv('data.csv').values
Handling Excel Data:
● Read Excel into NumPy Array: data = pd.read_excel('data.xlsx',
header=None).values
Handling Text Data:
● Convert Text to NumPy Array: text_array = np.array(list(text))
● Tokenization with CountVectorizer: vectorizer =
sklearn.feature_extraction.text.CountVectorizer(); tokenized_matrix
= vectorizer.fit_transform(text_data)
● TF-IDF Transformation: tfidf_transformer =
sklearn.feature_extraction.text.TfidfTransformer(); tfidf_matrix =
tfidf_transformer.fit_transform(count_matrix)
By: Waleed Mousa
Handling Time Series Data:
● Time Series Rolling Mean: rolling_mean =
pd.Series(array).rolling(window=3).mean()
● Time Series Differencing: differenced_series = np.diff(time_series,
n=1)
Handling Multidimensional Arrays:
● Reshape to 3D Array: reshaped_array =
original_array.reshape((num_samples, num_rows, num_cols))
Handling Spatial Data:
● Distance between Two Points in 2D Space: distance =
np.linalg.norm(point1 - point2)
● Calculate Haversine Distance: haversine_distance = haversine(lon1,
lat1, lon2, lat2)
Data Binning:
● Binning Numerical Data: binned_data = np.digitize(array, bins)
Handling Imbalanced Data:
● Under-sampling with Random Choice: undersampled_data =
np.concatenate([np.random.choice(data[data_label == label],
size=min_class_samples) for label in unique_labels])
● Over-sampling with Repetition: oversampled_data =
np.concatenate([data[data_label == label] for _ in
range(int(max_class_samples / min_class_samples))])
● Synthetic Over-sampling with SMOTE: oversampled_data,
oversampled_labels = SMOTE().fit_resample(data, labels)
Handling Image Data:
● Flatten 2D Image: flat_image = image.flatten()
● Reshape 1D Image to 2D: reshaped_image =
flat_image.reshape((height, width))
By: Waleed Mousa
● Convert Image to Grayscale: grayscale_image = np.dot(image[...,
:3], [0.2989, 0.5870, 0.1140])
● Resize Image: resized_image = skimage.transform.resize(image,
(new_height, new_width), mode='constant')
● Image Rotation with Scipy: rotated_image =
scipy.ndimage.rotate(image, angle=45, reshape=False)
● Image Histogram Equalization: equalized_image =
skimage.exposure.equalize_hist(image)
● Image Gaussian Blurring: blurred_image =
skimage.filters.gaussian(image, sigma=2)
● Image Edge Detection: edges = skimage.feature.canny(image, sigma=1)
● Image Segmentation with K-Means Clustering: segmented_image =
skimage.segmentation.slic(image, n_segments=100)
● Image Feature Extraction with Histogram of Oriented Gradients
(HOG): features, hog_image = skimage.feature.hog(image,
visualize=True)
● Image Cropping: cropped_image = original_image[y1:y2, x1:x2]
● Image Histogram: hist, bins = np.histogram(image.flatten(),
bins=256, range=[0,256])
● Image Thresholding: thresholded_image =
cv2.threshold(grayscale_image, threshold_value, 255,
cv2.THRESH_BINARY)[1]
● Image Morphological Operations: kernel = np.ones((5,5),np.uint8);
morph_image = cv2.morphologyEx(thresh_image, cv2.MORPH_OPEN,
kernel)
● Image Contour Detection: contours, hierarchy =
cv2.findContours(thresh_image, cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
● Image Color Spaces Conversion: hsv_image = cv2.cvtColor(rgb_image,
cv2.COLOR_BGR2HSV)
● Image Filtering with OpenCV: filtered_image =
cv2.bilateralFilter(image, d=9, sigmaColor=75, sigmaSpace=75)
● Image Edge Detection with OpenCV: edges = cv2.Canny(image,
low_threshold, high_threshold)
● Image Feature Extraction with OpenCV: sift = cv2.SIFT_create();
keypoints, descriptors = sift.detectAndCompute(gray_image, None)
● Image Template Matching with OpenCV: result =
cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
By: Waleed Mousa
● Image Superpixel Segmentation with OpenCV: segments =
cv2.ximgproc.createSuperpixelSLIC(image, algorithm=0,
region_size=10)
● Image Corner Detection with OpenCV: corners =
cv2.goodFeaturesToTrack(image, maxCorners=25, qualityLevel=0.01,
minDistance=10)
● Image Affine Transformation with OpenCV: rows, cols =
image.shape[:2]; M = cv2.getRotationMatrix2D((cols/2, rows/2),
angle, scale); rotated_image = cv2.warpAffine(image, M, (cols,
rows))
● Image Perspective Transformation with OpenCV: pts1 =
np.float32([[56,65],[368,52],[28,387],[389,390]]); pts2 =
np.float32([[0,0],[300,0],[0,300],[300,300]]); M =
cv2.getPerspectiveTransform(pts1,pts2); transformed_image =
cv2.warpPerspective(image,M,(300,300))
● Image Color Histogram with OpenCV: hist = cv2.calcHist([image],
[0, 1, 2], None, [256, 256, 256], [0, 256, 0, 256, 0, 256])
● Image Color Quantization with K-Means Clustering: image_reshaped =
image.reshape((-1, 3)); kmeans =
KMeans(n_clusters=k).fit(image_reshaped); quantized_image =
kmeans.cluster_centers_.astype(int)[kmeans.labels_].reshape(image.s
hape)
Advanced Operations with NumPy:
● Handling Sparse Data: sparse_matrix =
scipy.sparse.csr_matrix(array)
● Matrix Factorization with NMF: W, H =
sklearn.decomposition.NMF(n_components=2).fit_transform(data)
● Sparse Matrix Operations: result =
scipy.sparse.csr_matrix.dot(sparse_matrix1, sparse_matrix2)
Handling HDF5 Data:
● Read HDF5 File into NumPy Array: data =
pd.read_hdf('data.h5').values
Handling XML Data:
By: Waleed Mousa
● XML Parsing with BeautifulSoup: soup = BeautifulSoup(xml_data,
'xml'); values = [float(tag.text) for tag in
soup.find_all('value')]
Handling SQLite Data:
● Read SQLite Database into NumPy Array: connection =
sqlite3.connect('database.db'); query = 'SELECT * FROM table'; data
= pd.read_sql(query, connection).values
Handling Pickle Data:
● Read Pickle File into NumPy Array: with open('data.pkl', 'rb') as
f: data = pickle.load(f)
Handling Avro Data:
● Read Avro File into NumPy Array: import fastavro; with
open('data.avro', 'rb') as f: data = fastavro.reader(f)
Handling Parquet Data:
● Read Parquet File into NumPy Array: import pyarrow.parquet as pq;
table = pq.read_table('data.parquet'); data =
table.to_pandas().values
Handling Feather Data:
● Read Feather File into NumPy Array: import pyarrow.feather as
feather; table = feather.read_table('data.feather'); data =
table.to_pandas().values
Handling Video Data:
● Read Video Frames into NumPy Array: import cv2; video_capture =
cv2.VideoCapture('video.mp4'); success, frame =
video_capture.read(); video_array = [] while success:
video_array.append(frame); success, frame = video_capture.read()
Handling Audio Data:
By: Waleed Mousa
● Read Audio File into NumPy Array: import librosa; audio_data,
sampling_rate = librosa.load('audio.wav', sr=None)
Handling NumPy Datetime:
● NumPy Datetime Operations: date1 = np.datetime64('2022-01-01');
date2 = np.datetime64('2022-01-05'); days_difference = date2 -
date1
Handling Complex Numbers:
● Complex Numbers Operations: complex_result = complex_array1 +
complex_array2
Handling Units:
● Convert Units with Pint: import pint; ureg = pint.UnitRegistry();
quantity = 5 * ureg.meter; converted_quantity =
quantity.to(ureg.feet)
Handling Heterogeneous Data:
● Structured Arrays: structured_array = np.array([(1, 'John', 25),
(2, 'Alice', 30)], dtype=[('id', int), ('name', 'U10'), ('age',
int)])
Handling Point Cloud Data:
● PointCloud Operations with Open3D: import open3d; point_cloud =
open3d.io.read_point_cloud('point_cloud.ply'); downsampled_cloud =
point_cloud.voxel_down_sample(voxel_size=0.05)
By: Waleed Mousa