0% found this document useful (0 votes)
60 views

Numpy Cheatsheet

The document provides a comprehensive cheatsheet for data preprocessing techniques using NumPy. It covers topics such as array creation and manipulation, indexing and slicing, handling missing data, mathematical and statistical operations, data cleaning, filtering and sorting, random sampling, vectorization, file I/O, linear algebra, broadcasting, data transformation, scaling and normalization, handling categorical data, reshaping, interpolation, time series operations, image processing, handling strings, sets, dates, complex numbers, and distances. Statistical testing, outlier detection, handling different data types and imbalanced data are also discussed.

Uploaded by

jwp08363
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Numpy Cheatsheet

The document provides a comprehensive cheatsheet for data preprocessing techniques using NumPy. It covers topics such as array creation and manipulation, indexing and slicing, handling missing data, mathematical and statistical operations, data cleaning, filtering and sorting, random sampling, vectorization, file I/O, linear algebra, broadcasting, data transformation, scaling and normalization, handling categorical data, reshaping, interpolation, time series operations, image processing, handling strings, sets, dates, complex numbers, and distances. Statistical testing, outlier detection, handling different data types and imbalanced data are also discussed.

Uploaded by

jwp08363
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

# [ Data Preprocessing with NumPy ] {CheatSheet}

Basics and Array Creation:

● Create NumPy Array: np.array([1, 2, 3])


● Array Shape: array.shape
● Array Dimensions: array.ndim
● Array Size: array.size
● Reshape Array: array.reshape((rows, cols))
● Concatenate Arrays Vertically: np.vstack((array1, array2))
● Concatenate Arrays Horizontally: np.hstack((array1, array2))
● Transpose Array: array.T

Indexing and Slicing:

● Indexing: array[0]
● Slicing: array[1:4]
● Boolean Indexing: array[array > 5]
● Fancy Indexing: array[[1, 3, 5]]

Missing Data:

● Replace NaN with Zero: np.nan_to_num(array)


● Remove NaN Values: array = array[~np.isnan(array)]

Mathematical Operations:

● Element-wise Addition: array1 + array2


● Element-wise Multiplication: array1 * array2
● Matrix Multiplication: np.dot(matrix1, matrix2)
● Element-wise Square Root: np.sqrt(array)

Statistical Operations:

● Mean: np.mean(array)
● Median: np.median(array)
● Standard Deviation: np.std(array)

By: Waleed Mousa


● Variance: np.var(array)
● Minimum Value: np.min(array)
● Maximum Value: np.max(array)

Data Cleaning:

● Remove Duplicates: np.unique(array)


● Replace Values: np.where(array == 0, 1, array)
● Clip Values: np.clip(array, min_val, max_val)

Filtering and Sorting:

● Filter by Condition: array[array > threshold]


● Sort Array: np.sort(array)
● Sort by Column/Axis: array.sort(axis=0)

Random Sampling:

● Random Permutation: np.random.permutation(array)


● Random Sampling with Replacement: np.random.choice(array, size=n,
replace=True)
● Shuffle Array: np.random.shuffle(array)

Vectorization:

● Vectorized Operations: np.vectorize(function)(array)

File I/O:

● Read CSV: np.genfromtxt('data.csv', delimiter=',')


● Write CSV: np.savetxt('output.csv', array, delimiter=',')

Linear Algebra:

● Dot Product: np.dot(array1, array2)


● Matrix Inversion: np.linalg.inv(matrix)
● Eigenvalues and Eigenvectors: eigenvalues, eigenvectors =
np.linalg.eig(matrix)

By: Waleed Mousa


Broadcasting:

● Broadcasting: array += 5

Data Transformation:

● Log Transformation: np.log(array)


● Exponential Transformation: np.exp(array)
● Box-Cox Transformation: scipy.stats.boxcox(array)

Scaling and Normalization:

● Min-Max Scaling: (array - array.min()) / (array.max() -


array.min())
● Standardization: (array - np.mean(array)) / np.std(array)
● Z-Score Transformation: scipy.stats.zscore(array)

Handling Categorical Data:

● One-Hot Encoding: np.eye(num_classes)[array]

Reshaping and Flattening:

● Flatten Array: array.flatten()


● Ravel Array: np.ravel(array)

Interpolation:

● Linear Interpolation: np.interp(x, xp, yp)

Polynomial Fitting:

● Polynomial Fitting: np.polyfit(x, y, degree)

Time Series Operations:

● Time Lag Transformation: np.roll(array, shift=n)


● Moving Average: np.convolve(array, np.ones(window)/window,
mode='valid')

By: Waleed Mousa


Image Processing:

● Image Resizing: scipy.ndimage.zoom(image, zoom=(2, 2, 1))


● Image Rotation: scipy.ndimage.rotate(image, angle=45,
reshape=False)

Handling Strings:

● String Operations on Array: np.char.add(array1, array2)

Set Operations:

● Set Union: np.union1d(array1, array2)


● Set Intersection: np.intersect1d(array1, array2)
● Set Difference: np.setdiff1d(array1, array2)

Handling Dates:

● Convert to DateTime: np.datetime64('2022-01-01')


● Date Arithmetic: np.datetime64('2022-01-01') + np.timedelta64(5,
'D')

Handling Complex Numbers:

● Create Complex Numbers: np.complex(real, imag)


● Complex Conjugate: np.conjugate(complex_array)

Handling Inf and NaN:

● Replace Inf with Max Value: array[np.isinf(array)] = np.nan


● Replace NaN with Mean: array[np.isnan(array)] = np.nanmean(array)

Distance Metrics:

● Euclidean Distance: np.linalg.norm(vector1 - vector2)


● Cosine Similarity: cosine_similarity(array1, array2)

By: Waleed Mousa


Statistical Testing:

● T-Test for Independent Samples: t_stat, p_value =


scipy.stats.ttest_ind(sample1, sample2)
● ANOVA Test: f_stat, p_value = scipy.stats.f_oneway(group1, group2,
group3)

Outlier Detection:

● Z-Score Outliers: z_scores = scipy.stats.zscore(array)

Handling Logarithmic Data:

● Log Transformation for Skewed Data: log_array =


np.log1p(skewed_array)

Handling Exponential Data:

● Exponential Transformation for Highly Skewed Data: exp_array =


np.exp(original_array)

Handling Power Law Data:

● Power Transformation: power_transformed_array, lambda_value =


scipy.stats.boxcox(array)
● Yeo-Johnson Transformation: yeo_johnson_transformed_array,
lambda_value = scipy.stats.yeojohnson(array)

Principal Component Analysis (PCA):

● PCA: pca = PCA(n_components=2); transformed_data =


pca.fit_transform(data)

Singular Value Decomposition (SVD):

● SVD: U, S, Vt = np.linalg.svd(matrix)

By: Waleed Mousa


Handling Outliers:

● Winsorizing Outliers: winsorized_array =


scipy.stats.mstats.winsorize(original_array, limits=[0.05, 0.05])

Time Window Operations:

● Rolling Window Mean: rolling_mean =


pd.Series(array).rolling(window=3).mean()

Interpolation:

● Linear Interpolation: interpolated_values = np.interp(x, xp, yp)

Handling JSON Data:

● Convert NumPy Array to JSON: json_data = json.dumps(array.tolist())


● Convert JSON to NumPy Array: numpy_array = np.array(json_data)

Handling CSV Data:

● Read CSV into NumPy Array: data = np.genfromtxt('data.csv',


delimiter=',')
● CSV File Reading with Pandas: data = pd.read_csv('data.csv').values

Handling Excel Data:

● Read Excel into NumPy Array: data = pd.read_excel('data.xlsx',


header=None).values

Handling Text Data:

● Convert Text to NumPy Array: text_array = np.array(list(text))


● Tokenization with CountVectorizer: vectorizer =
sklearn.feature_extraction.text.CountVectorizer(); tokenized_matrix
= vectorizer.fit_transform(text_data)
● TF-IDF Transformation: tfidf_transformer =
sklearn.feature_extraction.text.TfidfTransformer(); tfidf_matrix =
tfidf_transformer.fit_transform(count_matrix)

By: Waleed Mousa


Handling Time Series Data:

● Time Series Rolling Mean: rolling_mean =


pd.Series(array).rolling(window=3).mean()
● Time Series Differencing: differenced_series = np.diff(time_series,
n=1)

Handling Multidimensional Arrays:

● Reshape to 3D Array: reshaped_array =


original_array.reshape((num_samples, num_rows, num_cols))

Handling Spatial Data:

● Distance between Two Points in 2D Space: distance =


np.linalg.norm(point1 - point2)
● Calculate Haversine Distance: haversine_distance = haversine(lon1,
lat1, lon2, lat2)

Data Binning:

● Binning Numerical Data: binned_data = np.digitize(array, bins)

Handling Imbalanced Data:

● Under-sampling with Random Choice: undersampled_data =


np.concatenate([np.random.choice(data[data_label == label],
size=min_class_samples) for label in unique_labels])
● Over-sampling with Repetition: oversampled_data =
np.concatenate([data[data_label == label] for _ in
range(int(max_class_samples / min_class_samples))])
● Synthetic Over-sampling with SMOTE: oversampled_data,
oversampled_labels = SMOTE().fit_resample(data, labels)

Handling Image Data:

● Flatten 2D Image: flat_image = image.flatten()


● Reshape 1D Image to 2D: reshaped_image =
flat_image.reshape((height, width))

By: Waleed Mousa


● Convert Image to Grayscale: grayscale_image = np.dot(image[...,
:3], [0.2989, 0.5870, 0.1140])
● Resize Image: resized_image = skimage.transform.resize(image,
(new_height, new_width), mode='constant')
● Image Rotation with Scipy: rotated_image =
scipy.ndimage.rotate(image, angle=45, reshape=False)
● Image Histogram Equalization: equalized_image =
skimage.exposure.equalize_hist(image)
● Image Gaussian Blurring: blurred_image =
skimage.filters.gaussian(image, sigma=2)
● Image Edge Detection: edges = skimage.feature.canny(image, sigma=1)
● Image Segmentation with K-Means Clustering: segmented_image =
skimage.segmentation.slic(image, n_segments=100)
● Image Feature Extraction with Histogram of Oriented Gradients
(HOG): features, hog_image = skimage.feature.hog(image,
visualize=True)
● Image Cropping: cropped_image = original_image[y1:y2, x1:x2]
● Image Histogram: hist, bins = np.histogram(image.flatten(),
bins=256, range=[0,256])
● Image Thresholding: thresholded_image =
cv2.threshold(grayscale_image, threshold_value, 255,
cv2.THRESH_BINARY)[1]
● Image Morphological Operations: kernel = np.ones((5,5),np.uint8);
morph_image = cv2.morphologyEx(thresh_image, cv2.MORPH_OPEN,
kernel)
● Image Contour Detection: contours, hierarchy =
cv2.findContours(thresh_image, cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
● Image Color Spaces Conversion: hsv_image = cv2.cvtColor(rgb_image,
cv2.COLOR_BGR2HSV)
● Image Filtering with OpenCV: filtered_image =
cv2.bilateralFilter(image, d=9, sigmaColor=75, sigmaSpace=75)
● Image Edge Detection with OpenCV: edges = cv2.Canny(image,
low_threshold, high_threshold)
● Image Feature Extraction with OpenCV: sift = cv2.SIFT_create();
keypoints, descriptors = sift.detectAndCompute(gray_image, None)
● Image Template Matching with OpenCV: result =
cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
By: Waleed Mousa
● Image Superpixel Segmentation with OpenCV: segments =
cv2.ximgproc.createSuperpixelSLIC(image, algorithm=0,
region_size=10)
● Image Corner Detection with OpenCV: corners =
cv2.goodFeaturesToTrack(image, maxCorners=25, qualityLevel=0.01,
minDistance=10)
● Image Affine Transformation with OpenCV: rows, cols =
image.shape[:2]; M = cv2.getRotationMatrix2D((cols/2, rows/2),
angle, scale); rotated_image = cv2.warpAffine(image, M, (cols,
rows))
● Image Perspective Transformation with OpenCV: pts1 =
np.float32([[56,65],[368,52],[28,387],[389,390]]); pts2 =
np.float32([[0,0],[300,0],[0,300],[300,300]]); M =
cv2.getPerspectiveTransform(pts1,pts2); transformed_image =
cv2.warpPerspective(image,M,(300,300))
● Image Color Histogram with OpenCV: hist = cv2.calcHist([image],
[0, 1, 2], None, [256, 256, 256], [0, 256, 0, 256, 0, 256])
● Image Color Quantization with K-Means Clustering: image_reshaped =
image.reshape((-1, 3)); kmeans =
KMeans(n_clusters=k).fit(image_reshaped); quantized_image =
kmeans.cluster_centers_.astype(int)[kmeans.labels_].reshape(image.s
hape)

Advanced Operations with NumPy:

● Handling Sparse Data: sparse_matrix =


scipy.sparse.csr_matrix(array)
● Matrix Factorization with NMF: W, H =
sklearn.decomposition.NMF(n_components=2).fit_transform(data)
● Sparse Matrix Operations: result =
scipy.sparse.csr_matrix.dot(sparse_matrix1, sparse_matrix2)

Handling HDF5 Data:

● Read HDF5 File into NumPy Array: data =


pd.read_hdf('data.h5').values

Handling XML Data:

By: Waleed Mousa


● XML Parsing with BeautifulSoup: soup = BeautifulSoup(xml_data,
'xml'); values = [float(tag.text) for tag in
soup.find_all('value')]

Handling SQLite Data:

● Read SQLite Database into NumPy Array: connection =


sqlite3.connect('database.db'); query = 'SELECT * FROM table'; data
= pd.read_sql(query, connection).values

Handling Pickle Data:

● Read Pickle File into NumPy Array: with open('data.pkl', 'rb') as


f: data = pickle.load(f)

Handling Avro Data:

● Read Avro File into NumPy Array: import fastavro; with


open('data.avro', 'rb') as f: data = fastavro.reader(f)

Handling Parquet Data:

● Read Parquet File into NumPy Array: import pyarrow.parquet as pq;


table = pq.read_table('data.parquet'); data =
table.to_pandas().values

Handling Feather Data:

● Read Feather File into NumPy Array: import pyarrow.feather as


feather; table = feather.read_table('data.feather'); data =
table.to_pandas().values

Handling Video Data:

● Read Video Frames into NumPy Array: import cv2; video_capture =


cv2.VideoCapture('video.mp4'); success, frame =
video_capture.read(); video_array = [] while success:
video_array.append(frame); success, frame = video_capture.read()

Handling Audio Data:

By: Waleed Mousa


● Read Audio File into NumPy Array: import librosa; audio_data,
sampling_rate = librosa.load('audio.wav', sr=None)

Handling NumPy Datetime:

● NumPy Datetime Operations: date1 = np.datetime64('2022-01-01');


date2 = np.datetime64('2022-01-05'); days_difference = date2 -
date1

Handling Complex Numbers:

● Complex Numbers Operations: complex_result = complex_array1 +


complex_array2

Handling Units:

● Convert Units with Pint: import pint; ureg = pint.UnitRegistry();


quantity = 5 * ureg.meter; converted_quantity =
quantity.to(ureg.feet)

Handling Heterogeneous Data:

● Structured Arrays: structured_array = np.array([(1, 'John', 25),


(2, 'Alice', 30)], dtype=[('id', int), ('name', 'U10'), ('age',
int)])

Handling Point Cloud Data:

● PointCloud Operations with Open3D: import open3d; point_cloud =


open3d.io.read_point_cloud('point_cloud.ply'); downsampled_cloud =
point_cloud.voxel_down_sample(voxel_size=0.05)

By: Waleed Mousa

You might also like