0% found this document useful (0 votes)
37 views30 pages

DSF Gourav-2

Uploaded by

abhishek9582822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views30 pages

DSF Gourav-2

Uploaded by

abhishek9582822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Department of CSE-DS Engineering

DPG Institute of Technology and


Management Gurugram 122004,Haryana

DATA SCIENCE LAB


FILE (LC-DS-341G)
V SEMESTER
CSE-DS ENGINEERING

Submitted TO: Submitted BY:


Dr. Poonam Sharma Shivam Kumar
Assoc. Professor Roll. No:
CSE Department B.Tech(CSE-DS)-5th sem
INDEX

S.No. Program Date Sign.


1. Downloading, installing.and setting path for R.

2. Give an idea of R Data Types.

3. R as a Calculator: Perform some arithmetic


operations in R.

4. Perform some Logical Operations in R.

5. Write a R script to Demonstrate Loops.

6. Write a R script to change the structure of a Data


frame.
7. Write a R script to Demonstrate aggregate
function in R.

8. Write a r script to handle missing values in r.

9. Write a r script to handle outliers.


PROGRAM-1
AIM :- Downloading, installing.and setting path for R.
INTRODUCTION :- R Studio is an integrated development environment(IDE) for R. IDE is
a GUI, where you can write your quotes, see the results and also see the variables that are
generated during the course of programming.
R is a language and environment for statistical computing and graphics. It is a GNU
project which is similar to the S language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.
• R Studio is available as both Open source and Commercial software.
• R Studio is also available as both Desktop and Server versions.
• R Studio is also available for various platforms such as Windows, Linux, and macOS
Why use R Studio?
• It is a powerful IDE, specifically used for the R language.
• Provides literate programming tools, which basically allow the use of R scripts,
outputs, text, and images into reports, Word documents, and even an HTML file.
• The use of Shiny (open-source R package) allows us to create interactive content in
reports and presentations.
Advantages of R Programming
• Open Source. R is an open-source programming language. ...
• Exemplary Support for Data Wrangling. R provides exemplary
support for data wrangling. ...
• The Array of Packages. ...
• Quality Plotting and Graphing. ...
• Highly Compatible. ...
• Platform Independent. ...
• Eye-Catching Reports. ...
• Machine Learning Operations.

Installing R Studio on Window:-


To Install R Studio on windows we will follow the following steps.
Step 1: First, you need to set up an R environment in your local machine. You can download
the same from internet.
Step 2: After downloading R for the Windows platform, install it by double-clicking it.
Step 3: Download R Studio from their official page. Note: It is free of cost (under AGPL
licensing).

Step 4: After downloading, you will get a file named “RStudio-1.x.xxxx.exe” in your
Downloads folder.
Step 5: Double-click the installer, and install the software.
Step 6: Test the R Studio installation

• Search for RStudio in the Window search bar on Taskbar.


• Start the application.

• Insert the following code in the console.


Input : print('Hello
world!') Output : [1] "Hello world!"

Step 7: Your installation is successful.


Result:- R studio has been successfully installed on your system.
PROGRAM-2
AIM:-Give an idea of R Data Types.
R supports a variety of data types, which can be broadly categorized as follows:
1. Basic Data Types:
 Numeric: Represents real numbers (both integers and decimals). By default, R stores
numbers as double-precision floating-point numbers.
x <- 3.14 # numeric
y <- 5 # numeric (treated as double)
 Integer: Represents whole numbers. Use the L suffix to explicitly declare
integers. x <- 5L # integer
 Complex: Represents complex numbers with real and imaginary
parts. z <- 2 + 3i # complex number
 Logical: Boolean values, TRUE or
FALSE. x <- TRUE
y <- FALSE
 Character: Represents text or string
data. name <- "John Doe"
2. Data Structures:
 Vector: A one-dimensional array that holds elements of the same type. Vectors can
be numeric, logical, character, or integer.
v <- c(1, 2, 3, 4) # numeric vector
names <- c("Alice", "Bob") # character vector
 Matrix: A two-dimensional array that holds elements of the same type. All elements
in a matrix must be of the same data type.
m <- matrix(1:9, nrow = 3, ncol = 3)
 Array: A multi-dimensional generalization of a matrix that can store elements of the
same type. Arrays can have more than two dimensions.
a <- array(1:8, dim = c(2, 2, 2)) # 3D array
 List: A generic vector that can hold elements of different types (numeric, character,
lists, etc.).
lst <- list(1, "apple", TRUE)
 Data Frame: A table-like structure, where each column can be of a different data
type (similar to a spreadsheet or SQL table).
df <- data.frame(Name = c("John", "Alice"), Age = c(23, 25))
3. Factor:
Represents categorical data and stores it as integers with a corresponding set of levels.
Factors are useful for handling categorical data.
factor_data <- factor(c("Male", "Female", "Female", "Male"))
4. NULL:
Represents an empty or undefined value.
x <- NULL
5. NA:
Represents missing values or undefined
data. x <- c(1, 2, NA, 4)
6. NaN (Not a Number):
Represents undefined mathematical operations like dividing by zero.
x <- 0/0 # NaN
PROGRAM -3
AIM:-R as a Calculator: Perform some arithmetic operations in R.
Creating a simple calculator in R is straightforward. You can implement basic operations like
addition, subtraction, multiplication, and division using a function that takes user input for
the numbers and the operation. Here's how you can build a simple calculator:
Simple Calculator in R
# Define the calculator function
calculator <- function() {

# Display menu options


cat("Simple R Calculator\n")
cat("1. Addition (+)\n")
cat("2. Subtraction (-)\n")
cat("3. Multiplication (*)\n")
cat("4. Division (/)\n")

# Get user input for numbers and operation


num1 <- as.numeric(readline(prompt = "Enter the first number: "))
num2 <- as.numeric(readline(prompt = "Enter the second number: "))
operator <- readline(prompt = "Choose an operation (+, -, *, /): ")

# Perform the operation based on user input


result <- switch(operator,
"+" = num1 + num2,
"-" = num1 - num2,
"*" = num1 * num2,
"/" = if(num2 != 0) num1 / num2 else "Error: Division by
zero", "Invalid operator")

# Display the result


cat("Result: ", result, "\n")
}

# Call the calculator function to run it


calculator()
Explanation:
1. User Input: The readline() function is used to take user input for the two numbers
and the operator. The as.numeric() function converts the input to numeric data type.
2. Switch Statement: The switch() function is used to handle the different operations
based on the user's choice. It selects the correct operation (addition, subtraction,
multiplication, or division) based on the operator entered by the user.
3. Division Handling: To avoid division by zero, a condition is used to check if the
second number is zero when performing division.
4. Result Display: The cat() function is used to display the result.
Example Output:
markdown
Simple R Calculator
1. Addition (+)
2. Subtraction (-)
3. Multiplication (*)
4. Division (/)
Enter the first number: 10
Enter the second number: 5
Choose an operation (+, -, *, /): +
Result: 15
PROGRAM.-4
AIM:- Perform some Logical Operations in R.
Logical operations in R are used to compare values and return Boolean results (TRUE or
FALSE). These operations can be applied to vectors, matrices, or individual values. Here's an
overview of logical operators and how they work in R:
1. Basic Logical Operators:
 AND (& for element-wise and && for short-circuit):
o &: Element-wise logical AND. It checks each element pair.
o &&: Only checks the first element of each vector and performs the logical
AND operation.
x <- c(TRUE, FALSE, TRUE)
y <- c(TRUE, TRUE, FALSE)

# Element-wise AND
x & y # Returns: TRUE FALSE FALSE

# Short-circuit AND (only compares the first elements)


x && y # Returns: TRUE
 OR (| for element-wise and || for short-circuit):
o |: Element-wise logical OR.
o ||: Only checks the first element of each vector and performs the logical OR
operation.
x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, TRUE, FALSE)

# Element-wise OR
x | y # Returns: TRUE TRUE TRUE

# Short-circuit OR (only compares the first elements)


x || y # Returns: TRUE
 NOT (!):
o Negates a logical value (TRUE becomes FALSE, and vice versa).
x <- TRUE
!x # Returns: FALSE
2. Comparison Operators:
These operators return logical values based on comparisons between two values or vectors.
 Equal to (==):
5 == 5 # Returns: TRUE
5 == 6 # Returns: FALSE
 Not equal to (!=):
5 != 6 # Returns: TRUE
5 != 5 # Returns: FALSE
 Greater than (>):
7 > 3 # Returns: TRUE
 Less than (<):
2 < 8 # Returns: TRUE
 Greater than or equal to
(>=): 5 >= 5 # Returns: TRUE
 Less than or equal to
(<=): 4 <= 6 # Returns: TRUE
3. Logical Functions:
 any(): Returns TRUE if at least one of the elements is
TRUE. x <- c(FALSE, TRUE, FALSE)
any(x) # Returns: TRUE
 all(): Returns TRUE only if all elements are
TRUE. x <- c(TRUE, TRUE, FALSE)
all(x) # Returns: FALSE
 xor(): Returns TRUE when exactly one of the two operands is TRUE, but not
both. xor(TRUE, FALSE) # Returns: TRUE
xor(TRUE, TRUE) # Returns: FALSE
4. Combining Logical Operations:
Logical operators can be combined to form more complex expressions.
x <- 5
y <- 10

(x < 6) & (y > 5) # Returns: TRUE (both conditions are TRUE)

!(x > 6) | (y == 10) # Returns: TRUE (because one condition is TRUE)


5. Logical Operations with Vectors:
When logical operations are applied to vectors, they are evaluated element-wise.
a <- c(1, 2, 3)
b <- c(3, 2, 1)

a > b # Returns: FALSE FALSE TRUE


These logical operations help in data manipulation, filtering, and decision-making in R.
PROGRAM.-5
AIM:-Write an R Script to demonstrate loops.
In R, loops are used to iterate over a sequence of elements or execute code repeatedly. The
most common types of loops in R are for, while, and repeat loops. Here’s an overview of
each with examples.
1. For Loop
The for loop in R iterates over a sequence, executing the code block for each element.
Syntax:
for (variable in sequence) {
# Code to execute
}
Example:
# Print numbers 1 to 5
for (i in 1:5) {
print(i)
}
Example with Vector:
# Loop through a vector
vec <- c("apple", "banana", "cherry")
for (fruit in vec) {
print(fruit)
}
2. While Loop
A while loop keeps executing the block of code as long as the condition is TRUE.
Syntax:
while (condition) {
# Code to execute
}
Example:
# Print numbers from 1 to 5
i <- 1
while (i <= 5) {
print(i)
i <- i + 1
}
3. Repeat Loop
A repeat loop is an infinite loop unless a condition is met and the loop is broken using break.
Syntax:
repeat {
# Code to execute
if (condition) {
break
}
}
Example:
# Print numbers from 1 to 5 using repeat loop
i <- 1
repeat {
print(i)
i <- i + 1
if (i > 5) {
break
}
}
4. Loop Control Statements
 break: Used to exit a loop early.
 next: Skips the current iteration and moves to the next iteration.
Example of break:
# Stop the loop when i is equal to 3
for (i in 1:5) {
if (i == 3) {
break
}
print(i)
}
Example of next:
# Skip printing the number 3
for (i in 1:5) {
if (i == 3) {
next
}
print(i)
}
5. Nested Loops
Loops can be nested within other loops.
Example:
# Nested for loop to print a 3x3 matrix
for (i in 1:3) {
for (j in 1:3) {
print(paste("i:", i, "j:", j))
}
}
PROGRAM.-6
AIM:-Write an R script to change the structure of a Data Frame.
In R, data frames are one of the most commonly used data structures for handling tabular
data. A data frame is essentially a table where each column can contain different types of data
(e.g., numeric, character, or logical). Data frames are widely used for data manipulation and
analysis in R, especially in the context of datasets.
1. Creating a Data Frame
You can create a data frame using the data.frame() function by specifying vectors of equal
length as columns.
Example:
# Creating vectors for each column
names <- c("John", "Alice", "Bob")
ages <- c(25, 30, 28)
scores <- c(88.5, 92.0, 79.5)

# Creating a data frame


df <- data.frame(Name = names, Age = ages, Score = scores)

# Display the data frame


print(df)
Output:
Name Age Score
1 John 25 88.5
2 Alice 30 92.0
3 Bob 28 79.5
2. Exploring and Accessing Data in a Data Frame
Once you have a data frame, you can access and explore the data in various ways.
Accessing Columns:
You can access individual columns of a data frame using the $ operator or by indexing.
# Accessing the "Name" column
df$Name
# Using indexing
df[, "Name"] # Same as df$Name
df[["Name"]] # Same as df$Name
Accessing Rows:
You can use row indexing to access specific rows of a data
frame. # Access the first row
df[1, ]

# Access multiple rows


df[1:2, ]
Accessing Specific Elements:
To access specific elements, you can use row and column
indexing. # Access the element in the first row and second column
df[1, 2] # Output: 25 (John's age)

# Access the element in the second row and "Score" column


df[2, "Score"] # Output: 92.0
Viewing the Structure:
To view the structure of a data frame (including data types of each column), use the str()
function.
str(df)
Example Output:
ruby
'data.frame': 3 obs. of 3 variables:
$ Name : chr "John" "Alice" "Bob"
$ Age : num 25 30 28
$ Score: num 88.5 92 79.5
3. Adding and Removing Columns
You can easily add or remove columns in a data frame.
Adding a New Column:
# Adding a new column for gender
df$Gender <- c("Male", "Female", "Male")
print(df)
Output:
Name Age Score Gender
1 John 25 88.5 Male
2 Alice 30 92.0 Female
3 Bob 28 79.5 Male
Removing a Column:
To remove a column, you can use the NULL
assignment. # Removing the "Gender" column
df$Gender <- NULL
4. Adding and Removing Rows
Adding a Row:
To add a row, you can use rbind() to combine an existing data frame with a new row.
# Adding a new row
new_row <- data.frame(Name = "Emma", Age = 22, Score = 85.0)
df <- rbind(df, new_row)
print(df)
Output:
Name Age Score
1 John 25 88.5
2 Alice 30 92.0
3 Bob 28 79.5
4 Emma 22 85.0
Removing a Row:
To remove a row, you can use negative
indexing. # Removing the first row
df <- df[-1, ]
print(df)
5. Subsetting Data Frames
You can extract subsets of data based on conditions.
Example:
# Subset rows where Age is greater than 25
subset_df <- subset(df, Age > 25)
print(subset_df)
Output:
Name Age Score
2 Alice 30 92.0
3 Bob 28 79.5
6. Handling Missing Data
Missing values in data frames are represented as NA. You can detect, remove, or replace
missing data as needed.
Detecting Missing Values:
# Check for missing values
is.na(df)
Removing Rows with Missing Values:
# Remove rows with any missing values
df_clean <- na.omit(df)
Replacing Missing Values:
# Replace missing values in the "Score" column with the mean score
df$Score[is.na(df$Score)] <- mean(df$Score, na.rm = TRUE)
7. Sorting Data Frames
You can sort a data frame by one or more columns using the order() function.
Sorting by One Column:
# Sorting by the "Age" column
df_sorted <- df[order(df$Age), ]
print(df_sorted)
Sorting by Multiple Columns:
# Sorting by "Age" and then by "Score"
df_sorted <- df[order(df$Age, df$Score), ]
print(df_sorted)
8. Merging Data Frames
You can merge two data frames using the merge() function, which performs a join operation.
Example:
# Create another data frame with additional information
df2 <- data.frame(Name = c("John", "Alice", "Emma"),
Country = c("USA", "UK", "Canada"))

# Merge the two data frames by the "Name"


column df_merged <- merge(df, df2, by = "Name")
print(df_merged)
Output:
Name Age Score Country
1 Alice 30 92.0 UK
2 Emma 22 85.0 Canada
3 John 25 88.5 USA
9. Summary Statistics of Data Frames
You can calculate summary statistics for numerical columns using the summary() function.
Example:
# Summarize the data frame
summary(df)
Example Output:
mathematica
Name Age Score
John :1 Min. :22.00 Min. :79.50
Alice:1 1st Qu.:24.25 1st Qu.:84.63
Bob :1 Median :26.50 Median :88.50
Emma :1 Mean :26.25 Mean :86.25
3rd Qu.:28.50 3rd Qu.:90.13
Max. :30.00 Max. :92.00
PROGRAM.-7
AIM:- Write an R script to demonstrate Aggregate Functions in R
Aggregate() function is used to get the summary statistics of the data by group. The statistics
include mean, min, sum. max etc.
Syntax:
aggregate(dataframe$aggregate_column, list(dataframe$group_column), FUN)
where
 dataframe is the input dataframe.
 aggregate_column is the column to be aggregated in the dataframe.
 group_column is the column to be grouped with FUN.
 FUN represents sum/mean/min/ max.
Example 1: R program to create with 4 columns and group with subjects and get the
aggregates like minimum, sum, and maximum.
 R

# create a dataframe with 4 columns


data = data.frame(subjects=c("java", "python", "java",
"java", "php", "php"),
id=c(1, 2, 3, 4, 5, 6),
names=c("manoj", "sai", "mounika",
"durga", "deepika", "roshan"),
marks=c(89, 89, 76, 89, 90, 67))

# display
print(data)

# aggregate sum of marks with subjects


print(aggregate(data$marks, list(data$subjects), FUN=sum))

# aggregate minimum of marks with subjects


print(aggregate(data$marks, list(data$subjects), FUN=min))

# aggregate maximum of marks with subjects


print(aggregate(data$marks, list(data$subjects),
FUN=max))
Output:

Example 2: R program to create with 4 columns and group with subjects and get the average
(mean).
 R

# create a dataframe with 4 columns


data = data.frame(subjects=c("java", "python", "java",
"java", "php", "php"),
id=c(1, 2, 3, 4, 5, 6),
names=c("manoj", "sai", "mounika",
"durga", "deepika", "roshan"),
marks=c(89, 89, 76, 89, 90, 67))
# display
print(data)

# aggregate average of marks with subjects


print(aggregate(data$marks, list(data$subjects),
FUN=mean))
Output:
PROGRAM.-8
AIM:- Write a R script to handle missing values in R.
Handling missing values is crucial in data preprocessing before performing any analysis in R.
Here's an R script that demonstrates various methods to handle missing values (NAs) in a
dataset:
Sample R Script: Handling Missing Values
# Sample data frame with missing values
(NA) data <- data.frame(
Name = c("John", "Alice", "Sam", NA, "Kate"),
Age = c(28, NA, 34, 25, NA),
Salary = c(50000, 60000, NA, 45000, 52000),
stringsAsFactors = FALSE
)

# Display the original data


print("Original Data:")
print(data)

# 1. Identify Missing Values


print("Identifying Missing Values (TRUE if
missing):") is.na(data) # Returns TRUE for missing
values

# 2. Count Missing Values in Each Column


print("Count of Missing Values in Each
Column:")
colSums(is.na(data)) # Sum of TRUE values (i.e., NAs) for each column

# 3. Removing Rows with Missing Values (Complete


Cases) print("Data with Rows Containing NAs Removed:")
data_clean <- na.omit(data) # Removes rows where any NA is present
print(data_clean)
# 4. Replace Missing Values with a Specific Value (e.g., Mean, Median, etc.)
# Replace missing Age with the mean of non-missing values
mean_age <- mean(data$Age, na.rm = TRUE) # Calculate mean, excluding NAs
data$Age[is.na(data$Age)] <- mean_age # Replace NA with mean
print("Data After Replacing Missing Age with Mean:")
print(data)

# Replace missing Salary with a specific value (e.g.,


median) median_salary <- median(data$Salary, na.rm =
TRUE) data$Salary[is.na(data$Salary)] <- median_salary
print("Data After Replacing Missing Salary with Median:")
print(data)

# 5. Fill Missing Values Using Linear Interpolation (For Numeric Data)


# Using the zoo package for interpolation
# install.packages("zoo") # Uncomment to install the package if needed
library(zoo)
data$Age <- na.approx(data$Age) # Perform linear interpolation for Age
print("Data After Linear Interpolation for Age:")
print(data)

# 6. Filter Rows with Missing Values in Specific


Columns print("Rows where 'Age' is not missing:")
data_age_present <- data[!is.na(data$Age), ] # Keep rows where 'Age' is not missing
print(data_age_present)

# 7. Imputation using Mean/Median (Multiple Columns)


impute_mean <- function(x) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
return(x)
}

# Apply imputation to all numeric columns


data_imputed <- data
data_imputed$Age <- impute_mean(data$Age)
data_imputed$Salary <- impute_mean(data$Salary)
print("Data After Imputation of Missing Values:")
print(data_imputed)
Explanation of the Script:
1. Identify Missing Values:
o is.na() is used to check for missing values in the dataset.
2. Count Missing Values:
o colSums(is.na(data)) gives the count of missing values for each column.
3. Removing Rows with Missing Values:
o na.omit() removes rows containing any NA values.
4. Replacing Missing Values:
o You can replace missing values with specific values (e.g., mean or median)
using mean() and median() functions.
5. Linear Interpolation:
o With the help of the zoo package's na.approx(), missing values in numeric data
can be interpolated linearly.
6. Filtering Rows Based on Missing Values in Specific Columns:
o Rows with missing values in specific columns can be filtered out using logical
conditions.
7. Imputation for Multiple Columns:
o A custom function impute_mean() replaces NA values with the mean of the
column. It can be applied to multiple columns.
PROGRAM.-9
AIM:- Write an R script to handle outliers.
Outliers are data points that significantly differ from other observations in a dataset. Handling
outliers is an essential step in data preprocessing. Here's an R script that demonstrates how to
detect and handle outliers using several common techniques:
Sample R Script: Handling Outliers
# Sample data with potential outliers
set.seed(123)
data <- data.frame(
ID = 1:20,
Age = c(25, 28, 22, 27, 35, 30, 24, 29, 100, 26, 23, 27, 28, 31, 26, 25, 200, 29, 28, 26), # Age
has outliers
Salary = c(30000, 32000, 29000, 33000, 1000000, 31000, 29500, 30500, 29500, 31500,
30000, 32000, 28000, 33000, 31000, 100000, 32000, 30000, 31000, 30000) # Salary has
outliers
)

# Display original data


print("Original Data:")
print(data)

# 1. Identifying Outliers Using the IQR (Interquartile Range)


Method # For numeric columns (Age and Salary)

outliers_iqr <- function(x) {


Q1 <- quantile(x, 0.25) # First quartile (25th percentile)
Q3 <- quantile(x, 0.75) # Third quartile (75th percentile)
IQR <- Q3 - Q1 # Interquartile Range

lower_bound <- Q1 - 1.5 * IQR # Lower bound for outliers


upper_bound <- Q3 + 1.5 * IQR # Upper bound for outliers
return(x < lower_bound | x > upper_bound) # TRUE if outlier
}

# Apply the IQR method to detect outliers in the 'Age' column


data$outlier_age <- outliers_iqr(data$Age)
print("Identified Outliers in Age (TRUE indicates an outlier):")
print(data$outlier_age)

# Apply the IQR method to detect outliers in the 'Salary' column


data$outlier_salary <- outliers_iqr(data$Salary)
print("Identified Outliers in Salary (TRUE indicates an outlier):")
print(data$outlier_salary)

# 2. Visualizing Outliers Using


Boxplots # Boxplot for Age
boxplot(data$Age, main = "Boxplot of Age", ylab = "Age", col = "lightblue")

# Boxplot for Salary


boxplot(data$Salary, main = "Boxplot of Salary", ylab = "Salary", col = "lightgreen")
# 3. Handling Outliers: Removal
# Remove rows with outliers in Age or Salary
data_clean <- data[!data$outlier_age & !data$outlier_salary, ]
print("Data After Removing Outliers:")
print(data_clean)

# 4. Handling Outliers: Capping (Winsorization)


# Capping replaces extreme values with lower/upper bounds

cap_outliers <- function(x) {


Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR


upper_bound <- Q3 + 1.5 * IQR

x[x < lower_bound] <- lower_bound # Cap lower outliers


x[x > upper_bound] <- upper_bound # Cap upper outliers

return(x)
}

# Apply capping to the Age and Salary columns


data$Age_capped <- cap_outliers(data$Age)
data$Salary_capped <- cap_outliers(data$Salary)
print("Data After Capping Outliers:")
print(data)

# 5. Handling Outliers: Replacing with Mean/Median


# Replace outliers with the median of the column

replace_with_median <- function(x) {


Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR


upper_bound <- Q3 + 1.5 * IQR

x[x < lower_bound | x > upper_bound] <- median(x, na.rm = TRUE) # Replace outliers
with median
return(x)
}

# Replace outliers in Age and Salary with median


data$Age_replaced <- replace_with_median(data$Age)
data$Salary_replaced <- replace_with_median(data$Salary)
print("Data After Replacing Outliers with Median:")
print(data)

You might also like