AI Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

Building and evaluating data

analysis and Artificial Intelligence


pipelines

3/30/2024

Module Title: Artificial Intelligence Fundamentals


Module Code: CMP7247
Coordinator Name: Debashish Das
Student Name: Sauravpreet Singh
Student Number: 23171175
Contents
1. Abstract:.............................................................................................................................2
2. Introduction:......................................................................................................................2
3. Background, Aims and Objectives:..................................................................................3
4. Dataset Description:..........................................................................................................3
5. Problem to be Addressed:.................................................................................................4
6. Artificial Intelligence models:...........................................................................................5
6.1. Summary of the approach.........................................................................................5
6.2. Data Preprocessing, visualisation, feature selection:..............................................5
7. Model training, Evaluation and Testing:.........................................................................9
8. Results and discussion:......................................................................................................9
9. Results Comparison across models:...............................................................................12
10. Analysis of the AI project life cycle compliance with the AI ethics:........................13
11. Conclusion:...................................................................................................................14
12. Recommendations and Future Work:........................................................................14
13. References:...................................................................................................................14

Dataset Link:
https://drive.google.com/file/d/1iECyRJ2zen67tBymhYbu5wE201rZkDg/view?usp=drivesdk

1|Page
1. Abstract:
In this document, we investigate how the use of machine learning allows the characterization of
the Adult dataset in terms of various social economic factors that predominantly determines the
income level. The Adult data, which in turn stem from the 1994 U.S. Census database,
encompasses a number of demographic and employment features these features include age,
education and training, occupation, and hours worked weekly. Our initial focus is on the
component that largely help to determine the status of the people whose income are greater than
$50K annually. To achieve this, we employed three popular classification models: Our model:
Logistic Regression, Decision Tree, and Random Forest. Every model was subjected to a
standardized machine learning process, which involved first data pre-processing, followed by
feature encoding and feature selection with SelectKBest, and lastly, training validation. The
performance of each model has been evaluated based on accuracy, precision, recall and F1-score,
with Random Forest receiving a higher performance score due to its ensemble approach, which
handles bias and variance very well. Importance of a feature analysis exposed the fact that the
education level, the age, the number of hours per week and the specific job sectors had the
biggest share of higher salary. For instance, expounding the box plots, scatter plots, and the
stacked bar charts further represented any given key are features and income classes and gave
deeper insights into the dynamism between these classes. The results set forth the strong relation
between educational and employment paradigms to income and therefore, recommend
policymakers to consider the economic welfare of the population as a focus area and prepare
themselves for better career planning. This research not only improves social and economic
analysis by machine learning usage but also follows the areas for future research and
methodological design improvements.

2. Introduction:
In this report, we discuss the most nettlesome socioeconomic issues affecting the living standard
of adults by using machine learning techniques to predict the key factors influencing income
levels. The Adult dataset comes from the 1994 U.S. Census database and is composed of several
demographic and employment-related features, which are for instance: age, education level,
occupation, and hours worked per week. The main task for us is to identify the determinants that
the strongest correlate with the fact that the individual's income is more than $50,000 per year.
To achieve this, we employed three popular classification models: LR, DT and RF. The models
were carefully trained and evaluated using the same machine learning pipeline which

2|Page
characterized data cleaning, features encoding, feature selection with SelectKBest, and model
validation. The evaluation of every model was done through accuracy, precision, recall, and F1-
score. Thus, Random Forest was found to be the best due to the ensemble approach it uses
which, in turn, helps the model to effectively handle both bias and variance.
Feature importance analysis identified the following factors: education level, age, hours per week
and specific occupational roles are the chief predictors of higher income. The combination of
charts like box plots, scatterplots, and all kinds of stacking bar charts served the purpose of
explaining to the audience about the relationships between the key features and income
categories. The charts also gave insights into the economic dynamics of the data.
This study demonstrates the underlying factors of education and employment when it comes to
income and calls for the implementation of policies and career advisory for the general public.
The results of this work not only demonstrate the applicability of machine learning in social and
economic research but also indicate the improvement as well as the new research themes that
should be pursued in the future.

3. Background, Aims and Objectives:


The Adult dataset, a derived data set from 1994 census data released by U.S. Census Bureau,
plays an important role in the socioeconomic conditions analysis of the U.S. working population.
The dataset constituting of many diverse factors such as age, education, marital status,
occupation and predominantly, income, serves as a strong base for investigating the dynamics of
the economic status as well as the participation in the labor market. The dataset is meaningful
because it reveals how the different factors can come together or apart, in a variety of ways, to
cause the varying income levels among different population subgroups, and so this data shows
the real picture of income inequality, and the distribution of economic opportunities.
In this project, we harness machine learning to predict income levels, focusing on two main
income categories: the toppers and bottom of the wage distribution, in this case, earners above
$50,000 and those below this threshold. The interconnection of the issues associated with
employment and demography in determining the income potential forms the basis of our study.
Narrowing down the vital parameters of income will not just be an academic exercise but will
also give policymakers, researchers, and financial planners a critical piece of data to formulate
policies, provide targeted interventions, and personal financial planning.
Our objectives are twofold: two-fold – to identify the degree to which different features forecast
income level and, second, to collect the usable information that can otherwise be used to shape
economic policies and personal decisions. As a result, our analysis is a combination of theoretical
research and practical applications, which has the potential to shape the discourse on economic
mobility and equality, and finally we can make a better allocation of resources and opportunities.
This project will demonstrate how machine learning is a powerful tool to change the way socio
economic research was just being done into a dynamic tool and decision making.

3|Page
4. Dataset Description:
The Adult dataset, the most famous in ones in the area of machine learning and socio-economic
research, derives from the 1994 U.S. Census database. It includes 32,561 entries each pointing to
an individual, identified by a specific set of attributes, such as age and job. The dataset is
structured into 15 key columns, as outlined in the table below:The dataset is structured into 15
key columns, as outlined in the table below:
Column Name Description Data Type
Age Age of the individual Integer
WorkClass Employment sector Categorical
Fnlwgt Final sampling weight Integer
Education Highest level of education Categorical
EducationNum Numerical representation of education level Integer
MaritalStatus Marital status Categorical
Occupation Type of occupation Categorical
Relationship Family relationship Categorical
Race Race of the individual Categorical
Sex Gender of the individual Categorical
CapitalGain Capital gains recorded Integer
CapitalLoss Capital losses recorded Integer
HoursPerWeek Hours worked per week Integer
NativeCountry Native country of the individual Categorical
Income Income level (<=50K, >50K) Categorical

The data set’ demographic ones cover age, race, sex and the native country, thus, letting us see
the diversity of the people we surveyed. Labor market related attributes include types of sectors,
jobs and hours for employees per week, thereby providing some information about the extent of
labor market participation. Educational attributes show the highest level of education achieved
and its numerical representation, so education level is the reason for economic outcomes success.
The dataset's focal point, the income level, is categorized into two groups: for example, low-
income workers (below and equal to $50,000) and high-income workers. This classification is
the basis for our predictive models in the modeling part of our analysis.
Adult dataset's extensive nature becomes an ideal source for investigating the different
connections between educational attainment, employment, income levels and demographic
attributes. Through the use of machine learning models, our analysis aims to forecast income for
this group of people using these attributes as key factors. Hence, we aim to bring to light the
factors that matter most as far as economic success and stability are concerned.

5. Problem to be Addressed:
The essence of this study is the prediction of economic status of a household through
demographic and employment information, which isolates poverty and guides appropriate
measures. Data prediction through machine learning and data analytics is an area that has
witnessed a rising popularity amongst different studies that try to use such techniques on datasets
like the Adult dataset. This data set is very rich which gives us an opportunity to do a detailed

4|Page
analysis of the factors determining the income and therefore it is an outstanding data set to
understand the patterns of economic outcomes.
The field has seen significant progress in the research done by Gomez-Cravioto et al. (2022)
where machine learning is employed to predict alumni income using supervised learning. The
research results demonstrate the way machine learning can handle complex data structures to
identify factors that affect only income; this will be a great platform and precedent for our
research. Moreover, the study of Sharath et al. (2016) and Srinivasa et al. (2018) proven the
significance of demographic and employment data as a tool for mapping the economic strata and
estimating income of individuals. These studies, among others, demonstrate data analytics as a
powerful tool to trace the roots and prove the methods used in the present research.
Chakrabarti and Biswas (2018) showcase an attractive example of statistical techniques working
on Adult dataset. Firstly, they prove the predictive power of the models but also indicate that
feature selection is the key for better model accuracy. This determines the use of advanced
machine learning and statistical techniques to make predictions of income levels that are
accurate. Through the bakbone of these study, our research project shall unearth into more of the
factors driving income inequalities, ultimately contributing to the debate around economic
inequality and creating a level playing ground for the different demographic groups.

6. Artificial Intelligence models:


6.1. Summary of the approach
6.1.1. Logistic Regression:
Logistic Regression is one of the most popular and effective statistics techniques, which is
used for binary classification and precisely proper for predicting categorical outcomes. The
function is based on a premise that its ability to forecast the income levels is in its simplicity
and effectiveness in dealing with the linear relationships between the response variable and
predictor variables. Gomez-Cravioto et al. (2022) are pointing out the appropriateness of
logistic regression in predictive analytics, saying that this technique is important in
socioeconomic projects where accuracy of predictions and model’s transparency are vital
factors. The model works by its probability estimations via logistic function, so it is a
straightforward way of categorizing individuals into different income brackets depending on
their demographic and employment attributes.
6.1.2. Decision Tree Classifier:
Decision Trees is one of the simplest ways to perform the classification, which utilizes the
tree structure to create the model. It is so because it splits the data set into subsets based on
features values. Thus, it deals with non-linear relations and interactions among variables. It is
a very effective method to handle the data. Chakrabarty and Biswas (2018) are able to show
the value of using decision trees to predict income because they are capable of modeling a
wide range of complex decision-making processes. The model's clarity, represented by the
flow of decision making, is so easy to understand how many attributes affect the level of
income.
6.1.3. Random Forest Classifier:
Classifier for Random Forests is a further development of decision trees by combining
several trees to raise the reliability of predictions and prevent overfitting. It operates by
applying a series of decision trees at the training time and selecting the mode of the mode of

5|Page
the class of the individual trees. This teamwork method contributes to the model’s robustness
and effectiveness, since Srinivasa et al. (2018) have demonstrated that Random Forest
generates excellent accuracy in income classification. A major advantage of this model lies in
the fact that it can extract sophisticated patterns in the data by itself, without demanding the
need for preprocessing or feature engineering.
6.2. Data Preprocessing, visualisation, feature selection:
6.2.1. Data Preprocessing: This step is about cleaning dataset by the way of
removing missing values and encoding categorical variables into numerical
format in order to make the data ready for machine learning algorithms. For
example, in the case of missing values in categorical columns, the occurrences
of 'Unknown' in the work class and occupations were maintained to preserve
data integrity without biasing the results. Categorical variables were then
encoded using OneHotEncoder from sklearn.preprocessing, which is used to
represent them in the format that machine learning models are able to read
(OneHotEncoder(sparse=False, drop='first')).

6.2.2. Visualization: Data exploration was done using graphical representations for
understanding the distribution of the variables and their association in the
dataset. Seaborn and Matplotlib libraries made it easier to create enlightening
figures. For example, for income distribution by education level I used
sns.boxplot and for age vs hours per week by income class a scatter plot was
shown.
 Income by Education Level (Box Plot): This diagram depicts the
income distribution in respect to the education level. It emphasizes the
fact that post-secondary education is the strong predictor of income,
where you can see the trend of the increase in income with the increase
in degree.

6|Page
 Age vs Hours per Week by Income Class (Scatter Plot): The scatter
plot illustrates the existing association between age, hours worked per
week, and income class. It shows how the age partition and the number
of work hours increases the income; these facts indicate the experience
and work commitment are the most vital factors in obtaining higher
financial status.

7|Page
 Proportion of Income Classes by Education Level (Stacked Bar
Chart): The bar graph, which is arranged stacked, shows the ratio of
people, according to the income distribution, with different education
levels. It shows a strong connection – the more education individuals
have, the higher the proportion of people who make over $50,000 a
year, which shows the education is a key factor that determines
economic progress.

6.2.3. Feature Selection: Sklearn feature selection method of SelectKBest was used
to select the top features that contribute most largely to predicting income
levels. With this method, statistical tests are applied to filter out features with
the highest scores (SelectKBest(score_func=f_classif, k=20)), which allows
for a better model performance by only considering the most relevant
predictors.

8|Page
7. Model training, Evaluation and Testing:
Model training, comparing, and testing are ranked as the major phases in the ML process, which
give rise to the way of constructing and evaluation of predictive algorithms. In this project, we
trained three models: In this case, I used Logistic Regression, Decision Tree, and Random Forest
after importing their packages in scikit-learn (LogisticRegression(), DecisionTreeClassifier(),
RandomForestClassifier()). Each model was trained using the Feature subset. This subset was
selected through the Feature selection techniques (SelectKBest) which is the most important
among all the features that is tied to income levels.
Accuracy, precision, recall and F1-score metrics were used for model evaluation and testing in
combination way. These metrics were employed to measure performance of the models on a
separate the test set. The model specific functions accuracy score, precision score, recall score,
and f1_score from scikit-learn were those that allowed to make quantitative comparisons of the
models predictive abilities. In addition to that, confusion matrices and classification reports
provided a very detailed account of how good the model was in doing its job of classifying the
dataset into two income categories.

8. Results and discussion:


In the analysis of the Adult dataset for predicting income levels of an individual, we trained and
evaluated three models: Logistic Regression, Tree-based Decision, and Random Forest.
Accuracy, precision, recall and F1-score are the metrics applied on the models to give a full
picture of their performance.
Model Accuracy Precision Recall F1-Score
Logistic 84.98% 71.91% 58.27% 64.46%
Regression
Decision Tree 82.45% 62.88% 60.34% 61.59%
Random Forest 85.26% 73.05% 59.17% 65.49%

9|Page
Confusion Matrix:

10 | P a g e
Discussion:
The evaluation of the Logistic Regression, Decision Tree and Random Forest models on the
Adult dataset reveals peculiar insight into their predictive power and underline the cases where
they perform well, and also those where they are quite limited in their ability to predict. The
Random Forest model highlighted its superior performance over the others, reaching the highest
accuracy, precision, recall, and F1-score. This is basically the outcome of the ensemble method
which pools the predictions of many decision trees and therefore allows the model to process the
complex databases much more efficiently while avoiding overfitting. This method turns out to be
the most applicable when there is a multiplex effect which means that, the final outcome will be
influenced by complex interactions between various features.
Logistic Regression, though a little less accurate than the Random Forest, is still yielding
remarkable results. Its main asset is a straightforwardness and interpretability which are, of
course, the two most important factors in the contexts where the human understanding of the
cause and effect of the variables is the most crucial as opposed to predictive accuracy. This is one
of the Model's stronger points, where it is more useful in the contexts where it is necessary to
reduce false positives.
The Decision Tree model, even though it lagged behind in the performance metrics part, still has
an advantage in the sense that its operation seems much more simple. This simplicity constitutes
the process of decision making, through which subjects can understand the process easily and it
is suitable for educational use and initial data exploration. Nevertheless, it could be biased by
small data variations, hence affecting its stability and generalizing power, reflecting the fact that
the ease of predictions and robustness are often conflicting issues in machine learning.

11 | P a g e
9. Results Comparison across models:
In our evaluation, we carried out comparison of three predictive models- Logistic Regression,
Decision Tree, and Random Forest using an Accuracy Comparison Plot. The Logistic Regression
model demonstrated the highest accuracy rate of 84.98%, and this result emphasized its stability
as it managed to capture and analyze the current patterns within the dataset. This factor is
relevant when added to its balance between complexity and predictability, which makes it
extremely suitable in scenarios like exact science that require high precision.
The Decision Tree model, which performed well with an accuracy of 82.45%, showed a medium
grade of performance. Its lower accuracy could be because of its simplicity, which can result in
overfitting or underfitting, depending on the nature of data being processed. Hence, this means
the method is ineffective in handling elements in the dataset that are complicated in the
relationship between the features.
Random Forest, a made-up ensemble approach that consists of several decision trees, was the
best-performing model at an accuracy of 85.26%, just by a little bit higher than Logistic
Regression. The main merit of this model is that it abrogates overfitting, thereby increasing the
capability of the results to generalize across different datasets. Although it is true that this
method is highly accurate, the fine-tuning of the feature selection at each stage of the splitting
could be very important to increase its performance in the future.

12 | P a g e
10. Analysis of the AI project life cycle compliance with the AI
ethics:
The use of AI such as regression models on predicting income levels from the Adult dataset will
need the AI ethics to be considered at every stage of the project. Ethical compliance is assessed
against the fairness, accountability, transparency and privacy principles.
 Fairness: The aim of the project is to reduce the bias by means of a diverse and
representative dataset, preprocessing the data to axe missing values without making the
data skew out, and choosing the features which are most statistically significant and
relevant rather than assumptions. Though the actions of the government do a fantastic job
of countering the skews of the past censuses, there still exists the possibility of
forwarding equality issues if steps are not taken. Therefore, taking into account the
recurrent monitoring of biased results, particularly against the vulnerable communities
will be of critical importance.
 Accountability: The use of simple models like Logistic Regression and Decision Trees is
an attempt to maintain transparency. Such models facilitate the audit trail of decision-
making processes, allowing tracking down of the outcomes to the model's logic. The
Random Forest model, though less transparent, can resort to various types of
interpretability aids like feature importance scores.
 Transparency: The project also provides full transparency by the way of documenting
the data sources, the preprocessing steps, which model to choose, and what metrics to
measure. Transparency also comes from the use of ROC curves and confusion matrices
which help show the model’s performance.
 Privacy: Yet, the dataset applied for the project is publicly available and anonymized.
Nonetheless, the project does not repatriate the individuals. The preservation of privacy
guidelines and regulations such as GDPR are implied, highlighting the security and
confidentiality of data throughout the whole project.

13 | P a g e
11. Conclusion:
The application of machine learning models for the Adult dataset to reveal the vital factors or
predictors of elevated income levels has been accomplished. Among models tested, Logistic
Regression model outperformed others with its high accuracy and superior interpretability,
making it especially useful in identifying education level, occupation classification, and working
hours which are consistent structural aspects of the job (factors established by theory).
Along with this project, we have derived solutions to AI ethical concerns, which involve data
bias and privacy safeguards, to create trustable models. Incorporating a range of socioeconomic
factors in this type of analysis helps to make its results more believable and useful, leading the
way for better informed policies and interventions. The comparing visualizations, specifically the
accuracy comparison plot, were quite informative and added clarity that helped the appreciation
of models’ performances and the effectiveness of high-end visualization techniques in data
science.
In sum, this study not only met its aim of extracting income levels with a precision rate of
99.8%, but also laid the basis for responsible AI use in the socioeconomic research. This is the
precursor to the need for rigorous model selection, data handling, and following ethical
standards. Having laid the groundwork for more refined research, this is the foundation for
progress.

12. Recommendations and Future Work:


The report advocates for further exploration of AI models to enable for the best policies in the
socio-economic sector. Stakeholders should evaluate Logistic Regression as an option for its
interpretability while Random Forest is another choice for predictive robustness. Ensuring
continuous ethical audits is advisable because it calls for responsible AI's use, with the primary
focus on transparency and fairness. Additionally, data scientists would try to find diverse datasets
that could prevent biases and they would also use advanced feature selection techniques to tune
up the models. Organizations can therefore use these insights for strategic planning, intervention
development that advances educational opportunity and job training, that may eventually raise
the income levels not only in the country but also in different demographic groups.
It is essential to continue research, in particular, to hybrid models and their use of machine
learning to attain higher precision while the complexity of the socioeconomic data is being
addressed. The effects of the new economic factors, including gig economy, on employment as
well as long-term data that exposes economic trends should not be ignored. The importance of
build models that are easy to understand and using them in policy making would increase their
adoption. Similarly, the need for designing strong tools for identifying and correcting bias is
critical so that AI technologies are used to promote equity and fairness in socioeconomic studies.

13. References:
Gomez-Cravioto, D.A., Diaz-Ramos, R.E., Hernandez-Gress, N., Preciado, J.L. and Ceballos,
H.G., 2022. Supervised machine learning predictive analytics for alumni income. Journal of Big
Data, 9(1), p.11.

14 | P a g e
Sharath, R., Nirupam, K.N., Sowmya, B.J. and Srinivasa, K.G., 2016. Data analytics to predict
the income and economic hierarchy on census data. In: 2016 International Conference on
Computation System and Information Technology for Sustainable Solutions (CSITSS). IEEE, pp.
249-254.
Srinivasa, K.G., Sharath, R., Chaitanya, S.K., Nirupam, K.N. and Sowmya, B.J., 2018. Data
analytics on census data to predict the income and economic hierarchy. International Journal of
Data Analysis Techniques and Strategies, 10(3), pp.223-240.
Chakrabarty, N. and Biswas, S., 2018. A statistical approach to adult census income level
prediction. In: 2018 International Conference on Advances in Computing, Communication
Control and Networking (ICACCCN). IEEE, pp. 207-212.

15 | P a g e

You might also like