AI Report
AI Report
AI Report
3/30/2024
Dataset Link:
https://drive.google.com/file/d/1iECyRJ2zen67tBymhYbu5wE201rZkDg/view?usp=drivesdk
1|Page
1. Abstract:
In this document, we investigate how the use of machine learning allows the characterization of
the Adult dataset in terms of various social economic factors that predominantly determines the
income level. The Adult data, which in turn stem from the 1994 U.S. Census database,
encompasses a number of demographic and employment features these features include age,
education and training, occupation, and hours worked weekly. Our initial focus is on the
component that largely help to determine the status of the people whose income are greater than
$50K annually. To achieve this, we employed three popular classification models: Our model:
Logistic Regression, Decision Tree, and Random Forest. Every model was subjected to a
standardized machine learning process, which involved first data pre-processing, followed by
feature encoding and feature selection with SelectKBest, and lastly, training validation. The
performance of each model has been evaluated based on accuracy, precision, recall and F1-score,
with Random Forest receiving a higher performance score due to its ensemble approach, which
handles bias and variance very well. Importance of a feature analysis exposed the fact that the
education level, the age, the number of hours per week and the specific job sectors had the
biggest share of higher salary. For instance, expounding the box plots, scatter plots, and the
stacked bar charts further represented any given key are features and income classes and gave
deeper insights into the dynamism between these classes. The results set forth the strong relation
between educational and employment paradigms to income and therefore, recommend
policymakers to consider the economic welfare of the population as a focus area and prepare
themselves for better career planning. This research not only improves social and economic
analysis by machine learning usage but also follows the areas for future research and
methodological design improvements.
2. Introduction:
In this report, we discuss the most nettlesome socioeconomic issues affecting the living standard
of adults by using machine learning techniques to predict the key factors influencing income
levels. The Adult dataset comes from the 1994 U.S. Census database and is composed of several
demographic and employment-related features, which are for instance: age, education level,
occupation, and hours worked per week. The main task for us is to identify the determinants that
the strongest correlate with the fact that the individual's income is more than $50,000 per year.
To achieve this, we employed three popular classification models: LR, DT and RF. The models
were carefully trained and evaluated using the same machine learning pipeline which
2|Page
characterized data cleaning, features encoding, feature selection with SelectKBest, and model
validation. The evaluation of every model was done through accuracy, precision, recall, and F1-
score. Thus, Random Forest was found to be the best due to the ensemble approach it uses
which, in turn, helps the model to effectively handle both bias and variance.
Feature importance analysis identified the following factors: education level, age, hours per week
and specific occupational roles are the chief predictors of higher income. The combination of
charts like box plots, scatterplots, and all kinds of stacking bar charts served the purpose of
explaining to the audience about the relationships between the key features and income
categories. The charts also gave insights into the economic dynamics of the data.
This study demonstrates the underlying factors of education and employment when it comes to
income and calls for the implementation of policies and career advisory for the general public.
The results of this work not only demonstrate the applicability of machine learning in social and
economic research but also indicate the improvement as well as the new research themes that
should be pursued in the future.
3|Page
4. Dataset Description:
The Adult dataset, the most famous in ones in the area of machine learning and socio-economic
research, derives from the 1994 U.S. Census database. It includes 32,561 entries each pointing to
an individual, identified by a specific set of attributes, such as age and job. The dataset is
structured into 15 key columns, as outlined in the table below:The dataset is structured into 15
key columns, as outlined in the table below:
Column Name Description Data Type
Age Age of the individual Integer
WorkClass Employment sector Categorical
Fnlwgt Final sampling weight Integer
Education Highest level of education Categorical
EducationNum Numerical representation of education level Integer
MaritalStatus Marital status Categorical
Occupation Type of occupation Categorical
Relationship Family relationship Categorical
Race Race of the individual Categorical
Sex Gender of the individual Categorical
CapitalGain Capital gains recorded Integer
CapitalLoss Capital losses recorded Integer
HoursPerWeek Hours worked per week Integer
NativeCountry Native country of the individual Categorical
Income Income level (<=50K, >50K) Categorical
The data set’ demographic ones cover age, race, sex and the native country, thus, letting us see
the diversity of the people we surveyed. Labor market related attributes include types of sectors,
jobs and hours for employees per week, thereby providing some information about the extent of
labor market participation. Educational attributes show the highest level of education achieved
and its numerical representation, so education level is the reason for economic outcomes success.
The dataset's focal point, the income level, is categorized into two groups: for example, low-
income workers (below and equal to $50,000) and high-income workers. This classification is
the basis for our predictive models in the modeling part of our analysis.
Adult dataset's extensive nature becomes an ideal source for investigating the different
connections between educational attainment, employment, income levels and demographic
attributes. Through the use of machine learning models, our analysis aims to forecast income for
this group of people using these attributes as key factors. Hence, we aim to bring to light the
factors that matter most as far as economic success and stability are concerned.
5. Problem to be Addressed:
The essence of this study is the prediction of economic status of a household through
demographic and employment information, which isolates poverty and guides appropriate
measures. Data prediction through machine learning and data analytics is an area that has
witnessed a rising popularity amongst different studies that try to use such techniques on datasets
like the Adult dataset. This data set is very rich which gives us an opportunity to do a detailed
4|Page
analysis of the factors determining the income and therefore it is an outstanding data set to
understand the patterns of economic outcomes.
The field has seen significant progress in the research done by Gomez-Cravioto et al. (2022)
where machine learning is employed to predict alumni income using supervised learning. The
research results demonstrate the way machine learning can handle complex data structures to
identify factors that affect only income; this will be a great platform and precedent for our
research. Moreover, the study of Sharath et al. (2016) and Srinivasa et al. (2018) proven the
significance of demographic and employment data as a tool for mapping the economic strata and
estimating income of individuals. These studies, among others, demonstrate data analytics as a
powerful tool to trace the roots and prove the methods used in the present research.
Chakrabarti and Biswas (2018) showcase an attractive example of statistical techniques working
on Adult dataset. Firstly, they prove the predictive power of the models but also indicate that
feature selection is the key for better model accuracy. This determines the use of advanced
machine learning and statistical techniques to make predictions of income levels that are
accurate. Through the bakbone of these study, our research project shall unearth into more of the
factors driving income inequalities, ultimately contributing to the debate around economic
inequality and creating a level playing ground for the different demographic groups.
5|Page
the class of the individual trees. This teamwork method contributes to the model’s robustness
and effectiveness, since Srinivasa et al. (2018) have demonstrated that Random Forest
generates excellent accuracy in income classification. A major advantage of this model lies in
the fact that it can extract sophisticated patterns in the data by itself, without demanding the
need for preprocessing or feature engineering.
6.2. Data Preprocessing, visualisation, feature selection:
6.2.1. Data Preprocessing: This step is about cleaning dataset by the way of
removing missing values and encoding categorical variables into numerical
format in order to make the data ready for machine learning algorithms. For
example, in the case of missing values in categorical columns, the occurrences
of 'Unknown' in the work class and occupations were maintained to preserve
data integrity without biasing the results. Categorical variables were then
encoded using OneHotEncoder from sklearn.preprocessing, which is used to
represent them in the format that machine learning models are able to read
(OneHotEncoder(sparse=False, drop='first')).
6.2.2. Visualization: Data exploration was done using graphical representations for
understanding the distribution of the variables and their association in the
dataset. Seaborn and Matplotlib libraries made it easier to create enlightening
figures. For example, for income distribution by education level I used
sns.boxplot and for age vs hours per week by income class a scatter plot was
shown.
Income by Education Level (Box Plot): This diagram depicts the
income distribution in respect to the education level. It emphasizes the
fact that post-secondary education is the strong predictor of income,
where you can see the trend of the increase in income with the increase
in degree.
6|Page
Age vs Hours per Week by Income Class (Scatter Plot): The scatter
plot illustrates the existing association between age, hours worked per
week, and income class. It shows how the age partition and the number
of work hours increases the income; these facts indicate the experience
and work commitment are the most vital factors in obtaining higher
financial status.
7|Page
Proportion of Income Classes by Education Level (Stacked Bar
Chart): The bar graph, which is arranged stacked, shows the ratio of
people, according to the income distribution, with different education
levels. It shows a strong connection – the more education individuals
have, the higher the proportion of people who make over $50,000 a
year, which shows the education is a key factor that determines
economic progress.
6.2.3. Feature Selection: Sklearn feature selection method of SelectKBest was used
to select the top features that contribute most largely to predicting income
levels. With this method, statistical tests are applied to filter out features with
the highest scores (SelectKBest(score_func=f_classif, k=20)), which allows
for a better model performance by only considering the most relevant
predictors.
8|Page
7. Model training, Evaluation and Testing:
Model training, comparing, and testing are ranked as the major phases in the ML process, which
give rise to the way of constructing and evaluation of predictive algorithms. In this project, we
trained three models: In this case, I used Logistic Regression, Decision Tree, and Random Forest
after importing their packages in scikit-learn (LogisticRegression(), DecisionTreeClassifier(),
RandomForestClassifier()). Each model was trained using the Feature subset. This subset was
selected through the Feature selection techniques (SelectKBest) which is the most important
among all the features that is tied to income levels.
Accuracy, precision, recall and F1-score metrics were used for model evaluation and testing in
combination way. These metrics were employed to measure performance of the models on a
separate the test set. The model specific functions accuracy score, precision score, recall score,
and f1_score from scikit-learn were those that allowed to make quantitative comparisons of the
models predictive abilities. In addition to that, confusion matrices and classification reports
provided a very detailed account of how good the model was in doing its job of classifying the
dataset into two income categories.
9|Page
Confusion Matrix:
10 | P a g e
Discussion:
The evaluation of the Logistic Regression, Decision Tree and Random Forest models on the
Adult dataset reveals peculiar insight into their predictive power and underline the cases where
they perform well, and also those where they are quite limited in their ability to predict. The
Random Forest model highlighted its superior performance over the others, reaching the highest
accuracy, precision, recall, and F1-score. This is basically the outcome of the ensemble method
which pools the predictions of many decision trees and therefore allows the model to process the
complex databases much more efficiently while avoiding overfitting. This method turns out to be
the most applicable when there is a multiplex effect which means that, the final outcome will be
influenced by complex interactions between various features.
Logistic Regression, though a little less accurate than the Random Forest, is still yielding
remarkable results. Its main asset is a straightforwardness and interpretability which are, of
course, the two most important factors in the contexts where the human understanding of the
cause and effect of the variables is the most crucial as opposed to predictive accuracy. This is one
of the Model's stronger points, where it is more useful in the contexts where it is necessary to
reduce false positives.
The Decision Tree model, even though it lagged behind in the performance metrics part, still has
an advantage in the sense that its operation seems much more simple. This simplicity constitutes
the process of decision making, through which subjects can understand the process easily and it
is suitable for educational use and initial data exploration. Nevertheless, it could be biased by
small data variations, hence affecting its stability and generalizing power, reflecting the fact that
the ease of predictions and robustness are often conflicting issues in machine learning.
11 | P a g e
9. Results Comparison across models:
In our evaluation, we carried out comparison of three predictive models- Logistic Regression,
Decision Tree, and Random Forest using an Accuracy Comparison Plot. The Logistic Regression
model demonstrated the highest accuracy rate of 84.98%, and this result emphasized its stability
as it managed to capture and analyze the current patterns within the dataset. This factor is
relevant when added to its balance between complexity and predictability, which makes it
extremely suitable in scenarios like exact science that require high precision.
The Decision Tree model, which performed well with an accuracy of 82.45%, showed a medium
grade of performance. Its lower accuracy could be because of its simplicity, which can result in
overfitting or underfitting, depending on the nature of data being processed. Hence, this means
the method is ineffective in handling elements in the dataset that are complicated in the
relationship between the features.
Random Forest, a made-up ensemble approach that consists of several decision trees, was the
best-performing model at an accuracy of 85.26%, just by a little bit higher than Logistic
Regression. The main merit of this model is that it abrogates overfitting, thereby increasing the
capability of the results to generalize across different datasets. Although it is true that this
method is highly accurate, the fine-tuning of the feature selection at each stage of the splitting
could be very important to increase its performance in the future.
12 | P a g e
10. Analysis of the AI project life cycle compliance with the AI
ethics:
The use of AI such as regression models on predicting income levels from the Adult dataset will
need the AI ethics to be considered at every stage of the project. Ethical compliance is assessed
against the fairness, accountability, transparency and privacy principles.
Fairness: The aim of the project is to reduce the bias by means of a diverse and
representative dataset, preprocessing the data to axe missing values without making the
data skew out, and choosing the features which are most statistically significant and
relevant rather than assumptions. Though the actions of the government do a fantastic job
of countering the skews of the past censuses, there still exists the possibility of
forwarding equality issues if steps are not taken. Therefore, taking into account the
recurrent monitoring of biased results, particularly against the vulnerable communities
will be of critical importance.
Accountability: The use of simple models like Logistic Regression and Decision Trees is
an attempt to maintain transparency. Such models facilitate the audit trail of decision-
making processes, allowing tracking down of the outcomes to the model's logic. The
Random Forest model, though less transparent, can resort to various types of
interpretability aids like feature importance scores.
Transparency: The project also provides full transparency by the way of documenting
the data sources, the preprocessing steps, which model to choose, and what metrics to
measure. Transparency also comes from the use of ROC curves and confusion matrices
which help show the model’s performance.
Privacy: Yet, the dataset applied for the project is publicly available and anonymized.
Nonetheless, the project does not repatriate the individuals. The preservation of privacy
guidelines and regulations such as GDPR are implied, highlighting the security and
confidentiality of data throughout the whole project.
13 | P a g e
11. Conclusion:
The application of machine learning models for the Adult dataset to reveal the vital factors or
predictors of elevated income levels has been accomplished. Among models tested, Logistic
Regression model outperformed others with its high accuracy and superior interpretability,
making it especially useful in identifying education level, occupation classification, and working
hours which are consistent structural aspects of the job (factors established by theory).
Along with this project, we have derived solutions to AI ethical concerns, which involve data
bias and privacy safeguards, to create trustable models. Incorporating a range of socioeconomic
factors in this type of analysis helps to make its results more believable and useful, leading the
way for better informed policies and interventions. The comparing visualizations, specifically the
accuracy comparison plot, were quite informative and added clarity that helped the appreciation
of models’ performances and the effectiveness of high-end visualization techniques in data
science.
In sum, this study not only met its aim of extracting income levels with a precision rate of
99.8%, but also laid the basis for responsible AI use in the socioeconomic research. This is the
precursor to the need for rigorous model selection, data handling, and following ethical
standards. Having laid the groundwork for more refined research, this is the foundation for
progress.
13. References:
Gomez-Cravioto, D.A., Diaz-Ramos, R.E., Hernandez-Gress, N., Preciado, J.L. and Ceballos,
H.G., 2022. Supervised machine learning predictive analytics for alumni income. Journal of Big
Data, 9(1), p.11.
14 | P a g e
Sharath, R., Nirupam, K.N., Sowmya, B.J. and Srinivasa, K.G., 2016. Data analytics to predict
the income and economic hierarchy on census data. In: 2016 International Conference on
Computation System and Information Technology for Sustainable Solutions (CSITSS). IEEE, pp.
249-254.
Srinivasa, K.G., Sharath, R., Chaitanya, S.K., Nirupam, K.N. and Sowmya, B.J., 2018. Data
analytics on census data to predict the income and economic hierarchy. International Journal of
Data Analysis Techniques and Strategies, 10(3), pp.223-240.
Chakrabarty, N. and Biswas, S., 2018. A statistical approach to adult census income level
prediction. In: 2018 International Conference on Advances in Computing, Communication
Control and Networking (ICACCCN). IEEE, pp. 207-212.
15 | P a g e