Notes: Data Science and Big Data
1. Introduction
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured data.
Big Data refers to extremely large datasets that may be analyzed computationally to reveal
patterns, trends, and associations, especially relating to human behavior and interactions.
2. Components of Data Science
- Data Collection and Storage
- Data Cleaning and Preprocessing
- Exploratory Data Analysis (EDA)
- Statistical Analysis and Machine Learning
- Data Visualization
- Deployment and Communication of Results
3. Tools and Technologies in Data Science
- Programming Languages: Python, R
- Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
- Tools: Jupyter Notebook, Tableau, Power BI
- Databases: SQL, NoSQL (MongoDB)
4. Introduction to Big Data
Big Data is characterized by the 5 V’s:
- Volume: Large amount of data
- Velocity: Speed of data generation
- Variety: Different types of data (structured, unstructured)
- Veracity: Uncertainty of data
- Value: Insights derived from data
5. Big Data Technologies
- Hadoop Ecosystem: HDFS, MapReduce, Hive, Pig, HBase
- Apache Spark: Fast in-memory big data processing
- NoSQL Databases: MongoDB, Cassandra
- Data Lakes and Cloud Storage (AWS, Azure)
6. Applications
- Business Intelligence and Analytics
- Healthcare and Genomics
- Social Media and Web Analytics
- Fraud Detection and Cybersecurity
- E-commerce and Recommendation Systems
7. Challenges
- Data Privacy and Security
- Data Integration and Cleaning
- Real-time Processing
- Lack of Skilled Professionals