0% found this document useful (0 votes)
7 views19 pages

Chap1_Introduction

The document outlines the course DS 644: Introduction to Big Data, covering topics such as big data analytics, technological infrastructure, and applications across various fields. It details the course structure, including class attendance, resources, and core faculty, while also highlighting challenges in big data and funded projects. Additionally, it emphasizes the mission of the Center for Big Data at NJIT to facilitate research collaboration and scientific discovery through advanced big data technologies.

Uploaded by

Pavan Frustum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

Chap1_Introduction

The document outlines the course DS 644: Introduction to Big Data, covering topics such as big data analytics, technological infrastructure, and applications across various fields. It details the course structure, including class attendance, resources, and core faculty, while also highlighting challenges in big data and funded projects. Additionally, it emphasizes the mission of the Center for Big Data at NJIT to facilitate research collaboration and scientific discovery through advanced big data technologies.

Uploaded by

Pavan Frustum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DS 644: Introduction to Big Data

Chapter 1. Introduction

Yijie Zhang

New Jersey Institute of Technology


yz829@njit.edu

1
Course Website

• Check out this course website on a regular basis for


homework/project assignments, reading materials, tutorials, etc.

2
Class Attendance Check
Purpose:
• Have an opportunity to learn about each other’s
education/research background or work experience to possibly
Order of Magnitude:
form a team with common interests for homework or projects. 20 1 100 One
• Name
210 K 103 Thousand

• Program/Year 220 M 106 Million



Why do you take this course?
What is the largest data size you’ve ever personally
230 G 109 Billion

handled and in what context? 240 T 1012 Trillion


− application domain
− data type 250 P 1015 Quadrillion



storage format
processing/analysis purposes
260 E 1018 Quintillion

− etc. 270 Z 1021 Sextillion

280 Y 1024
290
……

3
About this course
• Recent Developments and Future Trends on Big Data Computing
• Continuum Computing: from Edge to Cloud
• High-performance Computing: Supercomputer, Cluster, etc.

• Overview of Big Data Analytics

• Big Data Ecosystem


• Systems, Platforms, Tools, and Techniques for Big Data Transfer, Storage,
Management, Computing, Processing, and Analysis

• Machine Learning for Big Data June 2024

• Advanced Topics:
• Big Data Meets Large Models
• Big Data Visualization
• Big Data Transfer
• Big Data Workflows
• Big Data Security

State of the arts about big data:


• Networking: 100’s Gbps (backbone)
• Storage: PB/EB
• Computing: EFlop/s first EVER!

5
Textbooks and Reference Books
Overview Machine Learning / Data Mining
MapReduce / Hadoop

Popular Frameworks Learning Theory


Four V’s of Big Data

7
Big Data and HPC for LLMs

8
Center for Big Data

Director: Chase Wu

URL: https://centers.njit.edu/bigdata

Location: GITC 4416

9
Mission Statement

• Synergize the strong expertise in various


disciplines across the NJIT campus
• Build a unified big data platform that embodies a
rich set of big data enabling technologies and
services with optimized performance to facilitate
research collaboration and scientific discovery
• Investigate, develop, and apply cutting-edge
technologies to address unprecedented
challenges in big data with high Volume, high
Velocity, high Variety, and high Veracity,
in order to create high VALUE!

10
A Three-layer Structure of the CBD
Transportation
Solar-Terrestrial
Brain injury Goals: Advance sciences in various
Big Data Physics domains
Layer 3 Tasks: Adapt, customize, and refine
Applications Healthcare
Business application-specific solutions
Smart city
etc.

bound User
Interface
North-
Goals: Provide generic and special
big-data enabling solutions
Systems/Platforms
Big Data Tasks: Investigate, design, develop,
Tools/Libraries
Layer 2 Technological implement, and test big data-
Services
Infrastructure oriented analytics, visualization,
Algorithms
computing, networking, workflow,
storage, and retrieval solutions
Data Access

Retrieval
and

Raw data (experimental, simulation,


observational)
Goals: Share data and analysis
Metadata, markup data
Big Data results for community building
Layer 1 Repository Analysis results (intermediate, final)
Tasks: Standardize, categorize and
Models, views, tables, forms,
benchmark datasets
animations, etc.
Workflow templates, provenance data

11
− Layer 1: Big Data Repository
• Store, manage, and provide a wide variety of data such as raw data
(experimental, simulation, observational, and user-generated
content), metadata, markup data, analysis results (intermediate
and final) in various forms including models, views, tables, images,
and videos, and workflow templates with provenance data.
• Build a dedicated one-stop portal to share research data and
analysis results for community building.
− Layer 2: Big Data Technological Infrastructure
• Provide generic and domain-specific big data enabling solutions for
data management, movement, and analytics.
• Host and maintain a set of practical technical resources in the form
of systems/platforms, tools/libraries, services, and algorithms in
various areas including database management, data mining,
machine learning, and parallel and distributed computing, which
are needed to compose big data solutions in different application
domains.

12
− Layer 3: Big Data Applications
• Present a common portal to big data applications spanning across a
wide spectrum of research fields, including
− transportation
− solar-terrestrial
− brain injury
− physics
− healthcare
− business
− smart city
• Provide researchers powerful and customized big data solutions to
advance the frontier of sciences in various application domains.

13
Core Faculty of CBD
• Chase Wu (Director) Professor, Dept of Data Science
• Dantong Yu (Co-Director) Associate Professor, Leir Chair, School of Management
• Yi Chen Professor, Leir Chair, School of Management, Dept of
Computer Science
• Andrew Gerrard Professor, Dept of Physics, Center for Solar-Terrestrial Research
• Lazar Spasovic Professor, Dept of Civil and Environmental Engineering
• Steven Chien Professor, Dept of Civil and Environmental Engineering
• Joyoung Lee Assistant Professor, Dept of Civil and Environmental Engineering
• Namas Chandra Professor, Dept of Biomedical Engineering, Center for Injury Bio-
mechanics, Materials and Medicine
• Jason Wang Professor, Dept of Computer Science
• Usman Roshan Associate Professor, Dept of Computer Science
• Zhi Wei Professor, Dept of Computer Science
• Dimitri Theodoratos Associate Professor, Dept of Computer Science
• Vincent Oria Professor, Dept of Computer Science
• Senjuti Roy Associate Professor, Dept of Computer Science
• Brook Wu Associate Professor, Dept of Informatics
• Hai Phan Assistant Professor, Dept of Data Science

14
Funded Projects
• DOE: Technologies and Tools for Synthesis of Source-to-Sink High-
Performance Flows, DOE Office of Science, Big Data-Aware Terabits
Networking.
• NSF: An Integrated Approach to Performance Modeling and Optimization of
Big-data Scientific Workflows, Computer and Network Systems.
• DOE: Towards a Scalable and Adaptive Application Support Platform for Large-
Scale Distributed E-Sciences in High-Performance Network Environments,
DOE Office of Science, High-Performance Networks for Distributed Petascale
Science.
• Google Research Award, Understanding and Processing Subjective Queries on
Structured Data
• NSF: CAREER: Analyzing and Exploiting Meta-information for Keyword Search
on Semi-structured Data.
• EarthCube IA: Magnetosphere-Ionosphere-Atmosphere Coupling, Abstract
#1541009.
• Intelligent Transportation Systems Resource Center - Task: Data Acquisition,
Integration, Analysis, and Visualization.

15
Application 1: Transportation
NJIT Devices Transmit
NJIT
Data using Verizon 3G/
4G Network
Database Server
TrafficCast Bluetooth Internet
Data (Real time
speed and travel NJIT Internal Network
Real Time Traffic ASTI Real Time Traffic
time)
Volume Devices Volume NJIT
Web
Application
TRANSCOM Server Server
(Transmit & OpenReach)

Internet

Internet
Device Location and
Status

End Users
Travel Time
Travel Times
Indexes

Real time Speed


Level of Service and Map
Volume

Real Time Traffic Other Available Data Sources


Signal Status (Inrix, Plan4Safety, ATR, etc.)

Big Data Challenges:


• Standardization of data format
• Accurate modeling
• Clustering and classifying
• Integrating data from independent sources
• Uncovering patterns, correlation, etc.
• Interpretation 16
Application 2: Solar Terrestrial Research
OVSA: 50 GB/day
Van Allen Probes:
2GB/day

SWRL: 10 GB/day
Jeffer Lidar

BBSO: 6000 GB/day

Other: 0.25 GB/day


PEDC/Antarctic: 0.5 GB/day

Big Data Challenges:


• Complex Process: Plasma Physics + Fluid Dynamics
• Expensive Equipment: Remote Sensing/Instrumentation
• Data Reduction and Inversion
• Modeling and Prediction (sunspot cycle, solar flare) 17
Application 3: Brain Injury Research

Ballistic (bulletBlunt
) Injury-most prevalentBlast (military)
Blunt Impacts>> MVA, fall,
• Ballistics (Bullet,
sports injury shrapnel)
• Blunt (motor vehicle, sports,
CONCUSSION
fall from height)
• Blast (explosions)

18
Exascale Computing and Big Data

By Daniel A. Reed and Jack Dongarra

Communications of the ACM, July 2015

https://vimeo.com/129742718

19
Thanks!☺
Questions ?

20

You might also like