0% found this document useful (0 votes)
6 views

EDA Python for Data Analsis

The document provides a comprehensive guide on using Apache Spark for data manipulation, including data loading, cleaning, analysis, visualization, and machine learning integration. It covers various operations such as reading/writing different file formats, performing statistical analysis, and handling complex data types. Additionally, it discusses performance optimization techniques and advanced features like window functions, graph analysis, and real-time data processing.

Uploaded by

salmasaiff.22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

EDA Python for Data Analsis

The document provides a comprehensive guide on using Apache Spark for data manipulation, including data loading, cleaning, analysis, visualization, and machine learning integration. It covers various operations such as reading/writing different file formats, performing statistical analysis, and handling complex data types. Additionally, it discusses performance optimization techniques and advanced features like window functions, graph analysis, and real-time data processing.

Uploaded by

salmasaiff.22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

Data Loading

• Read CSV File:

df = spark.read.csv('filename.csv', header=True, inferSchema=True)

• Read Parquet File:

df = spark.read.parquet('filename.parquet')

• Read from JDBC (Databases):

df=spark.read.format("jdbc").options(url="jdbc_url",dbtable="table_name").lo
ad()

2. show data

• Display Top Rows: df.show()

• Print Schema: df.printSchema()

• Summary Statistics: df.describe().show()

• Count Rows: df.count()

• Display Columns: df.columns

3. Data Cleaning

• Drop Missing Values: df.na.drop()

• Fill Missing Values: df.na.fill(value)

• Drop Irrelevant Columns: df.drop('column_name')

• Rename Column: df.withColumnRenamed('old_name', 'new_name')

• Check for Duplicates: df.dropDuplicates()

• Handle Duplicates: df.dropDuplicates(['column1', 'column2'])

• Remove Duplicates Completely: df.dropDuplicates()


• Check for Outliers:

6. Statistical Analysis

• Describe data: df.describe()

• To show distribution data: Sns.histplot(df,bins=20,kde=True)

• Correlation Matrix: from pyspark.ml.stat import Correlation;


Correlation.corr(df, 'column')

• Covariance: df.stat.cov('column1', 'column2')

• Frequency Items: df.stat.freqItems(['column1', 'column2'])

7. Data Visualization

• Bar Chart: df.groupBy('column').count().show()

• Histogram: df.select('column').rdd.flatMap(lambda x: x).histogram(10)

• Scatter Plot: df.select('column1', 'column2').show()

• Box Plot: pandas_df[['column']].boxplot()

• ……………………

8. Export Data in Python

• Convert to Pandas DataFrame: pandas_df = df.toPandas()


• Convert to CSV (Pandas): pandas_df.to_csv('path_to_save.csv',
index=False)

• Write DataFrame to CSV: df.write.csv('path_to_save.csv')

• Write DataFrameto Parquet: df.write.parquet('path_to_save.parquet')


9. Advanced Data Processing

• Window Functions: from pyspark.sql.window import Window;


df.withColumn('rank',
rank().over(Window.partitionBy('column').orderBy('other_column')))
• Pivot Table: df.groupBy('column').pivot('pivot_column').sum('sum_column')
• UDF (User Defined Functions): from pyspark.sql.functions import udf;
my_udf = udf(my_python_function); df.withColumn('new_col',
my_udf(df['col']))

10. Performance Optimization

• Caching DataFrame: df.cache()

• Repartitioning: df.repartition(10)

• Broadcast Join Hint: df.join(broadcast(df2), 'key', 'inner')

11. Exploratory Data Analysis Specifics

• Column Value Counts: df.groupBy('column').count().show()

• Distinct Values in a Column: df.select('column').distinct().show()

• Aggregations (sum, max, min, avg): df.groupBy().sum('column').show()

12. Working with Complex Data Types

• Exploding Arrays: df.withColumn('exploded', explode(df['array_column']))

• Working with Structs: df.select(df['struct_column']['field'])

• Handling Maps: df.select(map_keys(df['map_column']))

13. Joins

• Inner Join: df1.join(df2, df1['id'] == df2['id'])

• Left Outer Join: df1.join(df2, df1['id'] == df2['id'], 'left_outer')


• Right Outer Join: df1.join(df2, df1['id'] == df2['id'], 'right_outer')
14. Saving and Loading Models

• Saving ML Model: model.save('model_path')

• Loading ML Model:

from pyspark.ml.classification import LogisticRegressionModel;


LogisticRegressionModel.load('model_path')

15. Handling JSON and Complex Files

• Read JSON: df = spark.read.json('path_to_file.json')

• Explode JSON Object: df.selectExpr('json_column.*')

16. Custom Aggregations

• Custom Aggregate Function:

from pyspark.sql import functions as F;


df.groupBy('group_column').agg(F.sum('sum_column'))

17. Working with Null Values

• Counting Nulls in Each Column:

df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

• Drop Rows with Null Values: df.na.drop()

18. Data Import/Export Tips

• Read Text Files: df = spark.read.text('path_to_file.txt')

• Write Data to JDBC:

df.write.format("jdbc").options(url="jdbc_url", dbtable="table_name").save()

19. Advanced SQL Operations

• Register DataFrame as Table: df.createOrReplaceTempView('temp_table')


• Perform SQL Queries: spark.sql('SELECT * FROM temp_table WHERE
condition')

20. Dealing with Large Datasets

• Sampling Data: sampled_df = df.sample(False, 0.1)

• Approximate Count Distinct:


df.select(approx_count_distinct('column')).show()

21. Data Quality Checks

• Checking Data Integrity: df.checkpoint()

• Asserting Conditions: df.filter(df['column'] > 0).count()

22. Advanced File Handling

• Specify Schema While Reading: schema = StructType([...]); df =


spark.read.csv('file.csv', schema=schema)

• Writing in Overwrite Mode: df.write.mode('overwrite').csv('path_to_file.csv')

23. Debugging and Error Handling

• Collecting Data Locally for Debugging: local_data = df.take(5)

• Handling Exceptions in UDFs:

def safe_udf(my_udf): def wrapper(*args, **kwargs): try: return


my_udf(*args, **kwargs) except: return None; return wrapper

24. Machine Learning Integration

• Creating Feature Vector:

from pyspark.ml.feature import VectorAssembler; assembler =


VectorAssembler(inputCols=['col1', 'col2'], outputCol='features'); feature_df =
assembler.transform(df)
25. Advanced Joins and Set Operations

• Cross Join: df1.crossJoin(df2)

• Set Operations (Union, Intersect, Minus): df1.union(df2);


df1.intersect(df2); df1.subtract(df2)

26. Dealing with Network Data

• Reading Data from HTTP Source: spark.read.format("csv").option("url",


"http://example.com/data.csv").load()

27. Integration with Visualization Libraries

• Convert to Pandas for Visualization: pandas_df = df.toPandas();


pandas_df.plot(kind='bar')

28. Spark Streaming for Real-Time EDA

• Reading from a Stream: df = spark.readStream.format('source').load()

• Writing to a Stream: df.writeStream.format('console').start()

29. Advanced Window Functions

• Cumulative Sum: from pyspark.sql.window import Window;


df.withColumn('cum_sum',
F.sum('column').over(Window.partitionBy('group_column').orderBy('order_col
umn')))

• Row Number: df.withColumn('row_num',


F.row_number().over(Window.orderBy('column')))

30. Handling Complex Analytics

• Rollup: df.rollup('column1', 'column2').agg(F.sum('column3'))

• Cube for Multi-Dimensional Aggregation: df.cube('column1',


'column2').agg(F.sum('column3'))
31. Dealing with Geospatial Data

• Using GeoSpark for Geospatial Data:

from geospark.register import GeoSparkRegistrator;


GeoSparkRegistrator.registerAll(spark)

32. Advanced File Formats

• Reading ORC Files: df = spark.read.orc('filename.orc')

• Writing Data to ORC: df.write.orc('path_to_file.orc')

33. Dealing with Sparse Data

• Using Sparse Vectors:

from pyspark.ml.linalg import SparseVector; sparse_vec =


SparseVector(size, {index: value})

34. Handling Binary Data

• Reading Binary Files:

df = spark.read.format('binaryFile').load('path_to_binary_file')

35. Efficient Data Transformation

• Using mapPartitions for Transformation:

rdd = df.rdd.mapPartitions(lambda partition: [transform(row) for row in


partition])

36. Advanced Machine Learning Operations

• Using ML Pipelines:

from pyspark.ml import Pipeline; pipeline = Pipeline(stages=[stage1,


stage2]); model = pipeline.fit(df)
• Model Evaluation:

from pyspark.ml.evaluation import BinaryClassificationEvaluator;


evaluator = BinaryClassificationEvaluator(); evaluator.evaluate(predictions)

37. Optimization Techniques

• Broadcast Variables for Efficiency: from pyspark.sql.functions import


broadcast; df.join(broadcast(df2), 'key')

• Using Accumulators for Global Aggregates: accumulator =


spark.sparkContext.accumulator(0); rdd.foreach(lambda x:
accumulator.add(x))

38. Advanced Data Import/Export

• Reading Data from Multiple Sources: df =


spark.read.format('format').option('option', 'value').load(['path1', 'path2'])

• Writing Data to Multiple Formats: df.write.format('format').save('path',


mode='overwrite')

39. Utilizing External Data Sources

• Connecting to External Data Sources (e.g., Kafka, S3):

df = spark.read.format('kafka').option('kafka.bootstrap.servers',
'host1:port1').load()

40. Efficient Use of SQL Functions

• Using Built-in SQL Functions:

from pyspark.sql.functions import col, lit; df.withColumn('new_column',


col('existing_column') + lit(1))

41. Exploring Data with GraphFrames

• Using GraphFrames for Graph Analysis:


from graphframes import GraphFrame; g = GraphFrame(vertices_df,
edges_df)

42. Working with Nested Data

• Exploding Nested Arrays:

df.selectExpr('id', 'explode(nestedArray) as element')

• Handling Nested Structs: df.select('struct_column.*')

43. Advanced Statistical Analysis

• Hypothesis Testing:

from pyspark.ml.stat import ChiSquareTest; r = ChiSquareTest.test(df,


'features', 'label')

• Statistical Functions (e.g., mean, stddev):

from pyspark.sql.functions import mean, stddev; df.select(mean('column'),


stddev('column'))

44. Customizing Spark Session

• Configuring SparkSession:

spark=SparkSession.builder.appName('app').config('spark.some.config.optio
n', 'value').getOrCreate()

You might also like