Added Answers to a couple of questions (andrewekhalel#9)

SajjadPSavoji · Sajjad Pakdamansavoji · andrewekhalel · web-flow · commit 34b1680934ae · 2023-06-17T16:44:00.000+02:00
* Update README.md a short answer for question three added * Update Q4 Previously in the solutions we had "Manual Feature Selection" while there are also statistical feature selection methods which are actually more commonly used. For that I changed this listing to "Feature Selection(Manual or Statistical)" source: https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/ in the above link they mention correlation as an unsupervised feature selection methods. correlation belongs to the statistical portion of methods. * Q6 short answer added from source Added a short paragraph summarizing the detailed explanations in the answer link * Update README.md * Q7 answer completed Added a more complete answer for question 7. The previous source is kept for reference as "src1" * Update README.md * Update README.md * Q15 short answer added a short paragraph summarizing the answer link added * Update README.md * Q15 answer completed * Q16 answer added added references to two 3D reconstruction method * Q19 solution added * Q57 fixed spacing * adding a separate document for NLP Questions * Q1 and Q2 added * Q3 added * Q4 added * Q5 added * Q6 added * Q7 added * Q8 added * Q9 added * Q10 added * Q11 added * Update README.md * Q12 added * Update README.md * Q 13 added * Q64 added * added contribution section --------- Co-authored-by: Sajjad Pakdamansavoji <sajjad@UIT-M-CC221.yorku.yorku.ca> Co-authored-by: Andrew Khalel <andrew_emel@hotmail.com>
diff --git a/NLP/README.md b/NLP/README.md
@@ -0,0 +1,107 @@
+# NLP Interview Questions
+A collection of technical interview questions for machine learning and computer vision engineering positions.
+
+The answer to all of these question were generated using ChatGPT!
+
+
+### 1. What is the difference between stemming and lemmatization? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+
+Stemming and lemmatization are both techniques used in natural language processing to reduce words to their base form. The main difference between the two is that stemming is a crude heuristic process that chops off the ends of words, while lemmatization is a more sophisticated process that uses vocabulary and morphological analysis to determine the base form of a word. Lemmatization is more accurate but also more computationally expensive.
+
+Example: The word "better"
+* Stemming: The stem of the word "better" is likely to be "better" (e.g. by using Porter stemmer)
+* Lemmatization: The base form of the word "better" is "good" (e.g. by using WordNetLemmatizer with POS tagger)
+
+### 2. What do you know about Latent Semantic Indexing (LSI)? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+Latent Semantic Indexing (LSI) is a technique used in NLP and information retrieval to extract the underlying meaning or concepts from a collection of text documents. LSI uses mathematical techniques such as Singular Value Decomposition (SVD) to identify patterns and relationships in the co-occurrence of words within a corpus of text. LSI is based on the idea that words that are used in similar context tend to have similar meanings. 
+
+### 3. What do you know about Dependency Parsing? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+Dependency parsing is a technique used in natural language processing to analyze the grammatical structure of a sentence, and to identify the relationships between its words. It is used to build a directed graph where words are represented as nodes, and grammatical relationships between words are represented as edges. Each node has one parent and can have multiple children, representing the grammatical relations between the words.
+
+There are different algorithms for dependency parsing, such as the Earley parser, the CYK parser, and the shift-reduce parser. 
+
+### 4. Name different approaches for text summarization. [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+There are several different approaches to text summarization, including:
+* Extractive summarization: Selects the most important sentences or phrases from the original text.
+* Abstractive summarization: Generates new sentences that capture the key concepts and themes of the original text.
+* Latent Semantic Analysis (LSA) based summarization: Uses LSA to identify the key concepts in a text.
+* Latent Dirichlet Allocation (LDA) based summarization: Uses LDA to identify the topics in a text.
+* Neural-based summarization: Uses deep neural networks to generate a summary.
+
+Each approach has its own strengths and weaknesses and the choice of the approach will depend on the specific use case and the quality of the summary desired.
+
+### 5. What approach would you use for part of speech tagging? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+There are a few different approaches that can be used for part-of-speech (POS) tagging, such as:
+* Rule-based tagging: using pre-defined rules to tag text
+* Statistical tagging: using statistical models to tag text
+* Hybrid tagging: Combining rule-based and statistical methods
+* Neural-based tagging: using deep neural networks to tag text
+
+### 6. Explain what is a n-gram model. [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+An n-gram model is a type of statistical language model used in NLP. It is based on the idea that the probability of a word in a sentence is dependent on the probability of the n-1 preceding words, where n is the number of words in the gram.
+
+The model represents the text as a sequence of n-grams, where each n-gram is a sequence of n words. The model uses the frequency of each n-gram in a large corpus of text to estimate the probability of each word in a sentence, based on the n-1 preceding words.
+
+### 7. Explain how TF-IDF measures word importance. [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document or collection of documents. It is calculated as the product of the term frequency (TF) and the inverse document frequency (IDF) of a word.
+
+The term frequency (TF) of a word is the number of times the word appears in a document, normalized by the total number of words in the document.
+
+The inverse document frequency (IDF) of a word is the logarithm of the total number of documents in the corpus divided by the number of documents in which the word appears.
+
+
+### 8. What is perplexity used for? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+Perplexity is a statistical measure used to evaluate the quality of a probability model, particularly language models. It is used to quantify the uncertainty of a model when predicting the next word in a sequence of words. The lower the perplexity, the better the model is at predicting the sequence of words. 
+
+Sure, here's the formula for perplexity in LaTeX format:
+
+Perplexity = $2^{H(D)}$
+
+$H(D) = - \frac{1}{N} {\sum}_{i=1}^{N} {log_2^{ P(w_i) }}$
+
+$w_i$ = the i-th word in the sequence
+
+$N$ = the number of words in the sequence
+
+$P(w_i)$ = the probability of the i-th word according to the model
+
+### 9. What is Bag-of-Worrds model? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+The bag-of-words model is a representation of text data where a text is represented as a bag (multiset) of its words, disregarding grammar and word order but keeping track of the frequency of each word. It is simple to implement and computationally efficient, but it discards grammatical information and word order, which can be important for some NLP tasks.
+
+### 10. Explain how the Markov assumption affects the bi-gram model? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+The Markov assumption is an important concept in the bi-gram model, it states that the probability of a word in a sentence depends only on the preceding word. The Markov assumption simplifies the bi-gram model by reducing the number of variables that need to be considered, making the model computationally efficient, but it also limits the context that the model takes into account, which can lead to errors in the probability estimates. In practice, increasing the order of the n-gram model can be used to increase the context taken into account, thus increasing the model's accuracy.
+
+### 11. What are the most common word embedding methods? explain each briefly. [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+Common word embedding methods include:
+* Count-based methods: Create embeddings by counting the co-occurrence of words in a corpus. Example: Latent Semantic Analysis (LSA)
+* Prediction-based methods: Create embeddings by training a model to predict a target word based on its surrounding context. Example: Continuous Bag-of-Words (CBOW) and Word2Vec
+* Hybrid methods: Combine both co-occurrence and context to generate embeddings. Example: GloVe (Global Vectors for Word Representation)
+* Neural Language Model based methods: Create embeddings by training a neural network-based language model on a large corpus of text. Example: BERT (Bidirectional Encoder Representations from Transformers)
+
+### 12. What are the first few steps that you will take before applying an NLP algorithm to a given corpus? [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+* Text pre-processing: Clean and transform the text into a format that can be processed by the model. Specific methods include: Removing special characters, lowercasing, removing stop words.
+
+* Tokenization: Break the text into individual words or phrases that can be used as input. Specific methods include: word tokenization, sentence tokenization, and n-gram tokenization.
+
+* Text normalization: Transform the text into a consistent format. Specific methods include: stemming, lemmatization.
+
+* Feature extraction: Select relevant features from the text to be used as input. Specific methods include: creating a vocabulary of the most common words in the corpus, creating a term-document matrix.
+
+* Splitting the data: Divide the data into training, validation and testing sets.
+
+* Annotating the data: Manually tag the data with relevant information. Specific methods include: POS tagging, NER tagging, and so on.
+
+### 13. List a few types of linguistic ambiguities. [[src]](https://www.projectpro.io/article/nlp-interview-questions-and-answers/439)
+* Lexical ambiguity: A word has multiple meanings. Example: "bass" can refer to a type of fish or a low-frequency sound.
+
+* Syntactic ambiguity: A sentence can be parsed in more than one way. Example: "I saw the man with the telescope" can mean that the speaker saw a man who had a telescope or the speaker saw a man through a telescope.
+
+* Semantic ambiguity: A word or phrase can have more than one meaning in a given context. Example: "bank" can refer to a financial institution or the edge of a river.
+
+* Pragmatic ambiguity: A sentence can have different interpretations depending on the speaker's intended meaning. Example: "I'm fine" can mean that the speaker is feeling well or that the speaker does not want to talk about their feelings.
+
+* Anaphora resolution: A pronoun or noun phrase refers to an antecedent with multiple possible referents.
+
+* Homonymy: Words that are written and pronounced the same but have different meanings. Example: "bass" as a type of fish and a low-frequency sound
+
+* Polysemy: words that have multiple meanings but are related in some way. Example: "bass" as a low-frequency sound and the bass guitar.
diff --git a/README.md b/README.md
@@ -20,9 +20,18 @@ Gradient descent is best used when the parameters cannot be calculated analytica
 #### 3) Explain over- and under-fitting and how to combat them? [[src](http://houseofbots.com/news-detail/2849-4-data-science-and-machine-learning-interview-questions)]
 [[Answer]](https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765)
 
+ML/DL models essentially learn a relationship between its given inputs(called training features) and objective outputs(called labels). Regardless of the quality of the learned relation(function), its performance on a test set(a collection of data different from the training input) is subject to investigation.
+
+Most ML/DL models have trainable parameters which will be learned to build that input-output relationship. Based on the number of parameters each model has, they can be sorted into more flexible(more parameters) to less flexible(less parameters).
+
+The problem of Underfitting arises when the flexibility of a model(its number of parameters) is not adequate to capture the underlying pattern in a training dataset. Overfitting, on the other hand, arises when the model is too flexible to the underlying pattern. In the later case it is said that the model has “memorized” the training data.
+
+An example of underfitting is estimating a second order polynomial(quadratic function) with a first order polynomial(a simple line). Similarly, estimating a line with a 10th order polynomial would be an example of overfitting.
+
+
 #### 4) How do you combat the curse of dimensionality? [[src](http://houseofbots.com/news-detail/2849-4-data-science-and-machine-learning-interview-questions)]
 
- - Manual Feature Selection
+ - Feature Selection(manual or via statistical methods)
  - Principal Component Analysis (PCA)
  - Multidimensional Scaling
  - Locally linear embedding  
@@ -39,8 +48,19 @@ The obvious *disadvantage* of **ridge** regression, is model interpretability. I
 #### 6) Explain Principal Component Analysis (PCA)? [[src](http://houseofbots.com/news-detail/2849-4-data-science-and-machine-learning-interview-questions)]
 [[Answer]](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
 
+Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning to reduce the number of features in a dataset while retaining as much information as possible. It works by identifying the directions (principal components) in which the data varies the most, and projecting the data onto a lower-dimensional subspace along these directions.
+
 #### 7) Why is ReLU better and more often used than Sigmoid in Neural Networks? [[src](http://houseofbots.com/news-detail/2849-4-data-science-and-machine-learning-interview-questions)]
-Imagine a network with random initialized weights ( or normalised ) and almost 50% of the network yields 0 activation because of the characteristic of ReLu ( output 0 for negative values of x ). This means a fewer neurons are firing ( sparse activation ) and the network is lighter. [[src]](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0)
+
+* Computation Efficiency:
+  As ReLU is a simple threshold the forward and backward path will be faster.
+* Reduced Likelihood of Vanishing Gradient:
+  Gradient of ReLU is 1 for positive values and 0 for negative values.
+* Sparsity:
+  Sparsity happens when the input of ReLU is negative. This means fewer neurons are firing ( sparse activation ) and the network is lighter. 
+
+
+[[src1]](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0) [[src2]](https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks)
 
 
 #### 8) Given stride S and kernel sizes  for each layer of a (1-dimensional) CNN, create a function to compute the [receptive field](https://www.quora.com/What-is-a-receptive-field-in-a-convolutional-neural-network) of a particular node in the network. This is just finding how many input nodes actually connect through to a neuron in a CNN. [[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)]
@@ -78,12 +98,23 @@ With this last approach, we care less about what is shown on the image but more
 
 #### 14) How does image registration work? Sparse vs. dense [optical flow](http://www.ncorr.com/download/publications/bakerunify.pdf) and so on. [[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)]
 
-#### 15) Describe how convolution works. What about if your inputs are grayscale vs RGB imagery? What determines the shape of the next layer? [[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)]
+#### 15) Describe how convolution works. What about if your inputs are grayscale vs RGB imagery? What determines the shape of the next layer?[[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)] 
+In a convolutional neural network (CNN), the convolution operation is applied to the input image using a small matrix called a kernel or filter. The kernel slides over the image in small steps, called strides, and performs element-wise multiplications with the corresponding elements of the image and then sums up the results. The output of this operation is called a feature map.
 
-[[Answer]](https://dev.to/sandeepbalachandran/machine-learning-convolution-with-color-images-2p41)
+When the input is RGB(or more than 3 channels) the sliding window will be a sliding cube. The shape of the next layer is determined by Kernel size, number of kernels, stride, padding, and dialation.
+
+[[src1]](https://dev.to/sandeepbalachandran/machine-learning-convolution-with-color-images-2p41)[[src2]](https://stackoverflow.com/questions/70231487/output-dimensions-of-convolution-in-pytorch)
 
 #### 16) Talk me through how you would create a 3D model of an object from imagery and depth sensor measurements taken at all angles around the object. [[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)]
 
+There are two popular methods for 3D reconstruction:
+* Structure from Motion (SfM) [[src]](https://www.mathworks.com/help/vision/ug/structure-from-motion.html)
+
+* Multi-View Stereo (MVS) [[src]](https://www.youtube.com/watch?v=Zwwty2qPNs8)
+
+SfM is better suited for creating models of large scenes while MVS is better suited for creating models of small objects.
+
+
 #### 17) Implement SQRT(const double & x) without using any special functions, just fundamental arithmetic. [[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)]
 
 The taylor series can be used for this step by providing an approximation of sqrt(x):
@@ -101,7 +132,11 @@ my_data.reverse()
 ```
 #### 19) Implement non maximal suppression as efficiently as you can. [[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)]
 
+Non-Maximum Suppression (NMS) is a technique used to eliminate multiple detections of the same object in a given image.
+To solve that first sort bounding boxes based on their scores(N LogN). Starting with the box with the highest score, remove boxes whose overlapping metric(IoU) is greater than a certain threshold.(N^2)
 
+To optimize this solution you can use special data structures to query for overlapping boxes such as R-tree or KD-tree. (N LogN)
+[[src]](https://towardsdatascience.com/non-maxima-suppression-139f7e00f0b5)
 
 #### 20) Reverse a linked list in place. [[src](https://www.reddit.com/r/computervision/comments/7gku4z/technical_interview_questions_in_cv/)]
 
@@ -290,6 +325,7 @@ We apply One-Hot Encoding when:
 
 - The categorical feature is not ordinal (like the countries above)
 - The number of categorical features is less so one-hot encoding can be effectively applied
+
 We apply Label Encoding when:
 
 - The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
@@ -363,9 +399,16 @@ The rand() function orders the data differently each time it is run, so if we ru
 [[src]](https://towardsdatascience.com/why-do-we-set-a-random-state-in-machine-learning-models-bb2dc68d8431#:~:text=In%20Scikit%2Dlearn%2C%20the%20random,random%20state%20instance%20from%20np.)
 
 
+#### 65) What is the difference between Bayesian vs frequentist statistics? [[src]](https://www.kdnuggets.com/2022/10/nlp-interview-questions.html)
+Frequentist statistics is a framework that focuses on estimating population parameters using sample statistics, and providing point estimates and confidence intervals.
+
+Bayesian statistics, on the other hand, is a framework that uses prior knowledge and information to update beliefs about a parameter or hypothesis, and provides probability distributions for parameters.
+
+The main difference is that Bayesian statistics incorporates prior knowledge and beliefs into the analysis, while frequentist statistics doesn't.
 
 ## Contributions
 Contributions are most welcomed.
  1. Fork the repository.
  2. Commit your *questions* or *answers*.
  3. Open **pull request**.
+