Introduction to Data Science Basics: Part 1

Introduction to Data Science Basics: Part 1

Learn univariate analysis, skewness, outlier identification, missing value imputation, and key statistical concepts in data science

How to perform univariate analysis for numerical and categorical variables?

Univariate analysis is a statistical technique used to analyze and describe the characteristics of a single variable. It is a useful tool for understanding the distribution, central tendency, and dispersion of a variable, as well as identifying patterns and relationship within the data. Here are the steps for performing univariate analysis for numerical and categorical variables.

For numerical variables:

  • Calculate descriptive statistics such as the mean, median, mode, and standard deviation to summarize the distribution of the data.
  • Visualize the distribution of the data using plots such as histograms, box-plots, or density plots.
  • Check for outliers and anomalies in the data.
  • Check for normality in the data using statistical tests or visualization such as a Q-Q plot.

For categorical variables:

  • Calculate the frequency or count of each category in the data.
  • Calculate the percentage or proportion of each category in the data.
  • Visualize the distribution of the data using plots such as bar plots or pie charts.
  • Check for imbalances or abnormalities in the distribution of the data.



What are Skewness in statistics and its types?

Skewness is a measure of the symmetry of a distribution. A distribution is symmetrical if it is shaped like a bell curve, with most of the data points concentrated on one side of the mean than the other.

Types of Skewness:

  1. Positive Skewness:Occurs when the distribution has a long tail on the right side, with the majority of the data points concentrated on the left side of the mean. It indicates that there are a few extreme values on the right side pulling the to the right.

  2. Negative Skewness: Occurs when the distribution has a long tail on the left side, with the majority of the data points concentrated on the right side of the mean. It indicates that there are a few extreme values on the left side pulling the mean to the left.


Skewness Image



What are the different ways in which we can find outliers in the data?

Outliers are data points that are significantly different from the majority of the data. They can be caused by errors, anomalies, or unusual circumstances, and they can have a significant impact on statistical analyses and machine learning models. Therefore, it is important to identify and handle outliers appropriately to obtain accurate and reliable results.

Common ways to find outliers:

  • Visual inspection: Outliers can often be identified by visually inspecting the data using plots such as histograms, scatterplots, or boxplots.

  • Summary statistics: Outliers can sometimes be identified by calculating summary statistics such as the mean, median, or interquartile range and comparing them to different from the median, it could indicates the presence of outliers.

  • Z-score: The z-score of a data point it a measure of how many standard deviations it is from the mean. Data points with a z-score greater than a certain threshold (e.g., 3 or 4) can be considered outliers.



What are the different ways by which you can impute the missing values in the dataset?

Imputing missing values is a crucial step in data pre-processing. Here are various method to handle missing values in a dataset:

  1. Drop rows: One option is to simply drop rows with null values from the dataset. This is a simple and fast method, but it can be problematic if a large number of rows are dropped, impacting the statistical power of the analysis.

  2. Drop columns: Another option is to drop columns with null values from the dataset. This can be a good option if the number of null values is large compared to the number of non-null values or if the column is not relevant to the analysis.

  3. Imputation with mean or median: One common method is to replace null values with the mean or median of the non-null values in the column. This is suitable if the data are missing at random, and the mean or median is a reasonable representation of the data.

  4. Imputation with mode: Another option is to replace null values with the mode (most common value) of the non-null values in the column. This is useful for categorical data where the mode is a meaningful representation.

  5. Imputation with a predictive model: Use a predictive model to estimate missing values based on other available data. This is a more complex method but can be more accurate if the data are not missing at random, and there is a string relationship between the missing values and the other data.



Mention the two kinds of target variables for predictive modeling.

Numerical/continuous Variables: Variables whose value lie within a range and could be any value in that range at the time of prediction. The value are not bound to be from the same range.

For example: Height of student - 5; 5.1; 6; 6.7; 7; 4.5; 5.11. Here, the range of values in (4, 7), and the height of new students can/cannot be any value from this range.

Categorical Variables: Variables that can take on one of a limited, and usually fixed, number of possible values, assigning individual or other unit of observation to a particular group based on some qualitative property.

For example: Exam Result - Pass, Fail (Binary categorical variable); Blood Type - A, B, O, AB (Polytomous categorical variable).



What will be the case in which the Mean, Median, and Mode will be the same for the dataset?

The mean, median, and mode of a dataset will all be the same if and only if the dataset consists of a single value that occurs with 100% frequency.

For example, consider the following dataset: 3, 3, 3, 3, 3, 3. The mean of this dataset is 3, the median is 3, and the mode is 3. This is because the dataset consists of a single value (3) that occurs with 100% frequency.

On the other hand, if the dataset contains multiple values, the mean, median, and mode will generally be different. For example, consider the following dataset: 1, 2, 3, 4, 5. The mean of this dataset is 3, the median is 3, and the mode is 1. This is because the dataset contains multiple values, and no value occurs with 100% frequency.

It is important to note that the mean, median, and mode can be affected by outliers or extreme values in the dataset. If the dataset contains extreme values, the mean and median may be significantly different from the mode, even if the dataset consists of a single value that occurs with a high frequency.



What is the difference Between Variance and Bias in Statistics?

In statistics, variance and bias are two measures of the quality or accuracy of a model or estimator.

Variance: Variance measures the amount of spread or dispersion in a dataset. It is calculated as the average squared deviation from the mean. High variance indicates that the data are spread out and may be more prone to error, while low variance indicates that the data are concentrated around the mean and may be more accurate.

Bias: Bias refers to the difference between the expected value of an estimator and the true value of the parameter being estimated. High bias indicates that the estimator is consistently under or overestimating the true value, while low bias indicates that the estimator is more accurate.

It is important to consider both variance and bias when evaluating the quality of a model or estimator. A model with low bias and high variance may be prone to overfitting, while a model with high bias and low variance may be prone to underfitting. Finding the right balance between bias and variance is an important aspect of model selection and optimization.

In machine learning, the bias-variance trade-off is crucial for finding the right balance between two sources of error: bias and variance.

Bias-Variance Trade-off Scenarios:

ScenarioBiasVarianceOutcome
Best fit (Ideal Scenario)LowLowModel captures underlying patterns well.
UnderfittingHighLowModel is too simple and doesn't capture patterns (High bias).
OverfittingLowHighModel is too complex and fits noise in the data (High variance).
Worst CaseHighHighDoes not capture underlying patterns and fits noise (High bias and variance).


Bias & Variance Image



Explain the concept of correlation and covariance?

Correlation: Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables. A positive correlation indicates that the two variables increase or decrease together, while a negative correlation indicates that the two variables move in opposite directions.

Covariance: Covariance is a measure of the joint variability of two random variables. It is used to measure how two variables are related. A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase when the other decreases.

While covariance gives the direction of the relationship between variables, it does not provide a standardized measure. Correlation, on the other hand, standardizes the measure, giving a value between -1 and 1, making it easier to interpret and compare relationships between different pairs of variables.



Why is hypothesis testing useful for a data scientist?

Hypothesis testing is a statistical technique used in data science to evaluate the validity of a claim or hypothesis about a population. It is used to determine whether there is sufficient evidence to support a claim or hypothesis and to assess the statistical significance of the results.

There are many situations in data science where hypothesis testing is useful. For example, it can be used to test the effectiveness of a new marketing campaign, to determine if there is a significant difference between the means of two groups, to evaluate the relationship between two variables, or to assess the accuracy of a predictive model.

Hypothesis testing is an important tool in data science because it allows data scientists to make informed decisions based on data, rather than relying on assumptions or subjective opinions. It helps data scientists to draw conclusions about the data that are supported by statistical evidence, and to communicate their findings in a clear and reliable manner. Hypothesis testing is therefore a key component of the scientific method and a fundamental aspect of data science practice.



What is the significance of the p-value?

The p-value is used to determine the statistical significance of a result. In hypothesis testing, the p-value assesses the probability of obtaining a result that is at least as extreme as the one observed, given that the null hypothesis is true. If the p-value is less than the predetermined level significance (usually denoted as alpha, α), then the result is considered statistically significant, and the null hypothesis is rejected.

The significance of the p-value lies in its ability to allow researchers to make decisions about the data based on a predetermined level of confidence. By setting a level of significance before conducting the statistical test, researchers can determine whether the results are likely to have occurred by chance or if there is a real effect present in the data.



What are the different types of sampling techniques used by data analysts?

There are various sampling techniques that data analysts commonly use. Here are some of the most common ones:

  • Simple Random Sampling: This is a basic form of sampling where each member of the population has an equal chance of being selected for the sample.

  • Stratified Random Sampling: Involves dividing the population into subgroups (strata) based on certain characteristics and selecting a random sample from each stratum.

  • Cluster Sampling: Divides the population into smaller groups (clusters) and selects a random sample of clusters.

  • Systematic Sampling: Involves selecting every kth member of the population to be included in the sample.