Skip to main content

Statistical techniques and its application in data scienceđź“Š

 



  In any data science task after preparing data and understanding data, data scientists want to understand what are features/attributes are there in the data to be extracted. how many categorical variables are there and how many numerical variables are there in the dataset. today in this blog we will only talk about numerical data only. as we want to understand how statistical methods help us to summarise and understand data better.

  I will focus on what are various statistical techniques are there and when to apply them to get a particular outcome in a given dataset.


here are the topics to be covered:


1) Summary statistics

2) Sampling methods

3) Hypothesis  testing

4) Estimation statistics



1) Summary statistics: 

  They're some of the very basic methods are there in summary statistics to summarise given data distribution. like

1) Mean

2) Median

3) Mode

4) Standard deviation

5) Variance

6) Range

7) percentiles

8) Interquarantile range

9) Min/Max values 


while generally mean and std. dev useful only in normally distributed data. one of the most frequently used summary statistics is known as a 5 number summary statistics which includes MIN, MAX. 25TH, 50TH, and 75TH QUARNTILES.

5  number summary can be presented as box and whisker plots to visualize data distributions and to visualize outliers and range of data.



2) Sampling Methods:

   Data is everything in any data science task, without data how we can extract insights from it. here comes data sampling into the picture, we usually have a lot of data available for training and testing purposes. we want to find the best data sample that contains all sets of features without bias. 

  Ultimately we use sample data to estimate population parameters and get an idea about the population. there are various methods to sampling in classical statistics but in machine learning, we use historical data as our sample to train and test the model.

  Sometimes we use multiple samples to train the model so that we can get optimal and most accurate predictions for test data.

 Cross-validation is a method to apply machine learning models to various samples to measure the accuracy and skill of machine learning models.



3) Hypothesis testing:

   Statistical hypothesis testing is used to test the statistical significance of a particular data sample. in hypothesis testing we assume the null hypothesis is true and calculate test statistics to get a P-value which gives us whether the chances of sample data are random or it has a significance. if  P-value is higher than our significance value alpha then we confirm that sample is randomly occurring so we do not reject the hypothesis. while if we have a P-value within an alpha or below alpha then we can say occurrence is statistically significant so that we can reject the null hypothesis and accept the alternative hypothesis.


   H0:  there's no difference in our sample data than our population parameter.

   Ha: there's a difference from our population parameter.


For significance value alpha we take 0.05 in general for most the cases.

For hypothesis tests, we have a standard python library scipy with which we can do hypothesis tests on various test datasets.



4) Estimation statistics:

  Estimation statistics is one of the branches of inferential statistics where we try to estimate population parameters from a sample of data. there are three methods in estimation statistics:


1) Prediction intervals

2) Confidence intervals

3) Tolerance intervals


In very basic terms, we are trying to find that interval from sample statistics that will inferral population parameters. we use Z- statistics or T-statistics to find intervals that our population will fall into.


So above are the basics of statistical methods that we use in data science and machine learning algorithms extensively to get insights out of data.

Thank you for your time in reading this page. in case if you want to connect with me, here's Email: avikumar.talaviya@gmail.com



References:

machine-learning mastery-statistics

khan academy


Comments

Popular posts from this blog

Introduction to Mathematics and Statistics for Data Science

  Hello and welcome to the Data science lessons blog. to perform any data science task mathematics knowledge and its application will be really important. in fact, it's inevitable in the data science field. Mathematics can be divided into four parts for the Data Science field: 1) Statistics (Descriptive and Inferential): 2) Linear Algebra 3) Probability 4) Optimization  1) Statistics: I cannot imagine data science without this evergreen field of Statistics and its applications across the industries and research fields. basically, statistical methods help us to summerise quantitative data and to get insights out of it. it is not easy to gain any insights by just seeing raw numerical data in any way, until and unless you are a math genius! Topics about Descriptive Statistics: 1) Mean, Median, Mode 2) IQR, percentiles 3) Std deviation and Variance 4) Normal Distribution 5) Z-statistics and T-statistics 6) correlation and linear regression Topics about Inferential Statistics: 1) Sampli

The Ultimate Data Visualization Guide For Beginners

  Hello and welcome to the data science blog site. today I am going to talk about some other sides of the data science field which you might have been aware of or not, that is nothing but 'the arts'. yeah! you heard it right artistic skills are really important to present your data science solutions that you have figured out from data modeling or crunching your data from various sources etc. if you can't present and tell your story to your audience then your solution has no meaning at all. you have to sell your story effectively and visually compellingly way, and that's where data visualization comes into the picture. Here's what we are going to cover: 1) Ideas on visualizations 2) Storytelling 3) Visual display of data 1) Ideas on visualizations: What is data visualizations:   data visualization is nothing but visualizing structured, raw, and numerical data in various forms of charts and graphs to let your audience understand data. that's no big deal, a simple

Introduction to conditional GANs

In this blog, we are going to see Generative adversarial networks (GAN). A generative adversarial network is a class of machine learning frameworks used for training generative models. Generative models create new data instances that resemble the training data. Given a training set, a GAN learns to generate new data with the same statistics as the training set. GANs much depend on the training loss of the model, the model tries to minimize loss to generate as real images as possible. Table of content 1)     What is GAN and How it works? 2)     What is Conditional GAN? 3)     Advantages of cGAN 4)     Pictorial explanation 5)     Use-cases   1)   What is GAN and How it works? GAN is a  generative model which achieves a high level of realism by pairing a generator with a discriminator. The generator learns to produce the target output, while the discriminator learns to distinguish true data from the output o