The concepts in statistics and mathematics are the building blocks of the techniques and tools we use to gain deeper insights into structured and unstructured data. Statistical concepts lie at the heart of data science.
In this informative session at SkillUp 2021, a two-day event organised by Analytics India Magazine, Rajeeva Karandikar of Chennai Mathematical Institute, presented a few examples (from history) to explain how to make the most of the available data and enormous computing power by combining statistical ideas with modern AI/ML tools.
Rajeeva Karandikar is the Director at Chennai Mathematical Institute. He is a Fellow of the Indian Academy of Sciences and Indian National Science Academy. His research interests include probability theory and stochastic processes, applications of statistics and cryptography.
Statistics quo
“Perhaps in 90% of the problems that need some decision based on available data, the standard tools in artificial intelligence or machine learning and statistics will yield the best or nearly the best answer. But the remaining 10% will need something more than just the tools,” said Karandikar.
He said not all data problems don’t have the benefit of big data, such as opinion polls, quality control, vaccine identification and approval, drug discovery and approval. Thus, statistical ideas and techniques are definitely relevant in such cases.
Karandikar called up instances from history to prove the significant role of statistics and data. Sir Francis Galton, cousin of Darwin who studied the inheritance of genetic traits, was his first example.
“It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre than they- to be smaller than the parents, if the parents were large; to be larger than the parents if the parents were very small”- Galton
Karandikar explained that Galton obtained data on the heights of parents and (grown-up) sons and got a confirmation of his ideas. He chose heights as it was easy to obtain data on them. His analysis of the data confirmed his hypothesis.
“Today we can obtain data on heights of a large number of individuals and their father’s heights (say from India passport database). It can be seen that any data-driven tool will confirm the conclusion reached by Galton. However, interchanging roles of the heights of sons and fathers lead to an exactly opposite conclusion. This nature can also be seen in simulated data,” Karandikar said.
Correlation and regression
Next, Karandikar discussed some important topics of statistics, such as correlation and regression. Most of the data-driven analysis tries to discover relationships among different variables and this is what correlation and regression are all about. It is also important to understand that correlation does not imply causation.
While correlation, as well as regression, are techniques to discover linear relationships, one needs to use transformations to get more complex relationships. Artificial intelligence, machine learning and other data-driven techniques likewise try to find the relations, linear or otherwise among the variables.
Karandikar said one must not use any such tools without understanding the domain of the task. He illustrated this by providing several examples of correlation and standard error, such as prediction of 2007 Cricket ODI World Cup, relationship between IIT-JEE/CAT scores and performance at IIT/IIM, among others.
Karandikar also talked about some important terms of statistics, such as spurious correlations or nonsense correlations, standard error, Simpson’s paradox or amalgamation paradox, omitted variable bias, GIGO, GMGO, among others.
“Techniques in AI/ML have made great advances and it is a great exploratory step. But for now, it has not matured enough to reach an inferential step. When used in conjunction with domain knowledge, we can do wonders,” said Karandikar while concluding the session.