# Statistical Methods In Bioinformatics: An Intro...

Advances in computers and biotechnology have had a profound impact on biomedical research, and as a result complex data sets can now be generated to address extremely complex biological questions. Correspondingly, advances in the statistical methods necessary to analyze such data are following closely behind the advances in data generation methods. The statistical methods required by bioinformatics present many new and difficult problems for the research community.

## Statistical Methods in Bioinformatics: An Intro...

This book provides an introduction to some of these new methods. The main biological topics treated include sequence analysis, BLAST, microarray analysis, gene finding, and the analysis of evolutionary processes. The main statistical techniques covered include hypothesis testing and estimation, Poisson processes, Markov models and Hidden Markov models, and multiple testing methods.

The second edition features new chapters on microarray analysis and on statistical inference, including a discussion of ANOVA, and discussions of the statistical theory of motifs and methods based on the hypergeometric distribution. Much material has been clarified and reorganized.

The book is written so as to appeal to biologists and computer scientists who wish to know more about the statistical methods of the field, as well as to trained statisticians who wish to become involved with bioinformatics. The earlier chapters introduce the concepts of probability and statistics at an elementary level, but with an emphasis on material relevant to later chapters and often not covered in standard introductory texts. Later chapters should be immediately accessible to the trained statistician. Sufficient mathematical background consists of introductory courses in calculus and linear algebra. The basic biological concepts that are used are explained, or can be understood from the context, and standard mathematical concepts are summarized in an Appendix. Problems are provided at the end of each chapter allowing the reader to develop aspects of the theory outlined in the main text.

"This book provides an excellent survey of statistical analyses of biological sequence data and brief treatments of other areas of bioinformatics...The explanations and derivations of difficult ideas are usually clear. Frequent examples of bioinformatics applications help to maintain interest and to elucidate the statistical concepts presented. Without being excessively mathematical, the authors succeed in accurately presenting the assumptions and limitations of the statistical methods...This book describes and impressive breadth of applications including methods of sequencing, modeling searching, aligning and comparing DNA and protein sequences...this book I strongly recommended for an overview of statistical sequence analyses and for use in an advanced class in bioinformatics."

Bioinformatics (/ËŒbaÉª.oÊŠËŒÉªnfÉ™rËˆmÃ¦tÉªks/ (listen)) is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using computational and statistical techniques.

There has been a tremendous advance in speed and cost reduction since the completion of the Human Genome Project, with some labs able to sequence over 100,000 billion bases each year, and a full genome can be sequenced for a thousand dollars or less.[14] Computers became essential in molecular biology when protein sequences became available after Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences manually turned out to be impractical. A pioneer in the field was Margaret Oakley Dayhoff.[15] She compiled one of the first protein sequence databases, initially published as books[16] and pioneered methods of sequence alignment and molecular evolution.[17] Another early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released with Tai Te Wu between 1980 and 1991.[18]In the 1970s, new techniques for sequencing DNA were applied to bacteriophage MS2 and Ã¸X174, and the extended nucleotide sequences were then parsed with informational and statistical algorithms. These studies illustrated that well known features, such as the coding segments and the triplet code, are revealed in straightforward statistical analyses and were thus proof of the concept that bioinformatics would be insightful.[19][20]

This course looks at the various software packages, databases and statistical methods which may be of use in performing such an analysis. As well as being a practical guide to performing these types of analysis the course will also look at the types of artefacts and bias which can lead to false conclusions about functionality and will look at the appropriate ways to both run the analysis and present the results for publication.

The goal of this course is to introduce trainees to the fundamental algorithms needed to understand and analyze genome-scale expression data sets. The course will cover three major kinds of applications. (1) Class comparison seeks to describe which features differ between two or more known classes of patient samples (such as normal vs. tumor). The methodology includes (generalized) linear models with careful attention to the issue of multiple comparisons. (2) Class discovery seeks to discuss the inherent structure present in a data set. The methodology includes a wide variety of techniques for clustering samples (including K-means as well as various forms of hierarchical clustering) and assessing the number of clusters and the robustness of cluster assignments. We also cover methods such as principal components analysis that help visualize the data. (3) Class prediction seeks to discover and validate models that can accurately predict the class or the outcomes of new samples. Methods include a wide variety of machine learning and statistical methods for feature selection and model construction. We will also discuss methods for cross-validation and independent validation of predictive models. The course will include an introduction to, and hands-on experience with, the R statistical software environment and the use of R packages that can be applied to these kinds of problems.

BIOS 500 (3) Statistical Methods I: Fall. This course is designed to teach students the fundamentals of applied statistical data analysis. Students successfully completing this course will be able to: choose appropriate statistical analyses for a variety of data types; perform exploratory data analyses; implement commonly used one and two-sample hypothesis testing and confidence interval methods for continuous variables; perform tests of association for categorical variables; conduct correlation and simple linear regression analyses; produce meaningful reports of statistical analyses and provide sound interpretations of analysis results. Students will be able to implement the statistical methods learned using SAS and JMP software on personal computers. Sample Syllabus

BIOS 500 Lab (1): Fall. The lab portion of BIOS 500 is designed with two purposes in mind: 1) to illustrate concepts and methods presented in the lectures using hands-on demonstrations and 2) to introduce SAS, a widely used statistical software package, as a data analysis tool. By the end of the semester, you should be able to produce and interpret the statistical output for methods learned in BIOS 500 lecture. Sample Syllabus - Labs

BIOS 501 (4) Statistical Methods II: Spring. Prerequisites: BIOS 500 or permission of instructor. This course is the follow-up to Biostatistical Methods I (BIOS 500). Students will apply many of the concepts learned in BIOS 500 in a broader field of statistical analysis: model construction. Topics covered include Linear Regression, Analysis of Variance, Logistic Regression, and Survival Analysis. Students who successfully complete this course will have a deep understanding of many analytical methods used by public health researchers in daily life. BIOS 501 Lab is a component of this course. Sample Syllabus

BIOS 502 (2) Statistical Methods III: Fall. Prerequisites: BIOS 500 & BIOS 501 or permission of instructor. We start with data analytic methods not covered in BIOS 500 & BIOS 501 (Statistical Methods I & II): two-way ANOVA, polynomial regression, count regression, Kaplan-Meier analysis, multiple imputation, propensity scores. After the first exam, we focus on multilevel modeling of intra- and inter-individual change. Other hierarchical models will also be examined to analyze other types of clustered data. As in the prerequisite courses, we will learn how to specify an appropriate model so that specific research questions of interest can be addressed in a methodologically sound way. Students will use SAS to perform statistical analyses. Sample Syllabus

BIOS 505 (4) Statistics for Experimental Biology: Spring. Intended for PhD candidates in the biological and biomedical sciences. Introduces the most frequently used statistical methods in those fields, including linear regression, ANOVA, logistic regression, and nonparametric methods. Students learn the statistical skills necessary to read scientific articles in their fields, do simple analyses on their own, and be good consumers of expert statistical advice. Sample Syllabus

BIOS 506 (4) Foundations of Biostatistical Methods: Fall. Prerequisite: Multivariate Calculus (Calculus III) or permission of instructor. This course is a mathematically sophisticated introduction to the concepts and methods of biostatistical data analysis. The topics include descriptive statistics; probability; detailed development of the binomial, Poisson and normal distributions; sampling distributions; point and confidence interval estimation; hypothesis testing; a variety of one- and two-sample parametric and non-parametric methods for analyzing continuous or discrete data and simple linear regression. The course will also equip students with computer skills for implementing these statistical methods using standard software R. Sample Syllabus 041b061a72