Lecture 2 Data Types in computational biology/Systems biology Useful websites

|
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
 25 views
of 44

Please download to get full document.

View again

Description
Lecture 2 Data Types in computational biology/Systems biology Useful websites Handling Multivariate data: Concept and types of metrics, distances etc. Introduction to PCA and PLS K-mean clustering. What is systems biology? Each lab/group has its own definition of systems biology.
Share
Transcript
Lecture 2 Data Types in computational biology/Systems biology Useful websites Handling Multivariate data: Concept and types of metrics, distances etc. Introduction to PCA and PLS K-mean clustering What is systems biology? Each lab/group has its own definition of systems biology. This is because systems biology requires the understanding and integration different levels of OMICS information utilizing the knowledge from different branches of science and individual labs/groups are working on different area. Theoretical target: Understanding life as a system. Practical Targets: Serving humanity by developing new generation medical tests, drugs, foods, fuel, materials, sensors, logic gates…… Understanding life or even a cell as a system is complicated and requires comprehensive analysis of different data types and/or sub-systems. Mostly individual groups or people work on different sub-systems--- Data Types in computational biology/Systems biology Some of the currently partially available and useful data types: Genome sequences Binding motifs in DNA sequences or CIS regulatory region CODON usage Gene expression levels for global gene sets/microRNAs Protein sequences Protein structures Protein domains Protein-protein interactions Binding relation between proteins and DNA Regulatory relation between genes Metabolic Pathways Metabolite profiles Species-metabolite relations Plants usage in traditional medicines Usually in wet labs, experiments are conducted to generate such data In dry labs like ours we analyze these data to extract targeted information using different algorithms and statistics etc. Sequence data (Genome /Protein sequence) >gi|15223276|ref|NP_171609.1| ANAC001 (Arabidopsis NAC domain containing protein 1); transcription factor [Arabidopsis thaliana] MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRE NNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDL LPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQ QFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLV DERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPL HNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFIS VISWIILVG Usually BLAST algorithms based on dynamic programming are used to determine how two or more sequences are matching with each other Sequence matching/alignments CODONS CODON USAGE CODON USAGE Multivariate data (Gene expression data/Metabolite profiles) There are many types of clustering algorithms applicable to multivariate data e.g. hierarchical, K-mean, SOM etc. Multivariate data also can be modeled using multivariate probability distribution function Binary relational Data (Protein-protein interactions, Regulatory relation between genes, Metabolic Pathways) are networks. Clustering is usually used to extract information from networks. Multivariate data and sequence data also can be easily converted to networks and then network clustering can be applied. AtpB AtpA AtpG AtpE AtpA AtpH AtpB AtpH AtpG AtpH AtpE AtpH Useful Websites Regulatory relation between genes, Metabolic Pathways) are networks. Some websites Regulatory relation between genes, Metabolic Pathways) are networks. www.geneontology.org www.genome.ad.jp/kegg www.ncbi.nlm.nih.gov www.ebi.ac.uk/databases http://www.ebi.ac.uk/uniprot/ http://www.yeastgenome.org/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://www.ebi.ac.uk/trembl http://dip.doe-mbi.ucla.edu/dip/Main.cgi www.ensembl.org Some websites where we can find different types of data and links to other databases Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) NETWORK TOOLS Interpretation Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) NETWORK TOOLS Interpretation Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) Source: Knowledge-Based Bioinformatics: From Analysis to Interpretation Gil Alterovitz, Marco Ramoni (Editors) Handling Multivariate data: Concept and types of metrics Interpretation Multivariate data example Multivariate data format Distances, metrics, dissimilarities and similarities are related concepts A metric is a function that satisfy the following properties: A function that satisfy only conditions (i)-(iii) is referred to as distances Source: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health) Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit (Editors) These measures consider the expression measurements as points in some metric space. Example: Let, X = (4, 6, 8) Y = (5, 3, 9) Widely used function for finding similarity is Correlation points in some metric space. Correlation gives a measure of linear association between variables and ranges between -1 to +1 Statistical distance between points points in some metric space. Statistical distance /Mahalanobis distance between two vectors can be calculated if the variance-covariance matrix is known or estimated. The Euclidean distance between point Q and P is larger than that between Q and origin but it seems P and Q are the part of the same cluster but Q and O are not. Distances between distributions points in some metric space. Different from the previous approach (i.e. considering expression measurements as points in some metric space) the data for each feature can be considered as independent sample from a population. Therefore the data reflects the underlying population and we need to measure similarities between two densities/distributions. Kullback-Leibler Information Mutual information KLI measures how much the shape of one distribution resembles the other MI is large when the joint distribution is quiet different from the product of the marginals. Principle Component Analysis (PCA) and points in some metric space. Partial Least Square (PLS)
  • Two major common effects of using PCA or PLS
  • Convert a group of correlated predictive variables to a group of independent variables
  • Construct a “strong” predictive variable from several “weaker” predictive variables
  • Major difference between PCA and PLS
  • PCA is performed without a consideration of the target variable. So PCA is an unsupervised analysis
  • PLS is performed to maximized the correlation between the target variable and the predictive variables. So PLS is a supervised analysis
  • PLS points in some metric space. PCA X (n x p) Y (n x q) A (n x p) 1 2 1 T (n x c) U (n x c) PC (n x p) max cov. Decomposition step 1 Regression step 2 X = matrix of predictors Y = matrix of responses T = factors of predictors U = factors of responses n = # of observations p = # of predictors q = # of responses c = # of extracted factors A = data matrix PC = principal component matrix n = # of observations p = # of variables Principle Component Analysis (PCA) points in some metric space.
  • In Principal Component Analysis, we look for a few linear combinations of the predictive variables which can be used to summarize the data without loosing too much information.
  • Intuitively, Principal components analysis is a method of extracting information from a higher dimensional data by projecting it to a lower dimension. Example: Consider the scatter plot of a 3-dimentional data (3 variables). Data across the 3 variables are higly correlated and majority of the points cluster around the center of the space. This is also the direction of the 1st PC, which roughly gives equal weight to 3 variables
  • PC1 = – 0.56 X1 – 0.57 X2 – 0.59 X3 Properties of Principal Components points in some metric space.
  • Var(PCi) = i
  • Cov(PCi,PCj) = 0
  • Var(PC1)  Var(PC2)  … Var(PCp)
  • Numerical Example points in some metric space.
  • The following is the high school
  • grade of 10 students on 6 subjects
  • (scale 1-10)
  • Math = Mathematics
  • Chem = Chemistry
  • Phy = Phisics
  • Bio = Biology
  • Eco = Economy
  • Soc = Sociology
  • Results points in some metric space. Partial Least Squares (PLS) points in some metric space.
  • Unlike PCA, the PLS technique works by successively extracting factors from both predictive and target variables such that covariance between the extracted factors is maximized
  • Decomposition step
  • X = TWt + E
  • Y = UVt + F
  • Regression step
  • Y = TB + D = XWB + D = XBPLS + D; BPLS = WB
  • Numerical Example points in some metric space.
  • The following is the high school
  • grade of 10 students on 6 subjects
  • (scale 1-10)
  • Math = Mathematics
  • Chem = Chemistry
  • Phy = Phisics
  • Bio = Biology
  • Eco = Economy
  • Soc = Sociology
  • and the corresponding GPA score
  • during undergraduate level.
  • Objective: Can we use information of student’s performance during high school to predict their GPA score when they enter undergraduate level? K-mean clustering points in some metric space. Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al. Source: edited by S. Teknomo, Kardi. K-Means Clustering Tutorials http:\\people.revoledu.com\kardi\ tutorial\kMean\ Initial value of edited by S. centroids: Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2denote the coordinate of the centroids, then c1 = (1,1) and c2 = (2,1)
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks