Data Management in Bioinformatics/Data Querying
Chapter Navigation | |
Top | E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages |
A Wikibookian has nominated this page for cleanup. You can help make it better. Please review any relevant discussion. |
A Wikibookian has nominated this page for cleanup. You can help make it better. Please review any relevant discussion. |
Data Querying
[edit | edit source]Query Languages
[edit | edit source]- SQL (Structured Query Language ). SQL is a Data Definition and Manipulation Language (both DDL and DML)
- RA (Relational Algebra)
- Datalog
Creating Tables
[edit | edit source]- Command : CREATE TABLE
For the relations:
Gene(gid, name, annotation)
Experiment(eid, name,date)
Microarray(gid, eid, exprlevel)
Tables can be created using the following set of commands in SQL:
CREATE TABLE Gene( gid INTEGER, name VARCHAR(20), annotation
VARCHAR(50), PRIMARY KEY (gid));
CREATE TABLE Experiment(eid INTEGER, name VARCHAR(10), date
DATE, PRIMARY KEY (eid));
CREATE TABLE Microarray (a INTEGER, b INTEGER, exprlevel REAL,
PRIMARY KEY (gid, eid));
A primary key is used to uniquely identify each row in a table. It can either be part of the actual record itself, or it can be an artificial field. SQL will automatically index the table based on the primary key. Indexing with other keys (other than the primary key) is also possible; it can be done by using the CREATE INDEX command.
Note that the primary key for the "Microarray" table is composed of keys from the other two tables i.e. Gene and Experiment. These fields are called Foreign Keys. It is important to reference where the table gets its primary key from. The syntax is as follows.
CREATE TABLE Microarray (a INTEGER, b INTEGER, exprlevel REAL,
PRIMARY KEY (gid, eid), FOREIGN KEY (a) REFERENCES Gene
(gid), FOREIGN KEY (b) Experiment (eid) );
These types of constraints are often called Referential Integrity Constraints.
Storing Data
[edit | edit source]- Command : INSERT INTO
Example:
INSERT INTO Gene VALUES(1, "1433E", "enzyme binding");
Most database management systems have the capability to load bulk data at once.
Querying
[edit | edit source]Projection
[edit | edit source]- To display all the data stored in the Gene table
- SQL
SELECT * FROM Gene;
- RA
- Datalog
- SQL
- To display the names of all genes in the Gene table
- SQL
SELECT name FROM Gene;
- RA
- Datalog Note: to list out only distinct values use SELECT DISTINCT in SQL. RA and Datalog return distinct values by default.
- SQL
- To display gene names that meet certain criterion
- SQL
SELECT * FROM Gene WHERE annotation like '%phosphates%';
- RA
- Datalog
- SQL
Cartesian Product
[edit | edit source]Definition : The Cartesian product of two tables of size nxa and mxb is a table having n*m rows and a+b columns
- SQL
SELECT * FROM Gene, Experiment;
- RA
- Datalog
Join
[edit | edit source]Example 1
[edit | edit source]- Table
|
|
|
- E-R Diagram
- SQL
SELECT Genes.*, Expts.*, exp_level FROM Genes, Expression, Expts WHERE Genes.gid = Expression.gid AND Expression.eid = Expts.gid;
- RA
Genes ?? Expression ?? Expts (A ?? B) ?? C = A ?? (B ?? C)
- Datalog
Answer(x, y, t, a, b, w) ← Genes(x, y, z), Expression(x, t, w), Expts(t, a, b) Genes' x match Expression's x Expts' t match Expression's t
Example 2
[edit | edit source]Question: ...
|
- RA CannotbeMax = ... Allpairs = ... Allpairs - CannotbeMax
- SQL
CREATE VIEW CannotbeMax(gid, eid) AS SELECT E1.gid, E1.eid FROM Expression AS E1, Expression AS E2 WHERE E1.explevel < E2.explevel; CREATE VIEW Allpairs(gid, eid) AS SELECT gid, eid FROM expression; SELECT * FROM Allpairs EXCEPT SELECT * FROM CannotbeMax;
- EXCEPT ∪ UNION ∩ INTERSECT
- Datalog
CannotbeMax(x, y) <- Expression(x, y, t), Expression(a, b, w), t < w. Answer(x, y) <- Expression(x, y, a), NOT CannotbeMax(x, y).
Interesting Query
[edit | edit source]Without MAX() operator
- max
- ref. Join section Example 2
- min
- second largest
- median
Relational Division
[edit | edit source]A X B = C C / B = ?
|
|
Through Relational Algebra
1. sid =sidStudent&Courses
2. Ideal = sid X R4G
3. Reality = Student&Courses
4. Defaulters = Ideal - Reality
5. Answer = sid - sidDefaulters
Through SQL
1.
CREATE VIEW SID(sid)<BR>
SELECT sid<BR>
FROM Student&Courses<BR>
2.
CREATE VIEW IDEAL(sid,courseId)<BR>
SELECT sid,courseID<BR>
FROM SID,R4G<BR>
3.
CREATE VIEW REALITY(sid,courseId)<BR>
SELECT *<BR>
FROM Student&Courses<BR>
4.
CREATE VIEW Defaulters(sid,courseId)<BR>
SELECT * <BR>
FROM IDEAL<BR>
EXCEPT<BR>
SELECT *<BR>
FROM REALITY<BR>
5.
SELECT * <BR>
FROM SID<BR>
EXCEPT <BR>
SELECT sid<BR>
FROM Defaulters <BR>
Exercise:
- Find students who satisfied graduate requirements except for 1.
- Find students who satisfied most of the graduate requirements.
A View is not really computed, an row most of times cannot be inserted because of ambiguities.
SQL Injection, is a hacking technique where the input is formatted such that it executes a SQL query which is not expected.
Bells and Whistles
1. ORDER BY
2. MAX,AVG,SUM,MIN
3. GROUP BY, HAVING
4. SELECT has a functionality of Printing