Introduction to Chemical Engineering Processes/Basic Statistics and Data Analysis

Mean and Standard Deviation

A lot of the time, when you're conducting an experiment, you will run it more than once, especially if it is inexpensive. Scientists run experiments more than once so that the random errors that result from taking measurements, such as having to guess a length between two hash marks on a ruler, cancel themselves out and leave them with a more precise measurement. However, the question remains: how should you consolidate all of the data into something that's more manageable to use?

Mean

Suppose you have n data points taken under the same conditions and you wish to consolidate them to as few as feasibly possible. One thing which could help is is to use some centralized value, which is in some way "between" all of the original data points. This, in fact, is called the mean of the data set.

There are many ways of computing the mean of a data set depending on how it is believed to be distributed. One of the most common methods is to use the arithmetic mean, which is defined as:

{\bar {x}}={\frac {\Sigma {\hat {x}}_{k}}{n}}

Other types of mean include the w:Geometric mean, which should be used when the data are very widely distributed (ex. an exponential distribution) and the "log-mean" which occurs often in transport phenomena.

Standard Deviation

Having a value for the mean tells you what value the data points "cluster" around but it does not tell you how spread out they are from the center. A second statistical variable called the standard deviation is used for that. The standard deviation is essentially the average distance between the data points and their mean. The distance is expressed as a squared distance in order to prevent negative deviations from lessoning the effect of positive deviations.

The mathematical formulation for the standard deviation $\sigma$ is:

\sigma ^{2}={\frac {\Sigma ({\hat {x}}_{k}-{\bar {x}})^{2}}{n-1}}

The denominator is n-1 instead of n because statisticians found that it gives better results for small numbers of experiments; see w:Standard deviation for a more thorough explanation of this.

Putting it together

The standard deviation of a data set measured under constant conditions is a measure of how precise the data set is. Because this is true, the standard deviation of a data set is often used in conjunction with the mean in order to report experimental results. Typically, results are reported as:

${\bar {x}}\pm \sigma$

If a distribution is assumed, knowing both the mean and standard deviation can help us to estimate the probability that the actual value of the variable is within a certain range, if there is no systematic bias in the data. If there is (such as use of broken equipment, negligence, and so on) then no statistics could predict the effects of that.

Linear Regression

Suppose you have a set of data points ( ${\hat {x}}_{k},{\hat {y}}_{k}$ ) taken under differing conditions which you suspect, from a graph, can be reasonably estimated by drawing a line through the points. Any line that you could draw will have (or can be written in) the following form:

$y=mx+b$ where m is the slope of the line and b is the y-intercept.

We seek the best line that we could possibly use to estimate the pattern of the data. This line will be most useful for both interpolating between points that we know, and extrapolating to unknown values (as long as they're close to measured values). In the most usual measure, how "good" the fit is is determined by the vertical distance between the line and the data points ( $R_{k}$ ), which is called the residual:

$R_{k}=(m{\hat {x}}_{k}+b)-{\hat {y}}_{k}$

In order to normalize the residuals so that they don't cancel when one's positive and one's negative (and thus helping to avoid statistical bias), we are usually concerned with the square of $R_{k}$ when doing least-squares regression. We use squared terms and not absolute values so that the function is differentiable, don't worry about this if you haven't taken calculus yet.

In order to take into account all of the data points, we simply seek to minimize the sum of the squared residuals:

${\mbox{ minimize }}\Sigma {R_{k}}^{2}$

Using calculus, we can take the derivative of this with respect to m and with respect to b and solve the equations to come up with the values of m and b that minimize the sum of squares (hence the alternate name of this technique: least-squares regression. The formulas are as follows, where n is the total number of data points you are regressing[1]:

m^{*}={\frac {n*\Sigma ({\hat {x}}_{k}*{\hat {y}}_{k})-\Sigma ({\hat {x}}_{k})*\Sigma ({\hat {y}}_{k})}{n*\Sigma ({\hat {x_{k}}}^{2})-(\Sigma ({\hat {x}}_{k}))^{2}}}

b^{*}={\frac {\Sigma ({\hat {y}}_{k})-m^{*}*\Sigma ({\hat {x}}_{k})}{n}}

Example of linear regression

Example:

Suppose you wanted to measure how fast you got to school by a less direct route than looking at the speedometer of your car. Instead, you look at a map and read the distances between each intersection, and then you measure how long it takes to go each distance. Suppose the results were as shown in the table below. How far from home did you start, and what is the best estimate for your average speed?

t(min)	D (yards)
1.1	559.5
1.9	759.5
3.0	898.2
3.8	1116.3
5.3	1308.7

The first thing we should do with any data like this is to graph it and see if a linear fit would be reasonable. Plotting this data, we can see by inspection that a linear fit appears to be reasonable.

Now we need to compute all of the values in our regression formulas, and to do this (by hand) we set up a table:

Trial	t	t^2	D	D^2	t*D
1	1.1	1.21	559.5	313040	615.45
2	1.9	3.61	759.5	576840	1443.05
3	3.0	9.00	898.2	806763	2694.6
4	3.8	14.44	1116.3	1246126	4241.94
5	5.3	28.09	1308.7	1712695	6936.11
TOTAL	15.1	56.35	4642.2	4655464	15931.15

Now that we have this data we can plug it into our linear regression equation:

$m^{*}={\frac {n*\Sigma ({\hat {x}}_{k}*{\hat {y}}_{k})-\Sigma ({\hat {x}}_{k})*\Sigma ({\hat {y}}_{k})}{n*\Sigma ({\hat {x_{k}}}^{2})-(\Sigma ({\hat {x}}_{k}))^{2}}}$

$={\frac {5*15931.13-15.1*4642.2}{5*56.35-(15.1)^{2}}}$

$=177.9{\mbox{ }}{\frac {yard}{min}}$

So $b={\frac {\Sigma ({\hat {y}}_{k})-m^{*}*\Sigma ({\hat {x}}_{k})}{n}}$

$={\frac {4642.2-177.9*15.1}{5}}=391.2{\mbox{ yards}}$

Hence the equation for the line of best fit is:

D=177.9*t+391.2

The graph of this plotted against the data looks like this:

How to tell how good your regression is

In the previous example, we visually determined if it would be reasonable to perform a linear fit, but it is certainly possible to have a less clear-cut case! If there is some slight curve to the data, is it still "close enough" to be useful? Though it will always come down to your own judgment after seeing the fit line graph against the data, there is a mathematical tool to help you called a correlation coefficient, r, which can be defined in several different ways. One of them is as follows [1]:

r={\frac {n*\Sigma ({\hat {x}}_{k}*{\hat {y}}_{k})-\Sigma ({\hat {x}}_{k})*\Sigma ({\hat {y}}_{k})}{{\sqrt {n*\Sigma ({{\hat {x}}_{k}}^{2})-(\Sigma {\hat {x}}_{k})^{2}}}*{\sqrt {n*\Sigma ({{\hat {y}}_{k}}^{2})-(\Sigma {\hat {y}}_{k})^{2}}}}}

It can be shown that this value always lies between -1 and 1. The closer it is to 1 (or -1), the more reasonable the linear fit. In general, the more data points you have, the smaller r needs to be before it's a good fit, but a good rule of thumb is to look for high (higher than 0.85 or 0.9) values and then graph to see if the graph makes sense. Sometimes it will, sometimes it won't, the method is not foolproof.

In the above example we have:

$r={\frac {5*15931.13-15.1*4642.2}{{\sqrt {5*56.35-(15.1)^{2}}}*{\sqrt {5*4655464-(4642.2)^{2}}}}}$

$r=0.992$

Hence the data correlates very well with a linear model.

Linearization

In general

Whenever you have to fit a parameter or multiple parameters to data, it is a good idea to try to linearize the function first, because linear regression is much less intensive and more accurate than nonlinear regression. The goal with any linearization is to reduce the function to the form:

$Variable{\mbox{ 1}}=constant+constant*{\mbox{ Variable 2}}$

The difference between this and "standard" linear regression is that Variable 1 and Variable 2 can be any functions of x and y, as long as they are not combined in any way (i.e. you can't have $ln(x+y)$ as one variable). The technique can be extended to more than two variables using a method called w:multiple linear regression but as that's more difficult to perform, this section will focus on two-dimensional regression.

Power Law

To see some of the power of linearization, let's suppose that we have two variables, x and y, related by a power law:

$y=A*x^{b}$

where A and b are constants. If we have data connecting changes in y to changes in x, we would like to know the values of a and b. This is difficult to do if the equation is in its current form but we can change it into a linear-type function!

The trick here is we need to get rid of the exponent b, so in order to do that we take the natural log of both sides:

$ln{\mbox{ y}}=ln{\mbox{ (A*x}}^{b})$

Using laws of logarithms we can simplify the right-hand side to obtain the following:

y=A*x^{b}<->ln{\mbox{ y}}=ln{\mbox{ A}}+b*ln{\mbox{ x}}

The beauty of this equation is that it is, in a sense, linear. If we graph ln(y) vs. ln(x) obtain a straight line with slope b and y-intercept ln(A).

Exponentials

Another common use of linearization is with exponentials, where x and y are related by an expression of the form:

$y=A*b^{x}$

This works for any base but the most common base encountered in practice is Euler's constant, e. Again, we take the natural log of both sides, in order to get rid of the exponent:

y=A*b^{x}<->ln{\mbox{ y}}=\ln {A}+x*\ln {b}

This time, Graph ln y vs. x to obtain a line with slope ln(b) and y-intercept ln(A).

Linear Interpolation

Often, when you look up properties on a chart, you will be looking them up at conditions in between two charted conditions. For example, if you were looking up the specific enthalpy of steam at 10 MPa and 430oC you would look in the steam tables and see something like this: [2]

T (oC)	H $({\frac {kJ}{kg}})$
400	2832.4
450	2943.4

How can you figure out the intermediate value for this? We can't exactly but we can assume that H(T) is a linear function. If we assume that it is linear, then we can easily find the intermediate value. First, we set up a table, including the unknown value like this:

T (oC)	H $({\frac {kJ}{kg}})$
400	2832.4
430	x
450	2943.4

Then since we're assuming the relationship between T and H is linear, and the slope of a line is a constant the slope between points 3 and 2 has to equal the slope between points 3 and 1.

Therefore, we can write that:

${\frac {2943.4-x}{450-430}}={\frac {2943.4-2832.4}{450-400}}$

Solving gives x = 2899 kJ/kg

The same method can be used to find an unknown T for a given H between two tabulated values.

General formula

To derive a more general formula (though I always derive it from scratch anyways, it's nice to have a formula), lets replace the numbers by variables ad give them more generic symbols:

x	y
$x_{1}$	$y_{1}$
$x^{*}$	$y^{*}$
$x_{2}$	$y_{2}$

Setting the slope between points 3 and 2 equal to that between 3 and 1 yields:

{\frac {y2-y1}{x2-x1}}={\frac {y2-y^{*}}{x2-x^{*}}}

This equation can then be solved for x* or y* as appropriate.

Limitations of Linear Interpolation

It is important to remember that linear interpolation is not exact. How inexact it is depends on two major factors:

What the real relationship between x and y is (the more curved it is, the worse the linear approximation)
The difference between consecutive x values on the table (the smaller the distance, the closer almost any function will resemble a line)

Therefore, it is not recommended to use linear interpolation if the spaces are very widely separated. However, if no other method of approximation is available, linear interpolation is often the only option, or other forms of interpolation (which may be just as inaccurate, depending on what the actual function is).

References

[1]: Smith, Karl J. The Nature of Mathematics. Pacific Grove, California: Brooks/Cole Publishing company, 6e, p. 683

[2]: Sandler, Stanley I. Chemical, Biochemical, and Engineering Thermodynamics. University of Delaware: John Wiley and Sons, 4e, p. 923