Stata/Data Management

Read and import data

Usually, data are loaded into memory using the use command. The clear option makes it sure that the current database in memory will be removed without saving the last changes.

use "W:\Data\…\table.dta" , clear

The cd command allows to specify a working directory and makes it easier to load tables into memory.

cd "W:\Data\" 
use table, clear

Stata9 users can import Stata10 datasets using the use10 command.

use10 table, clear

Some example datasets are stored in the Stata directory. They can be loaded into memory using the sysuse command.

. sysuse cancer, clear
. sysuse smoking, clear
. sysuse auto, clear
. sysuse jspmix, clear

You can import a Comma Separated Value (CSV) format using insheet

insheet using "W:\Data\…\table.csv", delim(";")

Save and export data

save

save table, replace

If you use Stata10 you can export to Stata9 format using saveold

saveold table, replace

outsheet : export to tab delimited or csv format.

outsheet using "W:\Data\…\table.csv", replace comma

Append and merge

The standard Stata command is merge. However, the user-written command mmerge is safer and gives a better output. This command may be installed using ssc install mmerge command or using findit mmerge.

dmerge
joinby merge all possible pairs between the datasets

append if you have two datasets with the same variable but different observations, you can make one dataset using the append command.

use data_1, clear
append data_2
br

Describe a datasets

des
des, s
codebook
codebook2

Detect missing values

tabmiss
npresent
nmissing

You can convert missing values to values using the mvencode command.

mvencode exg ga dvg verts eco dr dvd fn reg mnr div, mv(0) override

Variables

Very often you have to convert variable from a string to a numerical format. There are several way to do it. If you already have numeric values in your string variable, you should use destring. Otherwise you should use the encode command. Encode will automatically create a numerical variable and will use as a value label the string values of the previous variable.

gen
egen
replace
recode
drop
keep
rename

'vallist' gives the list of all categories of a categorical variable in Stata.

vallist codep

Dealing with labels

lab var
lab list
lab define
lab value

Expand

You can expand a dataset (ie multiplying observations by a given factor) using the expand command.

This is useful for generating panel data models. In the first example, we draw 10 observations in a standard normal distribution and we replicate each observation once.

clear
set obs 10
gen u = invnorm(uniform())
expand 2
sort u
br

It is also possible to pass an integer variable as an argument to expand.

clear
set obs 10
gen u = uniform()
gen var = 1 + int(10 * uniform())
expand var
sort u 
br

clear
set obs 10
gen u = invnorm(uniform())
expandcl 2 , gen(cl)

Data Storage types

All numeric types in Stata are normal "signed" quantities except that the highest 27 values are reserved for the "missing" types (., .a, .b, ..., .z). The storage size of the each variable is as follows:

Variable	Size (in bytes)
byte	1
int	2
long	4
float	4
double	8
string	1 per-letter (therefore only ASCII characters, not full Unicode/UTF-8)

Previous: Random Number Generation

Index

Next: Graphics