Jump to content

Introducing Julia/Working with text files

From Wikibooks, open books for an open world
Previous page
Strings and characters
Introducing Julia Next page
Working with dates and times
Working with text files

Reading from files

[edit | edit source]

The standard approach for getting information from a text file is using the open(), read(), and close() functions.

To read text from a file, first obtain a file handle:

f = open("sherlock-holmes.txt")

f is now Julia's connection to the file on disk. When you've finished with the file, you should close the connection, using:

close(f)

In general, the recommended way to work with a file in Julia is to wrap any file-processing functions inside a do block:

open("sherlock-holmes.txt") do file
    # do stuff with the open file
end

The open file is automatically closed when this block finishes. See Controlling the flow for more about do blocks.

Because of the scope of local variables in blocks, you might want to keep some of information that was processed:

totaltime, totallines = open("sherlock-holmes.txt") do f
    linecounter = 0
    timetaken = @elapsed for l in eachline(f)
        linecounter += 1
    end
    (timetaken, linecounter)
end
julia> totaltime, totallines
(0.004484679, 76803)

Slurp – reading a file all at once

[edit | edit source]

You can read the entire contents of an open file at once with read():

julia> s = read(f, String)

This stores the contents of the file in s:

s = open("sherlock-holmes.txt") do file
    read(file, String)
end

You can use readlines() to read in the whole file as an array, with each line an element:

julia> f = open("sherlock-holmes.txt");

julia> lines = readlines(f)
76803-element Array{String,1}:
"THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE\r\n"
"\r\n"
"   I. A Scandal in Bohemia\r\n"
"  II. The Red-headed League\r\n"
...
"Holmes, rather to my disappointment, manifested no further\r\n"
"interest in her when once she had ceased to be the centre of one\r\n"
"of his problems, and she is now the head of a private school at\r\n"
"Walsall, where I believe that she has met with considerable success.\r\n"
julia> close(f)

Now you can step through the lines:

counter = 1
for l in lines
   println("$counter $l")
   counter += 1
end
1 THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2
3    I. A Scandal in Bohemia
4   II. The Red-headed League
5  III. A Case of Identity
6   IV. The Boscombe Valley Mystery
...
12638 interest in her when once she had ceased to be the centre of one
12639 of his problems, and she is now the head of a private school at
12640 Walsall, where I believe that she has met with considerable success.

There's a better way to do this – see enumerate(), below.

You might find the chomp() function useful – it removes the trailing newline from a string.

Line by line

[edit | edit source]

The eachline() function turns a source into an iterator. This allows you to process a file a line at a time:

open("sherlock-holmes.txt") do file
    for ln in eachline(file)
        println("$(length(ln)), $(ln)")
    end
end
1, THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2,
28,    I. A Scandal in Bohemia
29,   II. The Red-headed League
26,  III. A Case of Identity
35,   IV. The Boscombe Valley Mystery
…
62, the island of Mauritius. As to Miss Violet Hunter, my friend
60, Holmes, rather to my disappointment, manifested no further
66, interest in her when once she had ceased to be the centre of one
65, of his problems, and she is now the head of a private school at
70, Walsall, where I believe that she has met with considerable success.

Another approach is to read until you reach the end of the file. You might want to keep track of which line you're on:

 open("sherlock-holmes.txt") do f
   line = 1
   while !eof(f)
     x = readline(f)
     println("$line $x")
     line += 1
   end
 end

A better approach is to use enumerate() on an iterable object – you'll get the line numbering 'for free':

open("sherlock-holmes.txt") do f
    for i in enumerate(eachline(f))
      println(i[1], ": ", i[2])
    end
end

If you have a specific function that you want to call on a file, you can use this alternative syntax:

function shout(f::IOStream)
    return uppercase(read(f, String))
end
julia> shoutversion = open(shout, "sherlock-holmes.txt");
julia> shoutversion[30237:30400]
"ELEMENTARY PROBLEMS. LET HIM, ON MEETING A\nFELLOW-MORTAL, LEARN AT A GLANCE TO DISTINGUISH THE HISTORY OF THE\nMAN, AND THE TRADE OR  PROFESSION TO WHICH HE BELONGS. "

This opens the file, runs the shout() function on it, then closes it again, assigning the processed contents to the variable.

You can use the CSV.jl to read and write comma-separated-values (.csv) files, and it's recommended over (handles more corner cases and can be faster, especially for larger files) using DelimitedFiles.readdlm() function to read lines delimited with certain characters, such as data files, arrays stored as text files, and tables. If you use the DataFrames package, there's also a readtable() specifically designed to read data into a table.

Working with paths and filenames

[edit | edit source]

These functions will be useful for working with filenames:

  • cd(path) changes the current directory.
  • pwd() gets the current working directory.
  • readdir(path) returns a lists of the contents of a named directory, or the current directory.
  • abspath(path) adds the current directory's path to a filename to make an absolute pathname.
  • joinpath(str, str, ...) assembles a pathname from pieces.
  • isdir(path) tells you whether the path is a directory.
  • splitdir(path) – split a path into a tuple of the directory name and file name.
  • splitdrive(path) – on Windows, split a path into the drive letter part and the path part. On Unix systems, the first component is always the empty string.
  • splitext(path) – if the last component of a path contains a dot, split the path into everything before the dot and everything including and after the dot. Otherwise, return a tuple of the argument unmodified and the empty string.
  • expanduser(path) – replaces a tilde character at the start of a path with the current user's home directory.
  • normpath(path) – normalizes a path, removing "." and ".." entries.
  • realpath(path) – canonicalizes a path by expanding symbolic links and removing "." and ".." entries.
  • homedir() – gets current user's home directory.
  • dirname(path) – gets the directory part of a path.
  • basename(path) – gets the file name part of a path.

To work on a restricted selection of files in a directory, use filter() and an anonymous function to filter the file names and just keep the ones you want. (filter() is more of a fishing net or sieve, rather than a coffee filter, in that it catches what you want to keep.)

for f in filter(x -> endswith(x, "jl"), readdir())
    println(f)
end

Astro.jl
calendar.jl
constants.jl
coordinates.jl
...
pseudoscience.jl
riseset.jl
sidereal.jl
sun.jl
utils.jl
vsop87d.jl

If you want to match a group of files using a regular expression, then use occursin(). Let's look for files with ".jpg" or ".png" suffixes (remembering to escape the "."):

for f in filter(x -> occursin(r"(?i)\.jpg|\.png", x), readdir())
    println(f)
end
034571172750.jpg
034571172750.png
51ZN2sCNfVL._SS400_.jpg
51bU7lucOJL._SL500_AA300_.jpg
Voronoy.jpg
kblue.png
korange.png
penrose.jpg
r-home-id-r4.png
wave.jpg

To examine a file hierarchy, use walkdir(), which lets you work through a directory, and examine the files in each directory in turn.

File information

[edit | edit source]

If you want information about a specific file, use stat("pathname"), and then use one of the fields to find out the information. Here's how to get all the information and the field names listed for a file "i":

 for n in fieldnames(typeof(stat("i")))
    println(n, ": ", getfield(stat("i"),n))
end
device: 16777219
inode: 2955324
mode: 16877
nlink: 943
uid: 502
gid: 20
rdev: 0
size: 32062
blksize: 4096
blocks: 0
mtime:1.409769933e9
ctime:1.409769933e9

You can access these fields via a 'stat' structure:

julia> s = stat("Untitled1.ipynb")
StatStruct(mode=100644, size=64424)
julia> s.ctime
1.446649269e9

and you can also use some of them directly:

julia> ctime("Untitled2.ipynb")
1.446649269e9

although not size:

julia> s.size
64424

To work on specific files that meet conditions – all Jupyter files (i.e. files with the extension "ipynb") modified after a certain date, for example – you could use something like this:

using Dates
function output_file(path)
    println(stat(path).size, ": ", path)
end 

for afile in filter!(f -> endswith(f, "ipynb") && (mtime(f) > Dates.datetime2unix(DateTime("2015-11-03T09:00"))),
    readdir())
    output_file(realpath(afile))
end

Interacting with the file system

[edit | edit source]

The cp(), mv(), rm(), and touch() functions have the same names and functions as their Unix shell counterparts.

To convert filenames to pathnames, use abspath(). You can map this over a list of files in a directory:

julia> map(abspath, readdir())
67-element Array{String,1}:
"/Users/me/.CFUserTextEncoding"
"/Users/me/.DS_Store"
"/Users/me/.Trash"
"/Users/me/.Xauthority"
"/Users/me/.ahbbighrc"
"/Users/me/.apdisk"
"/Users/me/.atom"
...

To restrict the list to filenames that contain a particular substring, use an anonymous function inside filter() – something like this:

julia> filter(x -> occursin("re", x), map(abspath, readdir()))
4-element Array{String,1}:
"/Users/me/.DS_Store"
"/Users/me/.gitignore"
"/Users/me/.hgignore_global"
"/Users/me/Pictures"
...

To restrict the list to regular expression matches, try this:

julia> filter(x -> occursin(r"recur.*\.jl", x), map(abspath, readdir()))
2-element Array{String,1}:
 "/Users/me/julia/recursive-directory-scan.jl"
 "/Users/me/julia/recursive-text.jl"

Writing to files

[edit | edit source]

To write to a text file, open it using the "w" flag and make sure that you have permission to create the file in the specified directory:

open("/tmp/t.txt", "w") do f
    write(f, "A, B, C, D\n")
end

Here's how to write 20 lines of 4 random numbers between 1 and 10, separated by commas:

function fourrandom()
    return rand(1:10,4)
end

open("/tmp/t.txt", "w") do f
           for i in 1:20
              n1, n2, n3, n4 = fourrandom()
              write(f, "$n1, $n2, $n3, $n4 \n")
           end
       end

A quicker alternative to this is to use the DelimitedFiles.writedlm() function, described next:

using DelimitedFiles
writedlm("/tmp/test.txt", rand(1:10, 20, 4), ", ")

Writing and reading array to and from a file

[edit | edit source]

In the DelimitedFiles package are two convenient functions, writedlm() and readdlm(). These let you read/write an array or collection from/to a file.

writedlm() writes the contents of an object to a text file, and readdlm() reads the data from a file into an array:

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
0.913583  0.312291  0.0855798  0.0592331  0.371789
0.13747   0.422435  0.295057   0.736044   0.763928
0.360894  0.434373  0.870768   0.469624   0.268495
0.620462  0.456771  0.258094   0.646355   0.275826
0.497492  0.854383  0.171938   0.870345   0.783558

julia> writedlm("/tmp/test.txt", numbers)

You can see the file using the shell (type a semicolon ";" to switch):

<shell> cat "/tmp/test.txt"
.9135833328830523	.3122905420350348	.08557977218948465	.0592330821115965	.3717889559226475
.13747015238054083	.42243494637594203	.29505701073304524	.7360443978397753	.7639280496847236
.36089432672073607	.43437288984307787	.870767989032692	.4696243851552686	.26849468736154325
.6204624598015906	.4567706404666232	.25809436255988105	.6463554854347682	.27582613759302377
.4974916625466639	.8543829989347014	.17193814498701587	.8703447748713236	.783557793485824

The elements are separated by tabs unless you specify another delimiter. Here, a colon is used to delimit the numbers:

julia> writedlm("/tmp/test.txt", rand(1:6, 10, 10), ":")
shell> cat "/tmp/test.txt"
3:3:3:2:3:2:6:2:3:5
3:1:2:1:5:6:6:1:3:6
5:2:3:1:4:4:4:3:4:1
3:2:1:3:3:1:1:1:5:6
4:2:4:4:4:2:3:5:1:6
6:6:4:1:6:6:3:4:5:4
2:1:3:1:4:1:5:4:6:6
4:4:6:4:6:6:1:4:2:3
1:4:4:1:1:1:5:6:5:6
2:4:4:3:6:6:1:1:5:5

To read in data from a text file, you can use readdlm().

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
0.862955  0.00827944  0.811526  0.854526  0.747977
0.661742  0.535057    0.186404  0.592903  0.758013
0.800939  0.949748    0.86552   0.113001  0.0849006
0.691113  0.0184901   0.170052  0.421047  0.374274
0.536154  0.48647     0.926233  0.683502  0.116988
julia> writedlm("/tmp/test.txt", numbers)

julia> numbers = readdlm("/tmp/test.txt")
5x5 Array{Float64,2}:
0.862955  0.00827944  0.811526  0.854526  0.747977
0.661742  0.535057    0.186404  0.592903  0.758013
0.800939  0.949748    0.86552   0.113001  0.0849006
0.691113  0.0184901   0.170052  0.421047  0.374274
0.536154  0.48647     0.926233  0.683502  0.116988

There are also a number of Julia packages specifically designed for reading and writing data to files, including DataFrames.jl and CSV.jl. Search in JuliaHub or JuliaPackages for these and more. Many of these packages live at the home of the JuliaData organization.