Combine two files with awk

up vote
9
down vote

favorite

File1.txt

item1 carA
item2 carB
item3 carC
item4 platD
item5 carE

File2.txt

carA platA
carB platB
carC platC
carE platE

Wanted output:

item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

How can I do it?

edited Mar 20 at 12:38

Zanna

48.1k13119228

asked Mar 20 at 12:33

pawana

552

add a commentÂ |Â

up vote
9
down vote

favorite

File1.txt

item1 carA
item2 carB
item3 carC
item4 platD
item5 carE

File2.txt

carA platA
carB platB
carC platC
carE platE

Wanted output:

item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

How can I do it?

edited Mar 20 at 12:38

Zanna

48.1k13119228

asked Mar 20 at 12:33

pawana

552

add a commentÂ |Â

up vote
9
down vote

favorite

File1.txt

item1 carA
item2 carB
item3 carC
item4 platD
item5 carE

File2.txt

carA platA
carB platB
carC platC
carE platE

Wanted output:

item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

How can I do it?

edited Mar 20 at 12:38

Zanna

48.1k13119228

asked Mar 20 at 12:33

pawana

552

File1.txt

item1 carA
item2 carB
item3 carC
item4 platD
item5 carE

File2.txt

carA platA
carB platB
carC platC
carE platE

Wanted output:

item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

How can I do it?

command-line text-processing awk

edited Mar 20 at 12:38

Zanna

48.1k13119228

asked Mar 20 at 12:33

pawana

552

edited Mar 20 at 12:38

Zanna

48.1k13119228

asked Mar 20 at 12:33

pawana

552

edited Mar 20 at 12:38

Zanna

48.1k13119228

edited Mar 20 at 12:38

Zanna

48.1k13119228

edited Mar 20 at 12:38

Zanna

48.1k13119228

asked Mar 20 at 12:33

pawana

552

asked Mar 20 at 12:33

pawana

552

asked Mar 20 at 12:33

pawana

552

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
11
down vote

The below answer is based on a similar Q&A in SO with some relevant modifications:

$ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

The idea is to create a hash-map with index, and use it as dictionary.

For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):

If the input file will be like file1b.txt :

item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE

The following command will do it:

$ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
item1 platA 
item2 platB 
item3 platC 
item4 platD 
item5 platE

edited Mar 20 at 14:25

answered Mar 20 at 12:38

Yaron

8,45771838

1

@pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
â€“Â Yaron
Mar 20 at 13:43

add a commentÂ |Â

up vote
6
down vote

I know you said awk, but there is a join command for this purpose...


 join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 | sort -k 1

It'd be sufficient with the first join command if it wasn't for this line:

item4 platD

The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.

Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:

$ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
$ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt 
$ head 10million.txt 
item530284 plat530284
item7946579 plat7946579
item1521735 plat1521735
item9762844 plat9762844
item2289811 plat2289811
item6878181 plat6878181
item7957075 plat7957075
item2527811 plat2527811
item5940907 plat5940907
item3289494 plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187

That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.

edited Mar 21 at 1:04

answered Mar 20 at 16:11

JoL

1,04427

In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
â€“Â David Foerster
Mar 20 at 21:16

@David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
â€“Â JoL
Mar 21 at 1:11

To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
â€“Â David Foerster
Mar 21 at 1:47

@David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
â€“Â JoL
Mar 21 at 1:54

1

Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
â€“Â David Foerster
Mar 21 at 2:01

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "89"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');

var $window = $(window),
onScroll = function(e)
var $elem = $('.new-login-left'),
docViewTop = $window.scrollTop(),
docViewBottom = docViewTop + $window.height(),
elemTop = $elem.offset().top,
elemBottom = elemTop + $elem.height();
if ((docViewTop elemBottom))
StackExchange.using('gps', function() StackExchange.gps.track('embedded_signup_form.view', location: 'question_page' ); );
$window.unbind('scroll', onScroll);

;
$window.on('scroll', onScroll);

);

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1017612%2fcombine-two-files-with-awk%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
11
down vote

The below answer is based on a similar Q&A in SO with some relevant modifications:

$ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

The idea is to create a hash-map with index, and use it as dictionary.

For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):

If the input file will be like file1b.txt :

item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE

The following command will do it:

$ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
item1 platA 
item2 platB 
item3 platC 
item4 platD 
item5 platE

edited Mar 20 at 14:25

answered Mar 20 at 12:38

Yaron

8,45771838

1

@pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
â€“Â Yaron
Mar 20 at 13:43

add a commentÂ |Â

up vote
11
down vote

The below answer is based on a similar Q&A in SO with some relevant modifications:

$ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

The idea is to create a hash-map with index, and use it as dictionary.

For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):

If the input file will be like file1b.txt :

item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE

The following command will do it:

$ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
item1 platA 
item2 platB 
item3 platC 
item4 platD 
item5 platE

edited Mar 20 at 14:25

answered Mar 20 at 12:38

Yaron

8,45771838

1

@pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
â€“Â Yaron
Mar 20 at 13:43

add a commentÂ |Â

up vote
11
down vote

The below answer is based on a similar Q&A in SO with some relevant modifications:

$ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

The idea is to create a hash-map with index, and use it as dictionary.

For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):

If the input file will be like file1b.txt :

item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE

The following command will do it:

$ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
item1 platA 
item2 platB 
item3 platC 
item4 platD 
item5 platE

edited Mar 20 at 14:25

answered Mar 20 at 12:38

Yaron

8,45771838

The below answer is based on a similar Q&A in SO with some relevant modifications:

$ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

The idea is to create a hash-map with index, and use it as dictionary.

For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):

If the input file will be like file1b.txt :

item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE

The following command will do it:

$ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
item1 platA 
item2 platB 
item3 platC 
item4 platD 
item5 platE

edited Mar 20 at 14:25

answered Mar 20 at 12:38

Yaron

8,45771838

edited Mar 20 at 14:25

answered Mar 20 at 12:38

Yaron

8,45771838

answered Mar 20 at 12:38

Yaron

8,45771838

answered Mar 20 at 12:38

Yaron

8,45771838

1

@pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
â€“Â Yaron
Mar 20 at 13:43

add a commentÂ |Â

1

@pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
â€“Â Yaron
Mar 20 at 13:43

@pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
â€“Â Yaron
Mar 20 at 13:43

add a commentÂ |Â

up vote
6
down vote

I know you said awk, but there is a join command for this purpose...


 join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 | sort -k 1

It'd be sufficient with the first join command if it wasn't for this line:

item4 platD

$ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
$ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt 
$ head 10million.txt 
item530284 plat530284
item7946579 plat7946579
item1521735 plat1521735
item9762844 plat9762844
item2289811 plat2289811
item6878181 plat6878181
item7957075 plat7957075
item2527811 plat2527811
item5940907 plat5940907
item3289494 plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187

That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.

edited Mar 21 at 1:04

answered Mar 20 at 16:11

JoL

1,04427

In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
â€“Â David Foerster
Mar 20 at 21:16

@David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
â€“Â JoL
Mar 21 at 1:11

To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
â€“Â David Foerster
Mar 21 at 1:47

@David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
â€“Â JoL
Mar 21 at 1:54

1

Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
â€“Â David Foerster
Mar 21 at 2:01

add a commentÂ |Â

up vote
6
down vote

I know you said awk, but there is a join command for this purpose...


 join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 | sort -k 1

It'd be sufficient with the first join command if it wasn't for this line:

item4 platD

$ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
$ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt 
$ head 10million.txt 
item530284 plat530284
item7946579 plat7946579
item1521735 plat1521735
item9762844 plat9762844
item2289811 plat2289811
item6878181 plat6878181
item7957075 plat7957075
item2527811 plat2527811
item5940907 plat5940907
item3289494 plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187

That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.

edited Mar 21 at 1:04

answered Mar 20 at 16:11

JoL

1,04427

In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
â€“Â David Foerster
Mar 20 at 21:16

@David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
â€“Â JoL
Mar 21 at 1:11

To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
â€“Â David Foerster
Mar 21 at 1:47

@David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
â€“Â JoL
Mar 21 at 1:54

1

Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
â€“Â David Foerster
Mar 21 at 2:01

add a commentÂ |Â

up vote
6
down vote

I know you said awk, but there is a join command for this purpose...


 join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 | sort -k 1

It'd be sufficient with the first join command if it wasn't for this line:

item4 platD

$ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
$ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt 
$ head 10million.txt 
item530284 plat530284
item7946579 plat7946579
item1521735 plat1521735
item9762844 plat9762844
item2289811 plat2289811
item6878181 plat6878181
item7957075 plat7957075
item2527811 plat2527811
item5940907 plat5940907
item3289494 plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187

That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.

edited Mar 21 at 1:04

answered Mar 20 at 16:11

JoL

1,04427

I know you said awk, but there is a join command for this purpose...


 join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
 | sort -k 1

It'd be sufficient with the first join command if it wasn't for this line:

item4 platD

$ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
$ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt 
$ head 10million.txt 
item530284 plat530284
item7946579 plat7946579
item1521735 plat1521735
item9762844 plat9762844
item2289811 plat2289811
item6878181 plat6878181
item7957075 plat7957075
item2527811 plat2527811
item5940907 plat5940907
item3289494 plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187

That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.

edited Mar 21 at 1:04

answered Mar 20 at 16:11

JoL

1,04427

edited Mar 21 at 1:04

answered Mar 20 at 16:11

JoL

1,04427

answered Mar 20 at 16:11

JoL

1,04427

answered Mar 20 at 16:11

JoL

1,04427

In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
â€“Â David Foerster
Mar 20 at 21:16

@David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
â€“Â JoL
Mar 21 at 1:11

To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
â€“Â David Foerster
Mar 21 at 1:47

@David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
â€“Â JoL
Mar 21 at 1:54

1

Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
â€“Â David Foerster
Mar 21 at 2:01

add a commentÂ |Â

In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
â€“Â David Foerster
Mar 20 at 21:16

@David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
â€“Â JoL
Mar 21 at 1:11

To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
â€“Â David Foerster
Mar 21 at 1:47

@David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
â€“Â JoL
Mar 21 at 1:54

1

Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
â€“Â David Foerster
Mar 21 at 2:01

In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
â€“Â David Foerster
Mar 20 at 21:16

@David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
â€“Â JoL
Mar 21 at 1:11

To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
â€“Â David Foerster
Mar 21 at 1:47

@David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
â€“Â JoL
Mar 21 at 1:54

Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
â€“Â David Foerster
Mar 21 at 2:01

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

Post as a guest

Name

Sign up or log in

Name

搜尋此網誌

Gfilui