Combine two files with awk

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP








up vote
9
down vote

favorite
3












File1.txt



item1 carA
item2 carB
item3 carC
item4 platD
item5 carE


File2.txt



carA platA
carB platB
carC platC
carE platE


Wanted output:



item1 platA
item2 platB
item3 platC
item4 platD
item5 platE


How can I do it?










share|improve this question



























    up vote
    9
    down vote

    favorite
    3












    File1.txt



    item1 carA
    item2 carB
    item3 carC
    item4 platD
    item5 carE


    File2.txt



    carA platA
    carB platB
    carC platC
    carE platE


    Wanted output:



    item1 platA
    item2 platB
    item3 platC
    item4 platD
    item5 platE


    How can I do it?










    share|improve this question

























      up vote
      9
      down vote

      favorite
      3









      up vote
      9
      down vote

      favorite
      3






      3





      File1.txt



      item1 carA
      item2 carB
      item3 carC
      item4 platD
      item5 carE


      File2.txt



      carA platA
      carB platB
      carC platC
      carE platE


      Wanted output:



      item1 platA
      item2 platB
      item3 platC
      item4 platD
      item5 platE


      How can I do it?










      share|improve this question















      File1.txt



      item1 carA
      item2 carB
      item3 carC
      item4 platD
      item5 carE


      File2.txt



      carA platA
      carB platB
      carC platC
      carE platE


      Wanted output:



      item1 platA
      item2 platB
      item3 platC
      item4 platD
      item5 platE


      How can I do it?







      command-line text-processing awk






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 20 at 12:38









      Zanna

      48.1k13119228




      48.1k13119228










      asked Mar 20 at 12:33









      pawana

      552




      552




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          11
          down vote













          The below answer is based on a similar Q&A in SO with some relevant modifications:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE


          The idea is to create a hash-map with index, and use it as dictionary.



          For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):



          If the input file will be like file1b.txt :



          item1 A5 B C D carA
          item2 A4 1 2 3 carB
          item3 A3 2 3 4 carC
          item4 A2 4 5 6 platD
          item5 A1 7 8 9 carE


          The following command will do it:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE





          share|improve this answer


















          • 1




            @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
            – Yaron
            Mar 20 at 13:43

















          up vote
          6
          down vote













          I know you said awk, but there is a join command for this purpose...




          join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          | sort -k 1


          It'd be sufficient with the first join command if it wasn't for this line:



          item4 platD


          The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.



          Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:



          $ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
          $ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt
          $ head 10million.txt
          item530284 plat530284
          item7946579 plat7946579
          item1521735 plat1521735
          item9762844 plat9762844
          item2289811 plat2289811
          item6878181 plat6878181
          item7957075 plat7957075
          item2527811 plat2527811
          item5940907 plat5940907
          item3289494 plat3289494
          $ TIMEFORMAT=%E
          $ time sort 1million.txt >/dev/null
          1.547
          $ time sort 10million.txt >/dev/null
          19.187


          That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.






          share|improve this answer






















          • In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
            – David Foerster
            Mar 20 at 21:16











          • @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
            – JoL
            Mar 21 at 1:11











          • To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
            – David Foerster
            Mar 21 at 1:47











          • @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
            – JoL
            Mar 21 at 1:54






          • 1




            Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
            – David Foerster
            Mar 21 at 2:01











          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "89"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1017612%2fcombine-two-files-with-awk%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          11
          down vote













          The below answer is based on a similar Q&A in SO with some relevant modifications:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE


          The idea is to create a hash-map with index, and use it as dictionary.



          For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):



          If the input file will be like file1b.txt :



          item1 A5 B C D carA
          item2 A4 1 2 3 carB
          item3 A3 2 3 4 carC
          item4 A2 4 5 6 platD
          item5 A1 7 8 9 carE


          The following command will do it:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE





          share|improve this answer


















          • 1




            @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
            – Yaron
            Mar 20 at 13:43














          up vote
          11
          down vote













          The below answer is based on a similar Q&A in SO with some relevant modifications:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE


          The idea is to create a hash-map with index, and use it as dictionary.



          For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):



          If the input file will be like file1b.txt :



          item1 A5 B C D carA
          item2 A4 1 2 3 carB
          item3 A3 2 3 4 carC
          item4 A2 4 5 6 platD
          item5 A1 7 8 9 carE


          The following command will do it:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE





          share|improve this answer


















          • 1




            @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
            – Yaron
            Mar 20 at 13:43












          up vote
          11
          down vote










          up vote
          11
          down vote









          The below answer is based on a similar Q&A in SO with some relevant modifications:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE


          The idea is to create a hash-map with index, and use it as dictionary.



          For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):



          If the input file will be like file1b.txt :



          item1 A5 B C D carA
          item2 A4 1 2 3 carB
          item3 A3 2 3 4 carC
          item4 A2 4 5 6 platD
          item5 A1 7 8 9 carE


          The following command will do it:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE





          share|improve this answer














          The below answer is based on a similar Q&A in SO with some relevant modifications:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($2 in dict) ? dict[$2] : $21' file2.txt file1.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE


          The idea is to create a hash-map with index, and use it as dictionary.



          For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):



          If the input file will be like file1b.txt :



          item1 A5 B C D carA
          item2 A4 1 2 3 carB
          item3 A3 2 3 4 carC
          item4 A2 4 5 6 platD
          item5 A1 7 8 9 carE


          The following command will do it:



          $ awk 'FNR==NR dict[$1]=$2; next $2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""1' file2.txt file1b.txt 
          item1 platA
          item2 platB
          item3 platC
          item4 platD
          item5 platE






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 20 at 14:25

























          answered Mar 20 at 12:38









          Yaron

          8,45771838




          8,45771838







          • 1




            @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
            – Yaron
            Mar 20 at 13:43












          • 1




            @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
            – Yaron
            Mar 20 at 13:43







          1




          1




          @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
          – Yaron
          Mar 20 at 13:43




          @pawana - I've updated my answer to also solve your second question in comment. If I've answered your question please accept it.
          – Yaron
          Mar 20 at 13:43












          up vote
          6
          down vote













          I know you said awk, but there is a join command for this purpose...




          join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          | sort -k 1


          It'd be sufficient with the first join command if it wasn't for this line:



          item4 platD


          The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.



          Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:



          $ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
          $ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt
          $ head 10million.txt
          item530284 plat530284
          item7946579 plat7946579
          item1521735 plat1521735
          item9762844 plat9762844
          item2289811 plat2289811
          item6878181 plat6878181
          item7957075 plat7957075
          item2527811 plat2527811
          item5940907 plat5940907
          item3289494 plat3289494
          $ TIMEFORMAT=%E
          $ time sort 1million.txt >/dev/null
          1.547
          $ time sort 10million.txt >/dev/null
          19.187


          That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.






          share|improve this answer






















          • In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
            – David Foerster
            Mar 20 at 21:16











          • @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
            – JoL
            Mar 21 at 1:11











          • To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
            – David Foerster
            Mar 21 at 1:47











          • @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
            – JoL
            Mar 21 at 1:54






          • 1




            Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
            – David Foerster
            Mar 21 at 2:01















          up vote
          6
          down vote













          I know you said awk, but there is a join command for this purpose...




          join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          | sort -k 1


          It'd be sufficient with the first join command if it wasn't for this line:



          item4 platD


          The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.



          Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:



          $ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
          $ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt
          $ head 10million.txt
          item530284 plat530284
          item7946579 plat7946579
          item1521735 plat1521735
          item9762844 plat9762844
          item2289811 plat2289811
          item6878181 plat6878181
          item7957075 plat7957075
          item2527811 plat2527811
          item5940907 plat5940907
          item3289494 plat3289494
          $ TIMEFORMAT=%E
          $ time sort 1million.txt >/dev/null
          1.547
          $ time sort 10million.txt >/dev/null
          19.187


          That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.






          share|improve this answer






















          • In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
            – David Foerster
            Mar 20 at 21:16











          • @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
            – JoL
            Mar 21 at 1:11











          • To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
            – David Foerster
            Mar 21 at 1:47











          • @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
            – JoL
            Mar 21 at 1:54






          • 1




            Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
            – David Foerster
            Mar 21 at 2:01













          up vote
          6
          down vote










          up vote
          6
          down vote









          I know you said awk, but there is a join command for this purpose...




          join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          | sort -k 1


          It'd be sufficient with the first join command if it wasn't for this line:



          item4 platD


          The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.



          Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:



          $ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
          $ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt
          $ head 10million.txt
          item530284 plat530284
          item7946579 plat7946579
          item1521735 plat1521735
          item9762844 plat9762844
          item2289811 plat2289811
          item6878181 plat6878181
          item7957075 plat7957075
          item2527811 plat2527811
          item5940907 plat5940907
          item3289494 plat3289494
          $ TIMEFORMAT=%E
          $ time sort 1million.txt >/dev/null
          1.547
          $ time sort 10million.txt >/dev/null
          19.187


          That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.






          share|improve this answer














          I know you said awk, but there is a join command for this purpose...




          join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)
          | sort -k 1


          It'd be sufficient with the first join command if it wasn't for this line:



          item4 platD


          The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.



          Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:



          $ ruby -e '(1..1000000).each puts "item#i plat#i"' | shuf > 1million.txt 
          $ ruby -e '(1..10000000).each puts "item#i plat#i"' | shuf > 10million.txt
          $ head 10million.txt
          item530284 plat530284
          item7946579 plat7946579
          item1521735 plat1521735
          item9762844 plat9762844
          item2289811 plat2289811
          item6878181 plat6878181
          item7957075 plat7957075
          item2527811 plat2527811
          item5940907 plat5940907
          item3289494 plat3289494
          $ TIMEFORMAT=%E
          $ time sort 1million.txt >/dev/null
          1.547
          $ time sort 10million.txt >/dev/null
          19.187


          That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 21 at 1:04

























          answered Mar 20 at 16:11









          JoL

          1,04427




          1,04427











          • In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
            – David Foerster
            Mar 20 at 21:16











          • @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
            – JoL
            Mar 21 at 1:11











          • To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
            – David Foerster
            Mar 21 at 1:47











          • @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
            – JoL
            Mar 21 at 1:54






          • 1




            Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
            – David Foerster
            Mar 21 at 2:01

















          • In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
            – David Foerster
            Mar 20 at 21:16











          • @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
            – JoL
            Mar 21 at 1:11











          • To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
            – David Foerster
            Mar 21 at 1:47











          • @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
            – JoL
            Mar 21 at 1:54






          • 1




            Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
            – David Foerster
            Mar 21 at 2:01
















          In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
          – David Foerster
          Mar 20 at 21:16





          In this case it would be better to store the sorted input data in (temporary) intermediate files because sorting takes quite long for non-trivially sized data sets. Otherwise +1.
          – David Foerster
          Mar 20 at 21:16













          @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
          – JoL
          Mar 21 at 1:11





          @David It's a good point. Personally, I really dislike having to create intermediate files, but I'm also impatient with long running processes. I wondered what "trivially sized" would be, and so I made a small benchmark, and added it to the answer along with your suggestion.
          – JoL
          Mar 21 at 1:11













          To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
          – David Foerster
          Mar 21 at 1:47





          To sort 1 mio records is fast enough on reasonably modern desktop computers. With 2 more 3 orders of magnitude more things start to become interesting. In any case elapsed (real) time (the %E in the time format) is less interesting to measure computational performance. User mode CPU time (%U or simply an unset TIMEFORMAT variable) would be much more meaningful.
          – David Foerster
          Mar 21 at 1:47













          @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
          – JoL
          Mar 21 at 1:54




          @David I'm not really familiar with the use cases for the different times. Why is it more interesting? Elapsed time is what coincides with the time that I'm actually waiting. For the 1.5 second command, I'm getting 4.5 seconds with %U.
          – JoL
          Mar 21 at 1:54




          1




          1




          Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
          – David Foerster
          Mar 21 at 2:01





          Elapsed time is affected by time spent waiting on other tasks running on the same system and blocking I/O requests. (User) CPU time is not. Usually when comparing the speed of computationally bound algorithms one wants to disregard I/O and avoid measurements errors due to other background tasks. The important question is "How much computation does this algorithm require on that data set?" instead of "How much time did my computer spend on all its tasks while it waited for that computation to complete?"
          – David Foerster
          Mar 21 at 2:01


















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1017612%2fcombine-two-files-with-awk%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          pylint3 and pip3 broken

          Missing snmpget and snmpwalk

          How to enroll fingerprints to Ubuntu 17.10 with VFS491