How do I search for lines in a file that only contain ASCII characters and then act on them?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP








up vote
6
down vote

favorite












I have a text file that looks like this:



English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
English words only
Also English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ


Note that in the middle there, there are two lines, English words only and Also English words only, one right after the other.



What I need to do is take those two lines, and combine into one line separated by a /, like this:



English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
English words only / Also English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ


I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]], and for non-ASCII with [^[:ascii:]]. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.



I found this question about "inverse matching", but, the answers there are beyond me.



Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.



Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?



Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.







share|improve this question


























    up vote
    6
    down vote

    favorite












    I have a text file that looks like this:



    English words only
    English and 日本語
    日本語のみ
    English words only
    English and 日本語
    日本語のみ
    English words only
    Also English words only
    English and 日本語
    日本語のみ
    English words only
    English and 日本語
    日本語のみ


    Note that in the middle there, there are two lines, English words only and Also English words only, one right after the other.



    What I need to do is take those two lines, and combine into one line separated by a /, like this:



    English words only
    English and 日本語
    日本語のみ
    English words only
    English and 日本語
    日本語のみ
    English words only / Also English words only
    English and 日本語
    日本語のみ
    English words only
    English and 日本語
    日本語のみ


    I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]], and for non-ASCII with [^[:ascii:]]. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.



    I found this question about "inverse matching", but, the answers there are beyond me.



    Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.



    Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?



    Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.







    share|improve this question
























      up vote
      6
      down vote

      favorite









      up vote
      6
      down vote

      favorite











      I have a text file that looks like this:



      English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ
      English words only
      Also English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ


      Note that in the middle there, there are two lines, English words only and Also English words only, one right after the other.



      What I need to do is take those two lines, and combine into one line separated by a /, like this:



      English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ
      English words only / Also English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ


      I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]], and for non-ASCII with [^[:ascii:]]. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.



      I found this question about "inverse matching", but, the answers there are beyond me.



      Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.



      Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?



      Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.







      share|improve this question














      I have a text file that looks like this:



      English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ
      English words only
      Also English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ


      Note that in the middle there, there are two lines, English words only and Also English words only, one right after the other.



      What I need to do is take those two lines, and combine into one line separated by a /, like this:



      English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ
      English words only / Also English words only
      English and 日本語
      日本語のみ
      English words only
      English and 日本語
      日本語のみ


      I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]], and for non-ASCII with [^[:ascii:]]. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.



      I found this question about "inverse matching", but, the answers there are beyond me.



      Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.



      Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?



      Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.









      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 26 at 16:54









      Zanna

      48k13119227




      48k13119227










      asked Apr 26 at 14:41









      Questioner

      1,4382480146




      1,4382480146




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          4
          down vote



          accepted










          It seems like you can use sed to do this job, even though it doesn't know about the [[:ascii:]] character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127], as long as we use the C or POSIX locales.



          Here's a command that should be reliable:



          LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file


          Notes




          • LC_ALL=C Use C locale settings only for this command (otherwise you get an error)


          • -r Use extended regex to make the command more readable (we need fewer backslashes) (GNU sed also recognises -E with the same meaning).


          • :a Label - loop starts here


          • ; Separates commands, like in the shell


          • N Read the next line into the pattern space, so we can replace n


          • s|old|new| Replace old with new


          • ^([d0-d127])n([d0-d127]+)$ - match two lines with only ASCII and capture the first line in 1 and the second line in 2. ^ is start of line, n is a newline, and $ is end of line, so ^line 1nline 2$ tests the whole of line 1 and line 2.


          • 1 / 2 The first and second lines, separated by  /  instead of a newline.


          • ta - If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.


          Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.






          share|improve this answer






















          • Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add > outputfile.txt at the end of the command (after where it says file in the command above, which is the input file) so that the results will actually get saved in a file.
            – Questioner
            Apr 27 at 0:38


















          up vote
          4
          down vote













          If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep



          $ grep -P '^[[:ascii:]]*$' file
          English words only
          English words only
          English words only
          Also English words only
          English words only


          Some tools provide a whole-line flag such as grep's -x or --line-regexp:




           -x, --line-regexp
          Select only those matches that exactly match the whole line.
          For a regular expression pattern, this is like parenthesizing
          the pattern and then surrounding it with ^ and $.



          allowing you to use:



          $ grep -Px '[[:ascii:]]*' file
          English words only
          English words only
          English words only
          Also English words only
          English words only



          Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep to slurp a whole file using the -Z flag however there are tools such as pcregrep or perl itself are probably more appropriate at that point.



          The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl is one of these, which provides a /m modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777); for example



          $ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
          English words only
          English and 日本語
          日本語のみ
          English words only
          English and 日本語
          日本語のみ
          English words only / Also English words only
          English and 日本語
          日本語のみ
          English words only
          English and 日本語
          日本語のみ





          share|improve this answer






















            Your Answer







            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "89"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1028440%2fhow-do-i-search-for-lines-in-a-file-that-only-contain-ascii-characters-and-then%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            4
            down vote



            accepted










            It seems like you can use sed to do this job, even though it doesn't know about the [[:ascii:]] character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127], as long as we use the C or POSIX locales.



            Here's a command that should be reliable:



            LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file


            Notes




            • LC_ALL=C Use C locale settings only for this command (otherwise you get an error)


            • -r Use extended regex to make the command more readable (we need fewer backslashes) (GNU sed also recognises -E with the same meaning).


            • :a Label - loop starts here


            • ; Separates commands, like in the shell


            • N Read the next line into the pattern space, so we can replace n


            • s|old|new| Replace old with new


            • ^([d0-d127])n([d0-d127]+)$ - match two lines with only ASCII and capture the first line in 1 and the second line in 2. ^ is start of line, n is a newline, and $ is end of line, so ^line 1nline 2$ tests the whole of line 1 and line 2.


            • 1 / 2 The first and second lines, separated by  /  instead of a newline.


            • ta - If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.


            Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.






            share|improve this answer






















            • Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add > outputfile.txt at the end of the command (after where it says file in the command above, which is the input file) so that the results will actually get saved in a file.
              – Questioner
              Apr 27 at 0:38















            up vote
            4
            down vote



            accepted










            It seems like you can use sed to do this job, even though it doesn't know about the [[:ascii:]] character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127], as long as we use the C or POSIX locales.



            Here's a command that should be reliable:



            LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file


            Notes




            • LC_ALL=C Use C locale settings only for this command (otherwise you get an error)


            • -r Use extended regex to make the command more readable (we need fewer backslashes) (GNU sed also recognises -E with the same meaning).


            • :a Label - loop starts here


            • ; Separates commands, like in the shell


            • N Read the next line into the pattern space, so we can replace n


            • s|old|new| Replace old with new


            • ^([d0-d127])n([d0-d127]+)$ - match two lines with only ASCII and capture the first line in 1 and the second line in 2. ^ is start of line, n is a newline, and $ is end of line, so ^line 1nline 2$ tests the whole of line 1 and line 2.


            • 1 / 2 The first and second lines, separated by  /  instead of a newline.


            • ta - If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.


            Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.






            share|improve this answer






















            • Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add > outputfile.txt at the end of the command (after where it says file in the command above, which is the input file) so that the results will actually get saved in a file.
              – Questioner
              Apr 27 at 0:38













            up vote
            4
            down vote



            accepted







            up vote
            4
            down vote



            accepted






            It seems like you can use sed to do this job, even though it doesn't know about the [[:ascii:]] character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127], as long as we use the C or POSIX locales.



            Here's a command that should be reliable:



            LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file


            Notes




            • LC_ALL=C Use C locale settings only for this command (otherwise you get an error)


            • -r Use extended regex to make the command more readable (we need fewer backslashes) (GNU sed also recognises -E with the same meaning).


            • :a Label - loop starts here


            • ; Separates commands, like in the shell


            • N Read the next line into the pattern space, so we can replace n


            • s|old|new| Replace old with new


            • ^([d0-d127])n([d0-d127]+)$ - match two lines with only ASCII and capture the first line in 1 and the second line in 2. ^ is start of line, n is a newline, and $ is end of line, so ^line 1nline 2$ tests the whole of line 1 and line 2.


            • 1 / 2 The first and second lines, separated by  /  instead of a newline.


            • ta - If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.


            Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.






            share|improve this answer














            It seems like you can use sed to do this job, even though it doesn't know about the [[:ascii:]] character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127], as long as we use the C or POSIX locales.



            Here's a command that should be reliable:



            LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file


            Notes




            • LC_ALL=C Use C locale settings only for this command (otherwise you get an error)


            • -r Use extended regex to make the command more readable (we need fewer backslashes) (GNU sed also recognises -E with the same meaning).


            • :a Label - loop starts here


            • ; Separates commands, like in the shell


            • N Read the next line into the pattern space, so we can replace n


            • s|old|new| Replace old with new


            • ^([d0-d127])n([d0-d127]+)$ - match two lines with only ASCII and capture the first line in 1 and the second line in 2. ^ is start of line, n is a newline, and $ is end of line, so ^line 1nline 2$ tests the whole of line 1 and line 2.


            • 1 / 2 The first and second lines, separated by  /  instead of a newline.


            • ta - If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.


            Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Apr 26 at 17:28

























            answered Apr 26 at 16:29









            Zanna

            48k13119227




            48k13119227











            • Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add > outputfile.txt at the end of the command (after where it says file in the command above, which is the input file) so that the results will actually get saved in a file.
              – Questioner
              Apr 27 at 0:38

















            • Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add > outputfile.txt at the end of the command (after where it says file in the command above, which is the input file) so that the results will actually get saved in a file.
              – Questioner
              Apr 27 at 0:38
















            Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add > outputfile.txt at the end of the command (after where it says file in the command above, which is the input file) so that the results will actually get saved in a file.
            – Questioner
            Apr 27 at 0:38





            Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add > outputfile.txt at the end of the command (after where it says file in the command above, which is the input file) so that the results will actually get saved in a file.
            – Questioner
            Apr 27 at 0:38













            up vote
            4
            down vote













            If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep



            $ grep -P '^[[:ascii:]]*$' file
            English words only
            English words only
            English words only
            Also English words only
            English words only


            Some tools provide a whole-line flag such as grep's -x or --line-regexp:




             -x, --line-regexp
            Select only those matches that exactly match the whole line.
            For a regular expression pattern, this is like parenthesizing
            the pattern and then surrounding it with ^ and $.



            allowing you to use:



            $ grep -Px '[[:ascii:]]*' file
            English words only
            English words only
            English words only
            Also English words only
            English words only



            Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep to slurp a whole file using the -Z flag however there are tools such as pcregrep or perl itself are probably more appropriate at that point.



            The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl is one of these, which provides a /m modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777); for example



            $ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
            English words only
            English and 日本語
            日本語のみ
            English words only
            English and 日本語
            日本語のみ
            English words only / Also English words only
            English and 日本語
            日本語のみ
            English words only
            English and 日本語
            日本語のみ





            share|improve this answer


























              up vote
              4
              down vote













              If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep



              $ grep -P '^[[:ascii:]]*$' file
              English words only
              English words only
              English words only
              Also English words only
              English words only


              Some tools provide a whole-line flag such as grep's -x or --line-regexp:




               -x, --line-regexp
              Select only those matches that exactly match the whole line.
              For a regular expression pattern, this is like parenthesizing
              the pattern and then surrounding it with ^ and $.



              allowing you to use:



              $ grep -Px '[[:ascii:]]*' file
              English words only
              English words only
              English words only
              Also English words only
              English words only



              Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep to slurp a whole file using the -Z flag however there are tools such as pcregrep or perl itself are probably more appropriate at that point.



              The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl is one of these, which provides a /m modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777); for example



              $ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
              English words only
              English and 日本語
              日本語のみ
              English words only
              English and 日本語
              日本語のみ
              English words only / Also English words only
              English and 日本語
              日本語のみ
              English words only
              English and 日本語
              日本語のみ





              share|improve this answer
























                up vote
                4
                down vote










                up vote
                4
                down vote









                If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep



                $ grep -P '^[[:ascii:]]*$' file
                English words only
                English words only
                English words only
                Also English words only
                English words only


                Some tools provide a whole-line flag such as grep's -x or --line-regexp:




                 -x, --line-regexp
                Select only those matches that exactly match the whole line.
                For a regular expression pattern, this is like parenthesizing
                the pattern and then surrounding it with ^ and $.



                allowing you to use:



                $ grep -Px '[[:ascii:]]*' file
                English words only
                English words only
                English words only
                Also English words only
                English words only



                Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep to slurp a whole file using the -Z flag however there are tools such as pcregrep or perl itself are probably more appropriate at that point.



                The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl is one of these, which provides a /m modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777); for example



                $ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
                English words only
                English and 日本語
                日本語のみ
                English words only
                English and 日本語
                日本語のみ
                English words only / Also English words only
                English and 日本語
                日本語のみ
                English words only
                English and 日本語
                日本語のみ





                share|improve this answer














                If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep



                $ grep -P '^[[:ascii:]]*$' file
                English words only
                English words only
                English words only
                Also English words only
                English words only


                Some tools provide a whole-line flag such as grep's -x or --line-regexp:




                 -x, --line-regexp
                Select only those matches that exactly match the whole line.
                For a regular expression pattern, this is like parenthesizing
                the pattern and then surrounding it with ^ and $.



                allowing you to use:



                $ grep -Px '[[:ascii:]]*' file
                English words only
                English words only
                English words only
                Also English words only
                English words only



                Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep to slurp a whole file using the -Z flag however there are tools such as pcregrep or perl itself are probably more appropriate at that point.



                The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl is one of these, which provides a /m modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777); for example



                $ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
                English words only
                English and 日本語
                日本語のみ
                English words only
                English and 日本語
                日本語のみ
                English words only / Also English words only
                English and 日本語
                日本語のみ
                English words only
                English and 日本語
                日本語のみ






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 26 at 18:00

























                answered Apr 26 at 14:48









                steeldriver

                62.7k1196164




                62.7k1196164



























                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1028440%2fhow-do-i-search-for-lines-in-a-file-that-only-contain-ascii-characters-and-then%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    pylint3 and pip3 broken

                    Missing snmpget and snmpwalk

                    How to enroll fingerprints to Ubuntu 17.10 with VFS491