How do I search for lines in a file that only contain ASCII characters and then act on them?
![Creative The name of the picture](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgO9GURib1T8z7lCwjOGLQaGtrueEthgQ8LO42ZX8cOfTqDK4jvDDpKkLFwf2J49kYCMNW7d4ABih_XCb_2UXdq5fPJDkoyg7-8g_YfRUot-XnaXkNYycsNp7lA5_TW9td0FFpLQ2APzKcZ/s1600/1.jpg)
![Creative The name of the picture](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYQ0N5W1qAOxLP7t7iOM6O6AzbZnkXUy16s7P_CWfOb5UbTQY_aDsc727chyphenhyphen5W4IppVNernMMQeaUFTB_rFzAd95_CDt-tnwN-nBx6JyUp2duGjPaL5-VgNO41AVsA_vu30EJcipdDG409/s400/Clash+Royale+CLAN+TAG%2523URR8PPP.png)
up vote
6
down vote
favorite
I have a text file that looks like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
Note that in the middle there, there are two lines, English words only
and Also English words only
, one right after the other.
What I need to do is take those two lines, and combine into one line separated by a /
, like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]]
, and for non-ASCII with [^[:ascii:]]
. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.
I found this question about "inverse matching", but, the answers there are beyond me.
Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.
Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?
Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.
command-line libreoffice text-processing language regex
add a comment |Â
up vote
6
down vote
favorite
I have a text file that looks like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
Note that in the middle there, there are two lines, English words only
and Also English words only
, one right after the other.
What I need to do is take those two lines, and combine into one line separated by a /
, like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]]
, and for non-ASCII with [^[:ascii:]]
. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.
I found this question about "inverse matching", but, the answers there are beyond me.
Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.
Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?
Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.
command-line libreoffice text-processing language regex
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I have a text file that looks like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
Note that in the middle there, there are two lines, English words only
and Also English words only
, one right after the other.
What I need to do is take those two lines, and combine into one line separated by a /
, like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]]
, and for non-ASCII with [^[:ascii:]]
. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.
I found this question about "inverse matching", but, the answers there are beyond me.
Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.
Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?
Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.
command-line libreoffice text-processing language regex
I have a text file that looks like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
Note that in the middle there, there are two lines, English words only
and Also English words only
, one right after the other.
What I need to do is take those two lines, and combine into one line separated by a /
, like this:
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]]
, and for non-ASCII with [^[:ascii:]]
. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.
I found this question about "inverse matching", but, the answers there are beyond me.
Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.
Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?
Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.
command-line libreoffice text-processing language regex
edited Apr 26 at 16:54
![](https://i.stack.imgur.com/8CW8e.png?s=32&g=1)
![](https://i.stack.imgur.com/8CW8e.png?s=32&g=1)
Zanna
48k13119227
48k13119227
asked Apr 26 at 14:41
![](https://i.stack.imgur.com/nzyQw.png?s=32&g=1)
![](https://i.stack.imgur.com/nzyQw.png?s=32&g=1)
Questioner
1,4382480146
1,4382480146
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
4
down vote
accepted
It seems like you can use sed
to do this job, even though it doesn't know about the [[:ascii:]]
character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127]
, as long as we use the C
or POSIX
locales.
Here's a command that should be reliable:
LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file
Notes
LC_ALL=C
UseC
locale settings only for this command (otherwise you get an error)-r
Use extended regex to make the command more readable (we need fewer backslashes) (GNUsed
also recognises-E
with the same meaning).:a
Label - loop starts here;
Separates commands, like in the shellN
Read the next line into the pattern space, so we can replacen
s|old|new|
Replaceold
withnew
^([d0-d127])n([d0-d127]+)$
- match two lines with only ASCII and capture the first line in1
and the second line in2
.^
is start of line,n
is a newline, and$
is end of line, so^line 1nline 2$
tests the whole ofline 1
andline 2
.1 / 2
The first and second lines, separated byÃÂ /ÃÂ
instead of a newline.ta
- If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.
Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.
Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add> outputfile.txt
at the end of the command (after where it saysfile
in the command above, which is the input file) so that the results will actually get saved in a file.
â Questioner
Apr 27 at 0:38
add a comment |Â
up vote
4
down vote
If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep
$ grep -P '^[[:ascii:]]*$' file
English words only
English words only
English words only
Also English words only
English words only
Some tools provide a whole-line flag such as grep's -x
or --line-regexp
:
-x, --line-regexp
Select only those matches that exactly match the whole line.
For a regular expression pattern, this is like parenthesizing
the pattern and then surrounding it with ^ and $.
allowing you to use:
$ grep -Px '[[:ascii:]]*' file
English words only
English words only
English words only
Also English words only
English words only
Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep
to slurp a whole file using the -Z
flag however there are tools such as pcregrep
or perl
itself are probably more appropriate at that point.
The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl
is one of these, which provides a /m
modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777
); for example
$ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
It seems like you can use sed
to do this job, even though it doesn't know about the [[:ascii:]]
character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127]
, as long as we use the C
or POSIX
locales.
Here's a command that should be reliable:
LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file
Notes
LC_ALL=C
UseC
locale settings only for this command (otherwise you get an error)-r
Use extended regex to make the command more readable (we need fewer backslashes) (GNUsed
also recognises-E
with the same meaning).:a
Label - loop starts here;
Separates commands, like in the shellN
Read the next line into the pattern space, so we can replacen
s|old|new|
Replaceold
withnew
^([d0-d127])n([d0-d127]+)$
- match two lines with only ASCII and capture the first line in1
and the second line in2
.^
is start of line,n
is a newline, and$
is end of line, so^line 1nline 2$
tests the whole ofline 1
andline 2
.1 / 2
The first and second lines, separated byÃÂ /ÃÂ
instead of a newline.ta
- If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.
Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.
Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add> outputfile.txt
at the end of the command (after where it saysfile
in the command above, which is the input file) so that the results will actually get saved in a file.
â Questioner
Apr 27 at 0:38
add a comment |Â
up vote
4
down vote
accepted
It seems like you can use sed
to do this job, even though it doesn't know about the [[:ascii:]]
character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127]
, as long as we use the C
or POSIX
locales.
Here's a command that should be reliable:
LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file
Notes
LC_ALL=C
UseC
locale settings only for this command (otherwise you get an error)-r
Use extended regex to make the command more readable (we need fewer backslashes) (GNUsed
also recognises-E
with the same meaning).:a
Label - loop starts here;
Separates commands, like in the shellN
Read the next line into the pattern space, so we can replacen
s|old|new|
Replaceold
withnew
^([d0-d127])n([d0-d127]+)$
- match two lines with only ASCII and capture the first line in1
and the second line in2
.^
is start of line,n
is a newline, and$
is end of line, so^line 1nline 2$
tests the whole ofline 1
andline 2
.1 / 2
The first and second lines, separated byÃÂ /ÃÂ
instead of a newline.ta
- If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.
Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.
Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add> outputfile.txt
at the end of the command (after where it saysfile
in the command above, which is the input file) so that the results will actually get saved in a file.
â Questioner
Apr 27 at 0:38
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
It seems like you can use sed
to do this job, even though it doesn't know about the [[:ascii:]]
character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127]
, as long as we use the C
or POSIX
locales.
Here's a command that should be reliable:
LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file
Notes
LC_ALL=C
UseC
locale settings only for this command (otherwise you get an error)-r
Use extended regex to make the command more readable (we need fewer backslashes) (GNUsed
also recognises-E
with the same meaning).:a
Label - loop starts here;
Separates commands, like in the shellN
Read the next line into the pattern space, so we can replacen
s|old|new|
Replaceold
withnew
^([d0-d127])n([d0-d127]+)$
- match two lines with only ASCII and capture the first line in1
and the second line in2
.^
is start of line,n
is a newline, and$
is end of line, so^line 1nline 2$
tests the whole ofline 1
andline 2
.1 / 2
The first and second lines, separated byÃÂ /ÃÂ
instead of a newline.ta
- If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.
Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.
It seems like you can use sed
to do this job, even though it doesn't know about the [[:ascii:]]
character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [d0-d127]
, as long as we use the C
or POSIX
locales.
Here's a command that should be reliable:
LC_ALL=C sed -r ':a;N;s|^([d0-d127]+)n([d0-d127]+)$|1 / 2|;ta' file
Notes
LC_ALL=C
UseC
locale settings only for this command (otherwise you get an error)-r
Use extended regex to make the command more readable (we need fewer backslashes) (GNUsed
also recognises-E
with the same meaning).:a
Label - loop starts here;
Separates commands, like in the shellN
Read the next line into the pattern space, so we can replacen
s|old|new|
Replaceold
withnew
^([d0-d127])n([d0-d127]+)$
- match two lines with only ASCII and capture the first line in1
and the second line in2
.^
is start of line,n
is a newline, and$
is end of line, so^line 1nline 2$
tests the whole ofline 1
andline 2
.1 / 2
The first and second lines, separated byÃÂ /ÃÂ
instead of a newline.ta
- If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.
Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.
edited Apr 26 at 17:28
answered Apr 26 at 16:29
![](https://i.stack.imgur.com/8CW8e.png?s=32&g=1)
![](https://i.stack.imgur.com/8CW8e.png?s=32&g=1)
Zanna
48k13119227
48k13119227
Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add> outputfile.txt
at the end of the command (after where it saysfile
in the command above, which is the input file) so that the results will actually get saved in a file.
â Questioner
Apr 27 at 0:38
add a comment |Â
Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add> outputfile.txt
at the end of the command (after where it saysfile
in the command above, which is the input file) so that the results will actually get saved in a file.
â Questioner
Apr 27 at 0:38
Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add
> outputfile.txt
at the end of the command (after where it says file
in the command above, which is the input file) so that the results will actually get saved in a file.â Questioner
Apr 27 at 0:38
Thank you for this command, it worked like a charm. Just an additional note for newbies like myself, one has to add
> outputfile.txt
at the end of the command (after where it says file
in the command above, which is the input file) so that the results will actually get saved in a file.â Questioner
Apr 27 at 0:38
add a comment |Â
up vote
4
down vote
If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep
$ grep -P '^[[:ascii:]]*$' file
English words only
English words only
English words only
Also English words only
English words only
Some tools provide a whole-line flag such as grep's -x
or --line-regexp
:
-x, --line-regexp
Select only those matches that exactly match the whole line.
For a regular expression pattern, this is like parenthesizing
the pattern and then surrounding it with ^ and $.
allowing you to use:
$ grep -Px '[[:ascii:]]*' file
English words only
English words only
English words only
Also English words only
English words only
Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep
to slurp a whole file using the -Z
flag however there are tools such as pcregrep
or perl
itself are probably more appropriate at that point.
The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl
is one of these, which provides a /m
modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777
); for example
$ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
add a comment |Â
up vote
4
down vote
If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep
$ grep -P '^[[:ascii:]]*$' file
English words only
English words only
English words only
Also English words only
English words only
Some tools provide a whole-line flag such as grep's -x
or --line-regexp
:
-x, --line-regexp
Select only those matches that exactly match the whole line.
For a regular expression pattern, this is like parenthesizing
the pattern and then surrounding it with ^ and $.
allowing you to use:
$ grep -Px '[[:ascii:]]*' file
English words only
English words only
English words only
Also English words only
English words only
Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep
to slurp a whole file using the -Z
flag however there are tools such as pcregrep
or perl
itself are probably more appropriate at that point.
The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl
is one of these, which provides a /m
modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777
); for example
$ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
add a comment |Â
up vote
4
down vote
up vote
4
down vote
If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep
$ grep -P '^[[:ascii:]]*$' file
English words only
English words only
English words only
Also English words only
English words only
Some tools provide a whole-line flag such as grep's -x
or --line-regexp
:
-x, --line-regexp
Select only those matches that exactly match the whole line.
For a regular expression pattern, this is like parenthesizing
the pattern and then surrounding it with ^ and $.
allowing you to use:
$ grep -Px '[[:ascii:]]*' file
English words only
English words only
English words only
Also English words only
English words only
Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep
to slurp a whole file using the -Z
flag however there are tools such as pcregrep
or perl
itself are probably more appropriate at that point.
The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl
is one of these, which provides a /m
modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777
); for example
$ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
If you want whole lines consisting only of ASCII characters you need to anchor your pattern to the start and end of line e.g. with grep
$ grep -P '^[[:ascii:]]*$' file
English words only
English words only
English words only
Also English words only
English words only
Some tools provide a whole-line flag such as grep's -x
or --line-regexp
:
-x, --line-regexp
Select only those matches that exactly match the whole line.
For a regular expression pattern, this is like parenthesizing
the pattern and then surrounding it with ^ and $.
allowing you to use:
$ grep -Px '[[:ascii:]]*' file
English words only
English words only
English words only
Also English words only
English words only
Multiline matching adds a whole other layer of complexity, since many of the common command line text processing utilities are line based. You can force grep
to slurp a whole file using the -Z
flag however there are tools such as pcregrep
or perl
itself are probably more appropriate at that point.
The next issue you need to solve is how to interpret the concepts "start of line" and "end of line" in the context of a multiline match. Some tools provide flags for that, as described in Regex Tutorial: Anchors: perl
is one of these, which provides a /m
modifier. You still need to slurp the file by unsetting the default record separator (done here using -0777
); for example
$ perl -0777 -pe 's^([[:ascii:]]+)n([[:ascii:]]+)$$1 / $2mg' file
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only / Also English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
English words only
English and æÂ¥æ¬èªÂ
æÂ¥æ¬èªÂã®ã¿
edited Apr 26 at 18:00
answered Apr 26 at 14:48
steeldriver
62.7k1196164
62.7k1196164
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
var $window = $(window),
onScroll = function(e)
var $elem = $('.new-login-left'),
docViewTop = $window.scrollTop(),
docViewBottom = docViewTop + $window.height(),
elemTop = $elem.offset().top,
elemBottom = elemTop + $elem.height();
if ((docViewTop elemBottom))
StackExchange.using('gps', function() StackExchange.gps.track('embedded_signup_form.view', location: 'question_page' ); );
$window.unbind('scroll', onScroll);
;
$window.on('scroll', onScroll);
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1028440%2fhow-do-i-search-for-lines-in-a-file-that-only-contain-ascii-characters-and-then%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
var $window = $(window),
onScroll = function(e)
var $elem = $('.new-login-left'),
docViewTop = $window.scrollTop(),
docViewBottom = docViewTop + $window.height(),
elemTop = $elem.offset().top,
elemBottom = elemTop + $elem.height();
if ((docViewTop elemBottom))
StackExchange.using('gps', function() StackExchange.gps.track('embedded_signup_form.view', location: 'question_page' ); );
$window.unbind('scroll', onScroll);
;
$window.on('scroll', onScroll);
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
var $window = $(window),
onScroll = function(e)
var $elem = $('.new-login-left'),
docViewTop = $window.scrollTop(),
docViewBottom = docViewTop + $window.height(),
elemTop = $elem.offset().top,
elemBottom = elemTop + $elem.height();
if ((docViewTop elemBottom))
StackExchange.using('gps', function() StackExchange.gps.track('embedded_signup_form.view', location: 'question_page' ); );
$window.unbind('scroll', onScroll);
;
$window.on('scroll', onScroll);
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
var $window = $(window),
onScroll = function(e)
var $elem = $('.new-login-left'),
docViewTop = $window.scrollTop(),
docViewBottom = docViewTop + $window.height(),
elemTop = $elem.offset().top,
elemBottom = elemTop + $elem.height();
if ((docViewTop elemBottom))
StackExchange.using('gps', function() StackExchange.gps.track('embedded_signup_form.view', location: 'question_page' ); );
$window.unbind('scroll', onScroll);
;
$window.on('scroll', onScroll);
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password