Saturday, April 30, 2016

Prepare to use Regular Expressions in Java



Regular expression quick reference:
  .        Wildcard: any character
  *        Repeat: zero or more occurrences of previous character or class
  ^        Line position: beginning of line
  $        Line position: end of line
  [class]  Character class: any one character in set
  [^class] Inverse class: any one character not in set
  [x-y]    Range: any characters within the specified range
  \x       Escape: literal use of metacharacter x
  \<xyz    Word position: beginning of word
  xyz\>    Word position: end of word

For full information on FINDSTR regular expressions refer to the online Command Reference.

Change to a directory where we can work with files.
C:\>cd c:\temp

Create a file that we can copy later.  The contents are not important.  ( Press Ctrl+Z to close the file )
C:\temp>copy con a.txt
hello
^Z
        1 file(s) copied.

Create a few files that we can work with.
C:\temp>for %i in ( 1 2 3 4 5 6 7 8 9 ) do copy a.txt myFile%i.txt

List the files and find a pattern to match.  Your times and dates will be different, but think about the pattern not the exact command.
C:\temp>dir
 Volume in drive C is OS
 Volume Serial Number is E828-D083

 Directory of C:\temp

11/11/2015  10:18 PM    <DIR>          .
11/11/2015  10:18 PM    <DIR>          ..
11/11/2015  10:17 PM                 7 a.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt
              16 File(s)        992,802 bytes
               3 Dir(s)  415,352,266,752 bytes free

You can learn about the options we can use with FINDSTR utility by typing findstr /? To learn about the regular expression syntax as we pipe the DIR output as an input to FINDSTR.  In this example, we’ll match any line that contains two digitsfollowed by a forward slash.
C:\temp>dir|findstr /r [0-9][0-9]/
11/11/2015  10:18 PM    <DIR>          .
11/11/2015  10:18 PM    <DIR>          ..
11/11/2015  10:17 PM                 7 a.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt

In this example, we’ll match an line that contains 11 followed by a forward slash.
Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r 11/
11/11/2015  10:18 PM    <DIR>          .
11/11/2015  10:18 PM    <DIR>          ..
11/11/2015  10:17 PM                 7 a.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt

In this example, we’ll match any line that starts with 11 followed by a forward slash.
Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/
11/11/2015  10:18 PM    <DIR>          .
11/11/2015  10:18 PM    <DIR>          ..
11/11/2015  10:17 PM                 7 a.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt

In this example, we’ll match any line that starts with 11 followed by a forward slash and any string.
Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*
11/11/2015  10:18 PM    <DIR>          .
11/11/2015  10:18 PM    <DIR>          ..
11/11/2015  10:17 PM                 7 a.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt

In this example, we’ll match any line that starts with 11 followed by forward slash, any number of characters followed by the string File.  Notice how the directory entries and file a.txt does not show up anymore. Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*File
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt

In this example, we’ll match any line that starts with 11 followed by a forward slash, any number of characters, and ends with the letter t. Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*t$
11/11/2015  10:17 PM                 7 a.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt

In this example, we’ll match any lines that starts with 11 followed by a forward slash, followed by any number of characters followed by a digit.txt. Notice the meaning of the period is escaped by the forward slash to take it as a literal character. Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*[1-9]\.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile4.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt

Let’s change one of the files to a unique sequence of characters.
C:\temp>ren myFile4.txt myFileZ.tat

In this example, we’ll match any lines that start with 11 followed by a forward slash, any number of characters followed by a digit, followed by any number of characters, and the line will end with a letter t.
Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*[1-9].*t$
11/11/2015  10:17 PM                 7 a.txt
11/11/2015  10:17 PM                 7 myFile1.txt
11/11/2015  10:17 PM                 7 myFile2.txt
11/11/2015  10:17 PM                 7 myFile3.txt
11/11/2015  10:17 PM                 7 myFile5.txt
11/11/2015  10:17 PM                 7 myFile6.txt
11/11/2015  10:17 PM                 7 myFile7.txt
11/11/2015  10:17 PM                 7 myFile8.txt
11/11/2015  10:17 PM                 7 myFile9.txt
11/11/2015  10:17 PM                 7 myFileZ.tat

In this example, we’ll match any line that matches all the lines as before, but now we’ll end the line with at and not just a t. Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*[1-9].*at$
11/11/2015  10:17 PM                 7 myFileZ.tat

In this example, we’ll match our uniquely named file ignoring everything else.
Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*[1-9]\.[a-zA-Z].*at$

Oops, something went wrong.  Why did the previous search pattern not match our file name, but the following pattern will?
Note: the first two digits will be different in your case, just type what you see as the first two digits after running dir in the current directory
C:\temp>dir|findstr /r ^11/.*[a-zA-Z].*at$
11/11/2015  10:17 PM                 7 myFileZ.tat

Write your answer here ( or the paper handed to you by your instructor ) and hand it back to your instructor.



Now, create a new file named myfile.txt using notepad and add the text (214)234-4567 to the file.  Find the text in all text files using FINDSTR utility.
C:\temp>findstr /r ^[(\[][0-9][0-9] *.txt

myfile.txt:(214)234-4567

Example Java code using regular expression to match pattern.


import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * This program tests a customer number to determine
 * whether it is in the proper format.
 */
public class SuspectNumbers  {
   public static void main(String[] args) throws IOException   {  
         File inputFile = new File("zip.txt");
         Scanner input=new Scanner(inputFile);
         boolean isEqual,isEqual2;
         Matcher m;
         String value;
         String match="75201";
         Pattern p = Pattern.compile("^75201");

      while(input.hasNext()){
        value=input.nextLine();
        if(value.startsWith(match))
              System.out.println(value);
        
        m = p.matcher(value);
        isEqual=Pattern.matches("^75201.*", value);
        if(isEqual)
              System.out.println("Regular Expression matched: "+value);
        
        isEqual2=m.find();
        if(isEqual2)
              System.out.println("Regular Expression 2 matched: "+value);
        
        if(Pattern.matches(" [0-9]{3}-[0-9]{2}-[0-9]{4} ", value))
              System.out.println("Social Security number matched: "+value);
        
        if(Pattern.matches("[\\[\\(][0-9]{3}[\\]\\)][0-9]{3}-[0-9]{4}.*", value))
              System.out.println("Phone number matched: "+value);
        
        if(Pattern.matches("[\\[\\(][0-9]{3}[\\]\\)][0-9]{3}-[0-9]{3}[a-zA-Z].*", value))
             System.out.println("Fake phone number matched: "+value);
      }
   }
}

Based on the code above, what sample value would you place in the input file to match fake phone numbers?




Appendix A: How to use regular expressions in MS Word


Save a Word Document with a phone number in it in a format (214)345-3456. Click on Find->Advanced Find …., check Use wildcards and type the following into the “Find what:”  <[0-9]{3}\)[0-9]{3}-[0-9]{4}>

      

Appendix A: How to use regular expressions in Notepad++


Save a Document with a phone number in it like [124]123-3456. Click on Search, check “Regular expression” and type the following into the “Find what:”  \[[0-9]{3}\][0-9]{3}-[0-9]{4}>


You can generate simple text files with patterns that you can search for and match with regular expressions.


75201    Dallas    TX
75202    Dallas    TX
75203    Dallas    TX
75204    Dallas    TX
75205    Dallas    TX
75206    Dallas    TX
 75378    Dallas    TX
(214)234-456I                                     // notice 456I is not ending with a one, it is the capital letter 'i'
75379    Dallas    TX
75047    Garland    TX
75048    Garland    TX
[123]345_4567                                  //phone numbers can be in different format
75049    Garland    TX
75080    Richardson    TX
75081    Richardson    TX
75082    Richardson    TX
76016    Arlington    TX
76017    Arlington    TX
234-05-3456                                     //social security numbers should be easy to match
76018    Arlington    TX
76019    Arlington    TX

No comments:

Post a Comment