STAT 29000: Project 3 — Fall 2021
Regular expressions, irregularly satisfying, introduction to grep
and regular expressions
Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process — after all, projects in industry will not typically provide you with a path to your dataset and call it a day. grep
is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.
Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn’t be, they can be awesome), you should definitely take the time to learn regular expressions. |
Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep
, and experiment with regular expressions using grep
, R, and later on, Python.
Scope: grep
, regular expression basics, utilizing regular expression tools in R and Python
Dataset(s)
The following questions will use the following dataset(s):
/anvil/projects/tdm/data/consumer_complaints/complaints.csv
Questions
Question 1
grep
stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep
, we will be using it with textual data.
Let’s assume for a second that we didn’t provide you with the location of this projects dataset, and you didn’t know the name of the file either. With all of that being said, you do know that it is the only dataset with the text "That’s the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it. (When you search for this sentence in the file, make sure that you type the single quote in "That’s" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence.)
Write a grep
command that finds the dataset. You can start in the /depot/datamine/data
directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a c
inside the /depot/datamine/data
directory.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
In the previous project, you learned about a command that could quickly print out the first n lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and only the first line of the file.
Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the complaint_id
, consumer_complaint_narrative
, and the state
.
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Imagine a scenario where we are dealing with a much bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the complaint_id
, state
, consumer_complaint_narrative
, and tags
.
Use UNIX tools to, in one line, create a new dataset called southeast.csv
that only contains the data for the five states mentioned above, and only the columns listed above.
Be careful you don’t accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL"). |
How many rows of data remain? How many megabytes is the new file? Use cut
to isolate just the data we ask for. For example, just print the number of rows, and just print the value (in Mb) of the size of the file.
-
Code used to solve this problem.
-
Output from running the code.
Question 4
We want to isolate some of our southeast complaints. Return rows from our new dataset, southeast.csv
, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single grep
command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not).
-
Code used to solve this problem.
-
Output from running the code.
Question 5
If you pay attention to the consumer_complaint_narrative
column, you’ll notice that some of the narratives contain dollar amounts in curly braces {
and }
. Use grep
to find the narratives that contain at least one dollar amount enclosed in curly braces. Use head
to limit output to only the first 5 results.
Use the option |
Use the option |
There are instances like
And that the following are not matched:
|
-
Code used to solve this problem.
-
Output from running the code.
Question 6
As mentioned earlier on, every major language has some sort of regular expression package. Use either the re
package in Python (or string methods in pandas
, for example, findall
), or the grep
, grepl
, and stringr
packages in R to perform the same operation in question (5).
If you are using
Then, dat['amounts'] will be a
|
-
Code used to solve this problem.
-
Output from running the code.
Question 7 (optional, 0 pts)
As mentioned earlier on, every major language has some sort of regular expression package. Use either the re
package in Python, or the grep
, grepl
, and stringr
packages in R to create a new column in your data frame (pandas
or R data frame) named amounts
that contains a semi-colon separated string of dollar amounts without the dollar sign. For example, if the dollar amounts are $100, $200, and $300, the amounts column should contain 100.00;200.00;300.00
.
One good way to do this is to use the
|
This is one way to test if a value is
|
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |