0% found this document useful (0 votes)
9 views9 pages

2.Text Representation Word Embeddings 1.Ipynb

The document introduces key terms related to text analysis, such as corpus, vocabulary, document, and word. It demonstrates the Bag of Words model using Python code to create a DataFrame and apply the CountVectorizer from sklearn to extract features from text data. The output includes a vocabulary dictionary and a matrix representation of the text data.

Uploaded by

v2170688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

2.Text Representation Word Embeddings 1.Ipynb

The document introduces key terms related to text analysis, such as corpus, vocabulary, document, and word. It demonstrates the Bag of Words model using Python code to create a DataFrame and apply the CountVectorizer from sklearn to extract features from text data. The output includes a vocabulary dictionary and a matrix representation of the text data.

Uploaded by

v2170688
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 9

{"cells":[{"cell_type":"markdown","source":["Some common terms to remember:\n","1.

Corpus\n","2. Vocabulary\n","3. Document\n","4. Word"],"metadata":


{"id":"6Nqr2Ga5Zl_I"}},{"cell_type":"markdown","metadata":
{"id":"Dc96CHeAfXMD"},"source":["# Bag of words"]},
{"cell_type":"code","execution_count":1,"metadata":
{"id":"X13eTxrzfXMH","executionInfo":
{"status":"ok","timestamp":1702122181123,"user_tz":-360,"elapsed":3,"user":
{"displayName":"colab0 ineuron","userId":"16851312232179065356"}}},"outputs":
[],"source":["import numpy as np\n","import pandas as pd"]},
{"cell_type":"code","execution_count":2,"metadata":{"colab":{"base_uri":"https://
localhost:8080/","height":174},"id":"YiAM6J7HfXMK","executionInfo":
{"status":"ok","timestamp":1702122186580,"user_tz":-360,"elapsed":15,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"9723045f-c4fa-4f17-f4be-
8820393e4517"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["
text output\n","0 people watch dswithbappy 1\n","1 dswithbappy watch
dswithbappy 1\n","2 people write comment 0\n","3
dswithbappy write comment 0"],"text/html":["\n"," <div id=\"df-7d65afe3-
2678-45e5-907e-ca7987d88dcc\" class=\"colab-df-container\">\n"," <div>\
n","<style scoped>\n"," .dataframe tbody tr th:only-of-type {\n","
vertical-align: middle;\n"," }\n","\n"," .dataframe tbody tr th {\n","
vertical-align: top;\n"," }\n","\n"," .dataframe thead th {\n"," text-
align: right;\n"," }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\
n"," <thead>\n"," <tr style=\"text-align: right;\">\n"," <th></th>\n","
<th>text</th>\n"," <th>output</th>\n"," </tr>\n"," </thead>\n"," <tbody>\
n"," <tr>\n"," <th>0</th>\n"," <td>people watch dswithbappy</td>\n","
<td>1</td>\n"," </tr>\n"," <tr>\n"," <th>1</th>\n","
<td>dswithbappy watch dswithbappy</td>\n"," <td>1</td>\n"," </tr>\n","
<tr>\n"," <th>2</th>\n"," <td>people write comment</td>\n","
<td>0</td>\n"," </tr>\n"," <tr>\n"," <th>3</th>\n","
<td>dswithbappy write comment</td>\n"," <td>0</td>\n"," </tr>\n","
</tbody>\n","</table>\n","</div>\n"," <div class=\"colab-df-buttons\">\n","\n","
<div class=\"colab-df-container\">\n"," <button class=\"colab-df-convert\"
onclick=\"convertToInteractive('df-7d65afe3-2678-45e5-907e-ca7987d88dcc')\"\n","
title=\"Convert this dataframe to an interactive table.\"\n","
style=\"display:none;\">\n","\n"," <svg xmlns=\"http://www.w3.org/2000/svg\"
height=\"24px\" viewBox=\"0 -960 960 960\">\n"," <path d=\"M120-120v-
720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-
160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-
160H180v160Zm440 0h160v-160H620v160Z\"/>\n"," </svg>\n"," </button>\n","\n","
<style>\n"," .colab-df-container {\n"," display:flex;\n"," gap: 12px;\
n"," }\n","\n"," .colab-df-convert {\n"," background-color: #E8F0FE;\
n"," border: none;\n"," border-radius: 50%;\n"," cursor: pointer;\
n"," display: none;\n"," fill: #1967D2;\n"," height: 32px;\n","
padding: 0 0 0 0;\n"," width: 32px;\n"," }\n","\n"," .colab-df-
convert:hover {\n"," background-color: #E2EBFA;\n"," box-shadow: 0px 1px
2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n"," fill:
#174EA6;\n"," }\n","\n"," .colab-df-buttons div {\n"," margin-bottom:
4px;\n"," }\n","\n"," [theme=dark] .colab-df-convert {\n"," background-
color: #3B4455;\n"," fill: #D2E3FC;\n"," }\n","\n","
[theme=dark] .colab-df-convert:hover {\n"," background-color: #434B5C;\n","
box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n"," filter: drop-shadow(0px
1px 2px rgba(0, 0, 0, 0.3));\n"," fill: #FFFFFF;\n"," }\n"," </style>\
n","\n"," <script>\n"," const buttonEl =\n","
document.querySelector('#df-7d65afe3-2678-45e5-907e-ca7987d88dcc button.colab-df-
convert');\n"," buttonEl.style.display =\n","
google.colab.kernel.accessAllowed ? 'block' : 'none';\n","\n"," async function
convertToInteractive(key) {\n"," const element =
document.querySelector('#df-7d65afe3-2678-45e5-907e-ca7987d88dcc');\n","
const dataTable =\n"," await
google.colab.kernel.invokeFunction('convertToInteractive',\n","
[key], {});\n"," if (!dataTable) return;\n","\n"," const docLinkHtml
= 'Like what you see? Visit the ' +\n"," '<a target=\"_blank\"
href=https://colab.research.google.com/notebooks/data_table.ipynb>data table
notebook</a>'\n"," + ' to learn more about interactive tables.';\n","
element.innerHTML = '';\n"," dataTable['output_type'] = 'display_data';\n","
await google.colab.output.renderOutput(dataTable, element);\n"," const
docLink = document.createElement('div');\n"," docLink.innerHTML =
docLinkHtml;\n"," element.appendChild(docLink);\n"," }\n","
</script>\n"," </div>\n","\n","\n","<div id=\"df-caaf1aed-538d-4673-b2e2-
817876f5b118\">\n"," <button class=\"colab-df-quickchart\"
onclick=\"quickchart('df-caaf1aed-538d-4673-b2e2-817876f5b118')\"\n","
title=\"Suggest charts\"\n"," style=\"display:none;\">\n","\n","<svg
xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n","
width=\"24px\">\n"," <g>\n"," <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0
1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-
2v-4h2v4z\"/>\n"," </g>\n","</svg>\n"," </button>\n","\n","<style>\
n"," .colab-df-quickchart {\n"," --bg-color: #E8F0FE;\n"," --fill-color:
#1967D2;\n"," --hover-bg-color: #E2EBFA;\n"," --hover-fill-color:
#174EA6;\n"," --disabled-fill-color: #AAA;\n"," --disabled-bg-color:
#DDD;\n"," }\n","\n"," [theme=dark] .colab-df-quickchart {\n"," --bg-color:
#3B4455;\n"," --fill-color: #D2E3FC;\n"," --hover-bg-color: #434B5C;\n","
--hover-fill-color: #FFFFFF;\n"," --disabled-bg-color: #3B4455;\n"," --
disabled-fill-color: #666;\n"," }\n","\n"," .colab-df-quickchart {\n","
background-color: var(--bg-color);\n"," border: none;\n"," border-radius:
50%;\n"," cursor: pointer;\n"," display: none;\n"," fill: var(--fill-
color);\n"," height: 32px;\n"," padding: 0;\n"," width: 32px;\n"," }\
n","\n"," .colab-df-quickchart:hover {\n"," background-color: var(--hover-bg-
color);\n"," box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60,
64, 67, 0.15);\n"," fill: var(--button-hover-fill-color);\n"," }\n","\
n"," .colab-df-quickchart-complete:disabled,\n"," .colab-df-quickchart-
complete:disabled:hover {\n"," background-color: var(--disabled-bg-color);\n","
fill: var(--disabled-fill-color);\n"," box-shadow: none;\n"," }\n","\
n"," .colab-df-spinner {\n"," border: 2px solid var(--fill-color);\n","
border-color: transparent;\n"," border-bottom-color: var(--fill-color);\n","
animation:\n"," spin 1s steps(1) infinite;\n"," }\n","\n"," @keyframes spin
{\n"," 0% {\n"," border-color: transparent;\n"," border-bottom-color:
var(--fill-color);\n"," border-left-color: var(--fill-color);\n"," }\n","
20% {\n"," border-color: transparent;\n"," border-left-color: var(--fill-
color);\n"," border-top-color: var(--fill-color);\n"," }\n"," 30% {\n","
border-color: transparent;\n"," border-left-color: var(--fill-color);\n","
border-top-color: var(--fill-color);\n"," border-right-color: var(--fill-
color);\n"," }\n"," 40% {\n"," border-color: transparent;\n","
border-right-color: var(--fill-color);\n"," border-top-color: var(--fill-
color);\n"," }\n"," 60% {\n"," border-color: transparent;\n","
border-right-color: var(--fill-color);\n"," }\n"," 80% {\n"," border-
color: transparent;\n"," border-right-color: var(--fill-color);\n","
border-bottom-color: var(--fill-color);\n"," }\n"," 90% {\n"," border-
color: transparent;\n"," border-bottom-color: var(--fill-color);\n"," }\
n"," }\n","</style>\n","\n"," <script>\n"," async function quickchart(key) {\
n"," const quickchartButtonEl =\n"," document.querySelector('#' + key +
' button');\n"," quickchartButtonEl.disabled = true; // To prevent multiple
clicks.\n"," quickchartButtonEl.classList.add('colab-df-spinner');\n","
try {\n"," const charts = await google.colab.kernel.invokeFunction(\n","
'suggestCharts', [key], {});\n"," } catch (error) {\n","
console.error('Error during call to suggestCharts:', error);\n"," }\n","
quickchartButtonEl.classList.remove('colab-df-spinner');\n","
quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n"," }\n","
(() => {\n"," let quickchartButtonEl =\n","
document.querySelector('#df-caaf1aed-538d-4673-b2e2-817876f5b118 button');\n","
quickchartButtonEl.style.display =\n"," google.colab.kernel.accessAllowed ?
'block' : 'none';\n"," })();\n"," </script>\n","</div>\n"," </div>\n","
</div>\n"]},"metadata":{},"execution_count":2}],"source":["df =
pd.DataFrame({\"text\":[\"people watch dswithbappy\",\
n"," \"dswithbappy watch dswithbappy\",\n","
\"people write comment\",\n"," \"dswithbappy write
comment\"],\"output\":[1,1,0,0]})\n","\n","df"]},
{"cell_type":"code","execution_count":3,"metadata":
{"id":"34_ERQexfXMM","executionInfo":
{"status":"ok","timestamp":1702122227516,"user_tz":-360,"elapsed":1913,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}}},"outputs":[],"source":["from
sklearn.feature_extraction.text import CountVectorizer\n","cv =
CountVectorizer()"]},{"cell_type":"code","execution_count":4,"metadata":
{"id":"XcuEAVN7fXMM","executionInfo":
{"status":"ok","timestamp":1702122231980,"user_tz":-360,"elapsed":4,"user":
{"displayName":"colab0 ineuron","userId":"16851312232179065356"}}},"outputs":
[],"source":["bow = cv.fit_transform(df['text'])"]},
{"cell_type":"code","execution_count":5,"metadata":{"colab":{"base_uri":"https://
localhost:8080/"},"id":"L6g9cuGAfXMM","executionInfo":
{"status":"ok","timestamp":1702122235583,"user_tz":-360,"elapsed":7,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"0044c5cb-b6af-417d-8886-
126ae17d1b2c"},"outputs":[{"output_type":"stream","name":"stdout","text":
["{'people': 2, 'watch': 3, 'dswithbappy': 1, 'write': 4, 'comment': 0}\
n"]}],"source":["#vocabulary\n","print(cv.vocabulary_)"]},
{"cell_type":"code","source":["bow.toarray()"],"metadata":{"colab":
{"base_uri":"https://localhost:8080/"},"id":"124u1tNE_3C3","executionInfo":
{"status":"ok","timestamp":1702122254278,"user_tz":-360,"elapsed":5,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"c661e979-d720-4786-cd7d-
8a55ef435255"},"execution_count":6,"outputs":
[{"output_type":"execute_result","data":{"text/plain":["array([[0, 1, 1, 1, 0],\
n"," [0, 2, 0, 1, 0],\n"," [1, 0, 1, 0, 1],\n"," [1, 1, 0, 0,
1]])"]},"metadata":{},"execution_count":6}]},
{"cell_type":"code","execution_count":null,"metadata":{"colab":
{"base_uri":"https://localhost:8080/"},"id":"Fj8Ie1C5fXMN","executionInfo":
{"status":"ok","timestamp":1668843798600,"user_tz":-360,"elapsed":4,"user":
{"displayName":"Boktiar Ahmed
Bappy","userId":"10381972055342951581"}},"outputId":"36e2833c-a8ca-4860-bde0-
0bc5723dc4ee"},"outputs":[{"output_type":"stream","name":"stdout","text":["[[0 1 1
1 0]]\n","[[0 2 0 1 0]]\n","[[1 0 1 0 1]]\n"]}],"source":["print(bow[0].toarray())\
n","print(bow[1].toarray())\n","print(bow[2].toarray())"]},
{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://
localhost:8080/"},"id":"ykzvZJonfXMN","executionInfo":
{"status":"ok","timestamp":1702122646509,"user_tz":-360,"elapsed":5,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"b0138701-c82e-4066-fd29-
ba5d3dc5e465"},"outputs":[{"output_type":"execute_result","data":{"text/plain":
["array([[0, 1, 0, 1, 0]])"]},"metadata":{},"execution_count":8}],"source":["# new\
n","cv.transform(['Bappy watch dswithbappy']).toarray()"]},
{"cell_type":"code","source":["X = bow.toarray()\n","y = df['output']"],"metadata":
{"id":"scbhgP7Hk2de"},"execution_count":null,"outputs":[]},
{"cell_type":"code","source":[],"metadata":
{"id":"v71DcxiEk2Sr"},"execution_count":null,"outputs":[]},
{"cell_type":"markdown","metadata":{"id":"tRyCsTt9fXMO"},"source":["# N-grams"]},
{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://
localhost:8080/","height":174},"id":"SnFlULCvfXMO","executionInfo":
{"status":"ok","timestamp":1702122671507,"user_tz":-360,"elapsed":531,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"52135bfc-2d77-49eb-dca9-
ed377def8558"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["
text output\n","0 people watch dswithbappy 1\n","1 dswithbappy watch
dswithbappy 1\n","2 people write comment 0\n","3
dswithbappy write comment 0"],"text/html":["\n"," <div id=\"df-93a370db-
9010-4da7-850f-e441dea1c1fb\" class=\"colab-df-container\">\n"," <div>\
n","<style scoped>\n"," .dataframe tbody tr th:only-of-type {\n","
vertical-align: middle;\n"," }\n","\n"," .dataframe tbody tr th {\n","
vertical-align: top;\n"," }\n","\n"," .dataframe thead th {\n"," text-
align: right;\n"," }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\
n"," <thead>\n"," <tr style=\"text-align: right;\">\n"," <th></th>\n","
<th>text</th>\n"," <th>output</th>\n"," </tr>\n"," </thead>\n"," <tbody>\
n"," <tr>\n"," <th>0</th>\n"," <td>people watch dswithbappy</td>\n","
<td>1</td>\n"," </tr>\n"," <tr>\n"," <th>1</th>\n","
<td>dswithbappy watch dswithbappy</td>\n"," <td>1</td>\n"," </tr>\n","
<tr>\n"," <th>2</th>\n"," <td>people write comment</td>\n","
<td>0</td>\n"," </tr>\n"," <tr>\n"," <th>3</th>\n","
<td>dswithbappy write comment</td>\n"," <td>0</td>\n"," </tr>\n","
</tbody>\n","</table>\n","</div>\n"," <div class=\"colab-df-buttons\">\n","\n","
<div class=\"colab-df-container\">\n"," <button class=\"colab-df-convert\"
onclick=\"convertToInteractive('df-93a370db-9010-4da7-850f-e441dea1c1fb')\"\n","
title=\"Convert this dataframe to an interactive table.\"\n","
style=\"display:none;\">\n","\n"," <svg xmlns=\"http://www.w3.org/2000/svg\"
height=\"24px\" viewBox=\"0 -960 960 960\">\n"," <path d=\"M120-120v-
720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-
160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-
160H180v160Zm440 0h160v-160H620v160Z\"/>\n"," </svg>\n"," </button>\n","\n","
<style>\n"," .colab-df-container {\n"," display:flex;\n"," gap: 12px;\
n"," }\n","\n"," .colab-df-convert {\n"," background-color: #E8F0FE;\
n"," border: none;\n"," border-radius: 50%;\n"," cursor: pointer;\
n"," display: none;\n"," fill: #1967D2;\n"," height: 32px;\n","
padding: 0 0 0 0;\n"," width: 32px;\n"," }\n","\n"," .colab-df-
convert:hover {\n"," background-color: #E2EBFA;\n"," box-shadow: 0px 1px
2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n"," fill:
#174EA6;\n"," }\n","\n"," .colab-df-buttons div {\n"," margin-bottom:
4px;\n"," }\n","\n"," [theme=dark] .colab-df-convert {\n"," background-
color: #3B4455;\n"," fill: #D2E3FC;\n"," }\n","\n","
[theme=dark] .colab-df-convert:hover {\n"," background-color: #434B5C;\n","
box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n"," filter: drop-shadow(0px
1px 2px rgba(0, 0, 0, 0.3));\n"," fill: #FFFFFF;\n"," }\n"," </style>\
n","\n"," <script>\n"," const buttonEl =\n","
document.querySelector('#df-93a370db-9010-4da7-850f-e441dea1c1fb button.colab-df-
convert');\n"," buttonEl.style.display =\n","
google.colab.kernel.accessAllowed ? 'block' : 'none';\n","\n"," async function
convertToInteractive(key) {\n"," const element =
document.querySelector('#df-93a370db-9010-4da7-850f-e441dea1c1fb');\n","
const dataTable =\n"," await
google.colab.kernel.invokeFunction('convertToInteractive',\n","
[key], {});\n"," if (!dataTable) return;\n","\n"," const docLinkHtml
= 'Like what you see? Visit the ' +\n"," '<a target=\"_blank\"
href=https://colab.research.google.com/notebooks/data_table.ipynb>data table
notebook</a>'\n"," + ' to learn more about interactive tables.';\n","
element.innerHTML = '';\n"," dataTable['output_type'] = 'display_data';\n","
await google.colab.output.renderOutput(dataTable, element);\n"," const
docLink = document.createElement('div');\n"," docLink.innerHTML =
docLinkHtml;\n"," element.appendChild(docLink);\n"," }\n","
</script>\n"," </div>\n","\n","\n","<div id=\"df-f7c7669f-8684-47b4-81dd-
9d2b26f00b58\">\n"," <button class=\"colab-df-quickchart\"
onclick=\"quickchart('df-f7c7669f-8684-47b4-81dd-9d2b26f00b58')\"\n","
title=\"Suggest charts\"\n"," style=\"display:none;\">\n","\n","<svg
xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n","
width=\"24px\">\n"," <g>\n"," <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0
1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-
2v-4h2v4z\"/>\n"," </g>\n","</svg>\n"," </button>\n","\n","<style>\
n"," .colab-df-quickchart {\n"," --bg-color: #E8F0FE;\n"," --fill-color:
#1967D2;\n"," --hover-bg-color: #E2EBFA;\n"," --hover-fill-color:
#174EA6;\n"," --disabled-fill-color: #AAA;\n"," --disabled-bg-color:
#DDD;\n"," }\n","\n"," [theme=dark] .colab-df-quickchart {\n"," --bg-color:
#3B4455;\n"," --fill-color: #D2E3FC;\n"," --hover-bg-color: #434B5C;\n","
--hover-fill-color: #FFFFFF;\n"," --disabled-bg-color: #3B4455;\n"," --
disabled-fill-color: #666;\n"," }\n","\n"," .colab-df-quickchart {\n","
background-color: var(--bg-color);\n"," border: none;\n"," border-radius:
50%;\n"," cursor: pointer;\n"," display: none;\n"," fill: var(--fill-
color);\n"," height: 32px;\n"," padding: 0;\n"," width: 32px;\n"," }\
n","\n"," .colab-df-quickchart:hover {\n"," background-color: var(--hover-bg-
color);\n"," box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60,
64, 67, 0.15);\n"," fill: var(--button-hover-fill-color);\n"," }\n","\
n"," .colab-df-quickchart-complete:disabled,\n"," .colab-df-quickchart-
complete:disabled:hover {\n"," background-color: var(--disabled-bg-color);\n","
fill: var(--disabled-fill-color);\n"," box-shadow: none;\n"," }\n","\
n"," .colab-df-spinner {\n"," border: 2px solid var(--fill-color);\n","
border-color: transparent;\n"," border-bottom-color: var(--fill-color);\n","
animation:\n"," spin 1s steps(1) infinite;\n"," }\n","\n"," @keyframes spin
{\n"," 0% {\n"," border-color: transparent;\n"," border-bottom-color:
var(--fill-color);\n"," border-left-color: var(--fill-color);\n","
}\n"," 20% {\n"," border-color: transparent;\n"," border-left-color:
var(--fill-color);\n"," border-top-color: var(--fill-color);\n"," }\n","
30% {\n"," border-color: transparent;\n"," border-left-color: var(--fill-
color);\n"," border-top-color: var(--fill-color);\n"," border-right-
color: var(--fill-color);\n"," }\n"," 40% {\n"," border-color:
transparent;\n"," border-right-color: var(--fill-color);\n"," border-top-
color: var(--fill-color);\n"," }\n"," 60% {\n"," border-color:
transparent;\n"," border-right-color: var(--fill-color);\n"," }\n"," 80%
{\n"," border-color: transparent;\n"," border-right-color: var(--fill-
color);\n"," border-bottom-color: var(--fill-color);\n"," }\n"," 90% {\
n"," border-color: transparent;\n"," border-bottom-color: var(--fill-
color);\n"," }\n"," }\n","</style>\n","\n"," <script>\n"," async function
quickchart(key) {\n"," const quickchartButtonEl =\n","
document.querySelector('#' + key + ' button');\n","
quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n","
quickchartButtonEl.classList.add('colab-df-spinner');\n"," try {\n","
const charts = await google.colab.kernel.invokeFunction(\n","
'suggestCharts', [key], {});\n"," } catch (error) {\n","
console.error('Error during call to suggestCharts:', error);\n"," }\n","
quickchartButtonEl.classList.remove('colab-df-spinner');\n","
quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n"," }\n","
(() => {\n"," let quickchartButtonEl =\n","
document.querySelector('#df-f7c7669f-8684-47b4-81dd-9d2b26f00b58 button');\n","
quickchartButtonEl.style.display =\n"," google.colab.kernel.accessAllowed ?
'block' : 'none';\n"," })();\n"," </script>\n","</div>\n"," </div>\n","
</div>\n"]},"metadata":{},"execution_count":9}],"source":["df =
pd.DataFrame({\"text\":[\"people watch dswithbappy\",\
n"," \"dswithbappy watch dswithbappy\",\n","
\"people write comment\",\n"," \"dswithbappy write
comment\"],\"output\":[1,1,0,0]})\n","\n","df"]},
{"cell_type":"code","execution_count":null,"metadata":
{"id":"ehq1YwC3fXMP"},"outputs":[],"source":["# BI grams\n","from
sklearn.feature_extraction.text import CountVectorizer\n","cv =
CountVectorizer(ngram_range=(2,2))"]},
{"cell_type":"code","execution_count":null,"metadata":{"id":"aAO3z9I-
fXMP"},"outputs":[],"source":["bow = cv.fit_transform(df['text'])"]},
{"cell_type":"code","execution_count":null,"metadata":{"colab":
{"base_uri":"https://localhost:8080/"},"id":"8X5poK0tfXMP","executionInfo":
{"status":"ok","timestamp":1694935340805,"user_tz":-360,"elapsed":10,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"fed2e3d8-3228-40ca-f5b3-
0f75083fbf88"},"outputs":[{"output_type":"stream","name":"stdout","text":["{'people
watch': 2, 'watch dswithbappy': 4, 'dswithbappy watch': 0, 'people write': 3,
'write comment': 5, 'dswithbappy write': 1}\n"]}],"source":
["print(cv.vocabulary_)"]},{"cell_type":"code","execution_count":null,"metadata":
{"colab":{"base_uri":"https://
localhost:8080/"},"id":"I_xGj2QtfXMQ","executionInfo":
{"status":"ok","timestamp":1668843949359,"user_tz":-360,"elapsed":461,"user":
{"displayName":"Boktiar Ahmed
Bappy","userId":"10381972055342951581"}},"outputId":"d2b212e5-09cd-4855-bb71-
09512da0120d"},"outputs":[{"output_type":"stream","name":"stdout","text":["[[0 0 1
0 1 0]]\n","[[1 0 0 0 1 0]]\n","[[0 0 0 1 0 1]]\n"]}],"source":
["print(bow[0].toarray())\n","print(bow[1].toarray())\
n","print(bow[2].toarray())"]},{"cell_type":"code","source":["#Ti gram\n","# BI
grams\n","from sklearn.feature_extraction.text import CountVectorizer\n","cv =
CountVectorizer(ngram_range=(3,3))"],"metadata":{"id":"kK2i-
zEYmWN7","executionInfo":{"status":"ok","timestamp":1702122980968,"user_tz":-
360,"elapsed":4,"user":{"displayName":"colab0
ineuron","userId":"16851312232179065356"}}},"execution_count":10,"outputs":[]},
{"cell_type":"code","source":["bow = cv.fit_transform(df['text'])"],"metadata":
{"id":"pyPfTgn-mawd","executionInfo":
{"status":"ok","timestamp":1702122981465,"user_tz":-360,"elapsed":2,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}}},"execution_count":11,"outputs":[]},
{"cell_type":"code","source":["print(cv.vocabulary_)"],"metadata":{"colab":
{"base_uri":"https://localhost:8080/"},"id":"P9frxWA3mdE3","executionInfo":
{"status":"ok","timestamp":1702122981982,"user_tz":-360,"elapsed":10,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"3bf35e41-c665-4bcb-c037-
4c961da91924"},"execution_count":12,"outputs":
[{"output_type":"stream","name":"stdout","text":["{'people watch dswithbappy': 2,
'dswithbappy watch dswithbappy': 0, 'people write comment': 3, 'dswithbappy write
comment': 1}\n"]}]},{"cell_type":"code","source":["print(bow[0].toarray())\
n","print(bow[1].toarray())\n","print(bow[2].toarray())"],"metadata":{"colab":
{"base_uri":"https://localhost:8080/"},"id":"Cs96lmRImhIP","executionInfo":
{"status":"ok","timestamp":1686757843245,"user_tz":-360,"elapsed":3,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"5b0bcb61-9c62-4764-a3f0-
f4ecb24fe3d8"},"execution_count":null,"outputs":
[{"output_type":"stream","name":"stdout","text":["[[0 0 1 0]]\n","[[1 0 0 0]]\
n","[[0 0 0 1]]\n"]}]},{"cell_type":"markdown","metadata":
{"id":"7RaMbIHLfXMQ"},"source":["# TF-IDF (Term frequency- Inverse document
frequency)"]},{"cell_type":"code","execution_count":13,"metadata":{"colab":
{"base_uri":"https://
localhost:8080/","height":174},"id":"_uiptQYdfXMQ","executionInfo":
{"status":"ok","timestamp":1702123498808,"user_tz":-360,"elapsed":14,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"fbb4a688-dc56-4125-a221-
f53765362579"},"outputs":[{"output_type":"execute_result","data":{"text/plain":["
text output\n","0 people watch dswithbappy 1\n","1 dswithbappy watch
dswithbappy 1\n","2 people write comment 0\n","3
dswithbappy write comment 0"],"text/html":["\n"," <div id=\"df-78a1d77f-
064e-41f8-823e-75e69fece310\" class=\"colab-df-container\">\n"," <div>\
n","<style scoped>\n"," .dataframe tbody tr th:only-of-type {\n","
vertical-align: middle;\n"," }\n","\n"," .dataframe tbody tr th {\n","
vertical-align: top;\n"," }\n","\n"," .dataframe thead th {\n"," text-
align: right;\n"," }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\
n"," <thead>\n"," <tr style=\"text-align: right;\">\n"," <th></th>\n","
<th>text</th>\n"," <th>output</th>\n"," </tr>\n"," </thead>\n"," <tbody>\
n"," <tr>\n"," <th>0</th>\n"," <td>people watch dswithbappy</td>\n","
<td>1</td>\n"," </tr>\n"," <tr>\n"," <th>1</th>\n","
<td>dswithbappy watch dswithbappy</td>\n"," <td>1</td>\n"," </tr>\n","
<tr>\n"," <th>2</th>\n"," <td>people write comment</td>\n","
<td>0</td>\n"," </tr>\n"," <tr>\n"," <th>3</th>\n","
<td>dswithbappy write comment</td>\n"," <td>0</td>\n"," </tr>\n","
</tbody>\n","</table>\n","</div>\n"," <div class=\"colab-df-buttons\">\n","\n","
<div class=\"colab-df-container\">\n"," <button class=\"colab-df-convert\"
onclick=\"convertToInteractive('df-78a1d77f-064e-41f8-823e-75e69fece310')\"\n","
title=\"Convert this dataframe to an interactive table.\"\n","
style=\"display:none;\">\n","\n"," <svg xmlns=\"http://www.w3.org/2000/svg\"
height=\"24px\" viewBox=\"0 -960 960 960\">\n"," <path d=\"M120-120v-
720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-
160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-
160H180v160Zm440 0h160v-160H620v160Z\"/>\n"," </svg>\n"," </button>\n","\n","
<style>\n"," .colab-df-container {\n"," display:flex;\n"," gap: 12px;\
n"," }\n","\n"," .colab-df-convert {\n"," background-color: #E8F0FE;\
n"," border: none;\n"," border-radius: 50%;\n"," cursor: pointer;\
n"," display: none;\n"," fill: #1967D2;\n"," height: 32px;\n","
padding: 0 0 0 0;\n"," width: 32px;\n"," }\n","\n"," .colab-df-
convert:hover {\n"," background-color: #E2EBFA;\n"," box-shadow: 0px 1px
2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n"," fill:
#174EA6;\n"," }\n","\n"," .colab-df-buttons div {\n"," margin-bottom:
4px;\n"," }\n","\n"," [theme=dark] .colab-df-convert {\n"," background-
color: #3B4455;\n"," fill: #D2E3FC;\n"," }\n","\n","
[theme=dark] .colab-df-convert:hover {\n"," background-color: #434B5C;\n","
box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n"," filter: drop-shadow(0px
1px 2px rgba(0, 0, 0, 0.3));\n"," fill: #FFFFFF;\n"," }\n"," </style>\
n","\n"," <script>\n"," const buttonEl =\n","
document.querySelector('#df-78a1d77f-064e-41f8-823e-75e69fece310 button.colab-df-
convert');\n"," buttonEl.style.display =\n","
google.colab.kernel.accessAllowed ? 'block' : 'none';\n","\n"," async function
convertToInteractive(key) {\n"," const element =
document.querySelector('#df-78a1d77f-064e-41f8-823e-75e69fece310');\n","
const dataTable =\n"," await
google.colab.kernel.invokeFunction('convertToInteractive',\n","
[key], {});\n"," if (!dataTable) return;\n","\n"," const docLinkHtml
= 'Like what you see? Visit the ' +\n"," '<a target=\"_blank\"
href=https://colab.research.google.com/notebooks/data_table.ipynb>data
table notebook</a>'\n"," + ' to learn more about interactive tables.';\
n"," element.innerHTML = '';\n"," dataTable['output_type'] =
'display_data';\n"," await google.colab.output.renderOutput(dataTable,
element);\n"," const docLink = document.createElement('div');\n","
docLink.innerHTML = docLinkHtml;\n"," element.appendChild(docLink);\n","
}\n"," </script>\n"," </div>\n","\n","\n","<div id=\"df-9696369d-9869-436a-
b909-0882d449f81c\">\n"," <button class=\"colab-df-quickchart\"
onclick=\"quickchart('df-9696369d-9869-436a-b909-0882d449f81c')\"\n","
title=\"Suggest charts\"\n"," style=\"display:none;\">\n","\n","<svg
xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n","
width=\"24px\">\n"," <g>\n"," <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0
1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-
2v-4h2v4z\"/>\n"," </g>\n","</svg>\n"," </button>\n","\n","<style>\
n"," .colab-df-quickchart {\n"," --bg-color: #E8F0FE;\n"," --fill-color:
#1967D2;\n"," --hover-bg-color: #E2EBFA;\n"," --hover-fill-color:
#174EA6;\n"," --disabled-fill-color: #AAA;\n"," --disabled-bg-color:
#DDD;\n"," }\n","\n"," [theme=dark] .colab-df-quickchart {\n"," --bg-color:
#3B4455;\n"," --fill-color: #D2E3FC;\n"," --hover-bg-color: #434B5C;\n","
--hover-fill-color: #FFFFFF;\n"," --disabled-bg-color: #3B4455;\n"," --
disabled-fill-color: #666;\n"," }\n","\n"," .colab-df-quickchart {\n","
background-color: var(--bg-color);\n"," border: none;\n"," border-radius:
50%;\n"," cursor: pointer;\n"," display: none;\n"," fill: var(--fill-
color);\n"," height: 32px;\n"," padding: 0;\n"," width: 32px;\n"," }\
n","\n"," .colab-df-quickchart:hover {\n"," background-color: var(--hover-bg-
color);\n"," box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60,
64, 67, 0.15);\n"," fill: var(--button-hover-fill-color);\n"," }\n","\
n"," .colab-df-quickchart-complete:disabled,\n"," .colab-df-quickchart-
complete:disabled:hover {\n"," background-color: var(--disabled-bg-color);\n","
fill: var(--disabled-fill-color);\n"," box-shadow: none;\n"," }\n","\
n"," .colab-df-spinner {\n"," border: 2px solid var(--fill-color);\n","
border-color: transparent;\n"," border-bottom-color: var(--fill-color);\n","
animation:\n"," spin 1s steps(1) infinite;\n"," }\n","\n"," @keyframes spin
{\n"," 0% {\n"," border-color: transparent;\n"," border-bottom-color:
var(--fill-color);\n"," border-left-color: var(--fill-color);\n"," }\n","
20% {\n"," border-color: transparent;\n"," border-left-color: var(--fill-
color);\n"," border-top-color: var(--fill-color);\n"," }\n"," 30% {\n","
border-color: transparent;\n"," border-left-color: var(--fill-color);\n","
border-top-color: var(--fill-color);\n"," border-right-color: var(--fill-
color);\n"," }\n"," 40% {\n"," border-color: transparent;\n","
border-right-color: var(--fill-color);\n"," border-top-color: var(--fill-
color);\n"," }\n"," 60% {\n"," border-color: transparent;\n","
border-right-color: var(--fill-color);\n"," }\n"," 80% {\n"," border-
color: transparent;\n"," border-right-color: var(--fill-color);\n","
border-bottom-color: var(--fill-color);\n"," }\n"," 90% {\n"," border-
color: transparent;\n"," border-bottom-color: var(--fill-color);\n"," }\
n"," }\n","</style>\n","\n"," <script>\n"," async function quickchart(key) {\
n"," const quickchartButtonEl =\n"," document.querySelector('#' + key +
' button');\n"," quickchartButtonEl.disabled = true; // To prevent multiple
clicks.\n"," quickchartButtonEl.classList.add('colab-df-spinner');\n","
try {\n"," const charts = await google.colab.kernel.invokeFunction(\n","
'suggestCharts', [key], {});\n"," } catch (error) {\n","
console.error('Error during call to suggestCharts:', error);\n"," }\n","
quickchartButtonEl.classList.remove('colab-df-spinner');\n","
quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n"," }\n","
(() => {\n"," let quickchartButtonEl =\n","
document.querySelector('#df-9696369d-9869-436a-b909-0882d449f81c button');\n","
quickchartButtonEl.style.display =\n"," google.colab.kernel.accessAllowed ?
'block' : 'none';\n"," })();\n"," </script>\n","</div>\n"," </div>\n","
</div>\n"]},"metadata":{},"execution_count":13}],"source":["df =
pd.DataFrame({\"text\":[\"people watch dswithbappy\",\
n"," \"dswithbappy watch dswithbappy\",\n","
\"people write comment\",\n"," \"dswithbappy write
comment\"],\"output\":[1,1,0,0]})\n","\n","df"]},
{"cell_type":"code","execution_count":14,"metadata":
{"id":"4sgfJYCqfXMR","executionInfo":
{"status":"ok","timestamp":1702123524543,"user_tz":-360,"elapsed":573,"user":
{"displayName":"colab0 ineuron","userId":"16851312232179065356"}}},"outputs":
[],"source":["from sklearn.feature_extraction.text import TfidfVectorizer\n","tfid=
TfidfVectorizer()"]},{"cell_type":"code","execution_count":15,"metadata":
{"id":"vrWuOQQBfXMR","executionInfo":
{"status":"ok","timestamp":1702123529200,"user_tz":-360,"elapsed":504,"user":
{"displayName":"colab0 ineuron","userId":"16851312232179065356"}}},"outputs":
[],"source":["arr = tfid.fit_transform(df['text']).toarray()"]},
{"cell_type":"code","source":["arr"],"metadata":{"colab":{"base_uri":"https://
localhost:8080/"},"id":"7dm6OMmdc9Py","executionInfo":
{"status":"ok","timestamp":1702123532119,"user_tz":-360,"elapsed":557,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"aaa7504e-5959-44d4-fb4e-
b23e6cd0306b"},"execution_count":16,"outputs":
[{"output_type":"execute_result","data":{"text/plain":["array([[0. ,
0.49681612, 0.61366674, 0.61366674, 0. ],\n"," [0. ,
0.8508161 , 0. , 0.52546357, 0. ],\n"," [0.57735027,
0. , 0.57735027, 0. , 0.57735027],\n"," [0.61366674,
0.49681612, 0. , 0. , 0.61366674]])"]},"metadata":
{},"execution_count":16}]},{"cell_type":"code","execution_count":17,"metadata":
{"colab":{"base_uri":"https://
localhost:8080/"},"id":"yvhynBOBfXMR","executionInfo":
{"status":"ok","timestamp":1702123541417,"user_tz":-360,"elapsed":526,"user":
{"displayName":"colab0
ineuron","userId":"16851312232179065356"}},"outputId":"36e79001-5188-486e-9d2c-
aaa03736be1f"},"outputs":[{"output_type":"stream","name":"stdout","text":
["[1.51082562 1.22314355 1.51082562 1.51082562 1.51082562]\n"]}],"source":
["print(tfid.idf_)"]},{"cell_type":"code","execution_count":null,"metadata":
{"id":"J6vC6Jx9fXMS"},"outputs":[],"source":[]}],"metadata":{"kernelspec":
{"display_name":"Python 3","language":"python","name":"python3"},"language_info":
{"codemirror_mode":
{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-
python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","
version":"3.8.5"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":0}

You might also like