# Nakafa Learning Content

> For AI agents: use [llms.txt](https://nakafa.com/llms.txt) for the site index. Markdown versions are available by appending `.md` to content URLs or sending `Accept: text/markdown`.

URL: https://nakafa.com/en/subjects/mathematics/statistics-regression/linear-regression-concept
Source: https://raw.githubusercontent.com/nakafaai/nakafa.com/refs/heads/main/packages/contents/material/lesson/mathematics/statistics-regression/linear-regression-concept/en.mdx

Learn linear regression to create best-fit lines through data points. Understand prediction, slope calculations, and how to model variable relationships.

---

## What Is Linear Regression?

With [Scatter Diagrams](/en/subjects/mathematics/statistics-regression/scatter-diagram), we can see the relationship between two variables, namely $$X$$ data and $$Y$$ data.

Visible text: With [Scatter Diagrams](/en/subjects/mathematics/statistics-regression/scatter-diagram), we can see the relationship between two variables, namely data and data.

Now, if the points on the scatter diagram seem to form a straight pattern (there's a linear correlation, whether positive or negative), we can try to draw a straight line that best fits through the middle of that cluster of points. This line is called the **Linear Regression Line**. The process of finding this line is called **Linear Regression**.

## The "Best-Fit" Line

The Linear Regression Line is often called the _best-fit_ line. Why? Because out of the many possible straight lines that could be drawn, this is the line whose position is "closest" to all the data points overall. This line attempts to summarize the trend or linear pattern present in the data.

## Example of a Regression Line

Let's say we have data on study time (hours) and exam scores again. The points tend to rise (positive correlation).

Component: ScatterDiagram
Props:
- title: Regression Line for the Relationship Between Study Time and Exam Scores
- description: The line shows the linear trend (regression line) of the data.
- xAxisLabel: Study Time (Hours)
- yAxisLabel: Exam Score
- datasets: [
{
name: "Student",
color: "var(--chart-1)",
points: [
{ x: 1, y: 60 },
{ x: 1.5, y: 68 },
{ x: 2, y: 65 },
{ x: 2.5, y: 75 },
{ x: 3, y: 72 },
{ x: 3.5, y: 80 },
{ x: 4, y: 85 },
{ x: 4, y: 82 },
{ x: 4.5, y: 90 },
{ x: 5, y: 88 },
{ x: 5.5, y: 95 },
{ x: 6, y: 92 },
],
},
]
- calculateRegressionLine: true
- regressionLineStyle: {
color: "var(--chart-3)",
}

See the line above? That is the linear regression line. The line shows the **general trend**: the longer the study time (X increases), the exam score (Y) also tends to increase following the direction of the line.

## What is the use of this regression line?

One of its main uses is for **prediction**. For example, if a new student studies for $$7 \text{ hours}$$, we can use this regression line to estimate what their exam score might be, even though we don't have exact data for $$7 \text{ hours}$$.

Visible text: One of its main uses is for **prediction**. For example, if a new student studies for , we can use this regression line to estimate what their exam score might be, even though we don't have exact data for .

## Mathematical Concept

The linear regression line (the _best-fit_ line) is found using a method called the **Least Squares Method**. The idea is to find the straight line that **minimizes the sum of the squared vertical distances** from each data point to the line.

Mathematically, the linear regression line has the form:

```math
\hat{y} = a + bx
```

Where:

- $$\hat{y}$$ (read: y-hat) is the **predicted value of $$y$$** by
  the regression line.
- $$x$$ is the value of the independent variable.
- $$b$$ is the **slope** of the line, indicating how much $$\hat{y}$$ changes
  for each one-unit change in $$x$$.
- $$a$$ is the **$$y$$-intercept**, which is the predicted value of $$\hat{y}$$ when $$x = 0$$
  .

Visible text: - (read: y-hat) is the **predicted value of ** by
 the regression line.
- is the value of the independent variable.
- is the **slope** of the line, indicating how much changes
 for each one-unit change in .
- is the **-intercept**, which is the predicted value of when 
 .

The values of $$b$$ and $$a$$ are calculated from the $$(x, y)$$ data we have using the following formulas:

Visible text: The values of and are calculated from the data we have using the following formulas:

Component: MathContainer
Children:

```math
b = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2}
```

```math
a = \bar{y} - b\bar{x}
```

Formula key:

- $$n$$ is the number of data pairs.
- $$\sum x$$ is the sum of all $$x$$ values.
- $$\sum y$$ is the sum of all $$y$$ values.
- $$\sum xy$$ is the sum of the product of each $$x$$ and $$y$$ pair.
- $$\sum x^2$$ is the sum of the square of each $$x$$ value.
- $$\bar{x}$$ is the mean of the $$x$$ values ($$\frac{\sum x}{n}$$
  ).
- $$\bar{y}$$ is the mean of the $$y$$ values ($$\frac{\sum y}{n}$$
  ).

Visible text: - is the number of data pairs.
- is the sum of all values.
- is the sum of all values.
- is the sum of the product of each and pair.
- is the sum of the square of each value.
- is the mean of the values (
 ).
- is the mean of the values (
 ).

With these formulas, we can obtain the single straight line that is considered to best represent the linear relationship pattern in our data.